### Overview

This project focuses on scraping video details from a YouTube channel using Selenium and BeautifulSoup, and organizing the data into a structured pandas DataFrame. By automating browser interactions, we dynamically extract titles, views, video links, upload dates, and thumbnail URLs, handling missing or dynamic content gracefully.

We will start by importing the necessary libraries to build our web scraping workflow. ***chromedriver_binary*** sets up ***ChromeDriver*** for seamless browser automation with ***Selenium***, while ***BeautifulSoup*** helps parse and extract meaningful data from HTML content. Finally, ***pandas*** is used to structure and manage the scraped data in a tabular format, making it easy to analyze or export. These libraries form the core of an efficient and streamlined web scraping process.

In [None]:
import chromedriver_binary 
from bs4 import BeautifulSoup
import pandas as pd

Next, we will configure ***selenium*** for automating browser for web scraping, While the ***webdriver*** and related classes from ***Selenium*** do the actual work of controlling the browser, ***Service*** defines to which path the ChromeDriver executable will be installed. Here, we employ the ***webdriver_manager*** to manage our ***ChromeDriver*** version and ***WebDriverWait*** with the help of ***expected_conditions*** to deal with dynamic web content later. It initializes the ***WebDriver*** using the ***chrome_driver_path*** and does a test by visiting the ***Videos*** section of the ***GeeksforGeeks YouTube channel*** and confirming with the title of the website. This makes it easy to navigate and interact with dynamic contents on YouTube.

In [49]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# Correct path to ChromeDriver executable
chrome_driver_path = #Your ChromeDriver Path
service = Service(chrome_driver_path)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service)

# Test the setup
driver.get("https://www.youtube.com/@GeeksforGeeksVideos/videos")
print(driver.title)

GeeksforGeeks - YouTube


The ***driver.page_source*** command returns the ***HTML*** source code of the current page displayed in the browser controlled by ***Selenium***. This is especially useful for web scraping, because as it will print the entire structure of the ***DOM*** at the moment the command is called, so long as it can get the dynamic content rendered in a Javascript application.

In your context, ***driver.page_source*** will return the ***HTML*** of the ***YouTube page*** at the ***Videos*** section of the GeeksforGeeks channel. The ***HTML*** can then be fed into a parsing library like ***BeautifulSoup*** to extract specific pieces of data like ***video titles***, ***views***, or upload ***dates***. This is an important step towards going from browser automation to actually scraping the data.

In [None]:
driver.page_source

The command ***soup = BeautifulSoup(driver.page_source, "html.parser")*** extracts the **HTML*** source received by ***Selenium*** through ***driver.page_source***. Here’s what it does:

1. **HTML Parsing:** The raw HTML from the browser is parsed using the BeautifulSoup library, which converts it into a more structured form that is easier to access and search through.

2. **Parser Specification:** The “html.parser" argument uses the built-in Python HTML parser. It is fast, does not require extra dependencies, but can use lxml or html5lib in more complicated possibilities.

3. **Purpose:** This process turns the HTML string into a BeautifulSoup object, saved in soup. This object lets you search for specific tags or attributes or text content efficiently with methods such as. find() or. find_all().

Here, soup will contain the structured HTML of the GeeksforGeeks YouTube "Videos" page which we will scrape for important elements such as video titles, links, views, or dates.

In [118]:
soup = BeautifulSoup(driver.page_source, "html.parser")

The below code snippet efficiently scrapes video details from a YouTube page using BeautifulSoup and organizes them into a list for further processing. Here's a concise breakdown of its functionality:

1. **Initialize Data Storage:** The ***data*** list is created to store scraped information for each video as a list of details.

2. **Iterate Through Video Items:** The ***for sp in soup.find_all('ytd-rich-item-renderer')*** loop iterates over all video elements identified by the ***ytd-rich-item-renderer*** tag.

3. **Extract Video Details:**
a)- ***Title:*** Retrieved using the ***a*** tag with a specific class, extracting its text content.

b)- ***Video Link:*** Extracted from the href attribute of the same ***a*** tag.

c)- ***Views:*** Safely retrieved from the first ***span*** tag in the metadata block; defaults to ***None*** if unavailable.

d)- ***Date/Time:*** Retrieved similarly from the second ***span*** tag in the metadata block, handling missing data gracefully.

e)- ***Thumbnail Link:*** Extracted from the ***src*** attribute of the ***img*** tag, stripping any query parameters; defaults to ***None*** if the image is missing.

5. ***Append Data:*** Each video's details are appended as a list to the ***data*** list, maintaining a structured format.

In [52]:
data = []

for sp in soup.find_all('ytd-rich-item-renderer'):
    
    # Extract the title
    title = sp.find('a', class_='yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media').text
    video_link = sp.find('a', class_ = 'yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media').get('href')
    try:
        views = sp.find_all('span', class_ = 'inline-metadata-item style-scope ytd-video-meta-block')[0].text
    except:
        views = None
    try:
        date_time = sp.find_all('span', class_ = "inline-metadata-item style-scope ytd-video-meta-block")[1].text
    except:
        date_time = None
    try:
        thumbnail_link = sp.find('img').get('src').split('?')[0]
    except:
        thumbnail_link = None

    data.append([title, views, video_link, date_time, thumbnail_link])

***len(data)*** returns the total number of videos scraped and stored in the data list, helping verify the scraper's success. 

In [53]:
len(data)

1048

***data[0]*** retrieves the first video's details from the data list. It will return a list containing the extracted information (title, views, video link, date/time, and thumbnail link) for the first video.

In [54]:
data[0]

['Ready to take on the challenge? | 90% Refund Challenge by GeeksforGeeks',
 '5.2K views',
 '/watch?v=B05KjM_OLTA',
 '1 day ago',
 'https://i.ytimg.com/vi/B05KjM_OLTA/hqdefault.jpg']

This below line converts the data list into a ***pandas*** DataFrame with columns named ***title***, ***views***, ***video_link***, ***date_time***, and ***thumbnail_link***. Each list in data becomes a row in the DataFrame, providing a clean tabular structure for easy analysis, manipulation, or export. For example, you can inspect the data with ***df.head()*** or save it as a ***CSV*** using ***df.to_csv('data.csv', index=False)***.

In [76]:
df = pd.DataFrame(data, columns = ['title', 'views', 'video_link', 'date_time', 'thumbnail_link'])

In [77]:
df

Unnamed: 0,title,views,video_link,date_time,thumbnail_link
0,Ready to take on the challenge? | 90% Refund C...,5.2K views,/watch?v=B05KjM_OLTA,1 day ago,https://i.ytimg.com/vi/B05KjM_OLTA/hqdefault.jpg
1,I went to the House of a Microsoft Software En...,3.5K views,/watch?v=nQ-dpHJPS2c,2 days ago,https://i.ytimg.com/vi/nQ-dpHJPS2c/hqdefault.jpg
2,The Only Cyber Security Roadmap you need for 2...,1.8K views,/watch?v=0CMNIpd-s_c,3 days ago,https://i.ytimg.com/vi/0CMNIpd-s_c/hqdefault.jpg
3,GeeksforGeeks is going to rock the world of Ed...,1.6K views,/watch?v=Na5zVw3g9So,5 days ago,https://i.ytimg.com/vi/Na5zVw3g9So/hqdefault.jpg
4,How to get a job in MAANG as a College Student...,4.2K views,/watch?v=dE-MlQnCT78,5 days ago,https://i.ytimg.com/vi/dE-MlQnCT78/hqdefault.jpg
...,...,...,...,...,...
1043,"Recursive O(logy) function for pow(x,y) | Geek...",4.7K views,/watch?v=2a_Snu8qjko,5 years ago,
1044,DSA Online Course | GeeksforGeeks,69K views,/watch?v=BnJWi0E3Mv0,5 years ago,
1045,Lets Create Threads (JAVA) | GeeksforGeeks,13K views,/watch?v=kviGixevLNA,5 years ago,
1046,Introduction to Multithreading (JAVA) | Geeks...,48K views,/watch?v=J_7FtjWgoC4,5 years ago,


In [78]:
df.isnull().sum()

title              0
views              0
video_link         0
date_time          0
thumbnail_link    40
dtype: int64

In [79]:
df.to_csv('data.csv', index=False)

In [80]:
df.tail()

Unnamed: 0,title,views,video_link,date_time,thumbnail_link
1043,"Recursive O(logy) function for pow(x,y) | Geek...",4.7K views,/watch?v=2a_Snu8qjko,5 years ago,
1044,DSA Online Course | GeeksforGeeks,69K views,/watch?v=BnJWi0E3Mv0,5 years ago,
1045,Lets Create Threads (JAVA) | GeeksforGeeks,13K views,/watch?v=kviGixevLNA,5 years ago,
1046,Introduction to Multithreading (JAVA) | Geeks...,48K views,/watch?v=J_7FtjWgoC4,5 years ago,
1047,How to declare a pointer to a function? | Geek...,4.6K views,/watch?v=eEHzPJrhK7g,5 years ago,
