# **YouTube Data Scrapper Using Selenium**

Here i have extracted the data from the particular channel(**@Kurzgesagt**)

This Notebook Demonstrates How To Extract:

- **Video_link**
- **No_of_likes**
- **Date_of_upload**
- **No_of_views**
- **No_of_comments**
- **Cleaned_description**
- 
**And Saving Data Into CSV**


## Importing Libraries

In [27]:
# Impor libraries

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from tqdm import tqdm
from datetime import datetime
import numpy as np
import pandas as pd
import json
import re
import time




Normal request method dont work on websites like youtube and even on most of the websites that's why we have to use a webdriver 

A WebDriver can do all that because it actually runs a real browser in the background, so the page loads exactly like it does when you open Chrome yourself.

How it works —

1️⃣ You install a WebDriver for your browser, like:

chromedriver for Chrome

geckodriver for Firefox.

2️⃣ You use a tool like Selenium or Playwright to tell that WebDriver what to do.

3️⃣ The WebDriver opens a real browser window for you, does all the clicks & typing, and returns the page’s final HTML — including content loaded by JavaScript!

## Initialize Selenium WebDriver

In [154]:
driver = webdriver.Chrome()
driver.get('https://www.youtube.com/@kurzgesagt/videos')

#Wait for page to load
time.sleep(5)

We usually have to manually scroll down till last video on the video Page Because YouTube uses infinite scrolling. Only a few videos load initially, and more load as you scroll down. If you don’t scroll, Selenium will only find the first few videos.
but i have uploaded the code for infinite scroll to reach the bottom of the page , so it  scroll automatically


## Extracting  Video Detailse
- Title
- Likes
- Views (exact number)
- Upload da and timete
- Desctionrip
- Comme_Num
- Video_link
- Thumbnail_linked)

In [156]:
# Scrolling till bottom of the page automatically
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
    time.sleep(2)
    
    new_height = driver.execute_script("return document.documentElement.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

#making intial dataframe to collect all videos link and title
soup = BeautifulSoup(driver.page_source , 'html.parser')
video_block = soup.find_all('ytd-rich-item-renderer')


#creating empty list
data = []

#extracting link and title
#--------------------------link and title block------------------------------------

for sp in video_block:
    try:
        title         = sp.find('h3').text
    except:
        title = np.nan
    try:
        video_link    = 'https://www.youtube.com'+sp.find('a').get('href')
    except:
        title = np.nan 
    try:
        thumbnail_link = sp.find('img').get('src').split('?')[0]
    except:
        thumbnail_link = np.nan 
        
    #appending data and setting default values so nothing breaks if extraction fails     
    data.append({
        'title':title,
        'views':np.nan,
        'upload_date':np.nan,
        'upload_time':np.nan,
        'comments_nubm':np.nan,
        'vid_description':np.nan,
        'video_link':video_link,
        'thumbnail_link':thumbnail_link})


#--------------------------link and title block------------------------------------
  

#Iterating through each video through each video and collecting their data
for link in tqdm(data):
   
    try:
        driver.get(link['video_link'])
    
            
        time.sleep(5)
    
        #extracting Views, Date and time of upload
        #--------------------------views, date and time block------------------------------------
    
        
        # Search for the ytInitialPlayerResponse JSON
        pattern = r'var ytInitialPlayerResponse = ({.*?});'
        match = re.search(pattern,driver.page_source)
        
        if match:
            data_json = json.loads(match.group(1))
            # Exact view count
            exact_views = data_json['videoDetails']['viewCount']
            # Exact upload date (ISO format)
            raw_date = data_json['microformat']['playerMicroformatRenderer']['uploadDate']
            if raw_date:
                dt = datetime.fromisoformat(raw_date)
                formatted_date = dt.strftime('%d-%m-%Y')
                formatted_time = dt.strftime('%I:%M:%S %p')
            
                link['views']= exact_views
                link['upload_date'] = formatted_date
                link['upload_time'] = formatted_time
        
           
       
        #--------------------------views, date and time block------------------------------------
        
        #Extracting Comments
        #------------------------------------comments--------------------------------------------
        
        # Scroll down a bit first (YouTube needs a small scroll to load comments)
        try:
            driver.execute_script("window.scrollBy(0, 10000);")
            
            # Wait until the comment count element is present
            time.sleep(3)
            
            # finding the tag where comment num is present , then scrolling to it 
            comment_section  = driver.find_element(By.CSS_SELECTOR ,'yt-formatted-string.count-text.style-scope.ytd-comments-header-renderer' )
            driver.execute_script('arguments[0].scrollIntoView(true);',comment_section)
            comment_num = comment_section.text
            
            #getting exact comment number in int format
            link['comments_nubm'] = int(''.join(filter(str.isdigit,comment_num)))
            
        except:
            link['comments_nubm'] = np.nan #comments disabled or not found
        #------------------------------------comments--------------------------------------------
        
        #Extracting Cleaned Description
        #---------------------------------Cleaned Description------------------------------------
        
        # Try to click Show More if it exists
        try:
            try:
                show_more = WebDriverWait(driver, 2).until(
                    EC.element_to_be_clickable((By.CSS_SELECTOR, "tp-yt-paper-button#expand"))
                )
                driver.execute_script("arguments[0].click();", show_more)
                #print("Clicked Show More button")
            except:
                pass #No Show More button
                
            span = driver.find_elements(By.CSS_SELECTOR , 'span.yt-core-attributed-string--link-inherit-color')
            des = ''
            
            for spans in span:
                text = spans.text
                if len(spans.text)>300:
                    des=  text
            
            def clear_description(des):
                des = ' '.join(des.split())  # normalize whitespace , it does this by spllitting the string whereever whitespaces(\n , tab , space , etc) emerges , then join all the elements of the string with single space
                # Remove junk keywords
                junk_keywords = ['OUR CHANNELS', 'Follow us', 'Subscribe', 'http', 'More videos', 'German']
                for word in junk_keywords:
                    if word in des:
                        des = des.split(word)[0]
                        break
                # Remove relative dates 
                des = re.sub(r'\b\d+\s+(second|seconds|minute|minutes|hour|hours|day|days|week|weeks|month|months|year|years)\s+ago\b', '', des, flags=re.IGNORECASE)
                return des.strip()
            
            link['vid_description']= clear_description(des)
        except:
            link['vid_description'] = np.nan

    except Exception as e:
        print(f'Error Processing {link['video_link']}:{e}')
        
    
        #---------------------------------Cleaned Description------------------------------------
        
        
#creating data frame and saving data in it
#saving data(list) in a data frame
df = pd.DataFrame(data)
df.to_csv('YouTube(kurzgesagt)(2).csv' , index = False)

100%|██████████| 223/223 [51:10<00:00, 13.77s/it]


In [170]:
df

Unnamed: 0,title,views,upload_date,upload_time,comments_nubm,video_link,thumbnail_link
0,Let's Kill You a Billion Times to Make You Imm...,2464691,29-07-2025,07:00:02 AM,,https://www.youtube.com/watch?v=7wK4peez9zE,https://i.ytimg.com/vi/7wK4peez9zE/hqdefault.jpg
1,When Your Body Attacks Itself – Autoimmune,2358145,01-07-2025,07:00:01 AM,,https://www.youtube.com/watch?v=efOW5NUTYB8,https://i.ytimg.com/vi/efOW5NUTYB8/hqdefault.jpg
2,How Nuclear Flies Protect You from Flesh-Eatin...,4741889,03-06-2025,07:00:05 AM,8454.0,https://www.youtube.com/watch?v=zxq60I5RSW8,https://i.ytimg.com/vi/zxq60I5RSW8/hqdefault.jpg
3,Why Does Fentanyl Feel So Good?,6967320,20-05-2025,07:00:01 AM,19392.0,https://www.youtube.com/watch?v=m6KnVTYtSc0,
4,What If It Rains Bananas For A Single Day? (Sp...,4154942,06-05-2025,07:00:01 AM,10662.0,https://www.youtube.com/watch?v=tRXy-b6_lBc,
...,...,...,...,...,...,...,...
218,How The Stock Exchange Works (For Dummies),8488160,28-11-2013,09:03:32 AM,6902.0,https://www.youtube.com/watch?v=F3QpgXBtDeo,https://i.ytimg.com/vi/F3QpgXBtDeo/hqdefault.jpg
219,The Gulf Stream Explained,6204998,11-10-2013,12:11:39 PM,2010.0,https://www.youtube.com/watch?v=UuGrBhK2c7U,https://i.ytimg.com/vi/UuGrBhK2c7U/hqdefault.jpg
220,Fracking explained: opportunity or danger,7417981,03-09-2013,02:12:24 AM,8109.0,https://www.youtube.com/watch?v=Uti2niW2BRA,https://i.ytimg.com/vi/Uti2niW2BRA/hqdefault.jpg
221,The Solar System -- our home in space,6461624,22-08-2013,06:24:56 AM,6116.0,https://www.youtube.com/watch?v=KsF_hdjWJjo,https://i.ytimg.com/vi/KsF_hdjWJjo/hqdefault.jpg


In [160]:
df.isnull().sum()

title                0
views                0
upload_date          0
upload_time          0
comments_nubm        2
vid_description      0
video_link           0
thumbnail_link     129
dtype: int64

## **Note:**

- I couldn't extract the vid_description properly , most of the description data was NAN as it was a difficult task to click on show more then find the specific description in the whole description sectio.
- **Though** i still have that part in the code , if in case anyone wants to know i attempted that.
- With slight modification this code can be applied to any YouTube channel to scrap most of the details.
