#  YouTube Video Scraper – Second Drop

## 1. Overview
This dataset was generated using a custom Python-based **web scraping pipeline** that collected metadata from YouTube videos.  
The scraping process leveraged **Selenium** and **BeautifulSoup** to dynamically load video pages, extract structured data, and store it in CSV format.

The dataset is designed for **Exploratory Data Analysis (EDA)**, visualization, and insights generation from YouTube content.

---

## 2. Data Collection Process
- **Tool Used:** Python (Selenium + BeautifulSoup)
- **Target Website:** YouTube
- **Scraping Approach:**
  - Navigated to each video’s URL.
  - Waited for dynamic elements to load (description, like counts).
  - Extracted data from HTML elements, including Shadow DOM components.
  - Stored results in CSV format.

---

## 3. Dataset Structure

| Column Name    | Data Type | Description |
|----------------|-----------|-------------|
| `video_link`   | string    | Full URL to the YouTube video |
| `title`        | string    | Video title as shown on YouTube |
| `views`        | integer   | Total number of views |
| `likes`        | integer   | Number of likes on the video |
| `description`  | string    | Full text description of the video |
| `date`         | date      | Upload date of the video |
| `channel_name` | string    | Name of the channel uploading the video |
| `duration`     | string    | Length of the video in `HH:MM:SS` format |
| `comments`     | integer   | Total number of comments (if available) |

---

## 4. Data Quality Notes
- **Missing Values:** Some videos have no `likes` or `description` due to disabled features or restricted content.
- **Dynamic Content:** Like counts can change rapidly; data is accurate at the time of scraping.
- **Encoding:** All text fields are UTF-8 encoded to preserve emojis and special characters.

---



## 5. Potential Uses
- **Trend Analysis:** Identify which topics gain the most engagement.
- **Content Strategy:** Understand optimal video lengths and posting times.
- **Sentiment Analysis:** Extract and analyze description text for audience tone.
- **Time Series Analysis:** Study how likes and views change over time.

---

## 6. Limitations
- The dataset represents a snapshot in time.
- Some private or region-restricted videos are excluded.
- YouTube's HTML structure changes frequently, requiring scraper updates.

---

## 7. Next Steps
- Perform **deep EDA** on engagement metrics.

---


In [1]:
import time
import numpy as np
import pandas as pd
from tqdm import tqdm

from bs4 import BeautifulSoup
from selenium import webdriver

import chromedriver_binary
from selenium.webdriver.common.by import By 

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


In [3]:
df = pd.read_csv('data.csv')

df.head()

Unnamed: 0,title,views,date_time,video_link,thumbnail_link
0,NIMCET 2026 | How To Prepare NIMCET 2026? | Bi...,1.3K views,1 day ago,/watch?v=Ho_PIAAVLlE,https://i.ytimg.com/vi/Ho_PIAAVLlE/hqdefault.jpg
1,NIMCET Roadmap | NIMCET 2026 Preparation | NIM...,2.3K views,1 day ago,/watch?v=cgeb6Gojoho,https://i.ytimg.com/vi/cgeb6Gojoho/hqdefault.jpg
2,AI Engineer Roadmap – How to Learn AI in 2025 ...,15K views,2 days ago,/watch?v=JagRXz_mTU8,https://i.ytimg.com/vi/JagRXz_mTU8/hqdefault.jpg
3,How to Score 9+ CGPA in College 🔥 Complete Roa...,10K views,3 days ago,/watch?v=UttzVuaF-f0,https://i.ytimg.com/vi/UttzVuaF-f0/hqdefault.jpg
4,Course Walkthrough - How to Utilize the Free C...,9K views,8 days ago,/watch?v=Dl-eEZlv_pk&pp=0gcJCccJAYcqIYzv,https://i.ytimg.com/vi/Dl-eEZlv_pk/hqdefault.jpg


In [5]:
chromedriver_binary.chromedriver_filename

'C:\\Users\\Infinix\\anaconda3\\Lib\\site-packages\\chromedriver_binary\\chromedriver.exe'

In [7]:
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options



In [11]:
options=Options()
service=Service()
browser=webdriver.Chrome(service=service, options=options)
browser.get('https://www.youtube.com/')

time.sleep(2)

data = []

for link in tqdm(df['video_link']):
    
    link = 'https://www.youtube.com/' + link
    browser.get(link)
    
    time.sleep(5)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    
    try:
        title=soup.find('yt-formatted-string',class_="style-scope ytd-watch-metadata").text
    except:
        title = np.nan
    
    try:
        info_tag = soup.find('yt-formatted-string', id='info')
        if info_tag:
            spans = info_tag.find_all('span')
            views = spans[0].text.strip() if len(spans) > 0 else "Views not found"
        else:
            views = "Info section not found"
    except:
        view = np.nan
        
    try:
        date_time = soup.find_all('yt-formatted-string', class_='style-scope ytd-video-primary-info-renderer')[1].text.strip()
    except:
        date_time = np.nan
    
    try:
        like=soup.find('button-view-model',class_="yt-spec-button-view-model").text
    except:
        like = np.nan
      
    try:
        description=soup.find('ytd-text-inline-expander',class_="style-scope ytd-watch-metadata").text
    except:
        description = np.nan


    data.append([title , date_time, like, views, link, description])




100%|████████████████████████████████████████████████████████████████████████████| 2047/2047 [4:56:34<00:00,  8.69s/it]


In [13]:
len(data)

2047

In [15]:
df = pd.DataFrame(data, columns = ['title','date_time','like','views','link','description'])


In [17]:
df.to_csv('gfg_data.csv',index = False)

In [19]:
df['description']

0       NIMCET Batch Link - https://www.geeksforgeeks....
1       NIMCET Batch Link - https://www.geeksforgeeks....
2       Are you ready to launch your career as an AI E...
3       Want to score 9+ CGPA consistently in college,...
4       Bonus Rewards added: Get a chance to win free ...
                              ...                        
2042    Explanation for the article: http://www.geeksf...
2043    Explanation for the article: http://geeksquiz....
2044    Explanation for the article: http://www.geeksf...
2045    Explanation for the Article: http://www.geeksf...
2046    Here's you next clue  - Our comprehensive guid...
Name: description, Length: 2047, dtype: object