# Cat-egories YouTube Channel Data Scraper

This notebook uses the YouTube Data API v3 to collect metadata from cat-themed YouTube channels.

## What it does:
- Reads channel IDs from `accounts.txt`
- Fetches channel metadata (subscribers, total views, etc.)
- Retrieves video data including:
  - Titles, descriptions, tags, hashtags
  - View counts, likes, comments
  - Published dates
- Exports data to separate CSV files for each channel
- Creates a summary CSV with all channels

## Before running:
1. **Get a YouTube API Key:**
   - Go to [Google Cloud Console](https://console.cloud.google.com/)
   - Create a new project or select existing
   - Enable YouTube Data API v3
   - Create credentials (API Key)

2. **Add your API key to `.env` file:**
   - Open the `.env` file in this directory
   - Replace `your_api_key_here` with your actual YouTube API key
   - Save the file

3. **Add channel IDs** to `accounts.txt` (already populated with examples)

## What Data Gets Scraped?

### For Each Channel:
The scraper collects **channel-level metadata**:
- **Channel Title** - Name of the channel
- **Channel Description** - About section text
- **Subscriber Count** - Number of subscribers
- **Total View Count** - All-time views across all videos
- **Video Count** - Total number of videos published
- **Published Date** - When the channel was created

### For Each Video (currently up to 50 per channel):
The scraper collects **video-level data**:

**Content Information:**
- **Video Title** - The video's title
- **Description** - Full video description text
- **Tags** - YouTube tags the creator assigned (stored as pipe-separated: `tag1|tag2|tag3`)
- **Hashtags** - Hashtags extracted from the description (stored as pipe-separated)
- **Duration** - Video length (in ISO format)
- **Published Date** - When the video was uploaded

**Engagement Metrics:**
- **View Count** - Number of views
- **Like Count** - Number of likes
- **Comment Count** - Number of comments
- **Favorite Count** - Number of favorites (usually 0, legacy metric)

**Identifiers:**
- **Video ID** - Unique YouTube video identifier
- **Channel ID** - Unique channel identifier
- **Channel Title** - For easy reference

### Example Row from CSV:
```
video_id: dQw4w9WgXcQ
title: Cute Cat Playing with Box
description: My cat loves this box! #catsofyoutube #funny
tags: cat|funny|pets|animals
hashtags: #catsofyoutube|#funny
view_count: 125000
like_count: 3500
comment_count: 450
duration: PT5M32S
published_at: 2024-03-15T10:30:00Z
```

## 1. Install Dependencies

In [None]:
%pip install google-api-python-client
%pip install pandas tqdm python-dotenv

## 2. Import Libraries, API Config, and Print Helpers

In [None]:
# Import required libraries
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import os
from tqdm import tqdm
import json
from dotenv import load_dotenv

# Color codes for terminal output
class Colors:
    GREEN = '\033[92m'
    RED = '\033[91m'
    YELLOW = '\033[93m'
    BLUE = '\033[94m'
    RESET = '\033[0m'
    BOLD = '\033[1m'

def print_success(message):
    """Print success message in green"""
    print(f"{Colors.GREEN}{message}{Colors.RESET}")

def print_error(message):
    """Print error message in red"""
    print(f"{Colors.RED}{message}{Colors.RESET}")

def print_warning(message):
    """Print warning message in yellow"""
    print(f"{Colors.YELLOW}{message}{Colors.RESET}")

def print_info(message):
    """Print info message in blue"""
    print(f"{Colors.BLUE}{message}{Colors.RESET}")

# Load environment variables from .env file
load_dotenv()

# Get YouTube API Key from environment variable
API_KEY = os.getenv('YOUTUBE_API_KEY')

if not API_KEY or API_KEY == 'your_api_key_here':
    raise ValueError("Please set your YOUTUBE_API_KEY in the .env file")

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=API_KEY)
print_success("YouTube API client initialized successfully")

## 3. Define Helper Functions

These functions handle:
- Reading channel IDs from file
- Fetching channel information
- Retrieving video lists
- Getting detailed video metadata

In [None]:
def resolve_channel_identifier(youtube, identifier):
    """
    Resolve a channel identifier to a channel ID.
    Handles @username, custom URLs, and direct channel IDs.
    """
    identifier = identifier.strip()
    
    # If it's already a channel ID (starts with UC), return it
    if identifier.startswith('UC') and len(identifier) == 24:
        return identifier
    
    # If it starts with @, it's a username handle
    if identifier.startswith('@'):
        username = identifier[1:]  # Remove the @
        try:
            request = youtube.channels().list(
                part='id',
                forHandle=username
            )
            response = request.execute()
            if response.get('items'):
                return response['items'][0]['id']
        except:
            pass
    
    # Try as custom URL or username
    try:
        request = youtube.channels().list(
            part='id',
            forUsername=identifier.replace('@', '')
        )
        response = request.execute()
        if response.get('items'):
            return response['items'][0]['id']
    except:
        pass
    
    print_error(f"Could not resolve: {identifier}")
    return None


def read_channel_ids(filename='accounts.txt'):
    """
    Read channel identifiers from a text file and resolve them to channel IDs.
    Skips empty lines and lines starting with #
    """
    identifiers = []
    with open(filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line and not line.startswith('#'):
                identifiers.append(line)
    return identifiers

# Test the function
identifiers = read_channel_ids('accounts.txt')
print_info(f"Found {len(identifiers)} channel identifiers:")
for identifier in identifiers:
    print(f"  - {identifier}")

In [None]:
def get_channel_info(youtube, channel_id):
    """
    Fetch channel metadata including title, description, subscriber count, view count, etc.
    """
    try:
        request = youtube.channels().list(
            part='snippet,statistics,contentDetails',
            id=channel_id
        )
        response = request.execute()
        
        if not response.get('items'):
            print(f"No channel found for ID: {channel_id}")
            return None
        
        channel = response['items'][0]
        
        # Clean description for CSV compatibility
        description = channel['snippet']['description']
        description_clean = description.replace('\n', ' ').replace('\r', ' ')
        
        channel_data = {
            'channel_id': channel_id,
            'channel_title': channel['snippet']['title'],
            'channel_description': description_clean,  # Cleaned description
            'published_at': channel['snippet']['publishedAt'],
            'subscriber_count': channel['statistics'].get('subscriberCount', 0),
            'view_count': channel['statistics'].get('viewCount', 0),
            'video_count': channel['statistics'].get('videoCount', 0),
            'uploads_playlist_id': channel['contentDetails']['relatedPlaylists']['uploads']
        }
        
        return channel_data
    except Exception as e:
        print(f"Error fetching channel info for {channel_id}: {e}")
        return None

In [None]:
def get_channel_videos(youtube, uploads_playlist_id, max_results=50):
    """
    Fetch video IDs from a channel's uploads playlist.
    Returns a list of video IDs.
    """
    video_ids = []
    next_page_token = None
    
    try:
        while len(video_ids) < max_results:
            request = youtube.playlistItems().list(
                part='contentDetails',
                playlistId=uploads_playlist_id,
                # Can be updated to fetch more than 50 if needed, should check API limits
                maxResults=min(50, max_results - len(video_ids)),
                pageToken=next_page_token
            )
            response = request.execute()
            
            for item in response['items']:
                video_ids.append(item['contentDetails']['videoId'])
            
            next_page_token = response.get('nextPageToken')
            
            if not next_page_token:
                break
                
    except Exception as e:
        print(f"Error fetching videos: {e}")
    
    return video_ids

In [None]:
def get_video_details(youtube, video_ids):
    """
    Fetch detailed information for a list of video IDs.
    Includes title, description, tags, views, likes, comments, etc.
    NOTE: Does NOT include channel-level data - that goes in the summary CSV.
    """
    all_video_data = []
    
    # YouTube API allows max 50 videos per request
    for i in range(0, len(video_ids), 50):
        batch = video_ids[i:i+50]
        
        try:
            request = youtube.videos().list(
                part='snippet,statistics,contentDetails',
                id=','.join(batch)
            )
            response = request.execute()
            
            for video in response['items']:
                # Extract hashtags from description
                description = video['snippet'].get('description', '')
                # Replace newlines with spaces to prevent CSV issues
                description_clean = description.replace('\n', ' ').replace('\r', ' ')
                
                hashtags = [word for word in description.split() if word.startswith('#')]
                
                video_data = {
                    'video_id': video['id'],
                    'title': video['snippet']['title'],
                    'description': description_clean,  # Cleaned description
                    'published_at': video['snippet']['publishedAt'],
                    'tags': '|'.join(video['snippet'].get('tags', [])),  # Join tags with |
                    'hashtags': '|'.join(hashtags),  # Join hashtags with |
                    'duration': video['contentDetails']['duration'],
                    'view_count': video['statistics'].get('viewCount', 0),
                    'like_count': video['statistics'].get('likeCount', 0),
                    'comment_count': video['statistics'].get('commentCount', 0),
                }
                
                all_video_data.append(video_data)
                
        except Exception as e:
            print(f"Error fetching video details: {e}")
    
    return all_video_data

In [None]:
def scrape_channel_data(youtube, channel_id, max_videos=50):
    """
    Main function to scrape all data for a single channel.
    Returns a DataFrame with video details (NO channel info - that's in summary).
    """
    print(f"\n{'='*60}")
    print(f"Scraping channel: {channel_id}")
    print(f"{'='*60}")
    
    # Get channel info
    channel_info = get_channel_info(youtube, channel_id)
    if not channel_info:
        return None, None
    
    print(f"Channel: {channel_info['channel_title']}")
    print(f"Subscribers: {channel_info['subscriber_count']}")
    print(f"Total Views: {channel_info['view_count']}")
    print(f"Total Videos: {channel_info['video_count']}")
    
    # Get video IDs
    print(f"\nFetching up to {max_videos} videos...")
    video_ids = get_channel_videos(youtube, channel_info['uploads_playlist_id'], max_videos)
    print(f"Found {len(video_ids)} videos")
    
    # Get video details
    print("Fetching video details...")
    video_data = get_video_details(youtube, video_ids)
    
    # Create DataFrame
    df = pd.DataFrame(video_data)
    
    # Convert numeric columns
    numeric_cols = ['view_count', 'like_count', 'comment_count']
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)
    
    print(f"Successfully scraped {len(df)} videos")
    
    return df, channel_info

## Main Scraping Process
- Scrape all the channels from `accounts.txt` and save the data:

### File Structure:
 - **Individual channel CSVs**: `{ChannelName}.csv` 
   - Contains ONLY video data (no channel info)
 - **Summary CSV**: `channels_summary.csv`
   - Contains one row per channel with channel metadata

In [None]:
# Create data directory if it doesn't exist
os.makedirs('data', exist_ok=True)

# Read channel identifiers
identifiers = read_channel_ids('accounts.txt')

print_info("Resolving channel identifiers...")
channel_ids = []
for identifier in identifiers:
    channel_id = resolve_channel_identifier(youtube, identifier)
    if channel_id:
        print_success(f"  [SUCCESS] {identifier} -> {channel_id}")
        channel_ids.append(channel_id)
    else:
        print_error(f"  [FAILED] {identifier} -> Could not resolve")

print_info(f"\nResolved {len(channel_ids)} out of {len(identifiers)} channels\n")

# Store all results
all_channels_data = []
channel_metadata = []

# Scrape each channel
for channel_id in tqdm(channel_ids, desc="Scraping channels"):
    df, channel_info = scrape_channel_data(youtube, channel_id, max_videos=50)
    
    if df is not None and channel_info is not None:
        channel_name = channel_info['channel_title']
        # Remove special characters and use only ASCII-safe characters
        clean_name = ''.join(c for c in channel_name if c.isalnum() or c in (' ', '-', '_'))
        clean_name = clean_name.replace(' ', '_').strip('_')
        filename = f"data/{clean_name}.csv"
        
        df.to_csv(filename, index=False, encoding='utf-8')
        print_success(f"[SAVED] {filename}\n")
        
        all_channels_data.append(df)
        channel_metadata.append(channel_info)

print(f"\n{'='*60}")
print_success(f"Scraping Complete!")
print_info(f"Successfully scraped {len(all_channels_data)} channels")
print_info(f"CSV files saved in the 'data/' directory")
print(f"{'='*60}")

## Summary CSV Creation
- After scraping each channel, append its metadata to a summary list
- At the end, convert this list to a DataFrame and save as `channels_summary.csv`

In [None]:
# Create summary DataFrame from channel metadata
summary_df = pd.DataFrame(channel_metadata)

# Calculate average engagement per channel from video data
# Match up with channel metadata by index (they're in the same order)
engagement_stats = []
for i, df in enumerate(all_channels_data):
    if len(df) > 0:
        # Get the corresponding channel info
        channel_info = channel_metadata[i]
        
        stats = {
            'channel_id': channel_info['channel_id'],
            'channel_title': channel_info['channel_title'],
            'total_videos_scraped': len(df),
            'avg_views': df['view_count'].mean(),
            'avg_likes': df['like_count'].mean(),
            'avg_comments': df['comment_count'].mean(),
            'total_views_scraped_videos': df['view_count'].sum(),
            'total_likes_scraped_videos': df['like_count'].sum(),
        }
        engagement_stats.append(stats)

engagement_df = pd.DataFrame(engagement_stats)

# Merge with channel metadata
if len(engagement_df) > 0:
    summary_full = pd.merge(summary_df, engagement_df, on=['channel_id', 'channel_title'], how='left')
    
    # Save summary
    summary_full.to_csv('data/channels_summary.csv', index=False)
    print("Channel Summary:")
    print(summary_full[['channel_title', 'subscriber_count', 'video_count', 
                        'total_videos_scraped', 'avg_views', 'avg_likes']].to_string(index=False))
    print(f"\nSummary saved to: data/channels_summary.csv")
else:
    print("No data scraped")