<a href="https://www.kaggle.com/code/georgiosspyrou1/youtube-video-virality-predictor-eda?scriptVersionId=225303221" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🌍 YouTube Trending Videos Analysis & Virality Prediction 🚀  

## 📌 Project Overview  
This project aims to **collect, analyze, and model YouTube trending video data** across multiple countries using the **YouTube API**. By gathering daily trending video statistics, we seek to uncover the patterns and factors that contribute to a video's virality.  

## 🛠️ How the Data is Collected  
The data is automatically collected using the **YouTube API** and stored in this GitHub repository:  
🔗 [Trending Video Metadata](https://github.com/gpsyrou/tube-virality/tree/main/assets/meta/trending)  

The main project repository can be found here: https://github.com/gpsyrou/tube-virality

### **Data Collection Process**  
1. **Fetching Trending Videos**:  
   - We start by retrieving the latest trending videos from **multiple countries** via the **YouTube API**.  
   - These trending videos are stored in JSON format in our repository.  

2. **Daily Statistics Updates**:  
   - Each day, a GitHub Actions workflow runs automatically to **gather fresh statistics** for all stored trending videos.  
   - The script fetches video engagement metrics (views, likes, comments, etc.) and updates them in JSON files.  
   - The updated dataset is saved here:  
     🔗 [Video Statistics](https://github.com/gpsyrou/tube-virality/tree/main/assets/meta/video_stats)  

## 🔍 Objectives  
- **Understand Viral Trends**: Identify factors that make a video go viral across different countries.  
- **Statistical Analysis**: Explore engagement trends (views, likes, shares) over time.  
- **Machine Learning Models**: Build predictive models to estimate a video's likelihood of becoming viral.  

## 📊 Dataset & Features  
Our dataset consists of **daily trending videos** from multiple regions, with key features such as:  
- **Video Metadata**: Title, description, channel, published date, tags  
- **Engagement Metrics**: Views, likes, comments, favorites  
- **Technical Details**: Duration, resolution, captions availability  
- **Categorical Information**: Topic categories, privacy settings, license type  

## 🛠️ Methodology  
1. **Data Collection**: Fetch trending video data daily using the YouTube API.  
2. **Data Cleaning & Transformation**: Standardizing, handling missing values, and processing categorical features.  
3. **Exploratory Analysis**: Visualizing key trends in video engagement across countries.  
4. **Feature Engineering**: Extracting meaningful attributes (e.g., video age, growth rate).  
5. **Model Building**: Training machine learning models (classification, regression) to predict virality.  
6. **Evaluation & Interpretation**: Understanding model results and key drivers of viral content.  

## 🚀 Expected Outcomes  
- A **comprehensive dataset** of YouTube trending videos across multiple countries.  
- Insights into **what makes a video go viral** based on statistical analysis.  
- Predictive models that can estimate a video's potential virality.  
- Open-source tools for **researchers, content creators, and marketers** to better understand online video trends.  


In [1]:
import json
import requests
import pandas as pd
from io import StringIO

In [2]:
class GitHubDataLoader:
    def __init__(self, user, repo, file_path, branch="main"):
        self.user = user
        self.repo = repo
        self.branch = branch
        self.file_path = file_path
        self.raw_url = f"https://raw.githubusercontent.com/{self.user}/{self.repo}/{self.branch}/{self.file_path}"

    def get_csv_file(self):
        response = requests.get(self.raw_url)
        response.raise_for_status()
        return response.text 

    def csv_to_dataframe(self, csv_content):
        df = pd.read_csv(StringIO(csv_content))
        return df

In [3]:
YOUTUBE_CATEGORY_MAP = {
    1: "Film & Animation",
    2: "Autos & Vehicles",
    10: "Music",
    15: "Pets & Animals",
    17: "Sports",
    18: "Short Movies",
    19: "Travel & Events",
    20: "Gaming",
    21: "Videoblogging",
    22: "People & Blogs",
    23: "Comedy",
    24: "Entertainment",
    25: "News & Politics",
    26: "Howto & Style",
    27: "Education",
    28: "Science & Technology",
    29: "Nonprofits & Activism",
    30: "Movies",
    31: "Anime/Animation",
    32: "Action/Adventure",
    33: "Classics",
    34: "Comedy (Movies)",
    35: "Documentary",
    36: "Drama",
    37: "Family",
    38: "Foreign",
    39: "Horror",
    40: "Sci-Fi/Fantasy",
    41: "Thriller",
    42: "Shorts",
    43: "Shows",
    44: "Trailers"
}

## Trending Videos Analysis

In [4]:
loader = GitHubDataLoader("gpsyrou", "tube-virality", "db/ods/trending_videos.csv")

trending_videos_content = loader.get_csv_file()

df = loader.csv_to_dataframe(trending_videos_content)

In [5]:
print(df.shape)

(3000, 15)


In [6]:
class ODSToStageProcessor:
    """
    Processes and cleans YouTube data from ODS (raw) format to a cleaner stage format.
    """
    def __init__(self, category_map: dict = {}):
        self.category_map = category_map
        
    def trending_ods_to_stage(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Cleans and preprocesses a YouTube trending videos DataFrame for analysis.
        """
        df = df.copy()
        df.rename(columns={'id': 'video_id'}, inplace=True)
        
        # Convert 'publishedAt' to datetime format and remove timezone info
        df["publishedAt"] = pd.to_datetime(df["publishedAt"], errors="coerce").dt.strftime('%Y-%m-%d %H:%M:%S')

        # Convert float columns to int (handling NaN values first)
        for col in ["viewCount", "likeCount", "commentCount"]:
            df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)

        # Ensure categoryId is numeric and convert to Int64 before mapping
        df["categoryId"] = pd.to_numeric(df["categoryId"], errors="coerce").fillna(0).astype("Int64")

        # Map categoryId to category description
        df["category_descr"] = df["categoryId"].map(self.category_map).fillna("Unknown")
        
        # Convert 'defaultAudioLanguage' to uppercase if it exists
        if "defaultAudioLanguage" in df.columns:
            df["defaultAudioLanguage"] = df["defaultAudioLanguage"].astype(str).str.upper()

        # Define the preferred column order
        column_order = [
            "video_id", "collection_date", "position", "publishedAt", "title", "channelTitle", "categoryId", "category_descr",
            "viewCount", "likeCount", "commentCount", "defaultAudioLanguage"
        ]
        
        # Reorder columns, keeping any additional columns at the end
        df = df[[col for col in column_order if col in df.columns] + [col for col in df.columns if col not in column_order]]
        
        return df
    

    def video_stats_ods_to_stage(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Cleans and preprocesses a YouTube video statistics DataFrame for analysis.
        """
        df = df.copy()
        
        # Convert 'published_at' to datetime format and remove timezone info
        df["published_at"] = pd.to_datetime(df["published_at"], errors="coerce").dt.strftime('%Y-%m-%d %H:%M:%S')
        
        # Convert boolean columns to proper boolean types
        for col in ["caption", "licensed_content", "embeddable", "public_stats_viewable"]:
            df[col] = df[col].astype(bool)
        
        # Convert numeric columns to integers
        for col in ["view_count", "like_count", "comment_count"]:
            df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)
        
        # Convert 'tags' column to a comma-separated string if it's a list
        if "tags" in df.columns:
            df["tags"] = df["tags"].apply(lambda x: ", ".join(x) if isinstance(x, list) else str(x))
        
        # Transform 'duration' column from PT format to minutes
        if "duration" in df.columns:
            def pt_to_minutes(pt_string):
                """Convert ISO 8601 duration format (PT format) to minutes, handling hours."""
                if not isinstance(pt_string, str) or not pt_string.startswith("PT"):
                    return 0.0
                
                # Remove PT prefix
                time_str = pt_string[2:]
                
                # Extract hours, minutes and seconds
                hours = 0
                minutes = 0
                seconds = 0
                
                # Handle hours if present
                if 'H' in time_str:
                    parts = time_str.split('H')
                    hours = int(parts[0])
                    time_str = parts[1]
                
                # Handle minutes if present
                if 'M' in time_str:
                    parts = time_str.split('M')
                    minutes = int(parts[0])
                    time_str = parts[1]
                
                # Handle seconds if present
                if 'S' in time_str:
                    seconds = int(time_str.replace('S', ''))
                
                # Convert to minutes
                return (hours * 60) + minutes + (seconds / 60)
            
            df["duration_in_minutes"] = df["duration"].apply(pt_to_minutes)
        
        # Define the preferred column order
        column_order = [
            "video_id", "channel_id", "title", "description", "published_at", "tags", "view_count", "like_count", "comment_count", 
            "duration", "duration_in_minutes", "dimension", "definition", "caption", "licensed_content", "projection", "privacy_status", "license", 
            "embeddable", "public_stats_viewable", "topic_categories", "collection_day", "country_code"
        ]
        
        # Reorder columns, keeping any additional columns at the end
        df = df[[col for col in column_order if col in df.columns] + [col for col in df.columns if col not in column_order]]
        
        return df

In [7]:
ots_processor = ODSToStageProcessor(category_map=YOUTUBE_CATEGORY_MAP)

In [8]:
trending_df = ots_processor.trending_ods_to_stage(df=df)

## Videos Statistics Analysis

In [9]:
loader = GitHubDataLoader("gpsyrou", "tube-virality", "db/ods/merged_video_stats.csv")

video_stats_content = loader.get_csv_file()

df = loader.csv_to_dataframe(video_stats_content)

In [10]:
video_stats_df = ots_processor.video_stats_ods_to_stage(df=df)