<a href="https://www.kaggle.com/code/georgiosspyrou1/youtube-video-virality-predictor-eda?scriptVersionId=226450250" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🚀 Tube Virality Project  

![Python](https://img.shields.io/badge/-Python-000?&logo=Python)  
![Go](https://img.shields.io/badge/-Golang-000?&logo=go)  

## 🎯 YouTube Trending Video Analytics API  

### **Project Purpose**  
The **Tube Virality** project aims to **collect, analyze, and model YouTube trending video data** across multiple countries using the **YouTube API**. This project goes beyond simply analyzing existing APIs; we are **building a custom API** to retrieve metadata—including view counts, likes, and descriptions—from YouTube videos and channels.

### **Key Objectives**  
- ✅ **Develop a custom API** to fetch YouTube video statistics.  
- ✅ **Collect trending videos** from various countries and store historical data.  
- ✅ **Analyze the collected data** to identify trends and patterns in virality.  
- ✅ **Build predictive models** to estimate a video's potential to go viral.

---

## 🛠️ How the Data is Collected  

The data is automatically collected using the **YouTube API** and stored in this GitHub repository:  
🔗 [Trending Video Metadata](https://github.com/gpsyrou/tube-virality/tree/main/assets/meta/trending)  

### **Collection Process**  
1. **Fetching Trending Videos**  
   - Using the YouTube API, trending videos from multiple countries are retrieved.  
   - The list of trending videos is stored and continuously updated.  

2. **Daily Statistics Updates** (Automated via **GitHub Actions**)  
   - A scheduled **GitHub Actions** workflow updates video statistics (views, likes, comments, etc.).  
   - These updates provide **historical trends** for analysis.  
   - The latest data is stored here:  
     🔗 [Video Statistics](https://github.com/gpsyrou/tube-virality/tree/main/assets/meta/video_stats)  

```mermaid
graph TD;
    A[trending.py: Fetch Trending Videos] -->|Generates daily JSON files - one per country| B[trending_db.py: Aggregate Trending Data];
    B -->|Merges all country JSONs into a unified CSV| C[video_stats.py: Extract & Fetch Video Stats];
    C -->|Creates a daily JSON file with statistics for all videos| D[video_stats_db.py: Compile Video Stats History];
    D -->|Combines all daily stats JSONs into a final dataset| E[Complete Merged Video Stats JSON];
```
---

## 🔍 Understanding Video Virality  

### **What Defines a Viral Video?**  
A video's **virality** isn't simply measured by view count—it depends on engagement, growth rate, and audience reach. Here are key factors:  
📌 **Engagement Rate** – Likes, comments, and shares relative to views.  
📌 **Subscriber Growth** – New subscribers gained after the video is posted.  
📌 **Rapid View Growth** – Views gained in the first 24-48 hours.  

For instance:  
- A YouTuber with **1M subscribers** getting **20M views** is expected.  
- A YouTuber with **10K subscribers** getting **2M views** is **extraordinary**.  

Our models will classify videos as **"success" (viral)** or **"non-success"**, based on these metrics.

---

## 📊 Dataset & Features  

Our dataset includes key **video metadata** and **engagement statistics**, such as:  

- **Video Details**: Title, description, duration, resolution  
- **Engagement Metrics**: Views, likes, comments, favorite count  
- **Channel Details**: Subscriber count, total videos, upload frequency  
- **Trending History**: How long a video remains on the trending list  
- **Country-Based Analysis**: Virality trends across different regions  

📌 **Goal:** Use these features to identify patterns and train models for virality prediction.  

---

## 🔬 Methodology  

1️⃣ **Data Collection** – Retrieve daily trending videos across countries.  
2️⃣ **Data Cleaning & Preprocessing** – Handle missing values, outliers, and standardize data.  
3️⃣ **Exploratory Analysis** – Identify key trends and patterns.  
4️⃣ **Feature Engineering** – Extract additional insights like growth rate and engagement score.  
5️⃣ **Model Development** – Train ML models for virality prediction.  
6️⃣ **Evaluation & Interpretation** – Validate predictions and refine models.  

---

## 💡 Technologies Utilized  

We've harnessed a blend of cutting-edge technologies to power the **Tube Virality** project:  

🔹 **Python 3.9** – Data processing, analysis, and ML model training.  
🔹 **SQL** – Storing structured video metadata for analysis.  
🔹 **Go** – Enhancing API performance and concurrent processing.  

---


In [1]:
import json
import requests
import pandas as pd
from io import StringIO

In [2]:
class GitHubDataLoader:
    def __init__(self, user, repo, file_path, branch="main"):
        self.user = user
        self.repo = repo
        self.branch = branch
        self.file_path = file_path
        self.raw_url = f"https://raw.githubusercontent.com/{self.user}/{self.repo}/{self.branch}/{self.file_path}"

    def get_csv_file(self):
        response = requests.get(self.raw_url)
        response.raise_for_status()
        return response.text 

    def csv_to_dataframe(self, csv_content):
        df = pd.read_csv(StringIO(csv_content))
        return df

In [3]:
YOUTUBE_CATEGORY_MAP = {
    1: "Film & Animation",
    2: "Autos & Vehicles",
    10: "Music",
    15: "Pets & Animals",
    17: "Sports",
    18: "Short Movies",
    19: "Travel & Events",
    20: "Gaming",
    21: "Videoblogging",
    22: "People & Blogs",
    23: "Comedy",
    24: "Entertainment",
    25: "News & Politics",
    26: "Howto & Style",
    27: "Education",
    28: "Science & Technology",
    29: "Nonprofits & Activism",
    30: "Movies",
    31: "Anime/Animation",
    32: "Action/Adventure",
    33: "Classics",
    34: "Comedy (Movies)",
    35: "Documentary",
    36: "Drama",
    37: "Family",
    38: "Foreign",
    39: "Horror",
    40: "Sci-Fi/Fantasy",
    41: "Thriller",
    42: "Shorts",
    43: "Shows",
    44: "Trailers"
}

## Trending Videos Analysis

In [4]:
loader = GitHubDataLoader("gpsyrou", "tube-virality", "db/ods/trending_videos.csv")

trending_videos_content = loader.get_csv_file()

df = loader.csv_to_dataframe(trending_videos_content)

In [5]:
print(df.shape)

(4500, 15)


In [6]:
import pandas as pd
import logging

class ODSToStageProcessor:
    """
    Processes and cleans YouTube data from ODS (raw) format to a cleaner stage format.
    """
    def __init__(self, category_map: dict = {}):
        self.category_map = category_map
        
        # Configure logging
        logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
        self.logger = logging.getLogger(__name__)

    def trending_ods_to_stage(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Cleans and preprocesses a YouTube trending videos DataFrame for analysis.
        """
        self.logger.info("Started processing trending ODS data.")
        df = df.copy()
        
        try:
            df.rename(columns={'id': 'video_id'}, inplace=True)
            self.logger.info("Renamed columns successfully.")
        except Exception as e:
            self.logger.error(f"Error renaming columns: {e}")

        try:
            # Convert 'publishedAt' to datetime format and remove timezone info
            df["publishedAt"] = pd.to_datetime(df["publishedAt"], errors="coerce").dt.strftime('%Y-%m-%d %H:%M:%S')
            self.logger.info("'publishedAt' column converted successfully.")
        except Exception as e:
            self.logger.error(f"Error converting 'publishedAt' column: {e}")
        
        # Convert float columns to int (handling NaN values first)
        for col in ["viewCount", "likeCount", "commentCount"]:
            try:
                df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)
                self.logger.info(f"Converted {col} to integers.")
            except Exception as e:
                self.logger.error(f"Error converting {col}: {e}")

        # Ensure categoryId is numeric and convert to Int64 before mapping
        try:
            df["categoryId"] = pd.to_numeric(df["categoryId"], errors="coerce").fillna(0).astype("Int64")
            self.logger.info("'categoryId' column converted successfully.")
        except Exception as e:
            self.logger.error(f"Error converting 'categoryId': {e}")
        
        # Map categoryId to category description
        df["category_descr"] = df["categoryId"].map(self.category_map).fillna("Unknown")
        self.logger.info("Category mapping completed.")
        
        # Convert 'defaultAudioLanguage' to uppercase if it exists
        if "defaultAudioLanguage" in df.columns:
            df["defaultAudioLanguage"] = df["defaultAudioLanguage"].astype(str).str.upper()
            self.logger.info("Converted 'defaultAudioLanguage' to uppercase.")
        
        # Define the preferred column order
        column_order = [
            "video_id", "collection_date", "position", "publishedAt", "title", "channelTitle", "categoryId", "category_descr",
            "viewCount", "likeCount", "commentCount", "defaultAudioLanguage"
        ]
        
        # Reorder columns, keeping any additional columns at the end
        df = df[[col for col in column_order if col in df.columns] + [col for col in df.columns if col not in column_order]]
        
        self.logger.info("Completed processing of trending ODS data.")
        return df
    

    def video_stats_ods_to_stage(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Cleans and preprocesses a YouTube video statistics DataFrame for analysis.
        """
        self.logger.info("Started processing video stats ODS data.")
        df = df.copy()
        
        try:
            # Convert 'published_at' to datetime format and remove timezone info
            df["published_at"] = pd.to_datetime(df["published_at"], errors="coerce").dt.strftime('%Y-%m-%d %H:%M:%S')
            self.logger.info("'published_at' column converted successfully.")
        except Exception as e:
            self.logger.error(f"Error converting 'published_at' column: {e}")

        # Convert boolean columns to proper boolean types
        for col in ["caption", "licensed_content", "embeddable", "public_stats_viewable"]:
            try:
                df[col] = df[col].astype(bool)
                self.logger.info(f"Converted {col} to boolean.")
            except Exception as e:
                self.logger.error(f"Error converting {col}: {e}")
        
        # Convert numeric columns to integers
        for col in ["view_count", "like_count", "comment_count"]:
            try:
                df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)
                self.logger.info(f"Converted {col} to integers.")
            except Exception as e:
                self.logger.error(f"Error converting {col}: {e}")
        
        # Convert 'tags' column to a comma-separated string if it's a list
        if "tags" in df.columns:
            try:
                df["tags"] = df["tags"].apply(lambda x: ", ".join(x) if isinstance(x, list) else str(x))
                self.logger.info("Converted 'tags' column to string.")
            except Exception as e:
                self.logger.error(f"Error processing 'tags' column: {e}")
        
        # Transform 'duration' column from PT format to minutes
        if "duration" in df.columns:
            try:
                def pt_to_minutes(pt_string):
                    """Convert ISO 8601 duration format (PT format) to minutes, handling hours."""
                    if not isinstance(pt_string, str) or not pt_string.startswith("PT"):
                        return 0.0
                    
                    # Use regex to handle hours, minutes, and seconds
                    import re
                    match = re.match(r"^PT(\d+H)?(\d+M)?(\d+S)?$", pt_string)
                    if match:
                        hours = int(match.group(1).replace('H', '') if match.group(1) else 0)
                        minutes = int(match.group(2).replace('M', '') if match.group(2) else 0)
                        seconds = int(match.group(3).replace('S', '') if match.group(3) else 0)
                        return hours * 60 + minutes + (seconds / 60)
                    return 0.0
                
                df["duration_in_minutes"] = df["duration"].apply(pt_to_minutes)
                self.logger.info("Transformed 'duration' to minutes.")
            except Exception as e:
                self.logger.error(f"Error transforming 'duration' column: {e}")
        
        # Define the preferred column order
        column_order = [
            "video_id", "channel_id", "title", "description", "published_at", "tags", "view_count", "like_count", "comment_count", 
            "duration", "duration_in_minutes", "dimension", "definition", "caption", "licensed_content", "projection", "privacy_status", "license", 
            "embeddable", "public_stats_viewable", "topic_categories", "collection_day", "country_code"
        ]
        
        # Reorder columns, keeping any additional columns at the end
        df = df[[col for col in column_order if col in df.columns] + [col for col in df.columns if col not in column_order]]
        
        self.logger.info("Completed processing of video stats ODS data.")
        return df


In [7]:
ots_processor = ODSToStageProcessor(category_map=YOUTUBE_CATEGORY_MAP)

In [8]:
trending_df = ots_processor.trending_ods_to_stage(df=df)

In [9]:
trending_df.head(3)

Unnamed: 0,video_id,collection_date,publishedAt,title,channelTitle,categoryId,category_descr,viewCount,likeCount,commentCount,defaultAudioLanguage,trending_position,country_code,channelId,description,thumbnail_url
0,hTpi0SyZfx8,2025-03-04,2025-03-03 15:30:24,NEW! Taarak Mehta Ka Ooltah Chashmah | Ep 4335...,Sony SAB,24,Entertainment,5446361,89355,3103,HI,1,IN,UC6-F5tO8uklgE9Zy8IvbdFw,Click here to subscribe to SAB: https://www.yo...,https://i.ytimg.com/vi/hTpi0SyZfx8/hqdefault.jpg
1,NkZFnpDhdCk,2025-03-04,2025-03-03 05:47:08,The Paradise Glimpse : RAW STATEMENT - Telugu ...,SLV Cinemas,1,Film & Animation,18241379,286388,5303,EN,2,IN,UCiWKEIAFbdMIEicYBwKfi0g,The most anticipated #NaniOdela2 The Paradise ...,https://i.ytimg.com/vi/NkZFnpDhdCk/hqdefault.jpg
2,jl-sgSDwJHs,2025-03-04,2025-02-28 13:32:07,Good Bad Ugly Tamil Teaser | Ajith Kumar | Tri...,Mythri Movie Makers,24,Entertainment,34924808,876063,57634,TE,3,IN,UCKZSn5C-RzrLjuWJF8wWiDw,Good Bad Ugly Tamil Teaser on Mythri Movie Mak...,https://i.ytimg.com/vi/jl-sgSDwJHs/hqdefault.jpg


In [10]:
trending_df.groupby('video_id').agg(
    min_collection_date=('collection_date', 'min'),
    max_collection_date=('collection_date', 'max'),
    min_trending_position=('trending_position', 'min'),
    max_trending_position=('trending_position', 'max')
)

Unnamed: 0_level_0,min_collection_date,max_collection_date,min_trending_position,max_trending_position
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
--MGIIuNZy8,2025-02-24,2025-02-25,32,45
-0dKgdKWJP0,2025-03-01,2025-03-02,15,33
-3Bme6nVN-4,2025-03-08,2025-03-08,33,33
-5RSoJ5Ky00,2025-02-23,2025-02-23,36,36
-7tWhJzScqk,2025-02-23,2025-02-24,38,50
...,...,...,...,...
zgzgtLOaJhU,2025-03-05,2025-03-05,46,46
zlNUbqAtU6E,2025-03-02,2025-03-04,8,50
zpUpkImSZvY,2025-03-08,2025-03-08,27,27
zpieZkvFnlE,2025-02-23,2025-02-23,35,43


## Videos Statistics Analysis

In [11]:
loader = GitHubDataLoader("gpsyrou", "tube-virality", "db/ods/merged_video_stats.csv")

video_stats_content = loader.get_csv_file()

df = loader.csv_to_dataframe(video_stats_content)

In [12]:
video_stats_df = ots_processor.video_stats_ods_to_stage(df=df)

In [13]:
video_stats_df.head(3)

Unnamed: 0,video_id,channel_id,title,description,published_at,tags,view_count,like_count,comment_count,duration,...,caption,licensed_content,projection,privacy_status,license,embeddable,public_stats_viewable,topic_categories,collection_day,country_code
0,CpzMAiDwfHc,UCiwQRG2sCcfjKkgxMEdJGPg,[굿데이] 10년째 어려운 내 친구 지드래곤..★ 역대급 라인업으로 88나라에 모인...,*ENG SUB AVAILABLE\n[굿데이] 일요일 밤 9시 10분 방송!\nMB...,2025-02-23 23:00:10,"굿데이,지디,gd,gdragon,빅뱅,김수현,정해인,광희,임시완,이수혁,김태호,무도...",3092983,40651,3968,PT13M47S,...,True,True,rectangular,public,youtube,True,True,['https://en.wikipedia.org/wiki/Entertainment'...,2025-03-03,KR
1,kshqQSqIW9M,UCCZ-gBdN59pF39tbm16xvdQ,pH-1이 찰스를 왜이리 좋아해..이상형이라고..? [월간데이트 2월호] (ENG ...,*이 영상은 아라바그의 유료광고를 포함하고 있습니다. \n\n연습생 여러분 안녕하세...,2025-02-22 09:00:54,"pH1,pH-1,ph1,피에이치원,준빵조교,박준원,찰스,찰스엔터,찰스준원,찰스준빵,...",2678854,88469,16462,PT46M24S,...,True,True,rectangular,public,youtube,True,True,['https://en.wikipedia.org/wiki/Lifestyle_(soc...,2025-03-03,KR
2,QdUhS8BNjWo,UCthyk2hY8NnO-UZejtAZyNw,#17 8년 전에 만났으면 우리 조금 달랐을까? [오래된 만남 추구] ep.5,"#오만추 #오래된만남추구\nKBS Joy, KBS2 일요일 밤 09:20 방송\n\...",2025-02-23 14:00:21,"KBSN,KBSJoy,korean,오만추,오래된 만남 추구,오래된,오래,만남,추구,...",1357923,13039,2545,PT15M26S,...,False,True,rectangular,public,youtube,True,True,['https://en.wikipedia.org/wiki/Entertainment'...,2025-03-03,KR
