<a href="https://www.kaggle.com/code/georgiosspyrou1/youtube-video-virality-predictor-eda?scriptVersionId=225127668" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🌍 YouTube Trending Videos Analysis & Virality Prediction 🚀  

## 📌 Project Overview  
This project aims to **collect, analyze, and model YouTube trending video data** across multiple countries using the **YouTube API**. By gathering daily trending video statistics, we seek to uncover the patterns and factors that contribute to a video's virality.  

## 🛠️ How the Data is Collected  
The data is automatically collected using the **YouTube API** and stored in this GitHub repository:  
🔗 [Trending Video Metadata](https://github.com/gpsyrou/tube-virality/tree/main/assets/meta/trending)  

The main project repository can be found here: https://github.com/gpsyrou/tube-virality

### **Data Collection Process**  
1. **Fetching Trending Videos**:  
   - We start by retrieving the latest trending videos from **multiple countries** via the **YouTube API**.  
   - These trending videos are stored in JSON format in our repository.  

2. **Daily Statistics Updates**:  
   - Each day, a GitHub Actions workflow runs automatically to **gather fresh statistics** for all stored trending videos.  
   - The script fetches video engagement metrics (views, likes, comments, etc.) and updates them in JSON files.  
   - The updated dataset is saved here:  
     🔗 [Video Statistics](https://github.com/gpsyrou/tube-virality/tree/main/assets/meta/video_stats)  

## 🔍 Objectives  
- **Understand Viral Trends**: Identify factors that make a video go viral across different countries.  
- **Statistical Analysis**: Explore engagement trends (views, likes, shares) over time.  
- **Machine Learning Models**: Build predictive models to estimate a video's likelihood of becoming viral.  

## 📊 Dataset & Features  
Our dataset consists of **daily trending videos** from multiple regions, with key features such as:  
- **Video Metadata**: Title, description, channel, published date, tags  
- **Engagement Metrics**: Views, likes, comments, favorites  
- **Technical Details**: Duration, resolution, captions availability  
- **Categorical Information**: Topic categories, privacy settings, license type  

## 🛠️ Methodology  
1. **Data Collection**: Fetch trending video data daily using the YouTube API.  
2. **Data Cleaning & Transformation**: Standardizing, handling missing values, and processing categorical features.  
3. **Exploratory Analysis**: Visualizing key trends in video engagement across countries.  
4. **Feature Engineering**: Extracting meaningful attributes (e.g., video age, growth rate).  
5. **Model Building**: Training machine learning models (classification, regression) to predict virality.  
6. **Evaluation & Interpretation**: Understanding model results and key drivers of viral content.  

## 🚀 Expected Outcomes  
- A **comprehensive dataset** of YouTube trending videos across multiple countries.  
- Insights into **what makes a video go viral** based on statistical analysis.  
- Predictive models that can estimate a video's potential virality.  
- Open-source tools for **researchers, content creators, and marketers** to better understand online video trends.  
  

**Stay tuned as we uncover the secrets of YouTube virality! 🚀📈**  


In [None]:
import pandas as pd
import json
import requests

In [None]:
class GitHubDataLoader:
    def __init__(self, user, repo, file_path, branch="main"):
        self.user = user
        self.repo = repo
        self.file_path = file_path
        self.branch = branch
        self.raw_url = f"https://raw.githubusercontent.com/{self.user}/{self.repo}/{self.branch}/{self.file_path}"

    def get_csv_file(self):
        response = requests.get(self.raw_url)
        response.raise_for_status()  # Raise error if request fails
        return response.text  # Return CSV content as string


class CSVLoader:
    def __init__(self, csv_content):
        self.csv_content = csv_content

    def to_dataframe(self):
        from io import StringIO
        df = pd.read_csv(StringIO(self.csv_content))
        return df



In [None]:
loader = GitHubDataLoader("gpsyrou", "tube-virality", "db/ods/trending_videos.csv")
csv_content = loader.get_csv_file()

In [None]:
processor = CSVLoader(csv_content)
df = processor.to_dataframe()

In [None]:
df.head()

In [None]:
df['id'].nunique()

In [None]:
df.sort_values(by=['id'])

In [None]:
loader = GitHubDataLoader("gpsyrou", "tube-virality", "db/ods/merged_video_stats.csv")
csv_content = loader.get_csv_file()

In [None]:
processor = CSVLoader(csv_content)
df = processor.to_dataframe()

In [None]:
df['video_id'].nunique()

In [None]:
df[df['video_id'] == 'CpzMAiDwfHc'].sort_values(by=['collection_day'], ascending=False)