# Final Presentation

## Group 4 Members

* Shyam Akhil Nekkanti - 8982123
* Jun He (Helena) - 8903073
* Zheming Li (Brendan) - 8914152

### Field of Inquiry: **YouTube Video Metrics**

### Research Question: **What factors most significantly impact the views of a YouTube video?**

 
 We analyzes YouTube video engagement metrics to predict video virality and optimize content strategies for creators. By examining features such as likes-to-views ratio, comments-to-views ratio, and video category, the goal is to identify patterns that correlate with a video trending.

 ### Revised Hypothesis:
"Videos with higher engagement ratios (likes and comments per view) are more likely to trend compared to videos with high view counts alone. Additionally, specific categories (e.g., Entertainment, Music) have a higher trending probability."

This hypothesis will be tested using Pearson’s correlation to identify relationships, logistic regression for classification, and probabilistic reasoning to estimate the likelihood of trending under varying conditions.


## 1. Data Understanding
We loaded the YouTube dataset and explored its structure, including basic statistics and data types.
### Dataset Description
The dataset contains information about trending YouTube videos across different countries, including the US, Canada, Germany, and others. It includes various attributes such as video title, channel title, publication date, trending date, views, likes, dislikes, comments, and tags.  It also has categorical data like video category and whether the video includes a thumbnail.

The dataset allow us to explore patterns in video performance and understand what factors may contribute to a video becoming trending, such as engagement metrics or metadata.



### Data Cleaning and Preparation

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB
from scipy.stats import pearsonr


In [5]:
# Load the data
file_path = "youtube-dataset/USvideos.csv"
df = pd.read_csv(file_path)

# Inspect data

# Display the first few rows of the dataset
print(df.head())

# Display basic information about the dataset
print(df.info())

# Display summary statistics
print(df.describe())


      video_id trending_date  \
0  2kyS6SvSYSE      17.14.11   
1  1ZAPwfrtAFY      17.14.11   
2  5qpjK5DgCt4      17.14.11   
3  puqaWrEC7tY      17.14.11   
4  d380meD0W0M      17.14.11   

                                               title          channel_title  \
0                 WE WANT TO TALK ABOUT OUR MARRIAGE           CaseyNeistat   
1  The Trump Presidency: Last Week Tonight with J...        LastWeekTonight   
2  Racist Superman | Rudy Mancuso, King Bach & Le...           Rudy Mancuso   
3                   Nickelback Lyrics: Real or Fake?  Good Mythical Morning   
4                           I Dare You: GOING BALD!?               nigahiga   

   category_id              publish_time  \
0           22  2017-11-13T17:13:01.000Z   
1           24  2017-11-13T07:30:00.000Z   
2           23  2017-11-12T19:05:24.000Z   
3           24  2017-11-13T11:00:04.000Z   
4           24  2017-11-12T18:01:41.000Z   

                                                tags    views   lik

## 2. Encapsulation in Classes and Methods
We encapsulated the data loading and basic analysis in a class to improve code organization and reusability.



In [6]:
import pandas as pd
import json


def load_json(file_path):
  """
  Load JSON data from a file
  :param file_path: path to the JSON file
  :return: JSON data
  """

  with open(file_path, 'r') as file:
    data = json.load(file)
  return data


def convert_raw_categories_to_dict(raw_categories):
  """
  Convert raw categories data to a dictionary
  :param raw_categories: raw categories data
  :return: categories dictionary
  """

  categories_dict = {}

  for item in raw_categories['items']:
    categories_dict[int(item['id'])] = item['snippet']['title']

  return categories_dict

class DataHandler:
    def __init__(self, filepath):
        self.data = pd.read_csv(filepath)

    def preprocess(self):
        # Drop irrelevant or missing data
        self.data.dropna(subset=['views', 'likes', 'dislikes', 'comment_count'], inplace=True)
        self.data['likes_to_views_ratio'] = self.data['likes'] / self.data['views']
        self.data['comments_to_views_ratio'] = self.data['comment_count'] / self.data['views']
        return self.data

    def summary(self):
        print("Data Summary:")
        print(self.data.describe())


class VideoAnalysis:
  """
  A class to explore the YouTube trending videos dataset
  """
  
  def __init__(self, file_path):
    self.df = pd.read_csv(file_path)

  # Method to explore the dataset
  def explore_data(self):
    """
    Explore the dataset by displaying column names and summary statistics
    :return: None
    """

    print("Column Names:", self.df.columns)
    print("Dataset Summary:\n", self.df.describe())

  def data_clean(self):
    """
    Handle missing values.
  Convert categorical data into dummy variables.
  Convert date fields into Julian dates."""
    data = self.df.dropna()
    # Convert date to Julian
    data['publish_date'] = pd.to_datetime(data['publish_time']).dt.date
    data['julian_date'] = data['publish_date'].apply(lambda x: x.toordinal())

    # Convert categorical columns to dummies
    categorical_columns = ['category_id']
    data = pd.get_dummies(data, columns=categorical_columns, drop_first=True)
    return self.data

  def data_transformation(self):
    """
    Box-Cox Transformation: For skewed data like views or likes.
    Tukey's Ladder: Apply if needed for outliers.
    """
    # Box-Cox Transformation on 'views'
    self.df['views_bc'], _ = boxcox(self.df['views'] + 1)

    # Tukey's Ladder on 'likes'
    self.df['likes_tukey'], _ = yeojohnson(self.df['likes'] + 1)
    return self.df
  


In [7]:
video_analysis = VideoAnalysis(file_path)

# Call the method to explore the dataset
video_analysis.explore_data()


# Load and preprocess the data
handler = DataHandler(file_path)
data = handler.preprocess()
handler.summary()

Column Names: Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')
Dataset Summary:
         category_id         views         likes      dislikes  comment_count
count  40949.000000  4.094900e+04  4.094900e+04  4.094900e+04   4.094900e+04
mean      19.972429  2.360785e+06  7.426670e+04  3.711401e+03   8.446804e+03
std        7.568327  7.394114e+06  2.288853e+05  2.902971e+04   3.743049e+04
min        1.000000  5.490000e+02  0.000000e+00  0.000000e+00   0.000000e+00
25%       17.000000  2.423290e+05  5.424000e+03  2.020000e+02   6.140000e+02
50%       24.000000  6.818610e+05  1.809100e+04  6.310000e+02   1.856000e+03
75%       25.000000  1.823157e+06  5.541700e+04  1.938000e+03   5.755000e+03
max       43.000000  2.252119e+08  5.613827e+06  1.674420e+0

## 3. Summary Update
The term project aims to analyze YouTube video data to understand trends, user engagement, and content performance. The dataset includes various features such as video titles, publish time, views, likes, dislikes, comment count, and more. The final hypothesis is that certain video attributes (e.g., title length, publish time) significantly impact user engagement metrics (e.g., views, likes).



## 4. Comprehensive Data Analysis
We performed various statistical tests, including Normality, T-test, and Chi-Square tests, to understand the data distribution and relationships.



In [8]:
from scipy import stats

# Normality test
print(stats.shapiro(df['views']))

# T-test
group1 = df[df['category_id'] == 1]['views']
group2 = df[df['category_id'] == 2]['views']
print(stats.ttest_ind(group1, group2))

# Chi-Square test
contingency_table = pd.crosstab(df['category_id'], df['comments_disabled'])
print(stats.chi2_contingency(contingency_table))


ShapiroResult(statistic=0.26586852271187644, pvalue=6.296434677503596e-148)
TtestResult(statistic=5.973831642243304, pvalue=2.6185932588214564e-09, df=2727.0)
Chi2ContingencyResult(statistic=801.1345737017033, pvalue=4.877334028605682e-161, dof=15, expected_freq=array([[2.30875040e+03, 3.62496032e+01],
       [3.78064031e+02, 5.93596913e+00],
       [6.37195419e+03, 1.00045813e+02],
       [9.05778407e+02, 1.42215927e+01],
       [2.14039376e+03, 3.36062419e+01],
       [3.95785782e+02, 6.21421769e+00],
       [8.04370607e+02, 1.26293927e+01],
       [3.16037901e+03, 4.96209920e+01],
       [3.40356082e+03, 5.34391804e+01],
       [9.80997397e+03, 1.54026032e+02],
       [2.44855532e+03, 3.84446751e+01],
       [4.08191008e+03, 6.40899167e+01],
       [1.63040113e+03, 2.55988669e+01],
       [2.36388473e+03, 3.71152653e+01],
       [5.61188796e+01, 8.81120418e-01],
       [5.61188796e+01, 8.81120418e-01]]))


  res = hypotest_fun_out(*samples, **kwds)


## 5. Dimensionality Reduction
We applied PCA to reduce the dimensionality of the dataset, making it easier to visualize and analyze.



In [9]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df.select_dtypes(include=[float, int]))

# Apply PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_df)

# Add PCA results to the dataframe
df['PCA1'] = pca_result[:, 0]
df['PCA2'] = pca_result[:, 1]


## 6. Algorithm Implementation
We implemented and displayed results for Clustering, Regression, Classification, and Probabilistic Reasoning algorithms to gain insights from the data.




In [10]:
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Clustering
kmeans = KMeans(n_clusters=3)
df['cluster'] = kmeans.fit_predict(scaled_df)

# Regression
X = df[['PCA1', 'PCA2']]
y = df['views']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

# Classification
df['high_views'] = df['views'] > df['views'].median()
X = df[['PCA1', 'PCA2']]
y = df['high_views']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

       False       0.88      0.94      0.90      4047
        True       0.93      0.87      0.90      4143

    accuracy                           0.90      8190
   macro avg       0.90      0.90      0.90      8190
weighted avg       0.90      0.90      0.90      8190



## 7. Probabilistic Reasoning
We used Gaussian Naive Bayes for probabilistic reasoning to predict high view counts based on PCA components.



In [11]:
from sklearn.naive_bayes import GaussianNB

# Probabilistic Reasoning
X = df[['PCA1', 'PCA2']]
y = df['high_views']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

       False       0.68      0.93      0.78      4047
        True       0.89      0.56      0.69      4143

    accuracy                           0.75      8190
   macro avg       0.79      0.75      0.74      8190
weighted avg       0.79      0.75      0.74      8190



## Conclusion
The analysis provided valuable insights into YouTube video performance and user engagement, supporting our hypothesis.
Goal: To understand the factors influencing YouTube video virality and predict whether a video will trend.

Findings:

Logistic regression and Naïve Bayes models demonstrated that likes_to_views_ratio and comments_to_views_ratio are strong predictors of trending status, with logistic regression achieving higher accuracy (~90%).
Future Steps: Refine the model using additional features like video duration and category, and incorporate advanced machine learning techniques for improved predictions.