# Data Preprocessing 

## 1. Import and Load Data
This section loads the primary YouTube dataset and the category mapping file, 
then merges them to create a single dataset with descriptive category titles 
instead of numeric IDs. This step ensures the dataset is ready for exploratory 
checks and later feature engineering.

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
import json

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split



In [3]:
# Loading main data
df = pd.read_csv('../data/raw/Usvideos.csv')
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [4]:
# Loading category id json data
with open('../data/raw/US_category_id.json', 'r') as cat_id:
    category_data = json. load(cat_id)

category_items = category_data['items']
categories_df = pd.DataFrame([{'category_id': int(item['id']), "category_title": item['snippet']['title']} for item in category_items])

In [5]:
# Combining datasets on cat_id key
df = df.merge(categories_df, on = 'category_id', how = "left")
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,category_title
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,People & Blogs
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",Entertainment
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,Comedy
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...,Entertainment
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...,Entertainment


## Exploratory Checks for Preprocessing
This section conducts some simple feature and null value checks and inspections to inform feature removal, data cleaning and feature engineering. 

In [5]:
# Assessing features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40949 non-null  object
 1   trending_date           40949 non-null  object
 2   title                   40949 non-null  object
 3   channel_title           40949 non-null  object
 4   category_id             40949 non-null  int64 
 5   publish_time            40949 non-null  object
 6   tags                    40949 non-null  object
 7   views                   40949 non-null  int64 
 8   likes                   40949 non-null  int64 
 9   dislikes                40949 non-null  int64 
 10  comment_count           40949 non-null  int64 
 11  thumbnail_link          40949 non-null  object
 12  comments_disabled       40949 non-null  bool  
 13  ratings_disabled        40949 non-null  bool  
 14  video_error_or_removed  40949 non-null  bool  
 15  de

In [6]:
# Assessing Null value counts for features  
df.isna().sum()

video_id                    0
trending_date               0
title                       0
channel_title               0
category_id                 0
publish_time                0
tags                        0
views                       0
likes                       0
dislikes                    0
comment_count               0
thumbnail_link              0
comments_disabled           0
ratings_disabled            0
video_error_or_removed      0
description               570
category_title              0
dtype: int64

## Removing Non-Predictive & Leaky Variables
This section removes features that are:
  - Unique identifiers (video_id) which do not carry any predictive value
  - Features that cause data leakage (trending_date, video_error_or_removed, likes, dislikes, comment_count). These features contain information about the future.
  - Fields irrelevant for pre-upload prediction (channel_title, thumbnail_link).

Removing these columns reduces noise and prevents leakage in the modeling phase. 

In [7]:
# Dropping non-predictive & leaky variable 
df = df.drop(['video_id', 'category_id', 'thumbnail_link', 'channel_title', 'video_error_or_removed' ,
              'comment_count', 'likes', 'dislikes' ,'trending_date'], axis = 1 )
df.head()

Unnamed: 0,title,publish_time,tags,views,comments_disabled,ratings_disabled,description,category_title
0,WE WANT TO TALK ABOUT OUR MARRIAGE,2017-11-13T17:13:01.000Z,SHANtell martin,748374,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,People & Blogs
1,The Trump Presidency: Last Week Tonight with J...,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,False,False,"One year after the presidential election, John...",Entertainment
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,Comedy
3,Nickelback Lyrics: Real or Fake?,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,False,False,Today we find out if Link is a Nickelback amat...,Entertainment
4,I Dare You: GOING BALD!?,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,False,False,I know it's been a while since we did this sho...,Entertainment


## 2. Data Cleaning 
This section replaces null description values with empty strings to prepare for feature engineering.  

In [8]:
# Assesing the null values in the description feature.
df[df['description'].isnull()].head()

Unnamed: 0,title,publish_time,tags,views,comments_disabled,ratings_disabled,description,category_title
42,Dennis Smith Jr. and LeBron James go back and ...,2017-11-13T15:11:00.000Z,[none],945,False,False,,Sports
47,Stephon Marbury and Jimmer Fredette fight in C...,2017-11-10T18:23:05.000Z,"NBA|""Basketball""|""Sports""",956169,False,False,,Sports
175,Sphaera - demonstrating interaction,2017-11-04T20:48:16.000Z,[none],1827,False,False,,Science & Technology
267,Dennis Smith Jr. and LeBron James go back and ...,2017-11-13T15:11:00.000Z,[none],21544,False,False,,Sports
312,Stephon Marbury and Jimmer Fredette fight in C...,2017-11-10T18:23:05.000Z,"NBA|""Basketball""|""Sports""",1015189,False,False,,Sports


In [9]:
# Substitutes NA values with empty strings to ensure no issue with future feature engineering 
df['description'] = df['description'].fillna('')
df[df['description'] == ''].head()

Unnamed: 0,title,publish_time,tags,views,comments_disabled,ratings_disabled,description,category_title
42,Dennis Smith Jr. and LeBron James go back and ...,2017-11-13T15:11:00.000Z,[none],945,False,False,,Sports
47,Stephon Marbury and Jimmer Fredette fight in C...,2017-11-10T18:23:05.000Z,"NBA|""Basketball""|""Sports""",956169,False,False,,Sports
175,Sphaera - demonstrating interaction,2017-11-04T20:48:16.000Z,[none],1827,False,False,,Science & Technology
267,Dennis Smith Jr. and LeBron James go back and ...,2017-11-13T15:11:00.000Z,[none],21544,False,False,,Sports
312,Stephon Marbury and Jimmer Fredette fight in C...,2017-11-10T18:23:05.000Z,"NBA|""Basketball""|""Sports""",1015189,False,False,,Sports


## 3. Feature Engineering 
This section creates new features from the information present in the data set. Given the limitations of the models to be used, the new featureS encode information that would be otherwise unreachable. 

In [10]:
# Engineering Description length to assess the effect of having longer or shorter descriptions
df['description_length'] = df['description'].str.len()

In [11]:
# Title Length to assess the effect of having longer titles on video views
df['title_length'] = df['title'].str.len()


In [12]:
# Tag count to assess the effect of having more tags 
df['tag_count'] = df['tags'].str.split('|').str.len()

In [13]:
# Day of the week video is posted to assess the effect of day of the week on video views 
df['publish_time'] = pd.to_datetime(df['publish_time'])
df['week_day'] = df['publish_time'].dt.day_name()

In [14]:
# Hour of the day that the video is posted to assess the effect the hour of day has on video success 
df['hour'] = df['publish_time'].dt.hour

In [15]:
# Dropping features used in feature engineering due to either being uninterpretable 
# by the models to be used or unhelpful for prediction 
df = df.drop(['publish_time', 'description', 'tags', 'title'], axis = 1)

## 4. Data Encoding
This section creates one hot encoded dummy variables from the categorical variables. This encoding allows for certain models to interpret the categorical variables for prediction.

In [16]:
df_encoded = pd.get_dummies(data = df, columns = [ 'week_day', 'hour', 'category_title' ], drop_first = True  )
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 50 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   views                                 40949 non-null  int64 
 1   comments_disabled                     40949 non-null  bool  
 2   ratings_disabled                      40949 non-null  bool  
 3   description_length                    40949 non-null  int64 
 4   title_length                          40949 non-null  int64 
 5   tag_count                             40949 non-null  object
 6   week_day_Monday                       40949 non-null  bool  
 7   week_day_Saturday                     40949 non-null  bool  
 8   week_day_Sunday                       40949 non-null  bool  
 9   week_day_Thursday                     40949 non-null  bool  
 10  week_day_Tuesday                      40949 non-null  bool  
 11  week_day_Wednesday          

## Feature Transformation 
This section transforms the target variable to account for the significant right. The transformation normalises the variable and allows it fulfill certain model assumptions.

In [17]:
# Log Transforming target variable with significant right skew 
df_encoded['log_views'] = np.log1p(df['views'])


In [18]:
# Dropping raw views to simplify modeling 
df_encoded = df_encoded.drop(['views'], axis = 1 )

## Outlier Handling 

Due to the fact that the outlier values in the data set in areas such as views and likes a representative of the nature of the underlying system, they will not be removed so as to avoid blurring the signal. 

## Train-Test Split
This section splits the dataset into train and test datasets. This is done to allow for proper model evaluation.

In [19]:
# Seperating dataset into train and test sets
train_df, test_df = train_test_split(df_encoded, test_size= 0.2 , random_state= 42)

## Standardization
This section standardises the features and target variable avoid scale issues when using models like lasso and ridge regression. The The scalar is fit on the training data to avoid leakage.

In [20]:
# Scaling data for model functionality  
scalar = StandardScaler(with_mean= False)
scalar.fit(train_df)
train_df = pd.DataFrame(scalar.transform(train_df), columns = df_encoded.columns)
test_df = pd.DataFrame(scalar.transform(test_df), columns = df_encoded.columns)

## Saving Data 
This section exports the training and test datasets as csv to be used later in the pipeline.

In [21]:
# Saving training and test data as csv files 
train_df.to_csv('../data/processed/training_data.csv', index = False, encoding= 'utf-8')
test_df.to_csv('../data/processed/test_data.csv', index = False, encoding= 'utf-8')