# Numeric Features
The youtube datasets presents a variety of numeric variables. These variables and their relationship with the number of days a video remains on the trending page, as well as the number of days a video takes to reach the trending page.

The variables given in the dataset which will be considered in this section are:

- Views
- Likes
- Dislikes
- Publish Date
- Publish Time
- Trending Date 
- Category


## Motivation
Our motivation regarding this section is to understand which of the variables listed above are considered by youtube's algorithm, and to identify - given these features - if it is possible to predict how soon after being published a video will reach trending, and once it reaches trending how long it will remain trending.

In [55]:
import numpy as np 
import pandas as pd 
import os
import pandas as pd
import json
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline
plt.style.use('ggplot')
import seaborn as sns
%matplotlib notebook

import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression


from lime.lime_text import LimeTextExplainer
from tqdm import tqdm
import string
import random
import operator
import seaborn as sns
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from statistics import *
import concurrent.futures
import time
#import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import warnings
import nltk


# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# keras module for building LSTM 
#from keras.preprocessing.sequence import pad_sequences
#from keras.layers import Embedding, LSTM, Dense, Dropout
#from keras.preprocessing.text import Tokenizer
#from keras.callbacks import EarlyStopping
#from keras.models import Sequential
#import keras.utils as ku 
from tensorflow.keras import backend
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
import tensorflow.keras.utils as ku
# set seeds for reproducability
from numpy.random import seed
seed(1)

import warnings
warnings.filterwarnings("ignore")


In [45]:
ENG_df = pd.read_csv('../../data/data.csv')

#Publish_time column contains date and time together, first I will get the correct format.
ENG_df['publish_time'] = pd.to_datetime(ENG_df['publish_time'], errors='coerce', format='%Y-%m-%dT%H:%M:%S.%fZ')

#Removing any null values
ENG_df = ENG_df[ENG_df['trending_date'].notnull()]
ENG_df = ENG_df[ENG_df['publish_time'].notnull()]

#Separating previous publish_time column into two separate columns, publish date and publish time. 
ENG_df.insert(4, 'publish_date', ENG_df['publish_time'].dt.date)
ENG_df['publish_time'] = ENG_df['publish_time'].dt.time

#splitting publish_time column into separate hour, minute and second columns
ENG_df['publish_time'] = ENG_df['publish_time'].astype(str)
ENG_df[['hour','minute','second']] = ENG_df.publish_time.str.split(":", expand=True).astype(int)

#Adding like_rate, dislike_rate and comment_rate features to observe. I expect there to be a relationship between these features
#and time to reach trending, as well as time remained on trending. 
#These features represent viewer engagement, what percentage of viewers actually like, dislike and/or comment on the videos.
ENG_df['like_rate'] = ENG_df['likes']/ENG_df['views']*100
ENG_df['dislike_rate'] = ENG_df['dislikes']/ENG_df['views']*100
ENG_df['comment_rate'] = ENG_df['comment_count']/ENG_df['views']*100



In [46]:
US_init = ENG_df[ENG_df['country']== 'Country.us']
GB_init = ENG_df[ENG_df['country']== 'Country.gb']
CA_init = ENG_df[ENG_df['country']== 'Country.ca']

In [50]:
#Addingn columns into each region's dataframe which represents the number of days a video remains trending. 
occurances_US = US_init.groupby(['video_id']).size()
days_trending_US = occurances_US.to_frame(name = 'days_trending').reset_index()

US = pd.merge(left=US_init, right=days_trending_US, left_on='video_id', right_on='video_id', how='outer')

occurances_GB = GB_init.groupby(['video_id']).size()
days_trending_GB = occurances_GB.to_frame(name = 'days_trending').reset_index()

GB = pd.merge(left=GB_init, right=days_trending_GB, left_on='video_id', right_on='video_id', how='outer')

occurances_CA = CA_init.groupby(['video_id']).size()
days_trending_CA = occurances_CA.to_frame(name = 'days_trending').reset_index()

CA = pd.merge(left=CA_init, right=days_trending_CA, left_on='video_id', right_on='video_id', how='outer')


In [52]:
#Creating dataframes which consist of the first occurance of a video on trending, 
#as well as one for a video's last occurence on trending
US_last = US.drop_duplicates(['video_id'], keep='last')
US_first = US.drop_duplicates(['video_id'], keep='first')

GB_last = GB.drop_duplicates(['video_id'], keep='last')
GB_first = GB.drop_duplicates(['video_id'], keep='first')

CA_last = CA.drop_duplicates(['video_id'], keep='last')
CA_first = CA.drop_duplicates(['video_id'], keep='first')

#US.set_index(['trending_date','video_id'], inplace= True)
#US.set_index(['trending_date', 'video_id'], inplace=True)


Now I will use sklearn's SelectKBest features on each region's dataset respectively. I will be considering, like rate, comment rate, dislike rate, hour, minute and second published. I will be assessing their applicability to days trending. Additionally, I will observe whether the number of views a video has on it's first occurance on trending is a valid predictor to the number of days a video remains on trending.

In [53]:
print(US_first.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6351 entries, 0 to 40948
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   video_id                6351 non-null   object 
 1   trending_date           6351 non-null   object 
 2   title                   6351 non-null   object 
 3   channel_title           6351 non-null   object 
 4   publish_date            6351 non-null   object 
 5   category_id             6351 non-null   int64  
 6   publish_time            6351 non-null   object 
 7   tags                    6351 non-null   object 
 8   views                   6351 non-null   int64  
 9   likes                   6351 non-null   int64  
 10  dislikes                6351 non-null   int64  
 11  comment_count           6351 non-null   int64  
 12  thumbnail_link          6351 non-null   object 
 13  comments_disabled       6351 non-null   bool   
 14  ratings_disabled        6351 non-null  

In [64]:
#Beginning with the US dataset. Considering the features regarding a video on it's first occurence on trending. 
#selecting features to consider for X_first as: views, like_rate, dislike_rate, comment_count, hour of publish, minute of publish and second of publish.
X_US_first = US_first.iloc[:,[8,22,23,24,19,20,21]]
y_US_first = US_first.iloc[:,-1]

#print(X_first.head)
#print(Y_first.head)

X_US_last = US_last.iloc[:,[22,23,24,19,20,21]]
Y_US_last = US_last.iloc[:,-1]

print(X_US_last.head)
print(Y_US_last.head)

<bound method NDFrame.head of        like_rate  dislike_rate  comment_rate  hour  minute  second
6       3.755347      0.310811      0.863541    17      13       1
13      2.475692      0.188365      0.324418     7      30       0
20      3.523733      0.136921      0.187942    19       5      24
27      1.831773      0.151763      0.327177    11       0       4
33      5.441241      0.085701      0.729767    18       1      41
...          ...           ...           ...   ...     ...     ...
40944   4.307714      0.147344      0.579004    18      55      26
40945   0.947428      0.039369      0.308182    15       6       8
40946   2.176723      0.046170      0.124278     5      27      27
40947   2.630015      0.128298      0.224197    16       3      58
40948   2.753000      0.099225      0.257851     9       0       6

[6351 rows x 6 columns]>
<bound method NDFrame.head of 6        7
13       7
20       7
27       7
33       6
        ..
40944    1
40945    1
40946    1
40947    1


In [61]:
US_first_selector = SelectKBest(score_func=mutual_info_regression, k=5)
US_first_new = US_first_selector.fit_transform(X_US_first, y_US_first)
#print(US_first_new[:5])
#print(X_US_first)

scores_US_first = pd.DataFrame({'Variable' : X_US_first.columns, 'Score': US_first_selector.scores_})
#pvalue_US_first = pd.DataFrame({'Variable' : X_US_first.columns, 'p values' : US_first_selector.pvalues_})

print(scores_US_first)




       Variable     Score
0         views  0.083957
1     like_rate  0.047186
2  dislike_rate  0.016911
3  comment_rate  0.032966
4          hour  0.019325
5        minute  0.020259
6        second  0.005865


Selecting the best 5 features, the scores_ show that the best features to consider on the first occurance of a video on the trending page in order to predict the number of days it will remain trending are:
1. Views
2. Like rate
3. Comment rate
4. Dislike rate
5. Hour uploaded

Next I will perform the same feature selection on the last occurence of a video on the trending page. As a result of it being the last occurrence on trending, I will mitigate the views feature from selection.

In [65]:
US_last_selector = SelectKBest(score_func=mutual_info_regression, k=5)
US_last_new = US_last_selector.fit_transform(X_US_last, Y_US_last)

US_scores_last = pd.DataFrame({'Variable' : X_US_last.columns, 'Score': US_last_selector.scores_})
#print(X_last_new[:5])
#print(X_last)

print(US_scores_last)

       Variable     Score
0     like_rate  0.020920
1  dislike_rate  0.012689
2  comment_rate  0.011331
3          hour  0.026092
4        minute  0.013858
5        second  0.000000


Considering the last occurrence of a video on trending, the 5 best selected features were:

1. Like rate
2. Comment rate
3. Dislike rate
4. Hour uploaded
5. Minute uploaded

