# Feature Analysis of Trending YouTube Videos Across the USA, Great Britain, Canada and Mexico

# Notes on the Data

The data for this project is on the public domain and was obtained from ”Kaggle.com”. The data itself contains trending video information from the USA, Great Britain, Germany, Canada, France,Russia, Mexico, South Korea, Japan, and India.  Each data set contains a dated list (spanning 200 days) of the top 200 trending videos for each country,  as well as 16 additional features such as:the video’s title, its description (written by the videos author), video ‘tags’, URL, number of likes,dislikes, and views, a generic category ID, and upload date. This means that videos that stay on thetrending list over several days appear repeatedly on the data, once on every daily trending list. The original data set size for each country is about 40000 by 16.

# Loading main libraries

In [29]:
### Loading the main libraries
import warnings #prevent "future warning" errors
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
%matplotlib inline

#Getting Stop words for languange analysis
from nltk.corpus import stopwords
stop = stopwords.words('english')
stopEs = stopwords.words('spanish')

#Importing K folds
from sklearn import model_selection

#Loading latent analysis models
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans

#loading Regression Models
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc, roc_auc_score,average_precision_score

# Defining Main Model Functions

## LDA Functions

In [6]:
# LDA function
def LDA(NComponents,dataSource):
    """Function for performing LDA, returns data organized into NComponents"""
    
    model = LatentDirichletAllocation(n_components = NComponents) 
    model.fit(dataSource.values)
    out = model.transform(dataSource.values)
    components = model.components_
    print ("Transformed Size: ", out.shape)
    print ("Component Size: ", components.shape)

    return out,components

##Choosing LDA's Nth topic and analyzing the features within top chosen percentile
def topicDescription(dataSource,components,topic,percentile):
    """
    This function gets the LDA-transformed data as an input and 
    outputs an analysis of the features within top chosen percentile of the Nth topic.
    """
    topicData = components.iloc[topic,:] 
    indices = np.where(topicData > np.percentile(topicData, percentile))
    
    topTopic = pd.DataFrame({'Feature': dataSource.columns.values[indices],('Distribution of Topic #'+str(topic+1)): topicData.iloc[indices]}) #if from csv file
    topTopic.sort_values(('Distribution of Topic #'+str(topic+1)), inplace=True, ascending=False)
    topTopic.head(n=80)
    return topTopic

##Choosing the users that most belog to each topic
def videoDescription(dataSource,transformed,topic,percentile):
    """
    This function gets the LDA-transformed data as an input and 
    outputs an analysis of the users within the Nth topic.
    """
    videoData = transformed[topic]  
    dummyData = dataSource.copy()
    dummyData.insert(0, ("Weight of topic #"+ str(topic+1)), videoData)
    dummyData.sort_values(("Weight of topic #"+ str(topic+1)), inplace=True, ascending=False)

    return dummyData

## Data Proccesing Functions

In [20]:
#Adding Categories To Countries (changing category ID numbers to actual category name)
def assignCategories(data):
    data['category_id'].replace(1,'Film and Animation',inplace=True)
    data['category_id'].replace(2,'Vehicles',inplace=True)
    data['category_id'].replace(10,'Music',inplace=True)
    data['category_id'].replace(15,'Pets and Animals',inplace=True)
    data['category_id'].replace(17,'Sports',inplace=True)
    data['category_id'].replace(18,'Short Movies',inplace=True)
    data['category_id'].replace(19,'Travel and Events',inplace=True)
    data['category_id'].replace(20,'Gaming',inplace=True)
    data['category_id'].replace(22,'Peaple and Blogs',inplace=True)
    data['category_id'].replace(23,'Comedy',inplace=True)
    data['category_id'].replace(24,'Entertainment',inplace=True)
    data['category_id'].replace(25,'News and Politics',inplace=True)
    data['category_id'].replace(26,'Style',inplace=True)
    data['category_id'].replace(27,'Education',inplace=True)
    data['category_id'].replace(28,'Science and Tech.',inplace=True)
    data['category_id'].replace(29,'Activism and Nonprofits',inplace=True)
    data['category_id'].replace(30,'Movies',inplace=True)
    data['category_id'].replace(43,'Shows',inplace=True)

# create numerical, time, and one-hot encoded features
#(one hot enconding categories and calculating continous variables from likes and views)
def makeFeatures(data):
    dummy = data
    dummy['percent_likes'] = 100*dummy['likes']/dummy['views']
    dummy['percent_dislikes'] = 100*dummy['dislikes']/dummy['views']
    dummy['percent_comments'] = 100*dummy['comment_count']/dummy['views']
    dummy['likes/dislikes']   = 100*dummy['likes']/(dummy['dislikes']+1)
    dummy['likes/comments']   = 100*dummy['likes']/(dummy['comment_count']+1)

    
    ##add published day and time 
    dummy["day_published"] = dummy["publish_time"].apply(lambda x: datetime.datetime.strptime(x[:10], "%Y-%m-%d").date().strftime('%a'))
    dummy["hour_published"] = dummy["publish_time"].apply(lambda x: x[11:13])
    dummy.drop(labels='publish_time', axis=1, inplace=True)
    
    #doing one-hot encoding on the categories
    dummy = pd.get_dummies(dummy, columns =['category_id'] ) #)= list(dummy['category_id'].unique()))

    return dummy

# Procces data in order to make sure we have a complete data set (elimante NA and bad values
def correlationData(data):

    #drop unused 
    dataSimple = data.drop(['percent_comments','hour_published','day_published','title','tags','video_id','thumbnail_link', 'comments_disabled','ratings_disabled','video_error_or_removed','description',], axis=1)
    
    # Look at correlations between features (from precept 6)
    plt.figure(figsize = (5,5), dpi=200)
    plt.matshow(dataSimple.corr(), fignum=1, cmap=plt.cm.bwr)
    cols = list(dataSimple.columns)
    plt.xticks(list(range(len(cols))), cols, rotation=90)
    plt.yticks(list(range(len(cols))), cols)
    plt.colorbar()  


## Language Processing Functions

In [21]:

# count individual words within the video descriptions
def countWords(data,name):
    """
    Input: the data for a particular country
    Output: a file named "name" with a list of all the words within the video descriptions
    """
    
    #concotenating all the essays into 1
    combinedStringCols = data["title"].map(str)+ data["tags"].map(str) + data["description"].map(str)

    #removing stop-words
    combinedStringCols = combinedStringCols.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    combinedStringCols = combinedStringCols.apply(lambda x: ' '.join([word for word in x.split() if word not in (stopEs)]))

    ##separating into individual strings/words
    words = combinedStringCols.str.split(" ",n=-1,expand=True)
    
    ##Counting the number of times each word is said
    wordValueCounts = pd.value_counts(words.values.flatten())

    #writing a text file withh the word counts
    wordValueCounts = wordValueCounts.reset_index(level=0, inplace=False)
    n=0
    with open(name, 'w') as f:
        for item in wordValueCounts['index']:
            f.write("%s\n" % wordValueCounts['index'].iloc[n])
            n = n + 1
            
            
#doing Bag of words representation function       
def makeBOW(data,nwords,name):
    """
    Input: data and number of top words we want in the BOW description
    Output: a Bag of words file named "name"
    """
    
    #Choose the top "n" words and load them from the proccesed word list:
    nwords = nwords
    words = pd.read_csv(name, low_memory=False,header=None,error_bad_lines=False,nrows = nwords)
    #print(words)
    
    combinedStringCols = data["title"].map(str)+ data["tags"].map(str) + data["description"].map(str)
    
    ##counting fully capitalized words
    #b = pd.DataFrame(df['A'].values.tolist()).stack().str.split(expand=True).stack().str.isupper().sum()
    #print (b)

    ##making a list out of words:
    wordList = list(words[0])
    #print(wordList)

    #removing duplicates
    wordList = list(dict.fromkeys(wordList))
    #print(wordList)

    ####counting words in each line and adding them as a collumn to categorical data
    bagOfWords = data.copy()
    for m in wordList: 
        word  = m
        count = combinedStringCols.str.count(word) 
        bagOfWords.insert(0,word,count,allow_duplicates=True)
    
    #print(bagOfWords)
    
    return bagOfWords,wordList

## Function for creating feature for total "days trending"

In [11]:
#Creating function that gets the number of days a video was on the list and combines all data about each unique
#video into a single entry. (videos are listed once for every day they are in the list)

def groupVideos(data):
    dummy = data.groupby(data['title']).mean()
    dummy.insert(0,"days_trending",pd.value_counts(data['title'])) #adding days trending as a collumn    
    dummy.sort_values("days_trending", inplace=True, ascending=False)

    return dummy


# Identifing Top Trending Topics Through LDA

## Loading the Data

In [17]:
#Loading the data for all 4 studied countries (these files are not included here but are available at Kaggele.com)

dataUSA = pd.read_csv("Data/USvideos.csv") # USA
dataGB = pd.read_csv("Data/GBvideos.csv") # Great Britain
dataCA = pd.read_csv("Data/CAvideos.csv") # Canada
dataMX = pd.read_csv("Data/MXvideos.csv",encoding = "Latin-1") # Mexico

##printing shape and looking at the data
print('USA data Shape:',dataUSA.shape)
print('GB data Shape:',dataGB.shape)
print('CA data Shape:',dataCA.shape)
print('MX data Shape:',dataMX.shape)

print("")
print("Sample of data for US-based videos:")
dataUSA.head()


USA data Shape: (40949, 16)
GB data Shape: (38916, 16)
CA data Shape: (40881, 16)
MX data Shape: (40451, 16)

Sample of data for US-based videos:


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


## Assigning Categories to Data (changing category ID's to actual names)


In [18]:
#Assigning Categories to Data 
assignCategories(dataUSA)
assignCategories(dataGB)
assignCategories(dataCA)
assignCategories(dataMX)

## Make New Features

In [22]:
#one hot enconding categories and calculating continous variables from likes and views
dataUSAEng=makeFeatures(dataUSA)
dataGBEng=makeFeatures(dataGB)
dataCAEng=makeFeatures(dataCA) 
dataMXEng=makeFeatures(dataMX)

## Adding Capitalized Word Indicator to Each Column
We want to see if capitalization has an impact on video trendability. 

In [23]:
#counting fully capitalized words and adding them to each data set
data = dataUSAEng.copy()
combinedStringCols = data["title"].map(str)+ data["tags"].map(str) + data["description"].map(str)
capitalized = pd.DataFrame(combinedStringCols.values.tolist()).stack().str.isupper()
dataUSAEng['capitalized'] = capitalized.values

data = dataGBEng.copy()
combinedStringCols = data["title"].map(str)+ data["tags"].map(str) + data["description"].map(str)
capitalized = pd.DataFrame(combinedStringCols.values.tolist()).stack().str.isupper()
dataGBEng['capitalized'] = capitalized.values

data = dataCAEng.copy()
combinedStringCols = data["title"].map(str)+ data["tags"].map(str) + data["description"].map(str)
capitalized = pd.DataFrame(combinedStringCols.values.tolist()).stack().str.isupper()
dataCAEng['capitalized'] = capitalized.values

data = dataMXEng.copy()
combinedStringCols = data["title"].map(str)+ data["tags"].map(str) + data["description"].map(str)
capitalized = pd.DataFrame(combinedStringCols.values.tolist()).stack().str.isupper()
dataMXEng['capitalized'] = capitalized.values

#printing shape and looking at the data
print('USA Eng. data Shape:',dataUSAEng.shape)
print('GB Eng. data Shape:',dataGBEng.shape)
print('CA Eng. data Shape:',dataCAEng.shape)
print('MX Eng. data Shape:',dataMXEng.shape)

USA Eng. data Shape: (40949, 38)
GB Eng. data Shape: (38916, 38)
CA Eng. data Shape: (40881, 39)
MX Eng. data Shape: (40451, 38)


## Counting Individual Words

In [11]:
#identifing individual words
countWords(dataUSAEng,'wordCountsUSA.txt')
countWords(dataGBEng,'wordCountsGB.txt')
countWords(dataCAEng,'wordCountsCA.txt')
countWords(dataMXEng,'wordCountsMX.txt')


## Converting into Bag of Words Representation

In [30]:
##Convert into BOW representation

nWords = 100 # number of words in BOW representation

completeUSA,wordListUSA = makeBOW(dataUSAEng,nWords,'Data/wordCountsUSA.txt')
completeGB,wordListGB = makeBOW(dataGBEng,nWords,'Data/wordCountsGB.txt')
completeCA,wordListCA = makeBOW(dataCAEng,nWords,'Data/wordCountsCA.txt')
completeMX,wordListMX = makeBOW(dataMXEng,nWords,'Data/wordCountsMX.txt')

print('Size of USA completely proccesed data:',completeUSA.shape)
print('Size of GB completely proccesed data:',completeGB.shape)
print('Size of CA completely proccesed data:',completeCA.shape)
print('Size of MX completely proccesed data:',completeMX.shape)


Size of USA completely proccesed data: (40949, 138)
Size of GB completely proccesed data: (38916, 138)
Size of CA completely proccesed data: (40881, 139)
Size of MX completely proccesed data: (40451, 138)


## Creating a data set with count-data only


In [31]:
##Creating a data set with count-data only (i.e. integers only)
countUSA = completeUSA.drop(['thumbnail_link', 'comments_disabled','ratings_disabled','video_error_or_removed','description','percent_likes','percent_dislikes','percent_comments','tags','title','video_id','trending_date','channel_title','trending_date','views','likes','comment_count','dislikes',"day_published",'hour_published','likes/dislikes','likes/comments'], axis=1)
countGB = completeGB.drop(['thumbnail_link', 'comments_disabled','ratings_disabled','video_error_or_removed','description','percent_likes','percent_dislikes','percent_comments','tags','title','video_id','trending_date','channel_title','trending_date','views','likes','comment_count','dislikes',"day_published",'hour_published','likes/dislikes','likes/comments'], axis=1)
countCA = completeCA.drop(['thumbnail_link', 'comments_disabled','ratings_disabled','video_error_or_removed','description','percent_likes','percent_dislikes','percent_comments','tags','title','video_id','trending_date','channel_title','trending_date','views','likes','comment_count','dislikes',"day_published",'hour_published','likes/dislikes','likes/comments'], axis=1)
countMX = completeMX.drop(['thumbnail_link', 'comments_disabled','ratings_disabled','video_error_or_removed','description','percent_likes','percent_dislikes','percent_comments','tags','title','video_id','trending_date','channel_title','trending_date','views','likes','comment_count','dislikes',"day_published",'hour_published','likes/dislikes','likes/comments'], axis=1)

print('Size of USA count data:',countUSA.shape)
print('Size of GB count data:',countGB.shape)
print('Size of CA count data:',countCA.shape)
print('Size of MX count data:',countMX.shape)

Size of USA count data: (40949, 117)
Size of GB count data: (38916, 117)
Size of CA count data: (40881, 118)
Size of MX count data: (40451, 117)


## Performing LDA with N=10 on the Count Data Set

In [None]:
%%time 
#Warning, this takes a LONG time, about 13 minutes total
#No need to run since I have included the files in the Data directory

#Number of Topics
ntopics = 10

transformedUSA,componentsUSA = LDA(ntopics,countUSA)
transformedGB,componentsGB = LDA(ntopics,countGB)
transformedCA,componentsCA = LDA(ntopics,countCA)
transformedMX,componentsMX = LDA(ntopics,countMX)

np.savetxt("LDA_transformedUSA_10.csv", transformedUSA, delimiter=",")
np.savetxt("LDA_componentsUSA_10.csv", componentsUSA, delimiter=",")

np.savetxt("LDA_transformedGB_10.csv", transformedGB, delimiter=",")#
np.savetxt("LDA_componentsGB_10.csv", componentsGB, delimiter=",")

np.savetxt("LDA_transformedCA_10.csv", transformedCA, delimiter=",")
np.savetxt("LDA_componentsCA_10.csv", componentsCA, delimiter=",")

np.savetxt("LDA_transformedMX_10.csv", transformedMX, delimiter=",")
np.savetxt("LDA_componentsMX_10.csv", componentsMX, delimiter=",")



In [32]:
#Reading LDA results from text files 

componentsUSA = pd.read_csv("Data/LDA_componentsUSA_10.csv", low_memory=False,header=None) #topic by features
transformedUSA = pd.read_csv("Data/LDA_transformedUSA_10.csv", low_memory=False,header=None) #users by topic
componentsUSA = componentsUSA/componentsUSA.sum(axis=1)[:, np.newaxis] ##normalizing components

componentsGB = pd.read_csv("Data/LDA_componentsGB_10.csv", low_memory=False,header=None) #topic by features
transformedGB = pd.read_csv("Data/LDA_transformedGB_10.csv", low_memory=False,header=None) #users by topic
componentsGB = componentsGB/componentsGB.sum(axis=1)[:, np.newaxis] ##normalizing components

componentsCA = pd.read_csv("Data/LDA_componentsCA_10.csv", low_memory=False,header=None) #topic by features
transformedCA = pd.read_csv("Data/LDA_transformedCA_10.csv", low_memory=False,header=None) #users by topic
componentsCA = componentsCA/componentsCA.sum(axis=1)[:, np.newaxis] ##normalizing components

componentsMX = pd.read_csv("Data/LDA_componentsMX_10.csv", low_memory=False,header=None) #topic by features
transformedMX = pd.read_csv("Data/LDA_transformedMX_10.csv", low_memory=False,header=None) #users by topic
componentsMX = componentsMX/componentsMX.sum(axis=1)[:, np.newaxis] ##normalizing components


## Identifing top words used in description in english-speaking and all countries

In [37]:
commonWordsEnglish = (list(set(wordListUSA).intersection(wordListGB,wordListCA)))
commonWordsAll = (list(set(wordListUSA).intersection(wordListGB,wordListCA,wordListMX)))
#print(wordListMX)

print('Common top words in english speaking countries:\n\n',commonWordsEnglish, "\n\n")
print('Common top words in all countries:\n\n',commonWordsAll)

Common top words in english speaking countries:

 ['movie', 'time', 'make', 'Follow', 'see', 'want', 'Twitter:', 'know', 'More', 'Instagram:', 'Show', 'videos', 'Late', 'Live', 'much', 'new', 'full', 'show', 'Subscribe', 'Official', 'ON', 'use', 'made', '2017', 'channel', 'THE', 'Kimmel', 'every', 'best', 'first', 'love', 'world', 'life', 'YouTube', 'Night', 'take', 'Watch', 'Jimmy', 'latest', 'get', 'Music', 'vs', 'find', 'live', '2', 'official', 'music', '2018', 'favorite', 'Facebook:', 'like', 'back'] 


Common top words in all countries:

 ['use', '2017', '2', 'videos', '2018', 'Twitter:', 'Facebook:', 'like', 'Instagram:', 'Music']


## Analyizing the top correlated features for each latent topic in each country

In [43]:
#For brevetiy I will only print the results from the first USA-based topic
#however feel free to explore by yourself for different topics and different countries

print(topicDescription(countUSA,componentsUSA,0,80))
#print(topicDescription(countUSA,componentsUSA,1,80))
#print(topicDescription(countUSA,componentsUSA,2,80))
#print(topicDescription(countUSA,componentsUSA,3,80))
#print(topicDescription(countUSA,componentsUSA,4,80))
#print(topicDescription(countUSA,componentsUSA,5,80))
#print(topicDescription(countUSA,componentsUSA,6,80))
#print(topicDescription(countUSA,componentsUSA,7,80))
#print(topicDescription(countUSA,componentsUSA,8,80))
#print(topicDescription(countUSA,componentsUSA,9,80))


                       Feature  Distribution of Topic #1
65                        show                  0.118268
83                        Show                  0.114253
87                        Late                  0.108224
48                         CBS                  0.058304
72                       Watch                  0.041590
33                     episode                  0.036151
2                       Follow                  0.035995
59                           2                  0.035929
6                        night                  0.032214
23                       James                  0.028296
60                        live                  0.027242
103  category_id_Entertainment                  0.024122
89                        This                  0.020127
73                        full                  0.019036
61                        time                  0.017543
77                   Subscribe                  0.017092
99                         new 

## Conclusion:
USA trending topics can be generally labeled as: 1) Entertainment, 2) News, 3) Movies, 4) Music, 5) Fashion, 6) Late-Night Shows, 7) Lifestyle, 8) Social Media, 9) Generic Videos, and 10) Sports.

Please refer to the Report for a complete list for the remaining countries. 

# Prediction of Continued Video Trendability

## Defining Classifier Functions

We will test if a particular video will (label =1) or will not (label=)0) trend for longer than the average trending video.

In [49]:
#Defining logistic regression
def LogisticReg(data):
    
    ##Making boolean values of days trending more than avg. 
    dummy = data.copy()
    mean = dummy['days_trending'].mean()

    mask1 = dummy['days_trending'] >= mean
    mask2 = dummy['days_trending'] < mean

    dummy.loc[mask1, 'days_trending'] = 1
    dummy.loc[mask2, 'days_trending'] = 0

    ##separating into training and test data
    kfold = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 42)
    result = next(kfold.split(dummy), None)
    
    train = dummy.iloc[result[0]]
    trainIn = train.drop(['days_trending'],axis=1)
    trainOut = train['days_trending']
    
    test =  dummy.iloc[result[1]]
    testIn = train.drop(['days_trending'],axis=1)
    testOut = train['days_trending']
    
    ##Doing logistic regression
    log = LogisticRegression(penalty = "l1",solver='liblinear',max_iter=1000,C=0.1)
    log.fit(trainIn,trainOut) 
    prediction = log.predict(testIn)
    probs = log.predict_proba(testIn)
    probs = probs[:, 1]
    score = log.score(testIn,testOut)
    coeff = log.coef_
    print("Accuracy:","\n",score)
    print("Precision Score:",average_precision_score(testOut,prediction),"\n\n")
    print("ROC area under curve:", roc_auc_score(testOut,probs))
    print("Confusion Matrix:","\n",confusion_matrix(testOut,prediction),"\n")
    
#Recursive Feature Elimination to extract feature importances
def RecursiveFeatureElimination(data,model,nFeatures):
    ##separating into training and test data
    dummy = data.copy()
    kfold = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 42)
    result = next(kfold.split(dummy), None)
    
    train = dummy.iloc[result[0]]
    trainIn = train.drop(['days_trending'],axis=1)
    trainOut = train['days_trending']
    
    test =  dummy.iloc[result[1]]
    testIn = train.drop(['days_trending'],axis=1)
    testOut = train['days_trending']
    
    
    ## Feature Selection based on Recursive Feature Elimination
    numFeatures = nFeatures #number of words that we want to use

    rfe = RFE(model,numFeatures) 
    rfe = rfe.fit(trainIn, trainOut)
    
    features = list(testIn.columns.values)
    bestFeatures = []
    num = 0 
    
    #choosing the best words from the rfe analysis and putting them into a dictinoary
    for bool in rfe.support_:
        if bool  == True:
            bestFeatures.append(features[num]) 
        num = num+1
        
    n=0
    
    #writing a text file with most representative words
    with open('bestFeaturesGB.txt', 'w') as f:
        for item in bestFeatures:
            f.write("%s\n" % bestFeatures[n])
            n = n + 1
    
    print(bestFeatures) 

##Defining Random Forest Classifier
def RFClassifier(data):
    ##Making boolean values of days trending more than avg. 
    dummy = data.copy()
    mean = dummy['days_trending'].mean()

    mask1 = dummy['days_trending'] >= mean
    mask2 = dummy['days_trending'] < mean

    dummy.loc[mask1, 'days_trending'] = 1
    dummy.loc[mask2, 'days_trending'] = 0

    ##separating into training and test data
    kfold = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 42)
    result = next(kfold.split(dummy), None)
    
    train = dummy.iloc[result[0]]
    trainIn = train.drop(['days_trending'],axis=1)
    trainOut = train['days_trending']
    
    test =  dummy.iloc[result[1]]
    testIn = train.drop(['days_trending'],axis=1)
    testOut = train['days_trending']
    
    ##Doing rf classifier 
    log = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)
    log.fit(trainIn,trainOut) 
    prediction = log.predict(testIn)
    probs = log.predict_proba(testIn)
    probs = probs[:, 1]
    score = log.score(testIn,testOut)
    print("Accuracy:","\n",score)
    print("Precision Score:",average_precision_score(testOut,prediction),"\n\n")
    print("ROC area under curve:", roc_auc_score(testOut,probs))
    print("Confusion Matrix:","\n",confusion_matrix(testOut,prediction),"\n")
    
##Grid Search For Random Forest Clasifier
def RFClassifierCV(data):
    ##Making boolean values of days trending more than avg. 
    dummy = data.copy()
    mean = dummy['days_trending'].mean()

    mask1 = dummy['days_trending'] >= mean
    mask2 = dummy['days_trending'] < mean

    dummy.loc[mask1, 'days_trending'] = 1
    dummy.loc[mask2, 'days_trending'] = 0

    ##separating into training and test data
    kfold = model_selection.KFold(n_splits = 5, shuffle = True, random_state = 42)
    result = next(kfold.split(dummy), None)
    
    train = dummy.iloc[result[0]]
    trainIn = train.drop(['days_trending'],axis=1)
    trainOut = train['days_trending']
    
    test =  dummy.iloc[result[1]]
    testIn = train.drop(['days_trending'],axis=1)
    testOut = train['days_trending']
    
    ##Doing random forest classifier with grid search
    RF = RandomForestClassifier(random_state=0)
    parameters = {"n_estimators":[25, 50, 100, 125,175,200], 
              "max_depth":list(range(1, 5)), 
              "max_features":['auto', 'sqrt', 'log2']}
    
    GS = GridSearchCV(RF, parameters, scoring='roc_auc', cv=5)
    GS.fit(trainIn,trainOut)
    bestRF = GS.best_estimator_
    
    ##Finding scoring parameters for best estimator
    bestRF.fit(trainIn,trainOut) 
    prediction = bestRF.predict(testIn)
    probs = bestRF.predict_proba(testIn)
    probs = probs[:, 1]
    score = bestRF.score(testIn,testOut)
    #print("Best Estimator:","\n",GS.best_estimator_)
    print("ROC area under curve:", roc_auc_score(testOut,probs))
    print("Accuracy:","\n",score)
    print("Precision Score:",average_precision_score(testOut,prediction),"\n\n")
    print("Confusion Matrix:","\n",confusion_matrix(testOut,prediction),"\n")    
        
    bestFeatures = pd.DataFrame(bestRF.fit(trainIn,trainOut).feature_importances_).transpose()
    bestFeatures.columns = list(testIn.columns.values)
    bestFeatures = bestFeatures.transpose()
    bestFeatures.sort_values(0, inplace=True, ascending=False)

    return bestFeatures


## Adding "days trending" to the data set and grouping repeated videos

In [45]:
groupUSA = groupVideos(completeUSA)
groupGB = groupVideos(completeGB)
groupCA = groupVideos(completeCA)
groupMX = groupVideos(completeMX)

print('Size of USA grouped data:',groupUSA.shape)
print('Size of GB grouped data:',groupGB.shape)
print('Size of CA grouped data:',groupCA.shape)
print('Size of MX grouped data:',groupMX.shape)

Size of USA grouped data: (6455, 130)
Size of GB grouped data: (3369, 130)
Size of CA grouped data: (24573, 131)
Size of MX grouped data: (33785, 130)


## Applying the algorithms to the updated data set

In [54]:
print("Results for logistic regression classifier:\n")
LogisticReg(groupGB)

print("...........................................\n")

print("Results for random forest classifier:\n")
RFClassifier(groupGB)

Results for logistic regression classifier:

Accuracy: 
 0.6805194805194805
Precision Score: 0.5047639755385419 


ROC area under curve: 0.7478959431557467
Confusion Matrix: 
 [[1487  131]
 [ 730  347]] 

...........................................

Results for random forest classifier:

Accuracy: 
 0.6623376623376623
Precision Score: 0.48095390706937424 


ROC area under curve: 0.7549983185908758
Confusion Matrix: 
 [[1524   94]
 [ 816  261]] 



## Most predictive feature selection based on Recursive Feature Elimination (RFE)


In [None]:
# Feature Selection for logistic regresion based on Recursive Feature Elimination (RFE)

#(THIS TAKES FOREVER, so I included files with the results in the "Data" folder)
model = LogisticRegresssion(penalty = "l1",solver='liblinear',max_iter=1000,C=0.1)
nFeatures = 10

RecursiveFeatureElimination(groupUSA,model,nFeatures)
RecursiveFeatureElimination(groupGB,model,nFeatures)
RecursiveFeatureElimination(groupCA,model,nFeatures)
RecursiveFeatureElimination(groupMX,model,nFeatures)


In [None]:
##Getting Best Features From Random Forest and Grid Search
#(THIS TAKES FOREVER, so I included files with the results in the "Data" folder)

bestFeaturesUSA = RFClassifierCV(groupUSA)
bestFeaturesUSA.to_csv('Data/bestFeaturesUSA.csv')

bestFeaturesGB = RFClassifierCV(groupGB)
bestFeaturesGB.to_csv('Data/bestFeaturesGB.csv')

bestFeaturesCA = RFClassifierCV(groupCA)
bestFeaturesCA.to_csv('Data/bestFeaturesCA.csv')

bestFeaturesMX = RFClassifierCV(groupMX)
bestFeaturesMX.to_csv('Data/bestFeaturesMX.csv')

bestFeaturesUSA.head(n=20)

In [59]:
#Loading most predictive features from the Data folder

BestFeaturesUSA = pd.read_csv('Data/bestFeaturesUSA.csv', low_memory=False,error_bad_lines=False)
BestFeaturesGB = pd.read_csv('Data/bestFeaturesGB.csv', low_memory=False,error_bad_lines=False)
BestFeaturesCA = pd.read_csv('Data/bestFeaturesCA.csv', low_memory=False,error_bad_lines=False)
BestFeaturesMX = pd.read_csv('Data/bestFeaturesMX.csv', low_memory=False,error_bad_lines=False)

#I only list the most predictive features for Mexico here, but feel free to test around
print("Sample best features for continued video trendability in Mexico: ")
BestFeaturesMX.head(n=20)

Sample best features for continued video trendability in Mexico: 


Unnamed: 0.1,Unnamed: 0,0
0,dislikes,0.230405
1,views,0.221185
2,likes,0.172999
3,comment_count,0.147273
4,percent_dislikes,0.027338
5,likes/dislikes,0.021233
6,percent_likes,0.018493
7,percent_comments,0.014103
8,4,0.008243
9,category_id_Music,0.008197


## Conclusion

Mexican videos are more likely to trend based on the number of user interactions they obtain (views, dislikes, likes, comments ext..). This should be obvious. However, we can also conclude that videos that contain 1)  Music, 2) Comedy, 3) Twitter mentions 4) and Mexico-related content are more likely to trend as well. Capitalization of the video has no effect on the continued trendability of the video. 

Please refer to the Report for a complete analysis for the remaining countries. 

## Looking at the top categories across countries (for fun)

In [60]:
print('\nCategory Prevalence in USA:\n',dataUSA.category_id.value_counts())
print('\nCategory Prevalence in GB:\n',dataGB.category_id.value_counts())
print('\nCategory Prevalence in CA:\n',dataCA.category_id.value_counts())
print('\nCategory Prevalence in MX:\n',dataMX.category_id.value_counts())



Category Prevalence in USA:
 Entertainment              9964
Music                      6472
Style                      4146
Comedy                     3457
Peaple and Blogs           3210
News and Politics          2487
Science and Tech.          2401
Film and Animation         2345
Sports                     2174
Education                  1656
Pets and Animals            920
Gaming                      817
Travel and Events           402
Vehicles                    384
Activism and Nonprofits      57
Shows                        57
Name: category_id, dtype: int64

Category Prevalence in GB:
 Music                      13754
Entertainment               9124
Peaple and Blogs            2926
Film and Animation          2577
Style                       1928
Sports                      1907
Comedy                      1828
Gaming                      1788
News and Politics           1225
Pets and Animals             534
Science and Tech.            518
Education                    457
V