# Sentiment Analysis of Usernames and Titles on YouTube and Twitch

## Data Preperation, Wrangling and Cleaning

This notebook concatenates all the files that were pushed on a scheduled basis from the social media platforms to GitHub.

The data is then filtered from the 28th of May 2021 till the 28th of July 2021. The textual data is processed, cleaned and made ready for visualisatioin and sentiment analysis

## Setting up environments

In [None]:
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180943 sha256=30e0efff01b79fe13856dcf9b1d866a81418cc1d2bf72b4e76a3f73553625007
  Stored in directory: /tmp/pip-ephem-wheel-cache-f7gqt_n0/wheels/11/95/ba/2c36cc368c0bd339b44a791c2c1881a1fb714b78c29a4cb8f5
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [None]:
!pip install PyGithub 

Collecting PyGithub
  Downloading PyGithub-1.55-py3-none-any.whl (291 kB)
[?25l[K     |█▏                              | 10 kB 27.1 MB/s eta 0:00:01[K     |██▎                             | 20 kB 33.6 MB/s eta 0:00:01[K     |███▍                            | 30 kB 26.6 MB/s eta 0:00:01[K     |████▌                           | 40 kB 18.8 MB/s eta 0:00:01[K     |█████▋                          | 51 kB 13.6 MB/s eta 0:00:01[K     |██████▊                         | 61 kB 11.9 MB/s eta 0:00:01[K     |███████▉                        | 71 kB 13.0 MB/s eta 0:00:01[K     |█████████                       | 81 kB 14.5 MB/s eta 0:00:01[K     |██████████                      | 92 kB 12.1 MB/s eta 0:00:01[K     |███████████▎                    | 102 kB 11.0 MB/s eta 0:00:01[K     |████████████▍                   | 112 kB 11.0 MB/s eta 0:00:01[K     |█████████████▌                  | 122 kB 11.0 MB/s eta 0:00:01[K     |██████████████▋                 | 133 kB 11.0 MB/s eta 

In [None]:
pip install scipy



## Ingesting Data 

In [None]:
!git clone https://github.com/JefNtungila/master_thesis.git

Cloning into 'master_thesis'...
remote: Enumerating objects: 6719, done.[K
remote: Counting objects: 100% (228/228), done.[K
remote: Compressing objects: 100% (228/228), done.[K
remote: Total 6719 (delta 92), reused 0 (delta 0), pack-reused 6491[K
Receiving objects: 100% (6719/6719), 19.41 MiB | 18.89 MiB/s, done.
Resolving deltas: 100% (3535/3535), done.


In [None]:
%%time

!cp master_thesis/* /content/master_thesis/

cp: 'master_thesis/twitch_2021-05-16T17:14:29.983501.csv' and '/content/master_thesis/twitch_2021-05-16T17:14:29.983501.csv' are the same file
cp: 'master_thesis/twitch_2021-05-16T19:30:08.054865.csv' and '/content/master_thesis/twitch_2021-05-16T19:30:08.054865.csv' are the same file
cp: 'master_thesis/twitch_2021-05-16T19:46:14.540208.csv' and '/content/master_thesis/twitch_2021-05-16T19:46:14.540208.csv' are the same file
cp: 'master_thesis/twitch_2021-05-16T20:17:07.501535.csv' and '/content/master_thesis/twitch_2021-05-16T20:17:07.501535.csv' are the same file
cp: 'master_thesis/twitch_2021-05-16T20:49:25.717908.csv' and '/content/master_thesis/twitch_2021-05-16T20:49:25.717908.csv' are the same file
cp: 'master_thesis/twitch_2021-05-16T21:17:07.632850.csv' and '/content/master_thesis/twitch_2021-05-16T21:17:07.632850.csv' are the same file
cp: 'master_thesis/twitch_2021-05-16T21:25:13.540832.csv' and '/content/master_thesis/twitch_2021-05-16T21:25:13.540832.csv' are the same file

In [None]:
# import OS module
import os
import pandas as pd

pd.options.display.max_colwidth = 100

# Get the list of all files and directories
#saved in the master thesis directory
path = '/content/master_thesis/'
dir_list = os.listdir(path)
 
#reading the files in as csv, appending them to a list, concatinating them as a whole 
twitch_files = pd.concat([pd.read_csv(f'{path}{file_}') for file_ in dir_list if 'twitch' in file_])
youtube_files = pd.concat([pd.read_csv(f'{path}{file_}') for file_ in dir_list if 'youtube' in file_])

print(len(twitch_files))
print(len(youtube_files))


186800
12947


## Data Wrangling and Cleaning

In [None]:
%time
from spacy.tokenizer import Tokenizer
import spacy
import html
from html.parser import HTMLParser
import re

#loading spacy encore model - need pip install it and restart runtime
nlp = spacy.load('en_core_web_lg')
tokenizer = Tokenizer(nlp.vocab)



CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


In [None]:
def token_text(doc):
  '''
  formatting as non html for visualisation
  keep only characters 
  tokenise sentence into words if those words are not punctuation, pronouns or stop words 
  '''
  doc = html.parser.unescape(doc) #remove html
  doc = re.sub(r'[^a-zA-Z ^0-9]', '', doc) #keep alphanumerical characters
  doc = re.sub('[0-9]+', '', doc) #remove numerical characters
  doc = doc.lower() #convert all strings to lowercase
  tokens = [token.text for token in tokenizer(doc)  if (token.is_punct == False) and (token.is_stop == False)]
  tokens = [''.join(x.split()) for x in tokens if x] #remove multiple random number leading spaces from each token in a doc
  tokens = [token for token in tokens if token != ''] #remove empty string from leading edges
  return tokens

In [None]:
from datetime import datetime, date, time, timedelta
import datetime

def wrangle(X):

  ''' wrangle function sort values by api call time and index 
  formats the api call time in usuable features
  filtering data from 28th of MAY when youtube data was ingested cleanly
  keeping 2 months of data for study
  creates tokenised version of username
  creates tokenis version of titles'''

  X = X.sort_values(['api_call_time', 'Unnamed: 0'])
  X['date_api_call_time'] = pd.to_datetime(X['api_call_time']).dt.date
  X['hour_api_call_time']  = pd.to_datetime(X['api_call_time']).dt.hour
  X = X[X['date_api_call_time'] >= datetime.date(2021, 5, 28)]
  X = X[X['date_api_call_time'] <= datetime.date(2021, 7, 28)]
  X.rename(columns={'Unnamed: 0': 'reference_index'}, inplace=True)
  X = X.reset_index(drop=True)
  X['tokenised_title'] = X[X.columns[X.columns.to_series().str.contains('title')]].iloc[:, 0].apply(lambda x : token_text(str(x)))

  #### add in date filter from when data is clean enough

  return X



In [None]:
def token_text_with_stopwords(doc):


  '''
  tokenising for NLP purposes
  formatting as non html for visualisation
  keep only characters 
  tokenise sentence into words if those words are not punctuation, pronouns or stop words 
  remove potential leading spaces
  '''

  doc = html.parser.unescape(doc) #remove html
  doc = re.sub(r'[^a-zA-Z ^0-9]', '', doc) #keep alphanumerical characters
  doc = re.sub('[0-9]+', '', doc) #remove numerical characters
  doc = doc.lower() #convert all strings to lowercase
  tokens = [token.text for token in tokenizer(doc)  if (token.is_punct == False) ]
  tokens = [''.join(x.split()) for x in tokens if x] #remove multiple random number leading spaces from each token in a doc
  tokens = [token for token in tokens if token != ''] #remove empty string from leading edges
  
  return tokens

In [None]:
twitch_data = wrangle(twitch_files)
youtube_data = wrangle(youtube_files)

In [None]:
#tokenising titles and keeping stopwords
twitch_data['tokenised_titles_with_stopwords'] = twitch_data['title'].apply(lambda x: token_text_with_stopwords(str(x)))
youtube_data['tokenised_titles_with_stopwords'] = youtube_data['video_title'].apply(lambda x : token_text_with_stopwords(str(x)))

In [None]:
print(twitch_data.shape)
print(youtube_data.shape)

(139500, 11)
(11783, 13)


## Adding Twitch Data Genres

In [None]:
vgsales = pd.read_csv('https://raw.githubusercontent.com/JefNtungila/Sentiment-Analysis-of-Usernames-and-Titles-on-YouTube-and-Twitch/main/data/vgsales.csv')

In [None]:
#manual labelling of missing top 30 games, often published after 2018
#complementart to classification by VGS which labelled more than 30K titles

game_genre = pd.DataFrame({'game_name': ['Call of Duty: Warzone', 'VALORANT',
       'Hearthstone', 'Genshin Impact', 'FIFA 21', 'Escape from Tarkov',
       'Teamfight Tactics', "Tom Clancy's Rainbow Six Siege", 'SMITE',
       'Resident Evil Village'],
       'game_genre': ['Shooter','Shooter','Misc','Misc','Sports','Shooter','Strategy', 'Shooter', 'Misc', 'Misc']})

In [None]:
#merging with vgsales
twitch_data = twitch_data.merge(vgsales[['Name', 'Genre']].drop_duplicates(subset= ['Name']), how = 'left', left_on='game_name', right_on='Name')
#merging with modern games dataframe (game_genre)
twitch_data = twitch_data.merge(game_genre, how = 'left', left_on = ['game_name'], right_on = ['game_name'])
twitch_data['Genre'] = twitch_data['Genre'].fillna(twitch_data['game_genre'])
twitch_data['Genre'] = twitch_data['Genre'].fillna('Other')
twitch_data = twitch_data.rename(columns = {'Genre': 'genre'})
twitch_data = twitch_data.drop(columns = ['game_genre', 'Name'])



In [None]:
twitch_data.columns.values

array(['reference_index', 'user_name', 'game_name', 'title',
       'viewer_count', 'started_at', 'api_call_time',
       'date_api_call_time', 'hour_api_call_time', 'tokenised_title',
       'tokenised_titles_with_stopwords', 'genre'], dtype=object)

In [None]:
#producing summary statistics for twitch data
#viewer count feature is right skewed min/ max? boxplot???
#bob ross is most occuring title

twitch_data[['reference_index', 'user_name', 'game_name', 'title',
       'viewer_count', 'started_at', 'api_call_time',
       'date_api_call_time', 'hour_api_call_time',  'genre']].describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
reference_index,139500,,,,49.5,28.8662,0.0,24.75,49.5,74.25,99.0
user_name,139500,4062.0,ops1x,937.0,,,,,,,
game_name,138386,1378.0,Grand Theft Auto V,12780.0,,,,,,,
title,139043,43567.0,Ramee | !Twitter | Chang Gang | NoPixel,299.0,,,,,,,
viewer_count,139500,,,,7108.38,12417.7,0.0,2254.0,3430.0,6788.0,400648.0
started_at,139500,34500.0,2021-06-17 08:00:29,54.0,,,,,,,
api_call_time,139500,1395.0,2021-06-24T22:06:02.526449,100.0,,,,,,,
date_api_call_time,139500,62.0,2021-06-18,2800.0,,,,,,,
hour_api_call_time,139500,,,,11.8222,6.80187,0.0,6.0,12.0,18.0,23.0
genre,139500,18.0,Other,37612.0,,,,,,,


## Adding YouTube Genres

In [None]:
vid_genre = pd.read_csv('https://raw.githubusercontent.com/JefNtungila/Sentiment-Analysis-of-Usernames-and-Titles-on-YouTube-and-Twitch/main/data/Trending_CrowdSourced_Classification.csv', 
                        encoding = 'latin1')
#replacing conventionally viral youtubers with the general youtuber
vid_genre['classification'] = vid_genre['classification'].replace({'CV':'YT'})

#renaming columns to actual words from accronym
vid_genre['classification'] = vid_genre['classification'].replace({'YT': 'Youtuber',
                                                                    'CO': 'Commercial',
                                                                    'TR':'Trailer',
                                                                    'MU': 'Music',
                                                                    'TM': 'Traditional Media'})
vid_genre = vid_genre[vid_genre['classification'].notna()]
vid_genre = vid_genre.rename(columns={'classification':'genre'})

In [None]:
#merging with the crowdsourced video genre
youtube_data = youtube_data.merge(vid_genre[['channel', 'genre']], how = 'left', left_on='username', right_on='channel')
youtube_data = youtube_data.drop(columns=['channel'])



In [None]:
from sklearn.feature_extraction.text import CountVectorizer


#creating word scarse count vectorised matric from youtube titles
countvec = CountVectorizer()
df_one_hot_encoded = pd.DataFrame(countvec.fit_transform(youtube_data['video_title']).toarray(), 
                                  index=youtube_data['video_title'], columns=countvec.get_feature_names())

In [None]:
#adding the target column to the countvectorised matrix

df_processed = pd.concat([df_one_hot_encoded.reset_index(drop=True),  
                          youtube_data['genre'].reset_index(drop = True)], axis = 1)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

#selecting labelled data for training and testing

X = df_processed[df_processed['genre'].notna()].drop(columns= ['genre'])
y = df_processed[df_processed['genre'].notna()]['genre']

#splitting in 80, 20
X_train, X_val, y_train, y_val = train_test_split( X, y, train_size = 0.8, stratify=y, random_state=42)

print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)


(2391, 4757) (2391,) (598, 4757) (598,)


Reference :

Chen, T. and Guestrin, C., 2016, August. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). 

In [None]:
pip install xgboost



In [None]:
from xgboost import XGBClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer



eval_set = [(X_train, y_train), (X_val, y_val)]


# Fit on train, score on val, predict on test, 100 trees, stop after no improvemnt for 30 epochs
model = XGBClassifier(n_estimators = 100, n_jobs = -1)
model.fit(X_train, y_train, eval_set = eval_set , early_stopping_rounds = 30)

y_pred = model.predict(X_val) #actually storing the predictions it would have made for X_val
print('Validation Accuracy', model.score(X_val, y_val))

[0]	validation_0-merror:0.212882	validation_1-merror:0.210702
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.

Will train until validation_1-merror hasn't improved in 30 rounds.
[1]	validation_0-merror:0.193643	validation_1-merror:0.185619
[2]	validation_0-merror:0.212882	validation_1-merror:0.198997
[3]	validation_0-merror:0.189879	validation_1-merror:0.182274
[4]	validation_0-merror:0.186533	validation_1-merror:0.182274
[5]	validation_0-merror:0.181932	validation_1-merror:0.17893
[6]	validation_0-merror:0.179423	validation_1-merror:0.17893
[7]	validation_0-merror:0.179423	validation_1-merror:0.17893
[8]	validation_0-merror:0.179423	validation_1-merror:0.17893
[9]	validation_0-merror:0.18486	validation_1-merror:0.183946
[10]	validation_0-merror:0.184023	validation_1-merror:0.182274
[11]	validation_0-merror:0.178168	validation_1-merror:0.17893
[12]	validation_0-merror:0.179841	validation_1-merror:0.17893
[13]	validation_0-merror:0.178168	v

In [None]:
#predicting genres for those that don't have a genre
pred = model.predict(df_processed[df_processed['genre'].isna()].drop(columns= ['genre']))
index_no_genre = youtube_data[youtube_data['genre'].isna()].reset_index()['index'].tolist()

#creating dataframe with the predicted genres and the row indeces on which they should be merged
pred_df = pd.DataFrame({'label': pred,
                        'index': index_no_genre})

pred_df['label'].value_counts()

Youtuber             5499
Music                1763
Traditional Media    1310
Commercial            151
Trailer                71
Name: label, dtype: int64

In [None]:
#merge original dataframe with new labels
youtube_data = pd.merge(youtube_data.reset_index(), pred_df, how = 'left', on='index')
youtube_data['genre'] = youtube_data['genre'].fillna(youtube_data['label'])
#deleting columns that were used as reference for the merge
youtube_data = youtube_data.drop(columns=['index', 'label'])

In [None]:
youtube_data[['reference_index', 'username', 'video_title', 'publish_time',
       'view_count', 'comment_count', 'like_count', 'dislike count',
       'api_call_time', 'date_api_call_time', 'hour_api_call_time', 'genre']].describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
reference_index,11783,,,,23.6104,13.9866,0.0,12.0,24.0,36.0,49.0
username,11783,1303.0,America's Got Talent,202.0,,,,,,,
video_title,11783,1910.0,Ed Sheeran - Bad Habits [Official Video],74.0,,,,,,,
publish_time,11783,1878.0,2021-06-25T04:00:33Z,74.0,,,,,,,
view_count,11783,,,,7986870.0,16004500.0,65009.0,1032880.0,2544510.0,6969420.0,150066000.0
comment_count,11783,,,,23365.9,87742.1,0.0,2500.0,6386.0,16928.0,1106400.0
like_count,11783,,,,351735.0,633044.0,589.0,46449.0,124633.0,360195.0,7425210.0
dislike count,11783,,,,10204.1,23490.7,17.0,722.5,2002.0,7422.0,252543.0
api_call_time,11783,245.0,2021-05-29T14:37:08.454652,50.0,,,,,,,
date_api_call_time,11783,62.0,2021-05-30,200.0,,,,,,,


## Username Wrangling

In [None]:
!pip install nltk



In [None]:
import nltk
from nltk import corpus

english_words = nltk.download('words')
english_vocab = set(word.lower() for word in nltk.corpus.words.words())

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [None]:
def word_finder(name):

  
  #not random , emperic testing gives good result 

  name = name.lower()

  #finding intersection with words in the dictionary
  all_possible_words = {name[i:j + i] for j in range(2, len(name)) for i in range(len(name)- j + 1)}
  all_words = english_vocab.intersection(all_possible_words)
  all_words = list(set(list(all_words)))
  if all_words == []:
    all_words = float('NaN')
  else:
    #transforming list of words into string of words
    all_words = ' '.join(word for word in all_words)
    #removing stopwords
    all_words = [token.text for token in tokenizer(all_words)  if (token.is_stop == False)] 

  return all_words




In [None]:
youtube_data.columns

Index(['reference_index', 'username', 'video_title', 'publish_time',
       'view_count', 'comment_count', 'like_count', 'dislike count',
       'api_call_time', 'date_api_call_time', 'hour_api_call_time',
       'tokenised_title', 'tokenised_titles_with_stopwords', 'genre'],
      dtype='object')

In [None]:
twitch_data['words_in_names'] = twitch_data['user_name'].apply(lambda x : word_finder(x))
youtube_data['words_in_names'] = youtube_data['username'].apply(lambda x : word_finder(x))

## Pushing data to drive and GitHub

In [None]:
from  github import Github

twitch_data.to_csv('twitch_data.csv')
!cp twitch_data.csv 'drive/My Drive/'

youtube_data.to_csv('youtube_data.csv')
!cp youtube_data.csv 'drive/My Drive/'

g = Github('github_key')
repo = g.get_repo('github_repo)
#parameters are filename, description, content
#cannot update because file is to big so have to delete and update

repo.create_file('twitch_data.csv', 'twitch hourly top 50 streams', str(twitch_data.to_csv()))
repo.create_file('youtube_data.csv', 'youtube top 100 streams worldwide taken every 6 hours', str(youtube_data.to_csv()))