## Chapter 1 YouTube Comments Spam Detection using Python


In our first NLP tutorial we'll take up UCI YouTube Spam Collection Dataset [1]. We will use a simple technique to convert the text data into numeric format that can be processed by machine learning algorithms. We will then apply Random Forests algorithm to build a machine learning model. Finally, we make the predictions for the target variables and also evaluate our model's accuracy. This book expects you to have some prior knowledge in Python, and coding of machine learning, and deep learning (neural networks) models. We will straightway jump into writing our code and try to understand it as we write it. 

In [125]:
# Import the required libraries.
import pandas as pd
import numpy as np
# The below code is for working with machine learning model.
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [126]:
# Ignore warnings.
import warnings 
warnings.filterwarnings('ignore')

### A quick look at our data

- Let's look at our data... 

In [168]:
# Read the data files available in the same folder as this code.
Youtube01_psy = pd.read_csv('Youtube01-Psy.csv')
Youtube02_katyperry = pd.read_csv('Youtube02-KatyPerry.csv')
Youtube03_lmfao = pd.read_csv('Youtube03-LMFAO.csv')
Youtube04_eminem = pd.read_csv('Youtube04-Eminem.csv')
Youtube05_shakira = pd.read_csv('Youtube05-Shakira.csv')

In [102]:
# Let's check the datasets size.
print(Youtube01_psy.shape)
print(Youtube02_katyperry.shape)
print(Youtube03_lmfao.shape)
print(Youtube04_eminem.shape)
print(Youtube05_shakira.shape)

(350, 5)
(350, 5)
(438, 5)
(448, 5)
(370, 5)


In [169]:
# ACombine all five datasets.
combined_df = pd.concat([Youtube01_psy, Youtube02_katyperry, Youtube03_lmfao, Youtube04_eminem, Youtube05_shakira])

# Reset the index
combined_df.reset_index(drop=True, inplace=True)

In [22]:
combined_df.head(3)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1


In [170]:
# Select only the useful "CONTENT" and "CLASS" columns.
combined_df = combined_df[["CONTENT", "CLASS"]]

In [150]:
# Randomly select 5 rows
random_sample = combined_df.sample(n=5)
print(random_sample)

                                                CONTENT  CLASS
987                            subscribe to my chanell﻿      1
1727                            BEST SONG! GO SHAKI :D﻿      0
1280  Share Eminem&#39;s Artist of the Year video so...      1
1241  MEGAN FOX AND EMINEM TOGETHER IN A VIDEO  DOES...      0
1126             I learned the shuffle because of them﻿      0


- Note the emogis and misalignment doe to spaces in CLASS

In [151]:
combined_df.shape

(1956, 2)

In [105]:
# Map "0" to  "Not Spam" and 1 : "Spam."
#combined_df['CLASS'] = combined_df['CLASS'].map({0 : "Not Spam", 1 : "Spam"})

In [171]:
# Randomly select 5 rows
random_sample = combined_df.sample(n=5)
print(random_sample)

                                                CONTENT  CLASS
1003               Check out this playlist on YouTube:﻿      1
1455                                So freaking sad...﻿      0
474   Imagine this in the news crazy woman found act...      0
1343  I know that maybe no one will read this but PL...      1
37    SUB 4 SUB PLEASE LIKE THIS COMMENT I WANT A SU...      1


In [172]:
# Seperate features and the target.
X = np.array(combined_df['CONTENT'])
y = np.array(combined_df['CLASS'])

In [159]:
X.shape

(1956,)

### Building vectors

The words contained in every tweet are a good pointer of whether they are about a real disaster or not. In theory this is not totally correct. We will still use it as our starting point in our first NLP tutorial. Below we will use scikit-learn's CountVectorizer to count the words in each tweet and then turn them into a data format that our machine learning model can undestand. A vector is, in this context is a set of numbers that a machine learning algorithm can understand. The related code, its usage, and output is given below.

•	Working of CountVectorzer: https://www.educative.io/answers/countvectorizer-in-python


### An example.

In [35]:
demo_text = ["Stella is a good girl. She loves to swim"] # Demo sentence

In [36]:
count_vectorizer = feature_extraction.text.CountVectorizer() # Instrantiate CountVectorizer() 
count_vectorizer.fit(demo_text) # Fit the demo text
print(count_vectorizer.vocabulary_) # Print results

{'stella': 5, 'is': 2, 'good': 1, 'girl': 0, 'she': 4, 'loves': 3, 'to': 7, 'swim': 6}


In [37]:
# encode document
demo_vector = count_vectorizer.transform(demo_text)
print(demo_vector.shape)
print(demo_vector.toarray())

(1, 8)
[[1 1 1 1 1 1 1 1]]


- The number elements in the vector representing demo_text [[1 1 1 1 1 1 1 1]] are eitht (0 to 7) as the the number of distinct words is also eight. 
- All the entries in the vector are 1 as no word in the demo_text is repeating (frequency of all words is 1). 

### Let's try abother example

In [38]:
demo_text2 = ["The sky is blue. I wish to fly in the blue sky"] # Demo sentence 2
count_vectorizer = feature_extraction.text.CountVectorizer() # Instrantiate CountVectorizer() 
count_vectorizer.fit(demo_text2) # Fit the demo text
print(count_vectorizer.vocabulary_) # Print results

{'the': 5, 'sky': 4, 'is': 3, 'blue': 0, 'wish': 7, 'to': 6, 'fly': 1, 'in': 2}


In [39]:
# encode document
demo_vector = count_vectorizer.transform(demo_text2)
print(demo_vector.shape)
print(demo_vector.toarray())

(1, 8)
[[2 1 1 1 2 2 1 1]]


- We are simply counting the repetition of words and putting it in the vector. 
- The words "the", "sky", and "blue" have a frequency of 2 each in the demo_text2.
- So, these three words are represented by 2 each in demo vecctor [[2 1 1 1 2 2 1 1]].

### Let's get counts for the first 5 entries in combined_df.

In [44]:
## let's get counts for the first 5 entries in "CONTENT" column.
demo_vectors = count_vectorizer.fit_transform(combined_df['CONTENT'][0:5])

In [45]:
type(demo_vectors) # Checking data type.

scipy.sparse._csr.csr_matrix

In [46]:
demo_vectors.shape # Checking shape.

(5, 46)

In [47]:
# we use .todense() here because these vectors are "sparse" 
# (only non-zero elements are kept to save space)
print(demo_vectors[0].todense().shape)
print(demo_vectors[0].todense())

(1, 46)
[[0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 1 0 1 0 0 0 0 0 1]]


The above code and result show us the following. 
- There are 46 distinct words (called "tokens") in the selected first five tweets.
- Obviously, the first tweet (and every other tweet) contains only some of those 46 distinct tokens. 
- The vector contains 54 elements as there are 54 distinct tokens.
- All of the non-zero counts above in above vector are the tokens that definitely exist in the entry in the column "CONTENT".

Now let's create representative number vectors for all of our five entries. 

In [157]:
combined_df.shape

(1956, 2)

In [160]:
X.shape

(1956,)

In [173]:
# Split in to train and test datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [174]:
X_train.shape

(1369,)

In [175]:
X_test.shape

(587,)

In [176]:
# We will use count_vectorizer.fit_transform() for X_train and X_test.
X_train = count_vectorizer.fit_transform(X_train)

In [177]:
X_test = count_vectorizer.transform(X_test) # We will do onlt transform() with X_test.

In [178]:
X_train.shape

(1369, 3458)

In [179]:
X_test.shape

(587, 3458)

### Let's build a simple machine learning model to predict the "target" variable.

In [180]:
# Initialize the Random Forest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)

- Below we will train the model and evaluate it.
- We will be using the evaluation metric for this purpose is F1 score. 

In [137]:
combined_df.columns

Index(['CONTENT', 'CLASS'], dtype='object')

In [181]:
scores = model_selection.cross_val_score(clf, X_train, y_train, cv=3, scoring="f1")
scores

array([0.95067265, 0.95594714, 0.95594714])

- The above F1 scores look good! You can get better results with other NLP techniques like TFIDF, LSA, LSTM / RNNs, and many others. We will play with them in the upcoming chapters. 
- As of now, let's make the much required precictions on the test data. 

In [182]:
X_train.shape

(1369, 3458)

In [183]:
clf.fit(X_train, y_train)

In [184]:
X_test.shape

(587, 3458)

In [185]:
# Make predictions
y_pred = clf.predict(X_test)

In [186]:
y_pred.shape

(587,)

In [187]:
# Construct a dataframe with columns as y_test and y_pred. 
test_df = pd.DataFrame()
test_df["y"] = y_test
test_df["y_predict"] = y_pred

In [188]:
# Display 10 random rows from test_df
random_sample = test_df.sample(n=10)
print(random_sample)

     y  y_predict
454  1          1
475  0          0
252  1          1
298  1          1
108  1          1
316  1          1
101  1          1
230  0          0
437  1          1
428  0          0


- We are finishing the solution here, though the predictions also look good. 
- For real-world analysis F1 score should be calculated on the test data. We are skipping this step.

- Note: We have skipped much of data cleaning and pre-process in this tutorial, still we managed to get respectable model results. In chapter 5, we will try to take up the same tutorial again but thhis time with reasonable cleaning of data. It will be interesting to see if there is any further improvements in the model results with data cleaning done. 

- It comples our job for now. Remember, it was a over simplified example, created only for demo. 

#### References

[1] Alberto,T.C. and Lochter,J.V.. (2017). YouTube Spam Collection. UCI Machine Learning Repository. https://doi.org/10.24432/C58885.