<a href="https://colab.research.google.com/github/ClamFD/NLP/blob/main/NLP_YoutubeSpam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Youtube comments spam detection

# Section 1

This project will use a dataset from UCI Machine Learning Repository "YouTube Spam Collection"

The aim is to detect spam comments which is a big problem on all social medias as it dilutes real comments and fills spaces with junk

Project will use 2 natural language processing pipelines to determine weather comments are catagorised as spam or legitimate

the goal will be to decide which method will give the most accurate predictions on the comments my targed accuracy will be between 85% and 95% as this proves the methods arent just luck guessing and actually making meaningful predictions

The dataset is half spam comments and half legit comments and are catagorised with a 1 for spam and a 0 for legit, this will be hidden when training the machine learning algorithms and shown when testing


Import libraries

In [5]:
!pip install ucimlrepo
import pandas as pd
from ucimlrepo import fetch_ucirepo
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



Fetch data from uci repo

In [6]:
df_raw = fetch_ucirepo(id=380)
X = df_raw.data.features
y = df_raw.data.targets
df = pd.concat([X, y], axis=1)

In [7]:
df.head()

Unnamed: 0,AUTHOR,DATE,CONTENT,CLASS
0,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


Data Exploration

In [8]:
df.shape

(1956, 4)

In [9]:
df.columns

Index(['AUTHOR', 'DATE', 'CONTENT', 'CLASS'], dtype='object')

Check if data is balanced

In [10]:
print(df['CLASS'].value_counts())

CLASS
1    1005
0     951
Name: count, dtype: int64


Data is balanced so no changes needed

Remove missing data

In [11]:
df.dropna(subset=['CONTENT'], inplace=True)

Data cleaning

In [12]:
df['CONTENT'] = df['CONTENT'].str.lower()
df['CONTENT'] = df['CONTENT'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))
df['CONTENT'] = df['CONTENT'].str.replace(r'\s+', ' ', regex=True).str.strip()

Setup NLTK

In [13]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("punkt_tab")

stop_words = set(stopwords.words("english"))
lemm = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


tokenisation, removing stopwords and lemmatisation

In [14]:
df['tokens'] = df['CONTENT'].apply(lambda x: [
    lemm.lemmatize(word) for word in word_tokenize(x) if word not in stop_words
])

Reconstruct and print

In [15]:
df['processed_text'] = df['tokens'].apply(lambda x: " ".join(x))

df.head()

Unnamed: 0,AUTHOR,DATE,CONTENT,CLASS,tokens,processed_text
0,Julius NM,2013-11-07T06:20:48,huh anyway check out this youtube channel koby...,1,"[huh, anyway, check, youtube, channel, kobyoshi]",huh anyway check youtube channel kobyoshi
1,adam riyati,2013-11-07T12:37:15,hey guys check out my new channel and our firs...,1,"[hey, guy, check, new, channel, first, vid, u,...",hey guy check new channel first vid u monkey i...
2,Evgeny Murashkin,2013-11-08T17:34:21,just for test i have to say murdevcom,1,"[test, say, murdevcom]",test say murdevcom
3,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy,1,"[shaking, sexy, as, channel, enjoy]",shaking sexy as channel enjoy
4,GsMega,2013-11-10T16:05:38,watchvvtarggvgtwq check this out,1,"[watchvvtarggvgtwq, check]",watchvvtarggvgtwq check


# Section 2


Before performing the machine learning the comments must be converted into a numerical format that the algorithm can understand with vectorisation. the TF-IDF method was the best option as the TF part is "term frequency" which simply counts how many words are in the comment however this will count the unhelpful words such as "a", the IDF part "inverse document frequency" determines the probability of words by giving giving common words less weight by performing a calculation counting of times a word appears in mesages and and if a word is used in 80% of all comments it will be less emphasised than a word used in 40% as it is probably found in both spam and legit comments, this will lead to words used commonly in spam to be detected easier be the machine learning algorythms later on


Count words frequency and give weight

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X_vectorized = tfidf.fit_transform(df['processed_text'])

In [17]:
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
aaaaaaa sexy : 7.886020786836131
abbastfuck : 7.886020786836131
abbastfuck lneadwbftotoufjzvfflfnaxykwzsivqhimxenotorious : 7.886020786836131
ablaze : 7.886020786836131
ablaze crabby : 7.886020786836131
able : 7.480555678727967
able advertise : 7.886020786836131
able cover : 7.886020786836131
abominable : 7.886020786836131
abominable generation : 7.886020786836131
abomination : 7.480555678727967
abomination clap : 7.886020786836131
abomination subscribe : 7.886020786836131
abonner : 7.886020786836131
abonner chane : 7.886020786836131
absolutely : 6.633257818340763
abuse : 7.480555678727967
abusive : 7.480555678727967
access : 7.480555678727967
account : 6.633257818340763
acoustic : 7.480555678727967
act : 7.192873606276185
act renewal : 7.480555678727967
acting : 7.192873606276185
acting like : 7.480555678727967
active : 7.480555678727967
actor : 7.480555678727967
actually : 6.1812726945977055
ad : 7.480555678727967
ad do

In [18]:
print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf value:')
print(X_vectorized)
print('\ntf-idf values in matrix form:')
print(X_vectorized.toarray())


Word indexes:

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 20568 stored elements and shape (1956, 5000)>
  Coords	Values
  (0, 3278)	0.49374548568178345
  (0, 110)	0.5134930486670009
  (0, 326)	0.16655950427730998
  (0, 4964)	0.22351474464476637
  (0, 292)	0.23282449023192786
  (0, 363)	0.43084789603970153
  (0, 4966)	0.4183326690891067
  (1, 326)	0.08136311514839734
  (1, 292)	0.11373308230173367
  (1, 3116)	0.1383132956439126
  (1, 2774)	0.12770588701004978
  (1, 3812)	0.14380877560845653
  (1, 1908)	0.17146336708734267
  (1, 4701)	0.2275950673033927
  (1, 3720)	0.5016752928975462
  (1, 3299)	0.12631431995594594
  (1, 4857)	0.2275950673033927
  (1, 3475)	0.19470594623272303
  (1, 3497)	0.1062131981222456
  (1, 421)	0.13910231919291527
  (1, 4001)	0.11191024475273306
  (1, 4473)	0.10951408608314117
  (1, 3125)	0.16482988342716967
  (1, 2784)	0.1684584963338457
  (1, 344)	0.17594093555838675
  :	:
  (1950, 4849)	0.4179365428143152
  (1950, 966)	0.424380

# Section 3
 the models ill be comparing are naive bayes and suport vector machine as both the naive bayes uses probability to predict the classification and SVM uses threshholding and creates margins between the two groups and places them into the best place relative to the margin. these are 2 completely different approaches but can both run off the TF-IDF vectorization

 naive bayes uses bayes theory to calculate the liklihood that a word will be see in each type of messsage based off the average ammount of times its seen in each classification in the training dataset, for example words like "link" and "check" are very common to see in spam comments and rarely seen in legit comments because the spam are tying to get you to go to their own channel or website, bayes theorum calculates this by counting the amount of times its seen vs all the words in the comments and gives it a probability score and can give a relatively acurate prediction based off it, however its known as naive because it doesnt consider each word in relation to the rest of the comment a rough example would be if there was a video where comments related to a chain link could be flagged as spam even though theyre not meaning a link to click off to another site

 suport vector machine finds a boundary called a hyperplane to seperate the spam comments from the real ones by plotting the comments based on patterns within them and placing the hyperplane in the gap with the biggest margin betwen the two seperate classifications. it uses the values created by the TF IDF to calculate where it should be and then makes its prediction based of what side of the boundary its on

 naive bayes is quicker to train and uses less computational power to perform comapred to svm so if the naive bayes is similar accuracy then it would be the best option but if its not as accurate then it would be worth going for the svm model for better results.


split data into testing (20%) and training (80%)

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

In [20]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score

train models

In [29]:
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [30]:

svm_model = LinearSVC()
svm_model.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)


predictions

In [28]:
nb_preds = nb_model.predict(X_test)


In [27]:
svm_preds = svm_model.predict(X_test)

evaluation

In [26]:
print("Naive Bayes")
print(classification_report(y_test, nb_preds))
print(accuracy_score(y_test, nb_preds))

print("SVM Performance")
print(classification_report(y_test, svm_preds))
print( accuracy_score(y_test, svm_preds))

Naive Bayes
              precision    recall  f1-score   support

           0       0.86      0.88      0.87       176
           1       0.90      0.88      0.89       216

    accuracy                           0.88       392
   macro avg       0.88      0.88      0.88       392
weighted avg       0.88      0.88      0.88       392

0.8801020408163265
SVM Performance
              precision    recall  f1-score   support

           0       0.83      0.95      0.88       176
           1       0.95      0.84      0.89       216

    accuracy                           0.89       392
   macro avg       0.89      0.89      0.89       392
weighted avg       0.90      0.89      0.89       392

0.8877551020408163


# Section 4

to compare the models i used an 80/20 split anfter training both the models with the TF IDF text i outputted their precision using the sklearn accuracy metric which shows the precision, recall and f1 score
once tested i can see that the f1 score is 89% for both models and the svm had slightly higher precision at 90% vs 88%, the naive bayes one was faster to train and still had decent overall accuracy but with lower accuracy will clasify legit comments as spam more often which is due to it only relying on the frequency of individual words and not patterns
the svm model had a slightly higher precision but took longer to train and on a bigger data set this would add up. so if i was tasked with implementing this id start off with the naive bayes method as a start and switch to svm if the performace wasnt working as expected