# Spam Comments Detection Using Machine Learning

* **I am going to make a model for spam comment detection using Machine Learning Algorithms.** 

* **I will be going take dataset from Kaggle of two youtubers KSY and Eminem.** 

# Import dataset and Important Libraries : 


### KSY Channel : 

In [14]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB

In [3]:
data = pd.read_csv(r'Youtube01-Psy.csv')

In [4]:
data.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


#### We only need the content and class column from the dataset for the rest of the task. So let’s select both the columns and move further:

In [5]:
data = data[["CONTENT","CLASS"]]
data.head()

Unnamed: 0,CONTENT,CLASS
0,"Huh, anyway check out this you[tube] channel: ...",1
1,Hey guys check out my new channel and our firs...,1
2,just for test I have to say murdev.com,1
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1


## Checking Null values and Duplicated values : 

In [6]:
data.isnull().sum()

CONTENT    0
CLASS      0
dtype: int64

In [8]:
data.duplicated().sum()

1

In [10]:
data.drop_duplicates(inplace = True)

**The "Class" column contains values 0 and 1. 0 indicates not spam, and 1 indicates spam. So to make it look better, I will use spam and not spam labels instead of 1 and 0:**

In [11]:
data["CLASS"] = data["CLASS"].map({0 : "Not Spam", 1: "Spam Comment"})

In [12]:
data.head()

Unnamed: 0,CONTENT,CLASS
0,"Huh, anyway check out this you[tube] channel: ...",Spam Comment
1,Hey guys check out my new channel and our firs...,Spam Comment
2,just for test I have to say murdev.com,Spam Comment
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,Spam Comment
4,watch?v=vtaRGgvGtWQ Check this out .﻿,Spam Comment


In [13]:
x = np.array(data["CONTENT"])
y = np.array(data["CLASS"])

In [15]:
cv = CountVectorizer()

In [17]:
x = cv.fit_transform(x)

In [19]:
x_train, x_test , y_train, y_test = train_test_split(x,y,test_size=0.25,random_state=42)

In [20]:
model = BernoulliNB()

In [21]:
model.fit(x_train,y_train)

## Model Score or Accuracy : 

In [23]:
model.score(x_train,y_train)*100

99.61685823754789

**Now let’s test the model by giving spam and not spam comments as input:**

In [29]:
sample = "Check this out: https://xyz.com/" 

In [30]:
sample_data = cv.transform([sample]).toarray()

In [32]:
print(model.predict(sample_data))

['Spam Comment']


In [28]:
sample = "Lack of information!" 
sample_data = cv.transform([sample]).toarray()
print(model.predict(sample_data))

['Not Spam']


# Eminem Channel : 

In [33]:
data = pd.read_csv(r"C:\Users\RGM\Youtube04-Eminem.csv")

In [34]:
data.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z12rwfnyyrbsefonb232i5ehdxzkjzjs2,Lisa Wellas,,+447935454150 lovely girl talk to me xxx﻿,1
1,z130wpnwwnyuetxcn23xf5k5ynmkdpjrj04,jason graham,2015-05-29T02:26:10.652000,I always end up coming back to this song<br />﻿,0
2,z13vsfqirtavjvu0t22ezrgzyorwxhpf3,Ajkal Khan,,"my sister just received over 6,500 new <a rel=...",1
3,z12wjzc4eprnvja4304cgbbizuved35wxcs,Dakota Taylor,2015-05-29T02:13:07.810000,Cool﻿,0
4,z13xjfr42z3uxdz2223gx5rrzs3dt5hna,Jihad Naser,,Hello I&#39;am from Palastine﻿,1


In [35]:
data = data[["CONTENT","CLASS"]]
data.head()

Unnamed: 0,CONTENT,CLASS
0,+447935454150 lovely girl talk to me xxx﻿,1
1,I always end up coming back to this song<br />﻿,0
2,"my sister just received over 6,500 new <a rel=...",1
3,Cool﻿,0
4,Hello I&#39;am from Palastine﻿,1


In [36]:
data.isnull().sum()

CONTENT    0
CLASS      0
dtype: int64

In [37]:
data.duplicated().sum()

36

In [38]:
data.drop_duplicates(inplace = True)

In [39]:
data["CLASS"] = data["CLASS"].map({0 : "Not Spam", 1: "Spam Comment"})

In [40]:
data.head()

Unnamed: 0,CONTENT,CLASS
0,+447935454150 lovely girl talk to me xxx﻿,Spam Comment
1,I always end up coming back to this song<br />﻿,Not Spam
2,"my sister just received over 6,500 new <a rel=...",Spam Comment
3,Cool﻿,Not Spam
4,Hello I&#39;am from Palastine﻿,Spam Comment


In [41]:
x = np.array(data["CONTENT"])
y = np.array(data["CLASS"])

In [42]:
x = cv.fit_transform(x)

In [43]:
x_train, x_test , y_train, y_test = train_test_split(x,y,test_size=0.25,random_state=42)

In [44]:
model = BernoulliNB()

In [45]:
model.fit(x_train,y_train)

In [46]:
model.score(x_train,y_train)*100

81.55339805825243

## So this is how I train a Machine Learning model for the task of spam detection using Python.