## Spam Comment Detection

Classifying comments as spam or not spam is crucial for platforms like YouTube, which uses Machine Learning to filter out spam automatically, helping creators maintain clean comment sections.
- If you're interested in learning how to build a spam detection model using Python, this article is perfect for you. 
- I'll guide you through the process of detecting spam comments with Machine Learning, demonstrating practical implementation and techniques to enhance your model's accuracy.
- Spam comment detection is a text classification task in Machine Learning, targeting comments that aim to redirect users to other social media accounts, websites, or content.
- To build a spam detection model, labeled data is essential. 
- Fortunately, I found a dataset on Kaggle containing YouTube spam comments, which will be instrumental in developing this model. 
- In the following section, I'll walk you through the process of detecting spam comments using Python and Machine Learning techniques.

### Spam Comments Detection using Python

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

data = pd.read_csv("C:/Users/asus/OneDrive/Desktop/ML_Datasets/project/More_Projects/Youtube01-Psy.csv")
data.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


We only need the content and class column from the dataset for the rest of the task. So let’s select both the columns and move further

In [2]:
data = data[["CONTENT", "CLASS"]]
data.head(5)

Unnamed: 0,CONTENT,CLASS
0,"Huh, anyway check out this you[tube] channel: ...",1
1,Hey guys check out my new channel and our firs...,1
2,just for test I have to say murdev.com,1
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1


The class column contains values 0 and 1. 0 indicates not spam, and 1 indicates spam. So to make it look better, I will use spam and not spam labels instead of 1 and 0

In [3]:
data["CLASS"] = data["CLASS"].map({0: "Not Spam",
                                   1: "Spam Comment"})
data.head(5)

Unnamed: 0,CONTENT,CLASS
0,"Huh, anyway check out this you[tube] channel: ...",Spam Comment
1,Hey guys check out my new channel and our firs...,Spam Comment
2,just for test I have to say murdev.com,Spam Comment
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,Spam Comment
4,watch?v=vtaRGgvGtWQ Check this out .﻿,Spam Comment


Now let’s move further by training a classification Machine Learning model to classify spam and not spam comments. As this problem is a problem of binary classification, I will use the **Bernoulli Naive Bayes algorithm** to train the model

In [4]:
x = np.array(data["CONTENT"])
y = np.array(data["CLASS"])

cv = CountVectorizer()
x = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(x, y,test_size=0.2,random_state=42)

model = BernoulliNB()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest))

0.9857142857142858


In [5]:
from sklearn.metrics import classification_report

# After training the model
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)

# Print the classification report
print(classification_report(ytest, predictions))

              precision    recall  f1-score   support

    Not Spam       0.96      1.00      0.98        27
Spam Comment       1.00      0.98      0.99        43

    accuracy                           0.99        70
   macro avg       0.98      0.99      0.99        70
weighted avg       0.99      0.99      0.99        70



Now let’s test the model by giving spam and not spam comments as input

In [6]:
sample = input("Enter a comment: ")  # User input
data = cv.transform([sample]).toarray()
print(model.predict(data))

Enter a comment: Check out:http://localhost:8888/
['Spam Comment']


In [7]:
sample = input("Enter a comment: ")  # User input
data = cv.transform([sample]).toarray()
print(model.predict(data))

Enter a comment: great initiative
['Not Spam']
