<a href="https://colab.research.google.com/github/Karishma-Kuria/CMPE-256-Adv-DataMining/blob/main/ML_Based_Spam_Filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ML Based Spam Filter**

In [40]:
# Importing relevant Libraries
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

In [41]:
# Reading the csv file containg the spam documents
dataset_path = 'https://github.com/Karishma-Kuria/CMPE-256-Adv-DataMining/blob/main/SpamDoc.csv?raw=true'
ds = pd.read_csv(dataset_path)
ds.head()

Unnamed: 0,text,spam
0,Free -Coupons for next movie. The above links ...,1
1,Free -Coupons for next movie. The above links ...,1
2,Our records indicate your Pension is under per...,1
3,"I know that's an incredible statement, but bea...",1
4,"Dear recipient, Avangar Technologies announces...",1


Since all thes documents are spams so I have assigned 1 in the spam column

In [42]:
ds.shape

(6, 2)

In [43]:
# Removing Duplicates
ds.drop_duplicates(inplace=True)
print(ds.shape)

(5, 2)


In [44]:
# Downloading the NLTK stopwords package
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Here I have used functions to do cleaning of the data. It will clean the text in dataset and return the tokens. In cleaning it will first remove punctuations and then remove the stop words ex: is, the etc


In [46]:
# For Tokenization (a list of tokens), will be used as the analyzer

def cleaning_process(text):

#1 Remove Punctuationa
    no_punctuation = [char for char in text if char not in string.punctuation]
    no_punctuation = ''.join(no_punctuation)

#2 Remove Stop Words
    clean = [word for word in no_punctuation.split() if word.lower() not in stopwords.words('english')]
    return clean


# to show the tokenization
ds['text'].head().apply(cleaning_process)

0    [Free, Coupons, next, movie, links, take, stra...
2    [records, indicate, Pension, performing, see, ...
3    [know, thats, incredible, statement, bear, exp...
4    [Dear, recipient, Avangar, Technologies, annou...
5    [Enter, win, 25000, get, Free, Hotel, Night, c...
Name: text, dtype: object

Feature Engineering : Feature Extraction: removing non relevant features

In [48]:
# Here I am converting the text into a matrix containing token counts :

from sklearn.feature_extraction.text import CountVectorizer
cleaning_message = CountVectorizer(analyzer=cleaning_process).fit_transform(ds['text'])

In [49]:
# Spliting the data into training and testing
# 70%: Training and 30%: Testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(cleaning_message, ds['spam'], test_size=0.30, random_state=0)
# Checking shape of the data
print(cleaning_message.shape)
print(x_train.shape)
print(x_test.shape)

(5, 197)
(3, 197)
(2, 197)


In [50]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True)

x_train_transformed = tfidf_transformer.fit(xtrain)

x_test_transformed = tfidf_transformer.fit(xtest)

In [51]:
print(x_train_transformed)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)


In [31]:
# Creating and training the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)

Checking the Classifiers prediction and the actual value of Dataset 

In [52]:
print(classifier.predict(x_train))
print(y_train.values)

[1 1 1]
[1 1 1]


Now we'll check the performance of model by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score.

In [53]:
# Evaluating model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(x_train)
print(classification_report(y_train, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(y_train, pred))
print("Accuracy: \n", accuracy_score(y_train, pred))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00         3

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3


Confusion Matrix: 
 [[3]]
Accuracy: 
 1.0


In [54]:
# Print the classifier Prediction
print(classifier.predict(x_test))
# Print the actual values
print(y_test.values)

[1 1]
[1 1]


The above result shows that, it has predicted accurately on test data, spam =1

In [56]:
# Evaluating model on test data
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
predict = classifier.predict(x_test)
print(classification_report(y_test, predict))
print()
print("Confusion Matrix: \n", confusion_matrix(y_test, predict))
print("Accuracy: \n", accuracy_score(y_test, predict))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00         2

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2


Confusion Matrix: 
 [[2]]
Accuracy: 
 1.0


From the above result its clear that the classifier accurately identified the  messages as spam with 100% accuracy on the test data. Since we have only spam messages in Dataset