# Spam detector

In today's society, almost everyone has a cell phone and they all receive messages (SMS/email) on their phone on a regular basis. But the important thing is that most of the messages received will be spam, and only a few will be unnecessary or necessary messages. Scammers create fraudulent text messages to trick you into giving them your personal information such as your password, account number, or social security number. If they have this information, they can gain access to your email, bank or other accounts.

I will use a dataset to detect SMS spam that contains the text of the SMS and the corresponding label (unwanted message or spam).

First of all, let's create and train a model that will determine whether a message is spam or not. First, let's connect the necessary libraries to create the program

In [1]:
import string

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Also we need to download 'stopwords'

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anvar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Now let's load the dataset from kaggle to train the model and see what it consists of

In [3]:
df = pd.read_csv('spam_ham_dataset.csv')
df

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


From this dataset we only need the text and label_num columns

And in the text column there is an element '\r\n' that needs to be deleted.

In [4]:
df['text'] = df['text'].apply(lambda x: x.replace('\r\n', ' '))

Now let's check if there are missing values in the dataset

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


Since there are no missing values in the dataset, we can move on

Translate all text into lower case, remove punctuation and make a 'stem' version of it, e.g.

In [6]:
stemmer = PorterStemmer()
stemmer.stem('running')

'run'

In [7]:
stemmer = PorterStemmer()
stemmer.stem('sophisticated')

'sophist'

So the 'stem' version is the main part of the word

In [8]:
stemmer = PorterStemmer()
corpus = []

stopwords_set = set(stopwords.words('english'))

We need a set of stopwords in English because the dataset contains letters in English

In [9]:
for i in range(len(df)):
    text = df['text'].iloc[i].lower() #lowercase
    text = text.translate(str.maketrans('', '', string.punctuation)).split() #punctuation removed
    text = [stemmer.stem(word) for word in text if word not in stopwords_set]
    text = ' '.join(text)
    corpus.append(text)

Now let's compare the original and the cleaned up version

In [10]:
df.text.iloc[0]

"Subject: enron methanol ; meter # : 988291 this is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary flow data provided by daren } . please override pop ' s daily volume { presently zero } to reflect daily activity you can obtain from gas control . this change is needed asap for economics purposes ."

In [11]:
corpus[0]

'subject enron methanol meter 988291 follow note gave monday 4 3 00 preliminari flow data provid daren pleas overrid pop daili volum present zero reflect daili activ obtain ga control chang need asap econom purpos'

Now we vectorize this data

In [12]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
y = df.label_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X[0]

array([1, 0, 0, ..., 0, 0, 0], shape=(42637,))

Now instead of text, we have an array of numbers.

And we can train the model.

To train the model I will use RandomForestClasssifier

In [13]:
clf = RandomForestClassifier(n_jobs=-1)

clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.9768115942028985

The trained model runs with 97% accuracy

Now, in order for a user to check if an email is spam, you need to convert the message to lower case, get rid of punctuation, make a stem version and vectorize it

In [None]:
email_to_classify = input()
email_text = email_to_classify.lower().translate(str.maketrans('', '', string.punctuation)).split()
email_text = [stemmer.stem(word) for word in text if word not in stopwords_set]
email_corpus = [email_text]

X_email = vectorizer.transform(email_corpus)
clf.predict(X_email)

This spam detector was made by Anvar Kamaleyev, a student from Almaty, Kazakhstan