# Detecting Spam
## Project Overview
Years ago I made a project that classified spam and non spam texts. The notebook was very slopy and did not make a lot of sense. It is a fairly easy task but did not get its right explanation and attention in the previous notebook so I will be re doing it. In this updated version of the notebook I will go over how we can classify messages and explain more in depth the strategies and model evaluations. As mentioned I initially created this notebook years ago as a beginners project. There are more complex models that could perform better but for this case I will stick to small models.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [5]:
# data is in file called data so we can simply load it in
# putting data into a pandas dataframe
data = pd.read_csv("./data", sep="\t", names=["label", "message",])

In [6]:
data

Unnamed: 0,label,message
0,spam,URGENT! This is the 2nd attempt to contact U!U...
1,ham,:( but your not here....
2,ham,Not directly behind... Abt 4 rows behind ü...
3,spam,Congratulations ur awarded 500 of CD vouchers ...
4,spam,Had your contract mobile 11 Mnths? Latest Moto...
...,...,...
5567,ham,hiya hows it going in sunny africa? hope u r a...
5568,ham,At WHAT TIME should i come tomorrow
5569,spam,Wanna have a laugh? Try CHIT-CHAT on your mobi...
5570,ham,"CHA QUITEAMUZING THATSCOOL BABE,PROBPOP IN & ..."


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


From this we know that we have a total of 5572 samples. We now need to put the messages themselves into something more useful. For starters we can know we can remove stop words which are words like `the`, `of`, or `in`. These words wont be a determing factor in detecting spam or ham as they are very common and could be used in both messages. We can tokenize the words which is simply making the message into a list of the words in the message. These are two simple approaches we will take to clean our data a bit for classification. We will also being lowercasing our data to make the data simpler. Additionally we have to convert the labels into numerical values 0 for ham and 1 for spam.

In [8]:
# converting text label into numerical values such as ham=0, spam=1
label_mapping = {'ham': 0, 'spam': 1}
data['label'] = data['label'].map(label_mapping)

In [9]:
data['message'] = data['message'].str.lower() # Convert to lowercase

In [10]:
# split into X (features) and y (labels)
X = data['message']
y = data['label']

In [None]:
# split the data into train and test sets, the test size will be 20 percent of the data while the train will
# be 80 percent of the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
X_train.info()

<class 'pandas.core.series.Series'>
Index: 4457 entries, 1978 to 860
Series name: message
Non-Null Count  Dtype 
--------------  ----- 
4457 non-null   object
dtypes: object(1)
memory usage: 69.6+ KB


We now have 4457 samples that form our training set. 

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [15]:
tfidf = TfidfVectorizer(stop_words='english')

In [16]:
tfidf

In [None]:
X_train_tfidf = tfidf.fit_transform(X_train)

In [17]:
X_train_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 34432 stored elements and shape (4457, 7436)>

In [None]:
X_test_tfidf = tfidf.transform(X_test)

In [12]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

In [13]:
y_pred = model.predict(X_test_tfidf)

In [16]:
X_train_tfidf = tfidf.fit_transform(X_train)

In [17]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       963
           1       1.00      0.80      0.89       152

    accuracy                           0.97      1115
   macro avg       0.98      0.90      0.94      1115
weighted avg       0.97      0.97      0.97      1115



In [18]:
# Predict a new message (for example)
new_message = ["Congratulations! You've won a free ticket to the Bahamas!"]
new_message_tfidf = tfidf.transform(new_message)
prediction = model.predict(new_message_tfidf)

In [19]:
prediction

array([0])