# Project Overview
Years ago I made a project that classified spam and non spam texts. The notebook was very slopy and did not make a lot of sense. It is a fairly easy task but did not get its right explanation and attention in the previous notebook so I will be re doing it. In this updated version of the notebook I will go over how we can classify messages and explain more in depth the strategies and model evaluations. As mentioned I initially created this notebook years ago as a beginners project. There are more complex models that could perform better but for this case I will stick to naive bayes and support vector machines. I know naive bayes is a common approach for this task so thats why I will be using this. The reason for support vector machine is because I think it will perform well.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Above we have imported some librarires. The data is in file called `data`. This is pretty straight forward we just load it in using pandas. Once we load it in we will begin some small data exploration and feature engieneering.

In [2]:
data = pd.read_csv("./data", sep="\t", names=["label", "message",]) # load in data

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


From this we know that we have a total of 5572 samples. We have no null values and they are both object types. Before we get into data exploration it would be helpful to first make the labels `int` type. Lets first see what our data looks like.

In [4]:
data.head()

Unnamed: 0,label,message
0,spam,URGENT! This is the 2nd attempt to contact U!U...
1,ham,:( but your not here....
2,ham,Not directly behind... Abt 4 rows behind ü...
3,spam,Congratulations ur awarded 500 of CD vouchers ...
4,spam,Had your contract mobile 11 Mnths? Latest Moto...


To start with we first need to turn our labels into numerical values. `Spam` will be represented by `1` and `Ham` will be represented by `0`.

In [5]:
# converting text label into numerical values such as ham=0, spam=1
label_mapping = {'ham': 0, 'spam': 1}
data['label'] = data['label'].map(label_mapping)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   int64 
 1   message  5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB


In [7]:
data['label'].value_counts() 

label
0    4825
1     747
Name: count, dtype: int64

We have a lot of `spam` examples which is good. This means that our model will have a lot of examples to learn to detect spam correctly. Now we should move onto the text message itself. For this column we will make all the messages lowercase. 

In [10]:
data['message'] = data['message'].str.lower() # Convert to lowercase

In [11]:
data.head(10)

Unnamed: 0,label,message
0,1,urgent! this is the 2nd attempt to contact u!u...
1,0,:( but your not here....
2,0,not directly behind... abt 4 rows behind ü...
3,1,congratulations ur awarded 500 of cd vouchers ...
4,1,had your contract mobile 11 mnths? latest moto...
5,1,urgent! call 09066350750 from your landline. y...
6,0,no plans yet. what are you doing ?
7,0,hi ....my engagement has been fixd on &lt;#&g...
8,0,not course. only maths one day one chapter wit...
9,0,wow didn't think it was that common. i take it...


Now our data is in lowercase and we have the labels just as we need them. I have not done much feature exploration because the data being in the format that it is does not allow for it. What I will do is try to create some columns and see if they have anny affect on weather a message is spam or not. If there seems to be some sort of relationship then we will keep that column for training. I will do this by doing word counts and seeif if any of them are appear more often in `spam` messages. Before that tho we must remove stop words. Stop words are words like `the`, `to` and `a`. The reason for this is because it is highly likely that these words appear in both `spam`and `ham` messages and would have little to know affect on detecting `spam`.

In [None]:
from collections import Counter
import nltk
from nltk.corpus import stopwords

In [32]:
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))  
    words = text.split() 
    filtered_words = [word for word in words if word.lower() not in stop_words]  
    return ' '.join(filtered_words)

In [35]:
data['message'] = data['message'].apply(remove_stopwords)

In [49]:
spam = data[data['label'] == 1]
ham = data[data['label'] == 0]

In [50]:
spam_words = ' '.join(spam['message']).lower().split()
ham_words = ' '.join(ham['message']).lower().split()

In [51]:
spam_word_counts = Counter(spam_words)
ham_word_counts = Counter(ham_words)

In [52]:
spam.head(10)

Unnamed: 0,label,message,urgent
0,1,urgent! 2nd attempt contact u!u £1000call 0907...,1
3,1,congratulations ur awarded 500 cd vouchers 125...,0
4,1,"contract mobile 11 mnths? latest motorola, nok...",0
5,1,urgent! call 09066350750 landline. complimenta...,1
10,1,ur chance win £250 wkly shopping spree txt: sh...,0
12,1,u secret admirer looking 2 make contact u-find...,0
13,1,"mila, age23, blonde, new uk. look sex uk guys....",0
18,1,well done england! get official poly ringtone ...,0
28,1,freemsg: txt: call no: 86888 & claim reward 3 ...,0
42,1,sunshine quiz! win super sony dvd recorder can...,0


In [53]:
ham.head(10)

Unnamed: 0,label,message,urgent
1,0,:( here....,0
2,0,directly behind... abt 4 rows behind ü...,0
6,0,plans yet. ?,0
7,0,hi ....my engagement fixd &lt;#&gt; th next mo...,0
8,0,course. maths one day one chapter one month fi...,0
9,0,wow think common. take back ur freak! unless u...,0
11,0,noooooooo please. last thing need stress. life...,0
14,0,"see swing bit, got things take care firsg",0
15,0,wanted wish happy new year wanted talk legal a...,0
16,0,finished work yet something?,0


In [54]:
spam_word_count_df = pd.DataFrame(spam_word_counts.items(), columns=['word', 'count']).sort_values(by='count', ascending=False)
ham_word_count_df = pd.DataFrame(ham_word_counts.items(), columns=['word', 'count']).sort_values(by='count', ascending=False)

In [55]:
spam_word_count_df.head(10)

Unnamed: 0,word,count
47,call,342
20,free,180
22,2,169
12,ur,144
26,txt,136
81,u,117
40,text,112
30,mobile,109
134,claim,106
196,reply,101


In [56]:
ham_word_count_df.head(10)

Unnamed: 0,word,count
44,u,881
150,get,293
149,2,288
16,&lt;#&gt;,276
41,ur,241
333,go,238
58,got,228
91,.,228
102,like,223
92,come,218


In [12]:
# data['urgent'] = data['message'].apply(lambda x: 1 if 'urgent' in x else 0)

In [13]:
# data.head()

Unnamed: 0,label,message,urgent
0,1,urgent! this is the 2nd attempt to contact u!u...,1
1,0,:( but your not here....,0
2,0,not directly behind... abt 4 rows behind ü...,0
3,1,congratulations ur awarded 500 of cd vouchers ...,0
4,1,had your contract mobile 11 mnths? latest moto...,0


In [None]:
# data['message'] = data['message'].str.lower() # Convert to lowercase

In [10]:
# split into X (features) and y (labels)
X = data['message']
y = data['label']

In [None]:
# split the data into train and test sets, the test size will be 20 percent of the data while the train will
# be 80 percent of the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
X_train.info()

<class 'pandas.core.series.Series'>
Index: 4457 entries, 1978 to 860
Series name: message
Non-Null Count  Dtype 
--------------  ----- 
4457 non-null   object
dtypes: object(1)
memory usage: 69.6+ KB


We now have 4457 samples that form our training set. 

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [15]:
tfidf = TfidfVectorizer(stop_words='english')

In [16]:
tfidf

In [None]:
X_train_tfidf = tfidf.fit_transform(X_train)

In [17]:
X_train_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 34432 stored elements and shape (4457, 7436)>

In [None]:
X_test_tfidf = tfidf.transform(X_test)

In [12]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

In [13]:
y_pred = model.predict(X_test_tfidf)

In [16]:
X_train_tfidf = tfidf.fit_transform(X_train)

In [17]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       963
           1       1.00      0.80      0.89       152

    accuracy                           0.97      1115
   macro avg       0.98      0.90      0.94      1115
weighted avg       0.97      0.97      0.97      1115



In [18]:
# Predict a new message (for example)
new_message = ["are u free tonight?"]
new_message_tfidf = tfidf.transform(new_message)
prediction = model.predict(new_message_tfidf)

In [19]:
prediction

array([0])