# 04-Spam-Classifier

It's time to make our first real Machine Learning application of NLP: a spam classifier!

A spam classifier is a Machine Learning model that classifier texts (email or SMS) into two categories: Spam (1) or legitimate (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW (Bag Of Words) on a dataset of texts.
Then we will use a classifier to predict to which class belong a new email/SMS, based on the BOW.

First things first: import the needed libraries.

In [1]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Load now the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [2]:
# TODO: Load the dataset 
df = pd.read_csv('spam.csv', encoding='latin-1')
print(df.head(5))

  Class                                            Message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


As usual, I suggest you to explore a bit this dataset.

In [3]:
# TODO: explore the dataset
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
None


So as you see we have a column containing the labels, and a column containing the text to classify.

We will begin by doing the usual preprocessing: tokenization, punctuation removal and lemmatization.

In [4]:
# TODO: Perform preprocessing over all the text
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
# Remove all characters except alphabets, numbers, and spaces
df['Message'] = df['Message'].str.replace('[^a-zA-Z0-9 ]', '')

# Initialize the Porter stemmer
stemmer = PorterStemmer()

# Function to preprocess text and return tokens
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    
    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Stemming
    tokens = [stemmer.stem(token) for token in tokens]
    
    return tokens

# Apply preprocessing to the 'Message' column
df['tokens'] = df['Message'].apply(preprocess_text)

# Display the first and last five rows of the preprocessed data
messages = pd.concat([df[['Message']].head(), df[['Message']].tail()])
messages = messages.apply(lambda x: '[' + x.str.split().str.join(', ') + ']')
print(messages)
print("Name: tokens, Length:", len(df['tokens']), ", dtype: object")

  df['Message'] = df['Message'].str.replace('[^a-zA-Z0-9 ]', '')


                                                Message
0     [Go, until, jurong, point, crazy, Available, o...
1                        [Ok, lar, Joking, wif, u, oni]
2     [Free, entry, in, 2, a, wkly, comp, to, win, F...
3     [U, dun, say, so, early, hor, U, c, already, t...
4     [Nah, I, dont, think, he, goes, to, usf, he, l...
5567  [This, is, the, 2nd, time, we, have, tried, 2,...
5568          [Will, b, going, to, esplanade, fr, home]
5569  [Pity, was, in, mood, for, that, Soany, other,...
5570  [The, guy, did, some, bitching, but, I, acted,...
5571                   [Rofl, Its, true, to, its, name]
Name: tokens, Length: 5572 , dtype: object


Ok now we have our preprocessed data. Next step is to do a BOW.

In [9]:
# TODO: compute the BOW
from sklearn.feature_extraction.text import CountVectorizer

# Convert the list of tokens into strings
df['preprocessed_message'] = df['tokens'].apply(lambda x: ' '.join(x))

# Initialize CountVectorizer to convert text to BOW
count_vectorizer = CountVectorizer()

# Fit and transform the preprocessed text data
bow_matrix = count_vectorizer.fit_transform(df['preprocessed_message'])

# Display the shape of the BOW matrix
print("Shape of BOW matrix:", bow_matrix.shape)


Shape of BOW matrix: (5572, 8075)


Then make a new dataframe as usual to have a visual idea of the words used and their frequencies.

In [10]:
# TODO: Make a new dataframe with the BOW
# Create a DataFrame for the BOW matrix
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

# Display the first few rows of the new DataFrame
print(bow_df.head())

   008704050406  0089mi  0121  01223585236  01223585334  0125698789  02  \
0             0       0     0            0            0           0   0   
1             0       0     0            0            0           0   0   
2             0       0     0            0            0           0   0   
3             0       0     0            0            0           0   0   
4             0       0     0            0            0           0   0   

   020603  0207  02070836089  ...  zebra  zed  zero  zhong  zindgi  zoe  \
0       0     0            0  ...      0    0     0      0       0    0   
1       0     0            0  ...      0    0     0      0       0    0   
2       0     0            0  ...      0    0     0      0       0    0   
3       0     0            0  ...      0    0     0      0       0    0   
4       0     0            0  ...      0    0     0      0       0    0   

   zogtoriu  zoom  zouk  zyada  
0         0     0     0      0  
1         0     0     0      0  

Let's check what is the most used word in the spam category and the non spam category.

There are two steps: first add the class to the BOW dataframe. Second, filter on a class, sum all the values and print the most frequent one.

In [11]:
# TODO: print the most used word in the spam and non spam category
# Add the 'Class' column to the BOW dataframe
bow_df['Class'] = df['Class']

# Filter on spam and non-spam categories, sum all the values excluding the 'Class' column
spam_df = bow_df[bow_df['Class'] == 'spam'].drop(columns=['Class'])
non_spam_df = bow_df[bow_df['Class'] == 'ham'].drop(columns=['Class'])

most_used_word_spam = spam_df.sum().idxmax()
most_used_word_non_spam = non_spam_df.sum().idxmax()

print("Most used word in spam category:", most_used_word_spam)
print("Most used word in non-spam category:", most_used_word_non_spam)

Most used word in spam category: call
Most used word in non-spam category: im


You should find that the most frequent spam word is 'free', not so surprising, right?

Now we can make a classifier based on our BOW. We will use a simple logistic regression here for the example.

You're an expert, you know what to do, right? Split the data, train your model, predict and see the performance.

In [15]:
# TODO: Perform a classification to predict whether a message is a spam or not
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Split the data into train and test sets
X = bow_df.drop(columns=['Class'])
y = bow_df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
logreg_classifier = LogisticRegression()
logreg_classifier.fit(X_train, y_train)

# Predict the labels for test set
y_pred = logreg_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print(classification_report(y_test, y_pred))


Accuracy: 0.97847533632287
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       1.00      0.84      0.91       150

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



What precision do you get? Check by hand on some samples where it did predict well to check what could go wrong...

Try to use other models and try to improve your results.