# 04-Spam-Classifier

It's time to make our first real Machine Learning application of NLP: a spam classifier!

A spam classifier is a Machine Learning model that classifier texts (email or SMS) into two categories: Spam (1) or legitimate (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW (Bag Of Words) on a dataset of texts.
Then we will use a classifier to predict to which class belong a new email/SMS, based on the BOW.

First things first: import the needed libraries.

In [4]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

Load now the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [5]:
# TODO: Load the dataset 
filename = "spam.csv"
df = pd.read_csv(filename)

As usual, I suggest you to explore a bit this dataset.

In [8]:
# TODO: explore the dataset
print(df.head())

  Class                                            Message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


So as you see we have a column containing the labels, and a column containing the text to classify.

We will begin by doing the usual preprocessing: tokenization, punctuation removal and lemmatization.

In [11]:
# TODO: Perform preprocessing over all the text
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text.lower())
    
    # Remove punctuation and stopwords
    tokens = [t for t in tokens if t not in string.punctuation and t not in stop_words]
    
    # Lemmatize
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    # Return cleaned text
    return ' '.join(tokens)

# Apply preprocessing to the 'message' column
df['clean_text'] = df['Message'].apply(preprocess_text)
print(df['clean_text'])

0       go jurong point crazy .. available bugis n gre...
1                         ok lar ... joking wif u oni ...
2       free entry 2 wkly comp win fa cup final tkts 2...
3             u dun say early hor ... u c already say ...
4                 nah n't think go usf life around though
                              ...                        
5567    2nd time tried 2 contact u. u �750 pound prize...
5568                         �_ b going esplanade fr home
5569                             pity mood ... suggestion
5570    guy bitching acted like 'd interested buying s...
5571                                       rofl true name
Name: clean_text, Length: 5572, dtype: object


Ok now we have our preprocessed data. Next step is to do a BOW.

In [16]:
# TODO: compute the BOW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['clean_text'])

# Convert BOW to DataFrame
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

# Concatenate BOW with the original DataFrame
df_bow = pd.concat([df['Message'], bow_df], axis=1)

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

Then make a new dataframe as usual to have a visual idea of the words used and their frequencies.

In [13]:
# TODO: Make a new dataframe with the BOW
print(df_bow.head())

NameError: name 'df_bow' is not defined

Let's check what is the most used word in the spam category and the non spam category.

There are two steps: first add the class to the BOW dataframe. Second, filter on a class, sum all the values and print the most frequent one.

In [14]:
# TODO: print the most used word in the spam and non spam category
print("Top words in spam messages:")
print(bow_df[df['label'] == 1].sum().sort_values(ascending=False).head(10))

print("\nTop words in non-spam messages:")
print(bow_df[df['label'] == 0].sum().sort_values(ascending=False).head(10))

Top words in spam messages:


NameError: name 'bow_df' is not defined

You should find that the most frequent spam word is 'free', not so surprising, right?

Now we can make a classifier based on our BOW. We will use a simple logistic regression here for the example.

You're an expert, you know what to do, right? Split the data, train your model, predict and see the performance.

In [None]:
# TODO: Perform a classification to predict whether a message is a spam or not
# Prepare data for classification
X = bow_df.drop('label', axis=1)
y = bow_df['label']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier
nb_classifier.fit(X_train, y_train)

# Predictions
y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

What precision do you get? Check by hand on some samples where it did predict well to check what could go wrong...

Try to use other models and try to improve your results.