# Spam/Ham SMS Message Classifier

**A machine learning project to automatically detect spam messages.**

In [1]:
import re

import nltk
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

RANDOM_SEED = 42
TEST_SIZE = 0.2

### Project Constants

**`RANDOM_SEED` ensures that our train-test split is reproducible, meaning it will be the same every time we run the code.**

**`TEST_SIZE` defines the proportion of the dataset that will be used for testing (in this case, 20%).**

In [7]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(message):
    message = re.sub('[^a-zA-Z]', ' ', message).lower().strip().split()
    message = [word for word in message if word not in stop_words]
    return ' '.join(message)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\el1syum\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Text Preprocessing

**This step prepares the text for analysis. First, we download a list of common English stopwords (e.g., 'a', 'the', 'is') from NLTK. These words are usually uninformative and can be removed.**

**The `preprocess_text` function then performs the following actions on each message:**
**1. Removes all punctuation and numbers.**
**2. Converts all characters to lowercase.**
**3. Splits the text into individual words (tokens).**
**4. Removes all stopwords from the list of tokens.**
**5. Joins the cleaned words back into a single string.**

In [6]:
data = pd.read_csv('spam.csv', encoding='latin1')

data_renamed = data[['v1', 'v2']].rename(columns={"v1": "label", "v2": "message"})

### Data Loading and Cleaning

**We load the dataset using pandas. The `encoding='latin1'` parameter is used to prevent potential encoding errors with this specific file.**

**The initial columns are named 'v1' and 'v2'. We rename them to `label` and `message` for better readability.**

In [None]:
data_renamed['message_len'] = data_renamed['message'].str.len()

avg = data_renamed.groupby(['label'])['message_len'].mean()
print(f"AVG:\n{avg}\n")

### Exploratory Data Analysis (EDA)

**Let's investigate if there is a difference in length between spam and ham messages. We create a new column, `message_len`, to store the length of each message.**

**By grouping the data by `label` and calculating the mean, we can see the average message length for each class.**

**As the output shows, spam messages are, on average, significantly longer than ham messages. This is a useful insight.**

In [None]:
data_renamed['cleaned_message'] = data_renamed['message'].apply(preprocess_text)
print(f"\nDATA:\n{data_renamed[['message', 'cleaned_message']].head()}\n")

**Now, we apply our `preprocess_text` function to every message in the dataset to create a new `cleaned_message` column.**

In [None]:
X = data_renamed['cleaned_message']
y = data_renamed['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)

### Feature and Target Split

**We define our features (`X`) as the cleaned messages and our target (`y`) as the labels. Then, we split the data into training and testing sets using `train_test_split`.**

In [None]:
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

### Text Vectorization (TF-IDF)

**Machine learning models cannot work with raw text. We need to convert the text data into numerical vectors. We use `TfidfVectorizer` for this, which calculates the "Term Frequency-Inverse Document Frequency" for each word. This method reflects how important a word is to a document in a collection.**

**Crucially, we `fit_transform` on the training data and only `transform` the test data to prevent data leakage.**

In [None]:
model = MultinomialNB()

model.fit(X_train_tfidf, y_train)

### Model Training

**We choose the `Multinomial Naive Bayes` classifier, a simple but effective algorithm for text classification.**

**The `model.fit()` command trains the model. During this process, the model learns the probability of each word appearing in spam versus ham messages based on the training data (`X_train_tfidf` and `y_train`).**

In [None]:
y_pred = model.predict(X_test_tfidf)

### Making Predictions

**Now that the model is trained, we use it to make predictions on the test data (`X_test_tfidf`), which it has never seen before. The results are stored in `y_pred`.**

In [None]:
# First way to check model accuracy (handmade)
correct = sum([list(y_test)[i] == y_pred[i] for i in range(len(y_pred))])
accuracy_1 = correct / len(y_pred)
print(f"\nAccuracy 1: {accuracy_1}\n")

# Second way to check accuracy (handmade, better, with numpy)
matches = (y_test == y_pred)
accuracy_2 = np.sum(matches) / len(y_test)
print(f"\nAccuracy 2: {accuracy_2}\n")

# Third way to check accuracy (auto)
accuracy_3 = model.score(X_test_tfidf, y_test)
print(f"\nAccuracy 3: {accuracy_3}\n")

**As we can see, all three methods yield the exact same accuracy score. This confirms our understanding of how the metric is calculated.**

In [None]:
# 4th way to check accuracy (and other info)
class_rep = classification_report(y_test, y_pred)
print(f"\nClassification Report:\n{class_rep}\n")

**While accuracy is a good starting point, it can be misleading, especially with imbalanced datasets. For a more detailed view of the model's performance, we use a `classification_report`.**

**It provides key metrics like `precision`, `recall`, and `f1-score` for each class.**

In [None]:
conf_max = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{conf_max}")

### Confusion Matrix

**The confusion matrix gives us a clear breakdown of the model's predictions versus the actual labels.**

**From this matrix, we can see:**
**- **False Positives (FP): 0**. The model never incorrectly classified a 'ham' message as 'spam'. This is an excellent result.**
**- **False Negatives (FN): 36**. The model missed 36 spam messages, classifying them as 'ham'.**

**This explains the `recall` score of 0.76 for the 'spam' class: the model successfully identified 76% of all actual spam messages. While the overall accuracy is high, there is still room for improvement in catching more spam.**