# **Spam Filtering Project**  
*Supervised Learning with Text Classification*  

**Group Members:**  
- Tamim Aleid (412111473)   
- Mohammed Saad (412112496)  

**Dataset:**  
[SMS Spam Collection Dataset
](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)  
- Features: Features: message (text), label (ham/spam)
- Goal: Build a machine learning model to classify messages as spam or not spam


**Key Steps:**  
1. Data Loading & Exploration
2. Text Preprocessing & Vectorization (TF-IDF)  
3. Model Training using Logistic Regression
4. Evaluation & Spam Prediction on New Messages

**Library Imports**

In [None]:
# This project uses Natural Language Processing (NLP) and Logistic Regression to detect spam messages.

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


**Load the Dataset**

In [None]:
# Load the SMS Spam Collection Dataset
# Dataset contains two columns: 'label' (ham/spam) and 'message' (text content)
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_table(url, header=None, names=['label', 'message'])

# Display the first few records
df.head()


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Basic Dataset Exploration**

In [None]:
# Display dataset shape and structure
print("Dataset Shape:", df.shape)
print("\nDataset Info:")
df.info()

# Show class distribution (how many ham vs spam)
print("\nClass Distribution:")
print(df['label'].value_counts())


Dataset Shape: (5572, 2)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB

Class Distribution:
label
ham     4825
spam     747
Name: count, dtype: int64


**Encode Target Labels**

In [None]:
# Convert textual labels to binary values: 'ham' = 0, 'spam' = 1
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

# Check the updated dataframe
df.head()


Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


**Clean and Prepare Data**

In [None]:
# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Remove duplicate messages if any
df.drop_duplicates(inplace=True)

# Confirm shape after cleaning
print("Dataset shape after removing duplicates:", df.shape)


Missing values:
 label        0
message      0
label_num    0
dtype: int64
Dataset shape after removing duplicates: (5169, 3)


**Text Feature Extraction using TF-IDF**

In [None]:
# Transform the 'message' text data into numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the text data
X = vectorizer.fit_transform(df['message'])

# Assign target variable
y = df['label_num']


**Split Dataset into Train and Test Sets**

In [None]:
# Split data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)


Training set size: (4135, 8444)
Testing set size: (1034, 8444)


**Train Logistic Regression Model**

In [None]:
# Initialize the logistic regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)

print("Model training completed.")


Model training completed.


**Evaluate Model Performance**

In [None]:
# Predict labels on the test set
y_pred = model.predict(X_test)

# Display performance metrics
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))


Confusion Matrix:
 [[892   2]
 [ 58  82]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97       894
           1       0.98      0.59      0.73       140

    accuracy                           0.94      1034
   macro avg       0.96      0.79      0.85      1034
weighted avg       0.94      0.94      0.94      1034

Accuracy Score: 0.941972920696325


**Predict on a Custom Message**

In [None]:
# Predict whether a custom message is spam or not
sample_message = ["Congratulations! You've won a free cruise. Call now to claim your prize."]
sample_transformed = vectorizer.transform(sample_message)
prediction = model.predict(sample_transformed)

# Output the prediction result
print("Prediction:", "Spam" if prediction[0] == 1 else "Ham")


Prediction: Spam
