
# 📧 Email/SMS Spam Detection Using Machine Learning

Welcome to this comprehensive notebook where we will build a spam detection system using machine learning. This notebook is designed for clarity and learning, so every step is well-explained and beginner-friendly.

**Dataset:** `spam.csv` (uploaded in the same directory)  
**Goal:** Classify SMS or email messages as **spam** or **ham (not spam)** using a machine learning model.

Let's get started! 🚀



## 1️⃣ Importing Required Libraries

Let's begin by importing all the essential libraries for data handling, text processing, model training, and evaluation.


In [1]:

# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import joblib



## 2️⃣ Loading the Dataset

We'll load the dataset `spam.csv`. Make sure it is in the same directory as this notebook.


In [2]:

# Load Dataset
df = pd.read_csv('spam.csv', encoding='latin-1')
df.head()


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,



## 3️⃣ Data Exploration

Let's check the dataset structure, inspect column names, and understand what we have.


In [3]:

# Explore Data
print("Columns in dataset:", df.columns)
df.info()
df.describe()


Columns in dataset: Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2



## 4️⃣ Data Preprocessing

We'll:
- Rename columns if necessary
- Encode labels ('ham' → 0, 'spam' → 1)
- Handle missing values (if any)


In [4]:

# Preprocessing
# Rename columns if necessary (adjust accordingly)
df.columns = ['label', 'text', 'Unnamed_2', 'Unnamed_3', 'Unnamed_4']  # adjust if needed
df = df[['label', 'text']]

# Encode labels
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Check for missing values
print(df.isnull().sum())

df.head()


label    0
text     0
dtype: int64


Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."



## 5️⃣ Splitting the Data

We'll split the dataset into **training** and **testing** sets to evaluate our model's performance.


In [5]:

# Split Data
X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")


Training samples: 4457
Testing samples: 1115



## 6️⃣ Feature Extraction (TF-IDF)

We'll use **TF-IDF** (Term Frequency-Inverse Document Frequency) to convert text into numerical features suitable for machine learning.


In [6]:

# Feature Extraction (TF-IDF)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"TF-IDF shape: {X_train_tfidf.shape}")


TF-IDF shape: (4457, 7472)



## 7️⃣ Model Training

We'll train a **Logistic Regression** model to classify messages as spam or ham.


In [7]:

# Model Training
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = model.predict(X_test_tfidf)



## 8️⃣ Model Evaluation

Let's evaluate the model using **accuracy** and a **classification report**.


In [8]:

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.95
Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       965
           1       0.97      0.67      0.79       150

    accuracy                           0.95      1115
   macro avg       0.96      0.83      0.88      1115
weighted avg       0.95      0.95      0.95      1115




## 9️⃣ Save the Model (Optional)

Saving the trained model and vectorizer for later use.


In [9]:

# Save Model (Optional)
joblib.dump(model, 'spam_classifier.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
print("Model and vectorizer saved!")


Model and vectorizer saved!



## ✅ Conclusion

Congratulations! 🎉 You've successfully built a spam detection model using machine learning. This project helps you understand data preprocessing, feature extraction, model training, and evaluation.

Feel free to experiment with other models (like Naive Bayes, SVM, Random Forest) or fine-tune hyperparameters for better performance.

Happy Learning! 🚀
