# Email Spam Classifier

## Project Overview
This notebook implements a machine learning model to classify emails as spam or ham (legitimate) using logistic regression and TF-IDF feature extraction. The model analyzes email content to determine whether an email is spam or legitimate.

## Dataset
- **Features**: Email message text content
- **Target**: Binary classification (0 = Spam, 1 = Ham/Legitimate)
- **Preprocessing**: TF-IDF vectorization with stop word removal
- **Algorithm**: Logistic Regression for binary text classification

In [52]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


## 1. Data Loading and Initial Exploration

In [None]:
# =====================================
# LIBRARY IMPORTS
# =====================================
# Essential libraries for email spam classification

import numpy as np  # Numerical computations
import pandas as pd  # Data manipulation and analysis
from sklearn.model_selection import train_test_split  # Data splitting
from sklearn.feature_extraction.text import TfidfVectorizer  # Text feature extraction
from sklearn.linear_model import LogisticRegression  # Classification algorithm
from sklearn.metrics import accuracy_score  # Model evaluation

df = pd.read_csv("data/mail_data.csv")

In [54]:
print(df)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


## 2. Data Preprocessing
### 2.1 Handling Missing Values

In [None]:
# Load the email dataset from CSV file
# Dataset contains email messages and their corresponding labels (spam/ham)
df = pd.read_csv("data/mail_data.csv")

data = df.where((pd.notnull(df)),'')

### 2.2 Data Exploration

In [None]:
# Display the entire dataset to understand structure and content
print(df)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [57]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [None]:
# Replace null values with empty strings to ensure clean text processing
# This prevents errors during text feature extraction
data = df.where((pd.notnull(df)),'')

(5572, 2)

### 2.3 Label Encoding

In [59]:
data.loc[data['Category'] == 'spam', 'Category',] = 0
data.loc[data['Category'] == 'ham', 'Category',] = 1

### 2.4 Feature-Target Separation

In [None]:
# Display first few rows of cleaned data
data.head()

X = data['Message']

Y = data['Category']

### 2.5 Data Verification

In [None]:
print(X)
# Display dataset information including data types and non-null counts
data.info()

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [None]:
print(Y)
# Display dataset dimensions (rows, columns)
data.shape
# Display the feature data (email messages) to verify separation
print(X)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


## 3. Data Splitting
### 3.1 Train-Test Split

In [None]:
# Display the target labels to verify encoding (0=spam, 1=ham)
print(Y)

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 3)

### 3.2 Data Shape Verification

In [None]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

# Convert categorical labels to numerical values for machine learning
# spam -> 0 (negative class)
# ham -> 1 (positive class - legitimate emails)
data.loc[data['Category'] == 'spam', 'Category',] = 0
data.loc[data['Category'] == 'ham', 'Category',] = 1

(5572,)
(4457,)
(1115,)


In [None]:
# Split data into training and testing sets (80% train, 20% test)
# random_state ensures reproducible results
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 3)

print(Y.shape)
print(Y_train.shape)
print(Y_test.shape)

(5572,)
(4457,)
(1115,)


## 4. Feature Extraction
### 4.1 TF-IDF Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Separate features (email messages) from target labels
# X contains the email text content
X = data['Message']

# Y contains the binary labels (0=spam, 1=ham)
Y = data['Category']

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

Y_train = Y_train.astype(int)
Y_test = Y_test.astype(int)


### 4.2 Feature Verification

In [None]:
# Verify the shapes of feature datasets
print(X.shape)        # Original feature set
print(X_train.shape)  # Training features
print(X_test.shape)   # Testing features

3075                  Don know. I did't msg him recently.
1787    Do you know why god created gap between your f...
1614                         Thnx dude. u guys out 2nite?
4304                                      Yup i'm free...
3266    44 7732584351, Do you want a New Nokia 3510i c...
                              ...                        
789     5 Free Top Polyphonic Tones call 087018728737,...
968     What do u want when i come back?.a beautiful n...
1667    Guess who spent all last night phasing in and ...
3321    Eh sorry leh... I din c ur msg. Not sad alread...
1688    Free Top ringtone -sub to weekly ringtone-get ...
Name: Message, Length: 4457, dtype: object


In [None]:
print(X_train_features)

# Verify the shapes of target datasets
print(Y.shape)        # Original target set
print(Y_train.shape)  # Training targets
print(Y_test.shape)   # Testing targets

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 34775 stored elements and shape (4457, 7431)>
  Coords	Values
  (0, 2329)	0.38783870336935383
  (0, 3811)	0.34780165336891333
  (0, 2224)	0.4131033779433779
  (0, 4456)	0.4168658090846482
  (0, 5413)	0.6198254967574347
  (1, 3811)	0.17419952275504036
  (1, 3046)	0.25037127926135183
  (1, 1991)	0.33036995955537024
  (1, 2956)	0.33036995955537024
  (1, 2758)	0.32264078859437995
  (1, 1839)	0.2784903590561455
  (1, 918)	0.22871581159877652
  (1, 2746)	0.33982970028640835
  (1, 2957)	0.33982970028640835
  (1, 3325)	0.31610586766078863
  (1, 3185)	0.2969448295769459
  (1, 4080)	0.18880584110891166
  (2, 6601)	0.6056811524587516
  (2, 2404)	0.45287711070606745
  (2, 3156)	0.4107239318312698
  (2, 407)	0.509272536051008
  (3, 7414)	0.8100020912469564
  (3, 2870)	0.5864269879324768
  (4, 2870)	0.41872147309323754
  (4, 487)	0.2899118421746198
  :	:
  (4454, 2855)	0.472106650836418
  (4454, 2246)	0.472106650836418
  (4455, 4456)	0.24

In [69]:
model = LogisticRegression()

In [None]:
# Convert text data to numerical features using TF-IDF (Term Frequency-Inverse Document Frequency)
# min_df=1: Include terms that appear in at least 1 document
# stop_words='english': Remove common English stop words
# lowercase=True: Convert all text to lowercase
from sklearn.feature_extraction.text import TfidfVectorizer

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

# Fit vectorizer on training data and transform both training and test data
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# Convert target labels to integer type for classification
Y_train = Y_train.astype(int)
Y_test = Y_test.astype(int)

model.fit(X_train_features, Y_train)

In [72]:
prediction_on_train_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train,prediction_on_train_data)

In [None]:
# Display original training text data
print(X_train)

print('Acc on training data : ', accuracy_on_training_data)

Acc on training data :  0.9676912721561588


In [None]:
# Display TF-IDF vectorized features (sparse matrix representation)
print(X_train_features)

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [79]:
print('acc on test data : ', accuracy_on_test_data)

acc on test data :  0.9668161434977578


In [None]:
input_your_mail = ["hi u win prize contact us "]

input_data_features = feature_extraction.transform(input_your_mail)

prediction = model.predict(input_data_features)

print(prediction)

if(prediction[0] ==1):
    print('Ham mail')
else:
    print('Spam mail')


[0]
Spam mail
