# Project 2: Spam Email Classification

## Problem Statement: Develop a machine learning model to classify emails as spam or non-spam based on their content and metadata.
### Student Name:- SUMAN RAKSHIT 
### CSI ID:- CT-CSI23/DS0605

#### Objectives of this project
- Classify the emails as spam or non-spam based on its content

# Data Loading

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('spam.csv', encoding='ISO-8859-1')

In [3]:
df

Unnamed: 0,category,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,no_spam,"Go until jurong point, crazy.. Available only ...",,,
1,no_spam,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,no_spam,U dun say so early hor... U c already then say...,,,
4,no_spam,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,no_spam,Will Ì_ b going to esplanade fr home?,,,
5569,no_spam,"Pity, * was in mood for that. So...any other s...",,,
5570,no_spam,The guy did some bitching but I acted like i'd...,,,


# Data Preprocessing

In [4]:
# drop last 3 cols
df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)

In [5]:
df.sample(5)

Unnamed: 0,category,message
1332,no_spam,It's ok lar. U sleep early too... Nite...
5356,no_spam,Tell me something. Thats okay.
2832,spam,You've won tkts to the EURO2004 CUP FINAL or å...
5436,no_spam,Mode men or have you left.
5310,no_spam,"yeah, that's what I was thinking"


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  5572 non-null   object
 1   message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


# Splitting the Data and Model Traning

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Split the data into features (emails) and labels (spam or non-spam)
X = df['message']
y = df["category"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform text data preprocessing (e.g., removing stop words, stemming, vectorization)
vectorizer = CountVectorizer(stop_words="english")
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)


# Converting the message into numeric value using TfidfVectorizer

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english")
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Model Selection and Evaluation

In [9]:
# Baseline model using Naive Bayes
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vectorized, y_train)

# Evaluation using test set
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = nb_classifier.predict(X_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label="spam")
recall = recall_score(y_test, y_pred, pos_label="spam")
f1 = f1_score(y_test, y_pred, pos_label="spam")

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Accuracy: 0.9838565022421525
Precision: 0.9583333333333334
Recall: 0.92
F1 Score: 0.9387755102040817


# Model Tuning 

In [10]:
# For class imbalance, you can use techniques like SMOTE (Synthetic Minority Over-sampling Technique)
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)

In [11]:
# Rebuild the model with the improved dataset
nb_classifier_improved = MultinomialNB()
nb_classifier_improved.fit(X_train_resampled, y_train_resampled)

# Evaluate the improved model
y_pred_improved = nb_classifier_improved.predict(X_test_tfidf)

accuracy_improved = accuracy_score(y_test, y_pred_improved)
precision_improved = precision_score(y_test, y_pred_improved, pos_label="spam")
recall_improved = recall_score(y_test, y_pred_improved, pos_label="spam")
f1_improved = f1_score(y_test, y_pred_improved, pos_label="spam")

print("Improved Model - Accuracy:", accuracy_improved)
print("Improved Model - Precision:", precision_improved)
print("Improved Model - Recall:", recall_improved)
print("Improved Model - F1 Score:", f1_improved)


Improved Model - Accuracy: 0.9650224215246637
Improved Model - Precision: 0.8363636363636363
Improved Model - Recall: 0.92
Improved Model - F1 Score: 0.8761904761904761


# Saving The Model

In [12]:
import pickle
pickle.dump(tfidf_vectorizer,open('vectorizer.pkl','wb'))
pickle.dump(nb_classifier_improved,open('model.pkl','wb'))

In [13]:
# Load the TF-IDF vectorizer
with open('vectorizer.pkl', 'rb') as vectorizer_file:
    loaded_tfidf = pickle.load(vectorizer_file)

# Load the model
with open('model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)

# Prediction using the model

In [14]:
email = ["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times."]

In [15]:
email_vectorized = tfidf_vectorizer.transform(email)

In [16]:
predict = nb_classifier_improved.predict(email_vectorized)
predict

array(['no_spam'], dtype='<U7')