### **Spam Email Classification using NLP and Machine Learning**
Spam mail detection is the process of using machine learning to sort emails into two categories: "spam" (unwanted or junk emails) and "ham" (legitimate or good emails). The process involves looking at the content of the email, like the words and phrases it contains, to find patterns. A model is then trained to recognize these patterns and separate spam from non-spam emails. Once the model is trained, it can automatically check new emails, helping to keep your inbox clean and safe from harmful or unwanted messages.

In [2]:
#import library
import numpy as np
import pandas as pd

In [3]:
#prepare and load the dataset
df=pd.read_csv("spam.csv", encoding="latin-1")

In [4]:
#print top n number of rows
df.head()

Unnamed: 0,class,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
#see the data column names
df.columns

Index(['class', 'message', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [6]:
#drop redundant Columns or Dropping Unnecessary Columns
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [7]:
df.head()

Unnamed: 0,class,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
df['class']=df['class'].map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,class,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
df.columns

Index(['class', 'message'], dtype='object')

In [10]:
df.shape

(5572, 2)

In [13]:
#Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
#Splitting Features and Labels
X=df['message']
y=df['class']

In [15]:
X.shape
y.shape

(5572,)

In [16]:
#Checking Missing values
df.isnull()

Unnamed: 0,class,message
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
5567,False,False
5568,False,False
5569,False,False
5570,False,False


In [17]:
df.isnull().sum()

class      0
message    0
dtype: int64

In [18]:
#Text to Numerical Conversion usnig CountVectorizer
cv = CountVectorizer()  # Initialize CountVectorizer
X = cv.fit_transform(X)  # Transform text data into numerical format

In [19]:
#Splitting into Training and Testing Sets
from sklearn.model_selection import train_test_split
x_train, x_test,y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)

In [20]:
#Shape of Training and Testing Data
x_train.shape
x_test.shape

(1115, 8672)

In [21]:
#Model Import and Initialization
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()

In [22]:
#Model Training
model.fit(x_train, y_train)

In [23]:
#Model Accuracy
model.score(x_test, y_test)

0.97847533632287

In [24]:
#Making Predictions:
msg = "Congratulations! You've won a lifetime supply of pizza!"
data = [msg]  # Wrapping the message in a list
vect = cv.transform(data).toarray()  # Converting to numerical format
my_prediction = model.predict(vect)  # Making the prediction

In [25]:
if my_prediction[0] == 1:
    print("Spam Alert: This message is spam!")
else:
    print("Ham: This message is genuine!")


Spam Alert: This message is spam!


In [26]:
from sklearn.metrics import confusion_matrix, classification_report
predictions = model.predict(x_test)
cm = confusion_matrix(y_test, predictions)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[952  13]
 [ 11 139]]


In [27]:
print("Classification Report:\n", classification_report(y_test, predictions))

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       965
           1       0.91      0.93      0.92       150

    accuracy                           0.98      1115
   macro avg       0.95      0.96      0.95      1115
weighted avg       0.98      0.98      0.98      1115



In [28]:
#Save the trained model and CountVectorizer for future use
import pickle

# Save the model and vectorizer
with open('spam_model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

with open('vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(cv, vectorizer_file)

Check versions 

In [29]:
import numpy as np
print(np.__version__)

2.1.3


In [30]:
import pandas as pd
print(pd.__version__)

2.2.3


In [31]:
import sklearn
print(sklearn.__version__)

1.5.2


In [32]:
import nltk
print(nltk.__version__)

3.9.1


In [33]:
import streamlit
print(streamlit.__version__)

1.40.2
