 <h1><center>Email Spam Detection Using Machine Learning</center></h1>

### Introduction

- In the realm of digital communication, the ubiquitous presence of spam emails is a shared experience for many. These unsolicited messages, commonly known as junk mail, inundate our inboxes with a myriad of cryptic content, scams, and, most perilously, phishing attempts. 

- This project embarks on the journey of creating an email spam detector using Python. Through the application of machine learning, the goal is to empower the detector to adeptly distinguish and classify emails into two categories: spam and non-spam. Let's delve into the exciting realm of email security and explore the intricacies of developing a robust spam detection system.

**Dataset Link** : https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

### Importing the Necessary Libraries

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score

### Data Collection & Pre-Processing

In [57]:
# loading the data from csv file to a pandas Dataframe
df=pd.read_csv('spam.csv', encoding = "ISO-8859-1")
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [58]:
df.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [59]:
df1= df.where((pd.notnull(df)),'')  # replacing the null value with blank space 

In [60]:
df1.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [61]:
df1.isnull().sum()

v1            0
v2            0
Unnamed: 2    0
Unnamed: 3    0
Unnamed: 4    0
dtype: int64

In [62]:
# checking the number of rows and columns in the dataframe
df1.shape

(5572, 5)

In [63]:
df1.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [64]:
# removing unwanted columns
df1.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True, axis=1) 

In [65]:
# renaming columns
df1.rename(columns={'v1': 'Category', 'v2' :'Message'}, inplace=True)

### Label Encoding

In [66]:
df1.loc[df1['Category' ]=='spam', 'category',]=0  #label encoding 
df1.loc[df1['Category' ]=='ham', 'category',]=1

In [67]:
 #removing extra category column
df1.drop('Category', inplace=True, axis=1)

In [72]:
# separating the data as texts and label
x = df1['Message']

y = df1['category']

In [75]:
print(x)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [76]:
print(y)

0       1.0
1       1.0
2       0.0
3       1.0
4       1.0
       ... 
5567    0.0
5568    1.0
5569    1.0
5570    1.0
5571    1.0
Name: category, Length: 5572, dtype: float64


### Splitting the data into training & testing data

In [77]:
X_train,X_test,Y_train,Y_test= train_test_split(x,y,test_size=0.2,random_state=20)

In [79]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(4457,)
(4457,)
(1115,)
(1115,)


### Feature Extraction

In [80]:
#Tranform the test data into feature vectors that can be used as input to the Logistic regression 
feature_extraction=TfidfVectorizer(min_df=1,stop_words='english', lowercase=True) 
feature_extraction

TfidfVectorizer(stop_words='english')

In [81]:
#Model fiting 
X_train_features=feature_extraction.fit_transform(X_train)
X_test_features=feature_extraction.transform(X_test)

In [82]:
#changing values of test data from object to integer 
Y_test=Y_test.astype('int')  
Y_train=Y_train.astype('int')

### Training the model

In [84]:
model=LogisticRegression()

In [85]:
model.fit(X_train_features,Y_train)

LogisticRegression()

### Evaluating the trained model

In [95]:
# prediction on training data
Model_predict_train = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, Model_predict_train)

In [96]:
print('Accuracy on training data : ', accuracy_on_training_data)

Accuracy on training data :  0.9679156383217411


In [89]:
# prediction on test data
Model_predict_test = model.predict(X_test_features)
accuracy_on_testing_data = accuracy_score(Y_test, Model_predict_test)

In [90]:
print('Accuracy on testing data : ', accuracy_on_testing_data)

Accuracy on testing data :  0.9605381165919282


### Predictive System

In [92]:
inp=["Lol your always so convincing."]

In [93]:
input_features=feature_extraction.transform(inp)
prediction=model.predict(input_features)
print(prediction)

[1]


In [94]:
if(prediction==1):
    print("HAM MAIL")
else:
    print("SPAM MAIL")

HAM MAIL
