<a href="https://colab.research.google.com/github/Auzek2002/Spam_Mail_Detection_Using_Logistic_Regression/blob/main/Spam_Email_Detection_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [92]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Load the Dataset:

In [93]:
raw_mail_data = pd.read_csv("/mail_data.csv")

Data Analysis:

In [94]:
raw_mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [95]:
raw_mail_data.shape

(5572, 2)

In [96]:
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)),'')
mail_data.isnull().sum()

Category    0
Message     0
dtype: int64

In [97]:
mail_data.shape

(5572, 2)

Label Encoding

> Spam -> 0

> ham/not spam -> 1







In [98]:
mail_data.loc[mail_data['Category'] == 'spam', 'Category'] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category'] = 1

In [99]:
x = mail_data['Message']
y = mail_data['Category']


Data Splitting

In [100]:
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2,random_state=3)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(4457,)
(1115,)
(4457,)
(1115,)


Feature Extraction:

In [101]:
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase= True)
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)

#converting all y values to integers:

y_train = y_train.astype('int')
y_test = y_test.astype('int')
print(x_train_features.shape)
print(x_test_features.shape)
print(y_train.shape)
print(y_test.shape)

(4457, 7431)
(1115, 7431)
(4457,)
(1115,)


Training The Model:

In [102]:
logistic_model = LogisticRegression()

#fit the model with the training data:

logistic_model.fit(x_train_features,y_train)

Evaluating and Using the model to Predict:

In [103]:
predictions = logistic_model.predict(x_train_features)

#checking the accuracy:

score = accuracy_score(y_train,predictions)
print(f"The accuracy score is : {score*100:0.2f}%")

The accuracy score is : 96.70%


Making Predictions on Testing Data:

In [104]:
predictions_test = logistic_model.predict(x_test_features)

#checking the accuracy:

score_test = accuracy_score(y_test,predictions_test)
print(f"The accuracy score is : {score_test*100:0.2f}%")

The accuracy score is : 96.59%


## **As the accuracy score is about 95% in both the Training and Testing Data it could be said that the model does not overfit or underfit the dataset**

Checking if the model is predicting correctly:

In [105]:
test_mail = ["Congratulations! you just won a 100% free membership for our club! Click the link NOW!"] #should be a spam mail
test_feature = feature_extraction.transform(test_mail)
predicts = logistic_model.predict(test_feature)
if predicts == 0:
  print("The Model has predicted the mail to be Spam")
else:
  print("The Model has predicted the mail to be Ham or not Spam")

The Model has predicted the mail to be Spam
