<a href="https://colab.research.google.com/github/TANZID36/Spam-Mail-Detector/blob/main/Spam_Mail_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Data Collection and Pre Processing

In [2]:
#loading the data into the pandas dataframe
raw_mail_data = pd.read_csv('/content/drive/MyDrive/csv files for colab/mail_data.csv')

In [3]:
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [4]:
raw_mail_data.shape

(5572, 2)

In [5]:
#replacing the null values with null string
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)),'')

In [6]:
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Label Encoding

spam mail = 0, ham mail = 1

In [7]:
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

In [8]:
mail_data.head()

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
#seperating the data as text and label
X = mail_data['Message']
Y = mail_data['Category']

Splitting the data into training and test data

In [10]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=2)

In [11]:
print(X.shape, X_train.shape, X_test.shape)

(5572,) (4457,) (1115,)


Feature Extraction, converting the text values into numerical values

In [12]:
#transforming the text data to feature vectors that can be used as input to a Logistic Regression
#TfidfVectorizer divides words into numerical values based on the frequency of the words
#it gives score to each words based on the frequency
#min_df mean if a word is used for once only, dont use it

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase='True')

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

#converting the Y_train and Y_test values into integers because the values are objects now
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

Model Training: Logistic Regression

In [13]:
model = LogisticRegression()

In [14]:
model.fit(X_train_features,Y_train)

LogisticRegression()

Model Evaluation

In [15]:
#accuracy on traning data
prediction_training_data = model.predict(X_train_features)
accuracy_training_data = accuracy_score(Y_train, prediction_training_data)
print('Accuracy on training data : ',accuracy_training_data)

Accuracy on training data :  0.9683643706529056


In [16]:
#accuracy on test data
prediction_test_data = model.predict(X_test_features)
accuracy_test_data = accuracy_score(Y_test, prediction_test_data)
print('Accuracy on test data : ',accuracy_test_data)

Accuracy on test data :  0.9524663677130045


Building a predictive system

In [17]:
#taking random input
input_mail = ["Aft i finish my lunch then i go str down lor. Ard 3 smth lor. U finish ur lunch already?"]

#converting text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

#making prediction
prediction = model.predict(input_data_features)
#print(prediction)

if (prediction[0]==0):
  print("This mail is a spam mail")
else:
  print("The mail is not a spam mail")

The mail is not a spam mail
