<a href="https://colab.research.google.com/github/HemanthKumarNP/Spam-Email-Classification-Model/blob/main/Spam_Email_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing **Dependencies**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score


Data **collection** & **Preprocessing**

In [None]:
#loading data from csv file to pandas dataframe
raw_mail_data = pd.read_csv('/content/mail_data.csv')

In [None]:
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5577     spam  Unlock the secrets to eternal youth with our r...
5578     spam  Get a free iPhone X by participating in our su...
5579     spam  Your bank account has been compromised. Update...
5580     spam  Meet singles in your area tonight! Join now fo...
5581     spam  Incredible business opportunity - earn six fig...

[5582 rows x 2 columns]


In [None]:
#replacing the null value with null string
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)),'')

In [None]:
#printing the first 5 rows of the Dataframe
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
#number of rows in columns in dataframe
mail_data.shape

(5582, 2)

Label Encoding

In [None]:
#label spam as 0
mail_data.loc[mail_data['Category']=='spam','Category']=0

#lable ham as 1
mail_data.loc[mail_data['Category']=='ham','Category']=1

spam =0
ham=1

In [None]:
#separating the data as text and labels
X = mail_data['Message']
Y = mail_data['Category']


In [None]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5577    Unlock the secrets to eternal youth with our r...
5578    Get a free iPhone X by participating in our su...
5579    Your bank account has been compromised. Update...
5580    Meet singles in your area tonight! Join now fo...
5581    Incredible business opportunity - earn six fig...
Name: Message, Length: 5582, dtype: object


In [None]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5577    0
5578    0
5579    0
5580    0
5581    0
Name: Category, Length: 5582, dtype: object


Splitting data into **training data** & **testing data**

In [None]:
X_train , X_test , Y_train , Y_test = train_test_split(X,Y,train_size=0.2,random_state=2)

In [None]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5582,)
(1116,)
(4466,)


**Feature Extraction**

In [None]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [None]:
print(X_train_features)

  (0, 488)	0.6297741025105246
  (0, 812)	0.7767783337652148
  (1, 3205)	0.4267602255053194
  (1, 1839)	0.2870618285688008
  (1, 2016)	0.3863783326579773
  (1, 965)	0.2897720268344958
  (1, 3307)	0.2819926539394581
  (1, 2725)	0.40313833248148434
  (1, 2409)	0.3329963739783447
  (1, 1071)	0.3863783326579773
  (2, 1738)	0.6366441229377474
  (2, 2570)	0.7711577404972516
  (3, 2679)	0.5297541920514164
  (3, 919)	0.40833708923786927
  (3, 2144)	0.5052785957691327
  (3, 1489)	0.5452658601217104
  (4, 2139)	0.37987424413692655
  (4, 1373)	0.4867893057182753
  (4, 1813)	0.4916285354555131
  (4, 3008)	0.434184933873566
  (4, 620)	0.434184933873566
  (5, 1861)	0.5997222058568904
  (5, 2757)	0.634862932127123
  (5, 1955)	0.48711634463760056
  (6, 2616)	0.6294583154081537
  :	:
  (1113, 973)	0.2590555740178533
  (1113, 674)	0.2104144269483401
  (1113, 3143)	0.24828564462887073
  (1113, 468)	0.24828564462887073
  (1113, 2058)	0.4083208746497984
  (1113, 3223)	0.19197762647261163
  (1113, 2565)	0.16

Training the model

**Logistic Regression**

In [None]:
model = LogisticRegression()

In [None]:
# training the Logistic Regression model with the training data
model.fit(X_train_features, Y_train)

Evaluating the trained data

In [None]:
#prediction on training data
prediction_on_train_data = model.predict(X_train_features)
accuracy_on_train_data = accuracy_score(Y_train , prediction_on_train_data)
print('Accuracy score on training data is',accuracy_on_train_data)

Accuracy score on training data is 0.9301075268817204


In [None]:
#prediction on test data
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test,prediction_on_test_data)
print('Accuracy score on testing data is',accuracy_on_test_data)

Accuracy score on testing data is 0.9223018360949395


Building a Predictive System

In [None]:
input_mail=["Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."]

#convert text data to feature vector
input_feature = feature_extraction.transform(input_mail)

#making prediction
prediction = model.predict(input_feature)

if prediction[0]==1:
  print('Ham mail')
else:
  print('Spam mail')

Ham mail


In [None]:
input_mail=["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]

#convert text data to feature vector
input_feature = feature_extraction.transform(input_mail)

#making prediction
prediction = model.predict(input_feature)

if prediction[0]==1:
  print('Ham mail')
else:
  print('Spam mail')

Spam mail
