**Spam Email Prediction Model**

Importing required packages or dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Reading the csv dataset files

In [2]:
df = pd.read_csv('mail_data.csv')

printing the top 5 lines of dataset

In [3]:
print(df.head())

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...


Replacing NaN values with empy string

In [4]:
data = df.where((pd.notnull(df)),'')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [6]:
data.shape

(5572, 2)

Allocating identification to spam and ham by 0 and 1 respectively.

In [7]:
data.loc[data['Category'] == 'spam','Category',] = 0
data.loc[data['Category'] == 'ham','Category',] = 1

In [8]:
x = data['Message']
y = data['Category']

In [9]:
print(x.head())

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Message, dtype: object


In [10]:
print(y.head())

0    1
1    1
2    0
3    1
4    1
Name: Category, dtype: object


Splitting the data for training and testing into 80:20 ratio

In [11]:
X_train,X_test,Y_train,Y_test = train_test_split(x , y , test_size=0.2 , random_state=42)

In logistic Regression the input data must be converted into feature data

In [12]:
feature_extraction = TfidfVectorizer(min_df = 1 , stop_words = 'english' , lowercase = True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [13]:
model = LogisticRegression()

Train the model on training data

In [14]:
model.fit(X_train_features,Y_train)

Checking accuracy on training data

In [15]:
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train,prediction_on_training_data)

In [16]:
print('Accuracy on training data : ',accuracy_on_training_data)

Accuracy on training data :  0.9670181736594121


Checking accuracy on testing data

In [17]:
prediction_on_testing_data = model.predict(X_test_features)
accuracy_on_testing_data = accuracy_score(Y_test,prediction_on_testing_data)

In [18]:
print('Accuracy on testing data : ',accuracy_on_testing_data)

Accuracy on testing data :  0.967713004484305


Testing on user input data

In [20]:
# Testing on user input data
usrInput = ["U've been selected to stay in 1 of 250 top British hotels-FOR NOTHING! Holiday valued at Â£350! Dial 08712300220 to claim-National Rate Call.Bx526,SW73SS"]

usrInput_data_features = feature_extraction.transform(usrInput)

predict = model.predict(usrInput_data_features)

print(predict)
# 0 means spam
# 1 means ham

[0]


In [21]:
import pickle

In [22]:
pickle.dump(model,open('model.pkl','wb'))