2 types of emails - Spam mail and ham mail

Work flow

Mial Data --> Data Preprocessing --> Train Test Split--> LOgistic regression model [binary classes means 2 classes - spam, ham email]

  new data --> Trained logistic regression model --> Prediction [spam or ham]

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer  # text data -> numerical
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score     # evaluate the model

Data Collection and preprocessing

In [3]:
raw_data = pd.read_csv('/content/Spam_email_prediction.csv')

In [4]:
raw_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
raw_data.shape

(5572, 2)

In [6]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [7]:
raw_data.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [8]:
# check the missing value exists
raw_data.isnull().sum()

Category    0
Message     0
dtype: int64

In [9]:
# replace the null values with a null string
mail_data = raw_data.where((pd.notnull(raw_data)),'')

In [10]:
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
mail_data.shape

(5572, 2)

Label Encoding

ham = 1,  Spam = 0

In [12]:
mail_data.loc[mail_data['Category']=='spam','Category',] = 0
mail_data.loc[mail_data['Category']=='ham','Category',] = 1

In [13]:
mail_data.head()

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
# split into features(messages) and target (1/0)
x = mail_data['Message']
y = mail_data['Category']

In [15]:
x.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Message, dtype: object

In [16]:
y.head()

0    1
1    1
2    0
3    1
4    1
Name: Category, dtype: object

In [17]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=2)

In [18]:
print('      x.shape :',x.shape)
print('x_train.shape :',x_train.shape, '   x_test.shape :',x_test.shape)
print('y_train.shape :',y_train.shape, '   y_test.shape :',y_test.shape)

      x.shape : (5572,)
x_train.shape : (4457,)    x_test.shape : (1115,)
y_train.shape : (4457,)    y_test.shape : (1115,)


**Feature Extraction**

In [22]:
# convert text data to neumerical values - used as input to logistic regression
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
# min_df=1 : if score of particular word is <1, we ignore it. if >1 we take it
# stop_words='english' : no meaning words(is, the, etc)
# lowercase='True' : all text in lowercase

In [23]:
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)

In [24]:
# convert y_train, y_test values as integer
y_train = y_train.astype('int')
y_test = y_test.astype('int')

In [26]:
print(x_train_features)

  (0, 4334)	0.42941702167641554
  (0, 3958)	0.6161071828926096
  (0, 6586)	0.44333254982109394
  (0, 6927)	0.48935591439341625
  (1, 2121)	0.3573617143022146
  (1, 1428)	0.5869421390016223
  (1, 6971)	0.42812434651556874
  (1, 3168)	0.5869421390016223
  (2, 5115)	0.3408491178137899
  (2, 7353)	0.319881180619685
  (2, 3852)	0.3408491178137899
  (2, 4884)	0.35749230587184955
  (2, 5695)	0.35749230587184955
  (2, 806)	0.2673024939370533
  (2, 5894)	0.35749230587184955
  (2, 1876)	0.2875172512410733
  (2, 6878)	0.35749230587184955
  (3, 197)	0.3652223710706673
  (3, 3723)	0.16297045459835785
  (3, 2435)	0.26698378141852
  (3, 1825)	0.26858331513730566
  (3, 5231)	0.2266831802864503
  (3, 300)	0.29159698754651975
  (3, 7248)	0.23571908490908416
  (3, 5005)	0.3169028431039865
  :	:
  (4454, 2244)	0.2526916142542512
  (4454, 666)	0.28653660324238944
  (4454, 1575)	0.20946314330145208
  (4454, 1094)	0.24862733340971147
  (4454, 5068)	0.2228435763245017
  (4454, 311)	0.1954719597423795
  (4454,

Training the ML model - Logistic Regression

In [27]:
model = LogisticRegression()

In [28]:
# training the Logistic Regression model with the training data
model.fit(x_train_features, y_train)

Evaluating the trained model

In [33]:
# prediction on training data
prediction_on_training_data = model.predict(x_train_features)
accuracy_on_training_data = accuracy_score(y_train, prediction_on_training_data)
print('Accuracy on training data :', accuracy_on_training_data)

Accuracy on training data : 0.9683643706529056


In [34]:
prediction_on_test_data = model.predict(x_test_features)
accuracy_on_test_data = accuracy_score(y_test, prediction_on_test_data)
print('Accuracy on testing data :', accuracy_on_test_data)

Accuracy on testing data : 0.9524663677130045


Building a Predictive System

In [35]:
input_mail = ["'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's'"]

In [37]:
# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# making preddiction
prediction = model.predict(input_data_features)
print(prediction)

[0]
