Problem Statement:-
Prediction of Spam mail and non-spam mail

Work Flow:-
1)data collection
2)data preprocessing
3)split data into training and testing data
4)Logistic gregression model
5)train logistic regression model
6)new mail data
7)predict spam and non-spam mail

import necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
#train_test_split used to data into train and test 
from sklearn.feature_extraction.text import TfidfVectorizer
#TfidfVectorizer used to convert text into numeric
from sklearn.linear_model import LogisticRegression
#LogisticRegression used to classify mail is spam or ham mail
from sklearn.metrics import accuracy_score
#accuracy_score it check what is accuracy of our model

In [2]:
#data collection and preprocessing

In [3]:
#load the data into pandas dataframe
data = pd.read_csv("D:/DS-ML-projects/mail_data.csv")

In [4]:
#there are 5572 rows and 2 columns

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [6]:
#replace null value with null string
mail_data = data.where((pd.notnull(data)),'')

In [7]:
#print first 5 rows of dataframe
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
#print the shape of this dataset
mail_data.shape

(5572, 2)

In [9]:
#there are 5572 rows and 2 columns

Label Encoding

In [10]:
#label spam mail as 0 and ham mail as 1
mail_data.loc[mail_data['Category'] == 'spam', 'Category']=0
mail_data.loc[mail_data['Category'] == 'ham', 'Category']=1

In [11]:
mail_data.head()

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
#separating the data as text and label

In [13]:
#take input as msg and output as category

In [14]:
X = mail_data['Message']
Y = mail_data['Category']

In [15]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [16]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


In [17]:
#split the x and y data into train and test data

In [18]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)
#test_size parameter means amount of data use for testing i.e here 20% use
#and 80% data for training data
#random_state parameter is used to split a data in same manner

In [19]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


Feature Extration

In [20]:
#transform the text data to feature vector that can be used as input to the logistic regression

In [39]:
feature_extraction = TfidfVectorizer(min_df =1, stop_words='english', lowercase=True)
#TfidfVectorizer function it convert all text into a numeric value
#min_df parameter used because if score of word is less than 1 then ignore it
#and score is greater than 1 then include that word
#stop_words="english" parameter used because stop words like the, is this word can ignore from data or documents
X_train_feature = feature_extraction.fit_transform(X_train)
#it will fit data AND Transform it into numeric data
X_test_feature = feature_extraction.transform(X_test)

#convert Y_train and Y_test values as integer

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [40]:
print(X_train)

3075                  Don know. I did't msg him recently.
1787    Do you know why god created gap between your f...
1614                         Thnx dude. u guys out 2nite?
4304                                      Yup i'm free...
3266    44 7732584351, Do you want a New Nokia 3510i c...
                              ...                        
789     5 Free Top Polyphonic Tones call 087018728737,...
968     What do u want when i come back?.a beautiful n...
1667    Guess who spent all last night phasing in and ...
3321    Eh sorry leh... I din c ur msg. Not sad alread...
1688    Free Top ringtone -sub to weekly ringtone-get ...
Name: Message, Length: 4457, dtype: object


In [41]:
#print features in numeric values
print(X_train_feature)

  (0, 2329)	0.38783870336935383
  (0, 3811)	0.34780165336891333
  (0, 2224)	0.413103377943378
  (0, 4456)	0.4168658090846482
  (0, 5413)	0.6198254967574347
  (1, 3811)	0.17419952275504033
  (1, 3046)	0.2503712792613518
  (1, 1991)	0.33036995955537024
  (1, 2956)	0.33036995955537024
  (1, 2758)	0.3226407885943799
  (1, 1839)	0.2784903590561455
  (1, 918)	0.22871581159877646
  (1, 2746)	0.3398297002864083
  (1, 2957)	0.3398297002864083
  (1, 3325)	0.31610586766078863
  (1, 3185)	0.29694482957694585
  (1, 4080)	0.18880584110891163
  (2, 6601)	0.6056811524587518
  (2, 2404)	0.45287711070606745
  (2, 3156)	0.4107239318312698
  (2, 407)	0.509272536051008
  (3, 7414)	0.8100020912469564
  (3, 2870)	0.5864269879324768
  (4, 2870)	0.41872147309323743
  (4, 487)	0.2899118421746198
  :	:
  (4454, 2855)	0.47210665083641806
  (4454, 2246)	0.47210665083641806
  (4455, 4456)	0.24920025316220423
  (4455, 3922)	0.31287563163368587
  (4455, 6916)	0.19636985317119715
  (4455, 4715)	0.30714144758811196
  (

In [42]:
#each numeric value has meaning which is given by TfidfVectorizer

Logistic Regression

In [43]:
model = LogisticRegression()

In [44]:
model.fit(X_train_features, Y_train)
#here X axis is X_tarin_features and Y_train is Y axis
#X_tarin_features represent all numeric features and
#Y_train represent the spam and ham i.e either 0 or 1

Evaluationg the trained model

In [45]:
#prediction on training data
prediction_on_training_data = model.predict(X_train_feature)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [46]:
print("Accuracy on training data = ",accuracy_on_training_data)

Accuracy on training data =  0.9676912721561588


In [47]:
#we get 96% accuracy on our training dataset

In [48]:
#prediction on test data
prediction_on_test_data = model.predict(X_test_feature)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [49]:
print("Accuracy on testing data = ",accuracy_on_test_data)

Accuracy on testing data =  0.9668161434977578


In [50]:
#we get 96% accuracy on our test dataset

In [51]:
#here the accuracy on both train and test data is same hence our model good fitted
#if we get high accuracy on train data and less on test data then our model is over fitted
#and if we get high accuracy on test data and less on train data then model is under fitted

we get 96% accuracy on both train and test dataset hence is same hence our model is good fitted

Building a Predictive System

In [57]:
input_mail = input("Enter mail for checking spam or ham: ")

# Convert text to feature vectors
input_data_features = feature_extraction.transform([input_mail])  # Wrap input_mail in a list

# Making prediction
prediction = model.predict(input_data_features)
print(prediction)
if prediction==1:
    print("mail is ham mail i.e non-spam")
else:
    print("mail is spam")

Enter mail for checking spam or ham:  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


[0]
mail is spam
