Project 5: Spam mail Classifier

In [1]:
import numpy as np
import pandas as pd


Load dataset

In [4]:
sp_dataset=pd.read_csv("/content/mail_data.csv")

In [5]:
sp_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


Replacing the null values

In [7]:
# we used .where()
mail_data=sp_dataset.where((pd.notnull(sp_dataset)),'')  # replacing the null values with the empty string

In [9]:
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Encoding the label

In [58]:
mail_data.loc[mail_data['Category']=='spam','Category',]=1
mail_data.loc[mail_data['Category']=='ham','Category',]=0

In [59]:
mail_data.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [60]:
print(mail_data)

      Category                                            Message
0            0  Go until jurong point, crazy.. Available only ...
1            0                      Ok lar... Joking wif u oni...
2            1  Free entry in 2 a wkly comp to win FA Cup fina...
3            0  U dun say so early hor... U c already then say...
4            0  Nah I don't think he goes to usf, he lives aro...
...        ...                                                ...
5567         1  This is the 2nd time we have tried 2 contact u...
5568         0               Will ü b going to esplanade fr home?
5569         0  Pity, * was in mood for that. So...any other s...
5570         0  The guy did some bitching but I acted like i'd...
5571         0                         Rofl. Its true to its name

[5572 rows x 2 columns]


Splitting the mail data into input and output(lable) datas

In [61]:
X=mail_data["Message"]
y=mail_data["Category"]

In [62]:
print(X)
print()
print(y)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: Category, Length: 5572, dtype: int64


Splitting the data into TRAINING AND TESTING DATA

In [16]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=3)


In [63]:
print(X_train.shape)
print()
print(X_test.shape)
print()
print(y_test)

(4457,)

(1115,)

2632    1
454     0
983     1
1282    0
4610    0
       ..
4827    0
5291    0
3325    0
3561    0
1136    0
Name: Category, Length: 1115, dtype: int64


Feature Extraction: use to transform the text data to feature vector to pass as input into the logistic regression model

we used TfidVectorizer() to transform the data into the feature vector, it takes the parameter, which are:
- min_df: Describe the no. of the repeatation of one particular word in the mail.
-stop_words="eng,etclish": It used to not to count the helping words like is, are, etc

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [22]:
vectorizer=TfidfVectorizer(min_df=1,stop_words="english",lowercase="True")

In [64]:
# Vectorize the X_train dataset
X_train_feature=vectorizer.fit_transform(X_train)

# Vectorize the X_test dataset
X_test_feature=vectorizer.transform(X_test)

#converting the y_test and y_train into integer
y_train=y_train.astype('int')
y_test=y_test.astype('int')



In [66]:
print(y_test)

2632    1
454     0
983     1
1282    0
4610    0
       ..
4827    0
5291    0
3325    0
3561    0
1136    0
Name: Category, Length: 1115, dtype: int64


MODEL TRIANING

In [67]:
#using logistic Regression model
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()


fitting the data into the model

In [68]:
model.fit(X_train_feature,y_train)

predicting the label

In [69]:
y_train_predict=model.predict(X_train_feature)

Calculating the Accuracy of the model

In [70]:
from sklearn.metrics import accuracy_score
Accuracy_train=accuracy_score(y_train_predict,y_train)

In [71]:
print(Accuracy_train *100,"%")

96.70181736594121 %


TESTING THE MODEL

In [73]:
y_test_predict= model.predict(X_test_feature)

Calculating the accuracy on testing data

In [74]:
Accuracy_test=accuracy_score(y_test_predict,y_test)
print (Accuracy_test*100,"%")

96.59192825112108 %


SPAM E-MAIL CLASSIFYING SYSTEM

In [88]:
# input the mail from the user
input_mail=["07732584351 - Rodger Burns - MSG = We tried to call you re your reply to our sms for a free nokia mobile + free camcorder. Please call now 08000930705 for delivery tomorrow"]

# convert the input mail into feature vectors
vec_ip_mail=vectorizer.transform(input_mail)

# make a prediction
prediction=model.predict(vec_ip_mail)

if prediction==1:
  print(prediction)
  print("SPAM MAIL !!!")
else:
  print(prediction)
  print("HAM MAIL !!")


[1]
SPAM MAIL !!!
