<a href="https://colab.research.google.com/github/ShreyaNayak04/EmailSpamClassifier/blob/main/spamMailPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [38]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer    #to convert text to numerical values
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Data Collection and Pre-processing

In [39]:
df = pd.read_csv('mail_data.csv')

In [40]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [41]:
df.tail()

Unnamed: 0,Category,Message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [42]:
df.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


**Replace null values with null string **




In [44]:
df1 = df.where((pd.notnull(df)),'')

In [45]:
df1.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [46]:
df1.shape

(5572, 2)

Change ham to 1 and spam to 0

In [47]:
df.loc[df['Category']== 'spam', 'Category',]=0
df.loc[df['Category']== 'ham', 'Category',]=1

In [48]:
df.head()

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


Separating the text and label

In [49]:
X = df['Message']
y = df['Category']

In [50]:
X.shape
y.shape

(5572,)

In [51]:
X

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object

In [52]:
y

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object

Train Test Split

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2529)

In [54]:
print(X.shape)
print(X_train.shape)
print(y_train.shape)

(5572,)
(4457,)
(4457,)


Feature Extraction.
 Transfrom the text data to feature vectors that can be used as input to LogisticRegression model.

In [65]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english')

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# convert Y_train and Y_test values as integers

y_train = y_train.astype('int')
y_test = y_test.astype('int')

In [66]:
print(X_train)

4824                                              :-) :-)
2853                           how tall are you princess?
2456    Abeg, make profit. But its a start. Are you us...
4038    Dont flatter yourself... Tell that man of mine...
1173                              Happy new years melody!
                              ...                        
3656                      Senthil group company Apnt 5pm.
5039    Thanks for being there for me just to talk to ...
4836    OH RITE. WELL IM WITH MY BEST MATE PETE, WHO I...
2876    Idk. You keep saying that you're not, but sinc...
5472    Well obviously not because all the people in m...
Name: Message, Length: 4457, dtype: object


In [67]:
print(X_train_features)

  (1, 5305)	0.5789200998158937
  (1, 6559)	0.8153842762950213
  (2, 2640)	0.4187286783470488
  (2, 6257)	0.4391746169598045
  (2, 7052)	0.3580172791257166
  (2, 6306)	0.3026187775921764
  (2, 5338)	0.39296984204650115
  (2, 4256)	0.25443626334336766
  (2, 775)	0.4391746169598045
  (3, 4425)	0.33143345179494843
  (3, 1651)	0.46978962898995447
  (3, 5119)	0.46978962898995447
  (3, 4265)	0.3124018804762068
  (3, 6608)	0.2566552158373531
  (3, 2850)	0.46978962898995447
  (3, 2395)	0.2541223047474187
  (4, 4365)	0.6771495151319988
  (4, 7488)	0.4943298750030536
  (4, 4673)	0.3706911710537141
  (4, 3289)	0.3996180232907298
  (5, 5340)	0.7357795587192053
  (5, 4948)	0.6772211167491541
  (6, 4898)	0.48744754584163635
  (6, 5140)	0.5134166517828614
  (6, 6869)	0.43318007014481547
  :	:
  (4454, 3557)	0.24042713516295264
  (4454, 4125)	0.3357732335001563
  (4454, 7265)	0.4506147193317603
  (4455, 7159)	0.29159107459830874
  (4455, 1573)	0.29159107459830874
  (4455, 3324)	0.278015943019586
  (445

Training the ML model

In [68]:
model = LogisticRegression ()

In [69]:
# training the model with the training data
model.fit(X_train_features, y_train)

Evaluate the trained model

In [77]:
prediction_on_train = model.predict(X_train_features)
accuracy_on_train = accuracy_score(y_train, prediction_on_train)

In [78]:
print('Accuarcy on trained data is : ',accuracy_on_train)

Accuarcy on trained data is :  0.9690374691496523


Prediction on test data

In [79]:
prediction_on_test = model.predict(X_test_features)
accuracy_on_test = accuracy_score(y_test, prediction_on_test)
print('Accuarcy on tested data is : ',accuracy_on_test)

Accuarcy on tested data is :  0.967713004484305
