<a href="https://colab.research.google.com/github/Dharma-Ranganathan/AllAboutPython/blob/main/Spam_Mail_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Overview:
1. Data collection from my drive
2. Data cleaning and Pre-processing
3. Splitting Training and Testing data
4. Model Selection
5. Model Evaluation

Importing required libraries

In [26]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

Data collection

In [2]:
mail = pd.read_csv('/content/drive/MyDrive/Colab_python/mail_data.csv')
mail.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Data cleaning and Pre-processing

In [3]:
# checking if dataset contains null values
mail.isnull().sum()

Unnamed: 0,0
Category,0
Message,0


In [4]:
# checking if dataset is in balanced state
mail.value_counts('Category')

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
ham,4825
spam,747


Dataset is not in balanced state, so we are gonna make it balanced

In [6]:
ham_mail = mail[mail['Category'] == 'ham']
spam_mail = mail[mail['Category'] == 'spam']

In [7]:
# ham and spam checking
print(ham_mail.shape,spam_mail.shape)

(4825, 2) (747, 2)


In [9]:
# making ham_mail shape as same as spam_mail
ham_mail_sampling = ham_mail.sample(n = 747)
ham_mail_sampling.shape

(747, 2)

We got ham = 747 mails and spam = 747 mails

In [12]:
# concating sampling datas and spam mails each other

mail = pd.concat([ham_mail_sampling,spam_mail],axis=0)
mail.shape

(1494, 2)

Now mail dataset is in balanced state, splitting feature and label

In [24]:
feature = mail['Message']
label = mail['Category']

print(feature)
print(label)

4358    HELLOGORGEOUS, HOWS U? MY FONE WAS ON CHARGE L...
5475    Dhoni have luck to win some big title.so we wi...
202     Hello darlin ive finished college now so txt m...
1410                             Where at were hungry too
3979    Reason is if the team budget is available at l...
                              ...                        
5537    Want explicit SEX in 30 secs? Ring 02073162414...
5540    ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547    Had your contract mobile 11 Mnths? Latest Moto...
5566    REMINDER FROM O2: To get 2.50 pounds free call...
5567    This is the 2nd time we have tried 2 contact u...
Name: Message, Length: 1494, dtype: object
4358     ham
5475     ham
202      ham
1410     ham
3979     ham
        ... 
5537    spam
5540    spam
5547    spam
5566    spam
5567    spam
Name: Category, Length: 1494, dtype: object


Converting features into numerical data

In [25]:
# TfidfVectorizer
vectorizer = TfidfVectorizer()

vectorizer.fit(feature)

feature = vectorizer.transform(feature)

print(feature)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 25140 stored elements and shape (1494, 4613)>
  Coords	Values
  (0, 427)	0.2161028460882127
  (0, 544)	0.23773774080088791
  (0, 786)	0.2250821386869225
  (0, 930)	0.1372684010743065
  (0, 1275)	0.181812349261572
  (0, 1583)	0.181812349261572
  (0, 1938)	0.1907916418602818
  (0, 2188)	0.23773774080088791
  (0, 2239)	0.23773774080088791
  (0, 2259)	0.20913796565027248
  (0, 2307)	0.16915674714760656
  (0, 2319)	0.09763592983457864
  (0, 2405)	0.23773774080088791
  (0, 2615)	0.2250821386869225
  (0, 2648)	0.23773774080088791
  (0, 2657)	0.17484746882363178
  (0, 2739)	0.1062664751703504
  (0, 2893)	0.11690766512354622
  (0, 2949)	0.181812349261572
  (0, 2958)	0.23773774080088791
  (0, 3037)	0.09781177568172816
  (0, 3957)	0.17700091264106105
  (0, 4043)	0.23773774080088791
  (0, 4412)	0.14930591258846437
  (0, 4452)	0.2250821386869225
  :	:
  (1493, 163)	0.2975844946775175
  (1493, 303)	0.20049962075893688
  (1493, 428)	0.2021

Label Encoding to labels

Ham mail = 0
Spam mail = 1

In [34]:
encoder = LabelEncoder()
label = encoder.fit_transform(label)
print(label)

[0 0 0 ... 1 1 1]


Splitting Training and Testing data

In [62]:
x_train,x_test,y_train,y_test = train_test_split(feature,label,test_size=0.2, stratify=label,random_state=42)
print(feature.shape, x_train.shape, x_test.shape)

(1494, 4613) (1195, 4613) (299, 4613)


Model Selection - LogisticRegression

In [63]:
# model creating
model = LogisticRegression()

model.fit(x_train,y_train)

X trained prediction - Seen data

In [64]:
# prediction of X trained

x_train_pred = model.predict(x_train)

# accuracy

x_train_acc = accuracy_score(x_train_pred,y_train)

print(f"accuracy of x trained : {x_train_acc * 100:.2f}% ")

accuracy of x trained : 98.08% 


X testing data - unseen data

In [65]:
# prediction of X test
x_test_pred = model.predict(x_test)

# accuracy
x_test_acc = accuracy_score(x_test_pred,y_test)

print(f"accuracy of X test data : {x_test_acc * 100:.2f}% ")

accuracy of X test data : 95.32% 


Building a predictive system using unseen datas of ham and spam texts

In [74]:
# input data of ham
input_ham = ["Where would I be without my baby? The thought alone might break me and I don't wanna go crazy but everybody needs his lady xxxxxxxx"]

# input data of spam
input_spam = ["You have 1 new message. Please call 08712400200"]

Vectorization of input datas of ham and spam texts

In [75]:
# ham vectorizing

input_ham = vectorizer.transform(input_ham)

# spam vectorizing

input_spam = vectorizer.transform(input_spam)

# checking
print(input_ham)
print(input_spam)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 21 stored elements and shape (1, 4613)>
  Coords	Values
  (0, 844)	0.25741114276255955
  (0, 868)	0.11516756778010721
  (0, 977)	0.26461131486198963
  (0, 1016)	0.1447561304162616
  (0, 1133)	0.25741114276255955
  (0, 1180)	0.15511731803512993
  (0, 1459)	0.2513234839935269
  (0, 1632)	0.1900878933008219
  (0, 1785)	0.300797112518723
  (0, 2061)	0.15927817459296623
  (0, 2212)	0.24605011410529318
  (0, 2739)	0.13445340559349922
  (0, 2775)	0.26461131486198963
  (0, 2893)	0.14791714593569985
  (0, 2929)	0.26461131486198963
  (0, 4065)	0.10331595378849494
  (0, 4096)	0.23723781565527471
  (0, 4401)	0.22687662803640637
  (0, 4464)	0.21188997130600745
  (0, 4501)	0.26461131486198963
  (0, 4538)	0.22394998478681197
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7 stored elements and shape (1, 4613)>
  Coords	Values
  (0, 112)	0.6656994523230196
  (0, 1196)	0.2128136966650467
  (0, 2163)	0.2695521168296356
  (0, 276

Prediction system and it's function

In [76]:
# function of proper print statement
def proper(pred):
    if pred == 1:
        print(f'obtained is spam mail')
    else:
        print(f'obtained is ham(normal) mail ')

In [79]:
# ham and spam prediction

ham_pred = model.predict(input_ham)
print(ham_pred)
proper(ham_pred[0])

spam_pred = model.predict(input_spam)
print(spam_pred)
proper(spam_pred[0])

[0]
obtained is ham(normal) mail 
[1]
obtained is spam mail


So far, model performance and accuracy were great..

Thank you...