<a href="https://colab.research.google.com/github/OSegun/Zummit-Africa-ML-AL-Projects/blob/main/Email_Classification_With_LGR_TFD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Email Classification Using Logistic Regrssion

This model is built to classify email coming in to be either spam or not spam.

The dataset gotten from kaggle is used to build and train the model to identify new emails as either spam or non spam using Logistic Regrssion model.

The dataset as;

- Category Feature

- Message Feature

The link to access the data on kaggle --> https://www.kaggle.com/datasets/bhaskarreddy072/mail-datacsv

In [8]:
# Import all necessary dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer # To convert text data into numerical values

In [2]:
mail_data = pd.read_csv('/content/mail_data.csv') # reading the dataset to pandas

In [3]:
mail_data.head() # Display the first few roles of the dataset

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
mail_data.shape # Number of Observation & Features

(5572, 2)

In [7]:
mail_data.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [6]:
mail_data.isnull().sum() # Checking for missing values

Category    0
Message     0
dtype: int64

In [None]:
#mail_data = mail_data.where(pd.notnull(mail_data),'') 

In [9]:
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
# encoding the labels spam = 0 and Not spam = 1
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

In [11]:
X = np.asarray(mail_data['Message'])
# Spliting the dataset into input and ouput or features and target
Y = np.asarray(mail_data['Category'])

In [12]:
print(X)
print(Y)

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 ... 'Pity, * was in mood for that. So...any other suggestions?'
 "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free"
 'Rofl. Its true to its name']
[1 1 0 ... 1 1 1]


In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3) # Spliting the dataset into 70% train and 30% test

In [14]:
print(X.shape, X_train.shape, X_test.shape)

(5572,) (3900,) (1672,)


In [16]:
feature_extraction = TfidfVectorizer(min_df= 1, stop_words='english', lowercase=True) # Intialise the TDF, This help to convert the text to numerical values
# The min_df helps to remove words with weight values that are less than 1, stop_words remove common english words like pronouns and lowercase turns all text to lower.
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

Y_train = Y_train.astype('int') # Convert all values to integer
Y_test =  Y_test.astype('int')

In [17]:
print(X_train_features)

  (0, 6403)	0.5126834950492223
  (0, 1476)	0.5740143978671064
  (0, 2556)	0.3805505733138392
  (0, 957)	0.5126834950492223
  (1, 5802)	0.8343586581298142
  (1, 5925)	0.5512219422372587
  (2, 6857)	0.3402475309356483
  (2, 3954)	0.308706217308303
  (2, 1939)	0.33323298182593064
  (2, 6715)	0.2854613543701438
  (2, 773)	0.3586980166311305
  (2, 3292)	0.3485439816248344
  (2, 6622)	0.27910155751709925
  (2, 6600)	0.25507211485402015
  (2, 5190)	0.2956153893764399
  (2, 5293)	0.28324118250310315
  (2, 2666)	0.18888891106532127
  (3, 1041)	0.3709311110965177
  (3, 2493)	0.3234128005862772
  (3, 5196)	0.3709311110965177
  (3, 5311)	0.2692270060646602
  (3, 4313)	0.3709311110965177
  (3, 3639)	0.24135661256904692
  (3, 1965)	0.3533935165266352
  (3, 2857)	0.4764570496179511
  :	:
  (3897, 6634)	0.25690269151428047
  (3898, 1490)	0.4677765236390843
  (3898, 4649)	0.4677765236390843
  (3898, 2580)	0.4677765236390843
  (3898, 4038)	0.33000851220997285
  (3898, 3896)	0.32006362823490025
  (3898, 

In [18]:
model = LogisticRegression() # Intansiate the Logistic Regression Model

In [19]:
model.fit(X_train_features, Y_train)

In [20]:
X_train_prediction = model.predict(X_train_features)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [21]:
print('Accuracy Score of the training data : ', training_data_accuracy)

Accuracy Score of the training data :  0.9651282051282051


In [22]:
X_test_prediction = model.predict(X_test_features)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

In [23]:
print('Accuracy Score of the test data : ', test_data_accuracy)

Accuracy Score of the test data :  0.9581339712918661


In [24]:
for i in range (len(X_test_prediction)):
  print(X_test_prediction[i], X_test[i], Y_test[i])

1 Lol wtf random. Btw is that your lunch break 1
1 No. She's currently in scotland for that. 1
0 Hi babe its Jordan, how r u? Im home from abroad and lonely, text me back if u wanna chat xxSP visionsms.com Text stop to stopCost 150p 08712400603 0
1 St andre, virgil's cream 1
1 The guy at the car shop who was flirting with me got my phone number from the paperwork and called and texted me. I'm nervous because of course now he may have my address. Should i call his boss and tell him, knowing this may get him fired? 1
1 Nope i waiting in sch 4 daddy... 1
1 K I'll be sure to get up before noon and see what's what 1
1 Theyre doing it to lots of places. Only hospitals and medical places are safe. 1
1 sry can't talk on phone, with parents 1
1 We're finally ready fyi 1
1 Yeah that'd pretty much be the best case scenario 1
1 Single line with a big meaning::::: "Miss anything 4 ur "Best Life" but, don't miss ur best life for anything... Gud nyt... 1
1 No calls..messages..missed calls 1
1 Yar lor