When classifying emails there are mainly two types, they are as follows:->

> "Spam" refers to unwanted or unsolicited emails that are typically sent in bulk and often contain scams, fake offers, or links to phishing websites. They are the emails which we want to filter out and classify as spam.

> "Ham" refers to legitimate or non-spam emails. These are emails that are sent for personal or professional reasons and are not considered unwanted or unsolicited. They are the emails which we want to classify as non-spam, also known as "ham"

Now that we know we have to classify the data into two categories we will be using Logistic Regresssion. 

A logistic regression model can be used to classify email messages as spam or ham by using a dataset of labeled email messages (provided to us) 

And using it to predict new unseen email messages. It works by assigning a probability of an email being spam based on certain words or phrases in the email.

### Importing required libraries

In [183]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Collecting and preparing data for analysis

In [184]:
# loading the data from csv file to a pandas Dataframe
# the data wasn't in utf8 format so I first saved the file as utf8 csv using Excel

data = pd.read_csv('C:/Users/SirFa/Downloads/Oasis/Datasets/spam.csv') 

In [185]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [186]:
data.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will �_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [187]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


We only require the first 2 columns and the rest aren't useful so we drop them<br>
Also the columns don't have relevant name so we should be changing that!

In [188]:
data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [189]:
data.drop(["Unnamed: 2", "Unnamed: 3","Unnamed: 4"], axis=1, inplace=True)
data.rename(columns={'v1':'Category','v2':'Mail'}, inplace=True)
data.columns


Index(['Category', 'Mail'], dtype='object')

In [190]:
# checking the number of rows and columns in the dataframe
print("Rows :",data.shape[0], "\nColumns :", data.shape[1])

Rows : 5572 
Columns : 2


### Label Encoding
Encoding the Category column as<br>
Spam : 0<br>
Ham :1

In [191]:
# label spam mail as 0;  ham mail as 1;
data.loc[data['Category'] == 'spam', 'Category',] = 0
data.loc[data['Category'] == 'ham', 'Category',] = 1

In [192]:
# Assigning both the columns as variable X and Y for easy calling
Y = data['Category']
X = data['Mail']

### Train-Test Split :
Now splitting the data into training and testing in the ratio 30:70 respectively

In [193]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=4)

In [194]:
print("Total No of Rows  :",X.shape[0])
print("Rows for training :",X_train.shape[0])
print("Rows for testing  :",X_test.shape[0])

Total No of Rows  : 5572
Rows for training : 3900
Rows for testing  : 1672


### Feature Extraction

In [195]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase='True')

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [196]:
# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [197]:
print(X_train)
print("-----------------------------------------------------------------------------------------------------------------------")
print(X_train_features)

1256       Not yet chikku..going to room nw, i'm in bus..
4163                  Its ok, called mom instead have fun
1994               Have you been practising your curtsey?
3587    If you were/are free i can give. Otherwise nal...
1598                  Daddy will take good care of you :)
                              ...                        
3671     came to look at the flat, seems ok, in his 50...
709     4mths half price Orange line rental & latest c...
2487    K ill drink.pa then what doing. I need srs mod...
174     Well, i'm gonna finish my bath now. Have a goo...
1146                            Babe ? I lost you ... :-(
Name: Mail, Length: 3900, dtype: object
-----------------------------------------------------------------------------------------------------------------------
  (0, 1416)	0.4541283081025466
  (0, 4354)	0.5404817119861236
  (0, 5206)	0.4327984288284353
  (0, 2863)	0.31622834091458635
  (0, 1612)	0.46296020908336527
  (1, 2748)	0.4325789275779803
  (1, 3321)	0.

### Trainning the model using Logistic Regression

In [198]:
model = LogisticRegression()
model.fit(X_train_features, Y_train)

LogisticRegression()

### Model Evaluation :

In [199]:
# prediction on training data

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

print('Accuracy on training data : ', accuracy_on_training_data)

Accuracy on training data :  0.9666666666666667


In [200]:
# prediction on test data

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

print('Accuracy on test data : ', accuracy_on_test_data)

Accuracy on test data :  0.9551435406698564


### Predicting for new inputs:

#### For Spam: Deliberately entering new spam mail to check 

In [201]:
# Get input mail from user
get_mail = input("Enter the text of the email :")

Enter the text of the email :"Congratulations! You have been selected to receive a $1000 gift card from our exclusive rewards program. To claim your gift card, click on the link below and enter the code provided.But hurry, this offer expires in 24 hours"


In [202]:
# Convert input to a list
list_mail = [get_mail]

In [203]:
# Convert text to feature vectors
input_data_features = feature_extraction.transform(list_mail)

In [204]:
# Make prediction
prediction = model.predict(input_data_features)

In [205]:
# Check prediction and print result
if (prediction[0]==1):
    print('Ham mail')
else:
    print('Spam mail')

Spam mail


#### For Ham Mail: Deliberately entering new ham mail to check 

In [206]:
# Get input mail from user
get_mail = input("Enter the text of the email :")

Enter the text of the email :"Just a reminder that we have a meeting scheduled for tomorrow at 10am in the conference room. The agenda for the meeting is attached. Please review it before the meeting."


In [207]:
# Convert input to a list
list_mail = [get_mail]

In [208]:
# Convert text to feature vectors
input_data_features = feature_extraction.transform(list_mail)

In [209]:
# Make prediction
prediction = model.predict(input_data_features)

In [210]:
# Check prediction and print result
if (prediction[0]==1):
    print('Ham mail')
else:
    print('Spam mail')

Ham mail


### Conclusion :->
A logistic regression model was developed to detect spam emails with an accuracy of 95.5%. The model was trained using a dataset and the results show the effectiveness of using logistic regression for this specific problem. The model can be confidently used to classify new incoming emails as spam or non-spam.