<div style="text-align: center;">
    <h1 style="font-size:50px">Email Spam Detection Project</h1>
</div>

<h1>Objective</h1>

<p style="font-size:15px;line-height: 1.5;">The objective of this project is to develop machine learning models that can accurately classify emails as either 'spam' or 'ham'. The aim is to identify and filter out unwanted or malicious emails to enhance email security and user experience.</p>

<h1>Dataset</h1>

<p style="font-size:15px;line-height: 1.5;">The dataset used consists of 5572 rows and 2 columns. Each row consists of the email message and the label (spam or ham)</p>

<h1>Approach</h1>

<ul style="font-size:15px;line-height: 1.5;">
    <li>First, the dataset is explored to understand its structure.</li>
    <li>Pre-Processing of dataset is done by encoding the target variable and splitting the dataset into training and testing sets in 8:2 ratio.</li>
    <li>6 different models are used for classification:
        <ul style="font-size:15px;line-height: 1.5;">
            <li>Logistic Regression</li>
            <li>Multinomial Naive Bayes</li>
            <li>Decision Tree</li>
            <li>Random Forest</li>
            <li>Support Vector Machine</li>
            <li>K-Nearest Neighbors</li>
        </ul>
    </li>
    <li>Each model is saved as a '.pkl' file.</li>
    <li>Users will be prompted to load the model of their choice and enter the email message in an interactive manner to make predictions.</li>
</ul>

# Import The Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

# Load The Dataset

In [2]:
dataset = pd.read_csv('spam.csv', usecols=['v1', 'v2'], encoding='latin1')
dataset

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


# Handling Missing Values

In [4]:
dataset = dataset.where((pd.notnull(dataset)), '')

# Printing Information About The Dataset

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


# Encoding Target Variables

<ul style="font-size:15px;line-height: 1.5;">
    <li>Target variable is that value which we want to predict. Here, the target variable has two classes - spam and ham.</li>
    <li>0 is assigned to spam and 1 is assigned to ham</li>
    <li>This simply means if the email is a spam, then output will be 0 and if email is ham, output will be 1.</li>
</ul>

In [6]:
dataset.loc[dataset['v1']=='spam', 'v1',] = 0
dataset.loc[dataset['v1']=='ham', 'v1',] = 1

# Separating The Training Data And Target Variable

<ul style="font-size:15px;line-height: 1.5;">
    <li>The training data i.e., the email messages are assigned to x.</li>
    <li>The target variable i.e., the labels (spam or ham) for messages are assigned to y.</li>
</ul>

In [7]:
x = dataset['v2']
y = dataset['v1']

## Printing The Training Data

In [8]:
print(x)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object


## Printing The Target Variable

In [9]:
print(y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: v1, Length: 5572, dtype: object


# Splitting The Dataset

<ul style="font-size:15px;line-height: 1.5;">
    <li>The entire dataset is split into 2 parts - training and testing in the ratio 8:2.</li>
    <li>The training data will be used to train the models.</li>
    <li>The testing data is the unseen data with which the model accuracy will be checked.</li>
</ul>

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2)

## Printing The Training Data Shape

In [11]:
print(x_train.shape)
print(y_train.shape)

(4457,)
(4457,)


## Printing the Testing Data Shape

In [12]:
print(x_test.shape)
print(y_test.shape)

(1115,)
(1115,)


# Feature Extraction

<p style="font-size:15px;line-height: 1.5;">To extract meaningful features from text, it is converted into numerical representations</p>

In [13]:
feature_extraction = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase = True)

x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)

y_train = y_train.astype('int')
y_test = y_test.astype('int')

## Before Feature Extraction

In [14]:
print(x_train)

2758              What time. IÛ÷m out until prob 3 or so
2941    Hello. No news on job, they are making me wait...
5519    Can you pls send me that company name. In saib...
3154                                                Ok...
2505                 Congrats kano..whr s the treat maga?
                              ...                        
1759    Do u ever get a song stuck in your head for no...
2052    Call 09094100151 to use ur mins! Calls cast 10...
4854                                         Same to u...
3166    When people see my msgs, They think Iam addict...
3784    Let me know if you need anything else. Salad o...
Name: v2, Length: 4457, dtype: object


## After Feature Extraction

In [15]:
print(x_train_features)

  (0, 5284)	0.8359467152211945
  (0, 6677)	0.5488106133366001
  (1, 3403)	0.18743087463448296
  (1, 2665)	0.3455115147127925
  (1, 7176)	0.36238236112270333
  (1, 7309)	0.36238236112270333
  (1, 3536)	0.21689748249572097
  (1, 7398)	0.21213702487846833
  (1, 7189)	0.20399236179839308
  (1, 2792)	0.36238236112270333
  (1, 7107)	0.2292899647547473
  (1, 4231)	0.2709588634380846
  (1, 3743)	0.24097792731086692
  (1, 4644)	0.2709588634380846
  (1, 3330)	0.24573838492811956
  (2, 1881)	0.563712980819382
  (2, 5722)	0.563712980819382
  (2, 1916)	0.4332976890082253
  (2, 5839)	0.27968152351623693
  (2, 5113)	0.31382592087847383
  (3, 4788)	1.0
  (4, 4211)	0.48113845790094695
  (4, 6803)	0.4018052765381354
  (4, 7235)	0.49840543969763945
  (4, 3808)	0.44754972732392345
  :	:
  (4453, 1613)	0.1758709224921876
  (4453, 6297)	0.1295659714848464
  (4453, 4393)	0.15669732745643003
  (4453, 6961)	0.10735471600873726
  (4453, 3786)	0.10287024105925678
  (4455, 840)	0.548960569798217
  (4455, 3496)	0.

# Import Libraries For Model Training And Testing

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Logistic Regression

## Training And Testing The Model

In [17]:
model1 = LogisticRegression()
model1.fit(x_train_features, y_train)
y_pred = model1.predict(x_test_features)
print("Accuracy : ", model1.score(x_test_features,y_test))

print('Classification Report')
print(classification_report(y_test, y_pred))
print()

print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Accuracy :  0.9704035874439462
Classification Report
              precision    recall  f1-score   support

           0       0.98      0.74      0.84       120
           1       0.97      1.00      0.98       995

    accuracy                           0.97      1115
   macro avg       0.97      0.87      0.91      1115
weighted avg       0.97      0.97      0.97      1115


Confusion Matrix
[[ 89  31]
 [  2 993]]


## Saving The Model

In [18]:
import joblib
filename = 'logistic_regression_model.pkl'
joblib.dump(model1, filename)
print(f"Model saved as {filename}")

Model saved as logistic_regression_model.pkl


# Multinomial Naive Bayes

## Training And Testing The Model

In [19]:
model2 = MultinomialNB()
model2.fit(x_train_features, y_train)

# Predict and evaluate the model
y_pred = model2.predict(x_test_features)
print("Accuracy:", model2.score(x_test_features, y_test))

print('Classification Report:')
print(classification_report(y_test, y_pred))
print()

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9721973094170404
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.74      0.85       120
           1       0.97      1.00      0.98       995

    accuracy                           0.97      1115
   macro avg       0.98      0.87      0.92      1115
weighted avg       0.97      0.97      0.97      1115


Confusion Matrix:
[[ 89  31]
 [  0 995]]


## Saving The Model

In [20]:
filename = 'multinomial_bayes_model.pkl'
joblib.dump(model2, filename)
print(f"Model saved as {filename}")

Model saved as multinomial_bayes_model.pkl


# Decision Tree

## Training And Testing The Model

In [21]:
model3 = DecisionTreeClassifier()
model3.fit(x_train_features, y_train)
y_pred = model3.predict(x_test_features)
print("Accuracy : ", model3.score(x_test_features,y_test))

print('Classification Report')
print(classification_report(y_test, y_pred))
print()

print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Accuracy :  0.9704035874439462
Classification Report
              precision    recall  f1-score   support

           0       0.88      0.83      0.86       120
           1       0.98      0.99      0.98       995

    accuracy                           0.97      1115
   macro avg       0.93      0.91      0.92      1115
weighted avg       0.97      0.97      0.97      1115


Confusion Matrix
[[100  20]
 [ 13 982]]


## Saving The Model

In [22]:
filename = 'decision_tree_model.pkl'
joblib.dump(model3, filename)
print(f"Model saved as {filename}")

Model saved as decision_tree_model.pkl


# Random Forest

## Training And Testing The Model

In [23]:
model4 = RandomForestClassifier()
model4.fit(x_train_features, y_train)
y_pred = model4.predict(x_test_features)
print("Accuracy : ", model4.score(x_test_features,y_test))

print('Classification Report')
print(classification_report(y_test, y_pred))
print()

print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Accuracy :  0.9802690582959641
Classification Report
              precision    recall  f1-score   support

           0       1.00      0.82      0.90       120
           1       0.98      1.00      0.99       995

    accuracy                           0.98      1115
   macro avg       0.99      0.91      0.94      1115
weighted avg       0.98      0.98      0.98      1115


Confusion Matrix
[[ 98  22]
 [  0 995]]


## Saving The Model

In [24]:
filename = 'random_forest_model.pkl'
joblib.dump(model4, filename)
print(f"Model saved as {filename}")

Model saved as random_forest_model.pkl


# Support Vector Machine

## Training And Testing The Model

In [25]:
model5 = SVC()
model5.fit(x_train_features, y_train)
y_pred = model5.predict(x_test_features)
print("Accuracy : ", model5.score(x_test_features,y_test))

print('Classification Report')
print(classification_report(y_test, y_pred))
print()

print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Accuracy :  0.9856502242152466
Classification Report
              precision    recall  f1-score   support

           0       1.00      0.87      0.93       120
           1       0.98      1.00      0.99       995

    accuracy                           0.99      1115
   macro avg       0.99      0.93      0.96      1115
weighted avg       0.99      0.99      0.99      1115


Confusion Matrix
[[104  16]
 [  0 995]]


## Saving The Model

In [26]:
filename = 'svm_model.pkl'
joblib.dump(model5, filename)
print(f"Model saved as {filename}")

Model saved as svm_model.pkl


# K-Nearest Neighbors

## Training And Testing The Model

In [27]:
model6 = KNeighborsClassifier()
model6.fit(x_train_features, y_train)
y_pred = model6.predict(x_test_features)
print("Accuracy : ", model6.score(x_test_features,y_test))

print('Classification Report')
print(classification_report(y_test, y_pred))
print()

print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Accuracy :  0.9282511210762332
Classification Report
              precision    recall  f1-score   support

           0       1.00      0.33      0.50       120
           1       0.93      1.00      0.96       995

    accuracy                           0.93      1115
   macro avg       0.96      0.67      0.73      1115
weighted avg       0.93      0.93      0.91      1115


Confusion Matrix
[[ 40  80]
 [  0 995]]


## Saving The Model

In [28]:
filename = 'knn_model.pkl'
joblib.dump(model5, filename)
print(f"Model saved as {filename}")

Model saved as knn_model.pkl


# Accuracies Of Different Models

In [29]:
print(f'Logistic Regression accuracy: {model1.score(x_test_features,y_test)*100}')
print(f'Multinomial Bayes accuracy: {model2.score(x_test_features,y_test)*100}')
print(f'Decision Tree accuracy: {model3.score(x_test_features,y_test)*100}')
print(f'Random Forest accuracy: {model4.score(x_test_features,y_test)*100}')
print(f'Support Vector Machine accuracy: {model5.score(x_test_features,y_test)*100}')
print(f'K-Nearest Neighbors accuracy: {model6.score(x_test_features,y_test)*100}')

Logistic Regression accuracy: 97.04035874439462
Multinomial Bayes accuracy: 97.21973094170404
Decision Tree accuracy: 97.04035874439462
Random Forest accuracy: 98.02690582959642
Support Vector Machine accuracy: 98.56502242152466
K-Nearest Neighbors accuracy: 92.82511210762333


# Taking User Input And Making Predictions

In [37]:
import warnings
# Suppress warnings
warnings.filterwarnings("ignore")

# Dictionary mapping model names to their filenames
model_files = {
    'Logistic Regression': 'logistic_regression_model.pkl',
    'Multinomial Bayes': 'multinomial_bayes_model.pkl',
    'Decision Tree': 'decision_tree_model.pkl',
    'Random Forest': 'random_forest_model.pkl',
    'Support Vector Machine': 'svm_model.pkl',
    'K-Nearest Neighbors': 'knn_model.pkl'
}

def choose_model():
    print("Choose a model to load:")
    for idx, model_name in enumerate(model_files.keys(), 1):
        print(f"{idx}. {model_name}")
    while True:
        try:
            choice = int(input("Enter your choice (1-6): "))
            if choice < 1 or choice > 6:
                raise ValueError("Please enter a number between 1 and 6.")
            break
        except ValueError as e:
            print(e)
    selected_model_name = list(model_files.keys())[choice - 1]
    selected_model_file = model_files[selected_model_name]
    return selected_model_name, selected_model_file

def load_selected_model(selected_model_file):
    loaded_model = joblib.load(selected_model_file)
    return loaded_model

def get_user_input():
    text_message = input("Enter a text message: ")
    user_input_df = pd.DataFrame([text_message], columns=['v2'])
    return user_input_df

def make_prediction(loaded_model, selected_model_name):
    user_input_df = get_user_input()
    user_input_features = feature_extraction.transform(user_input_df['v2'])  # Reuse the same vectorizer
    predicted_class = loaded_model.predict(user_input_features)
    print(f"The predicted class for the given message using {selected_model_name} is: {'spam' if predicted_class[0] == 0 else 'ham'}")


selected_model_name, selected_model_file = choose_model()
loaded_model = load_selected_model(selected_model_file)

while True:
    make_prediction(loaded_model, selected_model_name)
    while True:
        yes_or_no = input("Do you want to quit? (yes/no) ").lower()
        
        if yes_or_no == 'yes':
            break
        elif yes_or_no == 'no':
            while True:
                continue_or_change = input("Do you want to continue with the same model or choose a different model? (continue/change): ").lower()

                if continue_or_change == 'change':
                    print()
                    selected_model_name, selected_model_file = choose_model()
                    loaded_model = load_selected_model(selected_model_file)
                    break
                elif continue_or_change == 'continue':
                    break
                else:
                    print("Please enter either 'continue' or 'change'.")
            break
        else:
            print("Please enter either 'yes' or 'no'.")
    if yes_or_no == 'yes':
        break

Choose a model to load:
1. Logistic Regression
2. Multinomial Bayes
3. Decision Tree
4. Random Forest
5. Support Vector Machine
6. K-Nearest Neighbors
Enter your choice (1-6): 1
Enter a text message: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's			
The predicted class for the given message using Logistic Regression is: spam
Do you want to quit? (yes/no) no
Do you want to continue with the same model or choose a different model? (continue/change): change

Choose a model to load:
1. Logistic Regression
2. Multinomial Bayes
3. Decision Tree
4. Random Forest
5. Support Vector Machine
6. K-Nearest Neighbors
Enter your choice (1-6): 4
Enter a text message: Nah I don't think he goes to usf, he lives around here though			
The predicted class for the given message using Random Forest is: ham
Do you want to quit? (yes/no) y
Please enter either 'yes' or 'no'.
Do you want to quit? (yes/no)