<a href="https://colab.research.google.com/github/Russell-Robinson/Russell-Robinson.github.io/blob/main/Russell_project_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project aims to build a machine learning model that classifies emails as either spam (1) or not spam (0) using the Naive Bayes algorithm. The dataset used is a CSV file containing labeled email data. The project follows several key steps: data preprocessing, feature extraction, model training, and evaluation.

In this section, I import all the necessary libraries. These libraries allow us to manipulate data, process text, and build the machine learning model.

In [None]:
#import libraries
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

Here, I upload the dataset, which is in CSV format, and read it into a Pandas DataFrame for further manipulation.

In [None]:
#Load the data
from google.colab import files
uploaded = files.upload()

Saving spam_ham_dataset.csv to spam_ham_dataset (2).csv


In [None]:
#Read the csv file
df = pd.read_csv('spam_ham_dataset.csv')

# Print the first 5 rows of data
df.head(5)



Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


Once the data is loaded, it’s essential to explore its structure to understand what it contains and how to manipulate it.

In [None]:
#Print the shape (Get the number of rows and columns)
df.shape

(5171, 4)

In [None]:
#get the header names (column names)
df.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

Data cleaning is a crucial part of any machine learning project. We first check for duplicate records and missing values, then handle them accordingly.

In [None]:
#check for duplicates and remove them
df.drop_duplicates(inplace = True)

In [None]:
#show the new shape (number of rows and columns)
df.shape

(5171, 4)

In [None]:
#show the number of missing (NAN , NaN , na) data for each column
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
label,0
text,0
label_num,0


In [None]:
# Import the nltk module
import nltk

#Download the stopwords package
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

This is where the actual cleaning of the email text happens. We remove punctuation and stopwords to reduce noise and make the data more useful for training.

In [None]:
# Import necessary libraries
import nltk
import string
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')

def process_text(text):
    #1 remove punctuation
    #2 remove stopwords
    #3 return a list of clean text words
    if isinstance(text, str): # check if text is a string
        #1
        nopunc = [char for char in text if char not in string.punctuation]
        nopunc = ''.join(nopunc)

        #2
        clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

        #3
        return clean_words
    else:
        return [] # Return an empty list for non-string values

# Assuming 'text' is the name of the column you want to process, replace it if needed
df['text'].head().apply(process_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text
0,"[Subject, enron, methanol, meter, 988291, foll..."
1,"[Subject, hpl, nom, january, 9, 2001, see, att..."
2,"[Subject, neon, retreat, ho, ho, ho, around, w..."
3,"[Subject, photoshop, windows, office, cheap, m..."
4,"[Subject, indian, springs, deal, book, teco, p..."


In [None]:
#show the tokenization ( a list of tokens also called lemmas)
df['text'].head().apply(process_text) # Replace 'v2' with 'text'

Unnamed: 0,text
0,"[Subject, enron, methanol, meter, 988291, foll..."
1,"[Subject, hpl, nom, january, 9, 2001, see, att..."
2,"[Subject, neon, retreat, ho, ho, ho, around, w..."
3,"[Subject, photoshop, windows, office, cheap, m..."
4,"[Subject, indian, springs, deal, book, teco, p..."


I convert the cleaned text data into a format that can be used by machine learning algorithms. This involves tokenizing the text and using a bag-of-words approach.

In [None]:
#convert a collection of text to a matrix of tokens
from sklearn.feature_extraction.text import CountVectorizer
messages_bow = CountVectorizer(analyzer=process_text).fit_transform(df['text'])

I split the data into training and testing sets to evaluate the performance of our model.

In [None]:
#Split the data into 80% training and @05 testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, df['label'], test_size=0.20, random_state=0)

In [None]:
#get the shape of the messages_bow
messages_bow.shape

(5171, 50381)

Naive Bayes is a popular algorithm for text classification tasks. Here, I train the model on our training set.

In [None]:
#Create and train the naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, y_train)

In [None]:
#print the predictions
print(classifier.predict(X_train))

#print the actual values
print(y_train.values)

['ham' 'ham' 'ham' ... 'spam' 'ham' 'ham']
['ham' 'ham' 'ham' ... 'spam' 'ham' 'ham']


I evaluate the model using standard metrics such as accuracy, precision, recall, and F1-score.

In [None]:
#Evaluate the model on the traning data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(X_train)
print(classification_report(y_train, pred))
print()
print('Confusion Matrix: \n', confusion_matrix(y_train, pred))
print()
print('Accuracy: ', accuracy_score(y_train, pred))
#


              precision    recall  f1-score   support

         ham       0.99      0.99      0.99      2940
        spam       0.98      0.97      0.98      1196

    accuracy                           0.99      4136
   macro avg       0.99      0.98      0.98      4136
weighted avg       0.99      0.99      0.99      4136


Confusion Matrix: 
 [[2918   22]
 [  30 1166]]

Accuracy:  0.9874274661508704


In [None]:
#print the predictions
print(classifier.predict(X_test))

#print the actual values
print(y_test.values)

['ham' 'ham' 'ham' ... 'ham' 'spam' 'ham']
['ham' 'ham' 'ham' ... 'ham' 'spam' 'ham']


In [None]:
#Evaluate the model on the traning data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(X_test)
print(classification_report(y_test, pred))
print()
print('Confusion Matrix: \n', confusion_matrix(y_test, pred))
print()
print('Accuracy: ', accuracy_score(y_test, pred))
#

              precision    recall  f1-score   support

         ham       0.98      0.98      0.98       732
        spam       0.95      0.96      0.96       303

    accuracy                           0.97      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.97      0.97      0.97      1035


Confusion Matrix: 
 [[718  14]
 [ 13 290]]

Accuracy:  0.9739130434782609
