<a href="https://colab.research.google.com/github/DaveChui/Spam-Email-Detection/blob/main/New_HAMSPAM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Email spam classification case study
```text
Download the datafile email data.tar.gz. This datafile contains email data of around 5,000

emails divided in two folders ‘ham’ and ‘spam’ (there are about 3,500 emails in the ‘ham’ folder,

and 1,500 emails in the ‘spam’ folder). Each email is a separate text file in these folders. 

These emails have been slightly preprocessed to remove meta-data information.
```

## (i) (Embedding text data in Euclidean space) 
```text
The first challenge you face is how to systematically embed text data in a Euclidean space. 

It turns out that one successful way of transforming text data into vectors is via “Bag-of-words” model. 

Basically, given a dictionary of all possible words in some order, 

each text document can be represented as a word count vector of how 
often each word from the dictionary occurs in that document.
```

> Your first task is to embed the given email data in a Euclidean space by: first performing word stemming, and then applying the bag-of-words model

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# Import the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import nltk 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
ps = PorterStemmer()

In [7]:
import os
os.getcwd()

# Reading the files in the ham folder
ham_folder = os.path.join(os.getcwd(),"/content/drive/MyDrive/freelance_gig/COMS 4771 HW1 (Fall 2022) (Recieved: 24.9.22) (Due: 27.9.22)/ham/")

# call the os.walk method

#len(list(os.walk(ham_folder))[0])
ham_data = []

for root, folders,files in os.walk(ham_folder):
  for file in files:
    path = os.path.join(root,file)
    with open(path) as inf:
      ham_data.append(inf.read())
      key = ['emails']
 # Create a dataframe for the ham emails     
ham_df = pd.DataFrame.from_dict(ham_data)

In [8]:
len(ham_data)

3672

In [9]:
#add a test_label column with values "ham"
ham_df['test_label']='ham'

In [10]:
# rename the emails key from 0
ham_df["emails"] = ham_df.pop(0)

In [11]:
ham_df.head()

Unnamed: 0,test_label,emails
0,ham,Subject: # 9760\ntried to get fancy with your ...
1,ham,Subject: enron actuals for march 30 - april 1 ...
2,ham,"Subject: hpl nom for may 30 , 2001\n( see atta..."
3,ham,Subject: hpl nom for may 31 2001\n( see attach...
4,ham,"Subject: hpl nom for may 25 , 2001\n( see atta..."


In [12]:
ham_df.iloc[0]

test_label                                                  ham
emails        Subject: # 9760\ntried to get fancy with your ...
Name: 0, dtype: object

In [13]:
# Reading the files in the spam folder
spam_folder = os.path.join(os.getcwd(),"/content/drive/MyDrive/freelance_gig/COMS 4771 HW1 (Fall 2022) (Recieved: 24.9.22) (Due: 27.9.22)/spam/")

#?call the os.walk method

#?len(list(os.walk(spam_folder))[0])
spam_data = []

for root, folders,files in os.walk(spam_folder):
  for file in files:
    path = os.path.join(root,file)
    with open(path, 'rb') as inf:
      spam_data.append(inf.read())
# Create a dataframe for the spam emails     
spam_df = pd.DataFrame.from_dict(spam_data)

In [14]:
#add a test_label column with values "spam"
spam_df['test_label']='spam'

In [15]:
# rename the emails key from 0
spam_df["emails"] = spam_df.pop(0)

In [16]:
spam_df.tail()

Unnamed: 0,test_label,emails
1495,spam,b'Subject: cheap soft cialis tabs\r\nthese pil...
1496,spam,b'Subject: ordering this pain medication\r\npa...
1497,spam,"b""Subject: cialis , xanax , valium , viagra at..."
1498,spam,"b""Subject: is it big enough ?\r\nyou ' ve seen..."
1499,spam,b'Subject: the . m here on the\r\nhtmlbody\r\n...


In [17]:
from itertools import chain
all_emails = {key: list(chain(ham_df[key], spam_df[key])) for key in ham_df}

In [18]:
# Create a dataframe to store all our emails
all_emails_df = pd. DataFrame. from_dict(all_emails)

In [19]:
#Preview the top of the data
all_emails_df.head(5)

Unnamed: 0,test_label,emails
0,ham,Subject: # 9760\ntried to get fancy with your ...
1,ham,Subject: enron actuals for march 30 - april 1 ...
2,ham,"Subject: hpl nom for may 30 , 2001\n( see atta..."
3,ham,Subject: hpl nom for may 31 2001\n( see attach...
4,ham,"Subject: hpl nom for may 25 , 2001\n( see atta..."


In [20]:
all_emails_df.tail(5)

Unnamed: 0,test_label,emails
5167,spam,b'Subject: cheap soft cialis tabs\r\nthese pil...
5168,spam,b'Subject: ordering this pain medication\r\npa...
5169,spam,"b""Subject: cialis , xanax , valium , viagra at..."
5170,spam,"b""Subject: is it big enough ?\r\nyou ' ve seen..."
5171,spam,b'Subject: the . m here on the\r\nhtmlbody\r\n...


In [21]:
# Shuffle the emails
all_emails_df = all_emails_df.sample(frac = 1)

In [22]:
all_emails_df.tail()

Unnamed: 0,test_label,emails
834,ham,"Subject: noms\ndaren , there seems to be two n..."
3742,spam,b'Subject: claim your winning prize\r\nbingoli...
4773,spam,"b""Subject: meet me for wild sex . i ' m finall..."
402,ham,"Subject: enron actuals for june 08 , 2000\ntec..."
1763,ham,Subject: juvenile diabetes foundation fundrais...


In [23]:
# converting labels to binary
all_emails_df['test_label']=(all_emails_df['test_label']=='spam').astype(int)
all_emails_df

Unnamed: 0,test_label,emails
3275,0,"Subject: cleburne\ndaren & john ,\nsee the att..."
185,0,Subject: txu noms . for 10 / 3 / 2000\nattache...
2297,0,"Subject: 98 - 2601\nhi daren ,\ni ' m attempti..."
1408,0,Subject: panenergy marketing exchange deal\nsi...
4805,1,"b""Subject: impact equity report\r\nmineral exp..."
...,...,...
834,0,"Subject: noms\ndaren , there seems to be two n..."
3742,1,b'Subject: claim your winning prize\r\nbingoli...
4773,1,"b""Subject: meet me for wild sex . i ' m finall..."
402,0,"Subject: enron actuals for june 08 , 2000\ntec..."


In [24]:
all_emails_df.isnull().sum()

test_label    0
emails        0
dtype: int64

In [25]:
# convert the email text to lower case
all_emails_df['emails'] = all_emails_df['emails'].apply(lambda x: x.lower())

In [26]:
all_emails_df.columns

Index(['test_label', 'emails'], dtype='object')

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [28]:
all_emails_df['emails'] = all_emails_df['emails'].astype(str)

In [29]:
#Apply a mask for the  string or bytes-like object missing
mask = [isinstance(item, (str, bytes)) for item in all_emails_df['emails']]
all_emails_df = all_emails_df.loc[mask]

In [30]:
# Tokenization
 #from nltk.tokenize import word_tokenize

all_emails_df['emails'].apply(word_tokenize)
all_emails_df.tail(5)

Unnamed: 0,test_label,emails
834,0,"subject: noms\ndaren , there seems to be two n..."
3742,1,b'subject: claim your winning prize\r\nbingoli...
4773,1,"b""subject: meet me for wild sex . i ' m finall..."
402,0,"subject: enron actuals for june 08 , 2000\ntec..."
1763,0,subject: juvenile diabetes foundation fundrais...


In [31]:
all_emails_df.head()

Unnamed: 0,test_label,emails
3275,0,"subject: cleburne\ndaren & john ,\nsee the att..."
185,0,subject: txu noms . for 10 / 3 / 2000\nattache...
2297,0,"subject: 98 - 2601\nhi daren ,\ni ' m attempti..."
1408,0,subject: panenergy marketing exchange deal\nsi...
4805,1,"b""subject: impact equity report\r\nmineral exp..."


In [32]:
#clean the emails by removing punctuations
import string
#define a function to go through all the emails
def clean_emails(email):
    nonPunc = [char for char in email if char not in string.punctuation]
    nonPunc = "".join(nonPunc)
    return nonPunc
     

In [33]:
all_emails_df['emails'] = all_emails_df['emails'].apply(clean_emails)

In [34]:
all_emails_df.head()

Unnamed: 0,test_label,emails
3275,0,subject cleburne\ndaren john \nsee the attach...
185,0,subject txu noms for 10 3 2000\nattached ar...
2297,0,subject 98 2601\nhi daren \ni m attempting t...
1408,0,subject panenergy marketing exchange deal\nsit...
4805,1,bsubject impact equity reportrnmineral explora...


In [35]:
#Remove all the stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
all_emails_df['emails'] = all_emails_df['emails'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
# preview to check the stopwords are removed
all_emails_df.head()

Unnamed: 0,test_label,emails
3275,0,subject cleburne daren john see attached docum...
185,0,subject txu noms 10 3 2000 attached txu nomina...
2297,0,subject 98 2601 hi daren attempting clear ment...
1408,0,subject panenergy marketing exchange deal sita...
4805,1,bsubject impact equity reportrnmineral explora...


## (ii) Once you have a nice Euclidean representation of the email data. 

Your next task is to develop
a spam classifier to classify new emails as spam or not-spam. You should compare performance of naive-bayes, nearest neighbor (with L1, L2 and L∞ metric) and decision tree
classifiers.
(you may use builtin functions for performaing basic linear algebra and probability calculations but you should write the classifiers from scratch.)
You must submit your code to Courseworks to receive full credit.

In [38]:
# Now we can create the BagofWords model.
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
bag_of_words = count_vectorizer.fit_transform(all_emails_df['emails'])
X = pd.DataFrame(bag_of_words.toarray(),
				columns = count_vectorizer.get_feature_names_out())
y = all_emails_df.iloc[:, 0]

In [47]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, shuffle = True)



In [40]:
print(y_test.shape)

(1552,)


##Building the Models
1. MultiNomial Naive Bayes

In [43]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [85]:
# Training our model
model = MultinomialNB()
model.fit(X_train,y_train)

MultinomialNB()

In [88]:
y_predict = model.predict(X_test)

In [90]:
accuracyScore = accuracy_score(y_test,y_predict)*100

In [91]:
print("Model's prediction Accuracy :",accuracyScore)

Model's prediction Accuracy : 99.74226804123711


In [92]:
print(classification_report(y_test, y_predict))

print(confusion_matrix(y_test, y_predict))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1115
           1       0.99      1.00      1.00       437

    accuracy                           1.00      1552
   macro avg       1.00      1.00      1.00      1552
weighted avg       1.00      1.00      1.00      1552

[[1111    4]
 [   0  437]]


###2. Decision Tree

In [95]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split=50, random_state=0)
dt.fit(X_train,y_train)
DecisionTreeClassifier(min_samples_split=50, random_state=0)
y_dtc = dt.predict(X_test)


In [96]:
print('The Decision tree Accuracy: ' ,accuracy_score(y_test,y_dtc)*100 , '%')
print('The Decision tree confusion_matrix: ', confusion_matrix(y_dtc, y_test))

The Decision tree Accuracy:  100.0 %
The Decision tree confusion_matrix:  [[1115    0]
 [   0  437]]


###3. K Nearest neighbours

In [41]:
from sklearn.neighbors import KNeighborsClassifier

In [80]:
#k value = 3
knn_3 = KNeighborsClassifier(n_neighbors=3)
knn_3.fit(X_train,y_train)
y_pred = knn_3.predict(X_train)
print("The Classification report is: ")
X_train.transpose()
print(classification_report(y_test,y_pred))
print("The Accuracy Score of KNN(K=3): ")
print(round(accuracy_score(y_test,y_pred),3))

Classification report is: 
              precision    recall  f1-score   support

           0       0.70      0.71      0.70      1830
           1       0.27      0.26      0.26       756

    accuracy                           0.58      2586
   macro avg       0.48      0.48      0.48      2586
weighted avg       0.57      0.58      0.57      2586

Accuracy Score of KNN(K=3): 
0.577


In [46]:
#k value = 5
knn_5 = KNeighborsClassifier(n_neighbors=5)
knn_5.fit(X_train,y_train)
y_pred = knn_5.predict(X_train)
print("Classification report is: ")
X_train.transpose()
print(classification_report(y_test,y_pred))
print("Accuracy Score of KNN(K=5): ")
print(round(accuracy_score(y_test,y_pred),5))

Classification report is: 
              precision    recall  f1-score   support

           0       0.71      0.69      0.70      1845
           1       0.29      0.31      0.30       741

    accuracy                           0.58      2586
   macro avg       0.50      0.50      0.50      2586
weighted avg       0.59      0.58      0.59      2586

Accuracy Score of KNN(K=5): 
0.57966
