# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [1]:
!pip install nltk



In [2]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /Users/dima/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/dima/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/dima/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/dima/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
import pandas as pd


df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.head(10)

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [4]:
import numpy as np

In [5]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:
#you need nested loop for going through each text and each string component

clean_text = []
for text in df.text:
    for punc in string.punctuation:
        text = text.replace(punc,'')
    clean_text.append(text)

df['clean_text'] = pd.DataFrame(clean_text)

df.head(20)

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds fr...
5,"Subject: great nnews hello , welcome to medzo...",1,Subject great nnews hello welcome to medzonl...
6,Subject: here ' s a hot play in motion homela...,1,Subject here s a hot play in motion homeland...
7,Subject: save your money buy getting this thin...,1,Subject save your money buy getting this thing...
8,Subject: undeliverable : home based business f...,1,Subject undeliverable home based business for...
9,Subject: save your money buy getting this thin...,1,Subject save your money buy getting this thing...


### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [7]:
def lowercase(clean_text):
    clean_text = [text.lower() for text in clean_text]
    return clean_text


In [8]:
df.clean_text = lowercase(df.clean_text)

In [9]:
df.head(10)

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...
5,"Subject: great nnews hello , welcome to medzo...",1,subject great nnews hello welcome to medzonl...
6,Subject: here ' s a hot play in motion homela...,1,subject here s a hot play in motion homeland...
7,Subject: save your money buy getting this thin...,1,subject save your money buy getting this thing...
8,Subject: undeliverable : home based business f...,1,subject undeliverable home based business for...
9,Subject: save your money buy getting this thin...,1,subject save your money buy getting this thing...


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [10]:
def num_rem(text):
    return "".join(char for char in text if not char.isdigit())


In [11]:
df.clean_text = df.clean_text.apply(num_rem)

### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [12]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

from nltk.tokenize import word_tokenize

In [13]:
def stopwords(text):
    text_toke = word_tokenize(text)
    return " ".join(w for w in text_toke if not w in stop_words)

In [14]:
df.clean_text = df.clean_text.apply(stopwords)

In [15]:
print(df.clean_text[0])

subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website make task much easier promise havinq ordered iogo company automaticaily become world ieader isguite ciear without good products effective business organization practicable aim hotat nowadays market promise marketing efforts become much effective list clear benefits creativeness hand made original logos specially done reflect distinctive company image convenience logo stationery provided formats easy use content management system letsyou change website content even structure promptness see logo drafts within three business days affordability marketing break make gaps budget satisfaction guaranteed provide unlimited amount changes extra fees surethat love result collaboration look portfolio interested


In [16]:
df.clean_text[0]

'subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website make task much easier promise havinq ordered iogo company automaticaily become world ieader isguite ciear without good products effective business organization practicable aim hotat nowadays market promise marketing efforts become much effective list clear benefits creativeness hand made original logos specially done reflect distinctive company image convenience logo stationery provided formats easy use content management system letsyou change website content even structure promptness see logo drafts within three business days affordability marketing break make gaps budget satisfaction guaranteed provide unlimited amount changes extra fees surethat love result collaboration look portfolio interested'

### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [17]:
from nltk.stem import WordNetLemmatizer

In [25]:
def lemmatizer(text):
    text_toke = word_tokenize(text)
    text_lemmatized = [WordNetLemmatizer().lemmatize(word,pos='v') for word in text_toke]
    return " ".join(text_lemmatized)

In [26]:
df.clean_text = df.clean_text.apply(lemmatizer)

## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

In [44]:
count_vectorizer = CountVectorizer()

X = count_vectorizer.fit_transform(df.clean_text)

X_bow = pd.DataFrame(X.toarray(),columns=count_vectorizer.get_feature_names_out(),index=df.clean_text)

In [45]:
X_bow

Unnamed: 0_level_0,aa,aaa,aaaenerfax,aadedeji,aagrawal,aal,aaldous,aaliyah,aall,aanalysis,...,zwzm,zxghlajf,zyban,zyc,zygoma,zymg,zzmacmac,zzn,zzncacst,zzzz
clean_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website make task much easier promise havinq order iogo company automaticaily become world ieader isguite ciear without good product effective business organization practicable aim hotat nowadays market promise market effort become much effective list clear benefit creativeness hand make original logo specially do reflect distinctive company image convenience logo stationery provide format easy use content management system letsyou change website content even structure promptness see logo draft within three business day affordability market break make gap budget satisfaction guarantee provide unlimited amount change extra fee surethat love result collaboration look portfolio interest,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject stock trade gunslinger fanny merrill muzo colza attainder penultimate like esmark perspicuous ramble segovia group try sling kansa tanzania yes chameleon continuant clothesman libretto chesapeake tight waterway herald hawthorn like chisel morristown superior deoxyribonucleic clockwork try hall incredible mcdougall yes hepburn einsteinian earmark sapling boar duane plain palfrey inflexible like huzzah pepperoni bedtime nameable attire try edt chronography optimum yes pirogue diffusion albeit,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject unbelievable new home make easy im want show homeowner pre approve home loan fix rate offer extend unconditionally credit way factor take advantage limit time opportunity ask visit website complete minute post approval form look foward hear dorcas pittman,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject color print special request additional information click click printable version order form pdf format phone fax e mail ramsey goldengraphix com request additional information click click printable version order form pdf format golden graphix print azusa canyon rd irwindale ca e mail message advertisement solicitation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject money get software cd software compatibility great grow old along best yet tradgedies finish death comedy end marriage,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
subject research development charge gpg forward shirley crenshaw hou ect vince j kaminski pm vera apodaca et enron enron cc vince j kaminski hou ect ect shirley crenshaw hou ect ect pinnamaneni krishnarao hou ect ect subject research development charge gpg vera shall talk account group correction vince pm vera apodaca enron vera apodaca enron vera apodaca enron pm pm pinnamaneni krishnarao hou ect ect cc vince j kaminski hou ect ect subject research development charge gpg per mail date june kim watson suppose occur true july fist six month review july actuals able locate entry would pls let know whether entry make intend process thank,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject receipt visit jim thank invitation visit lsu shirley fedex receipt tomorrow vince jam r garven pm vince j kaminski cc subject receipt visit dear vince thank take time visit faculty student get lot presentation favor ask concern expense reimbursement process mail travel lodge receipt secretary joan payne follow address joan payne department finance ceba louisiana state university baton rouge la thank jim garven jam r garven william h wright jr endow chair financial service department finance ceba e j ourso college business administration louisiana state university baton rouge la voice fax e mail jgarven lsu edu home page http garven lsu edu vita http garven lsu edu dossier html research paper archive http garven lsu edu research html,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject enron case study update wow day super thank much vince come baylor monday next week hash question list thank john pm write good afternoon john want drop line update andy fastow confirm one hour interview slot mr fastow monday december th noon addition schedule interview mr lay mr skilling outline question please hesitate contact regard cindy forward cindy derecskey corp enron pm cindy derecskey john martin cc vince j kaminski hou ect ect christie patrick hou ect ect subject enron case study document link cindy derecskey pm good afternoon john hope thing well write update status meet andy fastow ken lay jeff skilling arrange follow meet date time ken lay jeff skilling still try work andy fastow schedule jeff skilling december th p ken lay december th p also attempt schedule meet andy fastow december th convenience also allow u possibly schedule additional meet th need let know soon successful regard cindy derecskey university affair enron corp john martin carr p collins chair finance finance department baylor university po box waco tx office fax j martin baylor edu web http hsb baylor edu html martinj home html,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
subject interest david please call shirley crenshaw assistant extension set vince david p dupre pm vince j kaminski hou ect ect cc subject interest time available next day thank david vince j kaminski pm david p dupre hou ect ect cc vince j kaminski hou ect ect subject interest david please stop chat minute vince david p dupre vince j kaminski hou ect ect cc subject interest may meet discus interest join group strong quantitative discipline highly numerate thank david forward david p dupre hou ect david p dupre hou ect ect cc subject interest vince kaminski,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [53]:
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

In [57]:
X = df.clean_text
y = df.spam

In [65]:
score = cross_validate(model, X_bow, y, cv=5,scoring=['recall'])['test_recall'].mean()

print(score)

0.9890403999893052


🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !