# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [None]:
#!pip install nltk

In [1]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/orchidaung/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/orchidaung/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/orchidaung/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/orchidaung/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [98]:
import pandas as pd

df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [99]:
#Remove punctuation

import string
string.punctuation  

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [100]:
for pun in string.punctuation:
    df['clean_text'] = df['text'].str.replace(pun,'')
df.head()

  df['clean_text'] = df['text'].str.replace(pun,'')


Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject: naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,Subject: the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,Subject: unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,Subject: 4 color printing special request add...
4,"Subject: do not have money , get software cds ...",1,"Subject: do not have money , get software cds ..."


### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [101]:
df['clean_text'] = df['clean_text'].str.lower()
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject: naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,subject: the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,subject: unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,subject: 4 color printing special request add...
4,"Subject: do not have money , get software cds ...",1,"subject: do not have money , get software cds ..."


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [102]:
df['clean_text'] = df['clean_text'].str.replace('\d','')
df.head()

  df['clean_text'] = df['clean_text'].str.replace('\d','')


Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject: naturally irresistible your corporate...
1,Subject: the stock trading gunslinger fanny i...,1,subject: the stock trading gunslinger fanny i...
2,Subject: unbelievable new homes made easy im ...,1,subject: unbelievable new homes made easy im ...
3,Subject: 4 color printing special request add...,1,subject: color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,"subject: do not have money , get software cds ..."


### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [103]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

In [104]:
df['clean_text'] =  df['clean_text'].apply(lambda x : word_tokenize(x))
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,"[subject, :, naturally, irresistible, your, co..."
1,Subject: the stock trading gunslinger fanny i...,1,"[subject, :, the, stock, trading, gunslinger, ..."
2,Subject: unbelievable new homes made easy im ...,1,"[subject, :, unbelievable, new, homes, made, e..."
3,Subject: 4 color printing special request add...,1,"[subject, :, color, printing, special, request..."
4,"Subject: do not have money , get software cds ...",1,"[subject, :, do, not, have, money, ,, get, sof..."


In [105]:
stopwords_removed = df['clean_text'].apply(lambda x : [w for w in x if w in stop_words])
stopwords_removed.head()

0    [your, is, to, a, the, is, of, and, the, but, ...
1    [the, is, but, not, and, is, not, or, no, is, ...
2    [to, you, this, you, have, been, for, a, at, a...
3    [now, here, here, for, a, of, our, now, here, ...
4    [do, not, have, from, here, ain, t, it, with, ...
Name: clean_text, dtype: object

In [106]:
df['clean_text'] = df['clean_text'].apply(lambda x : [w for w in x if not w in stop_words])
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,"[subject, :, naturally, irresistible, corporat..."
1,Subject: the stock trading gunslinger fanny i...,1,"[subject, :, stock, trading, gunslinger, fanny..."
2,Subject: unbelievable new homes made easy im ...,1,"[subject, :, unbelievable, new, homes, made, e..."
3,Subject: 4 color printing special request add...,1,"[subject, :, color, printing, special, request..."
4,"Subject: do not have money , get software cds ...",1,"[subject, :, money, ,, get, software, cds, !, ..."


### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [107]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [108]:
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(w, pos='v') for w in text])

In [109]:
df['clean_text'] = df['clean_text'].apply(lemmatize_text)
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject : naturally irresistible corporate ide...
1,Subject: the stock trading gunslinger fanny i...,1,subject : stock trade gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject : unbelievable new home make easy im w...
3,Subject: 4 color printing special request add...,1,subject : color print special request addition...
4,"Subject: do not have money , get software cds ...",1,"subject : money , get software cds ! software ..."


## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [110]:
from sklearn.feature_extraction.text import CountVectorizer

In [111]:
count_vectorizer = CountVectorizer()
df['clean_text'].head()

0    subject : naturally irresistible corporate ide...
1    subject : stock trade gunslinger fanny merrill...
2    subject : unbelievable new home make easy im w...
3    subject : color print special request addition...
4    subject : money , get software cds ! software ...
Name: clean_text, dtype: object

In [143]:
X_bow = count_vectorizer.fit_transform(df['clean_text'])
X_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [114]:
count_vectorizer.get_feature_names_out()

array(['additional', 'advantage', 'advertisement', 'affordability', 'aim',
       'albeit', 'along', 'amount', 'approval', 'approve', 'ask',
       'attainder', 'attire', 'automaticaily', 'azusa', 'become',
       'bedtime', 'benefit', 'best', 'boar', 'break', 'budget',
       'business', 'ca', 'canyon', 'catchy', 'cds', 'chameleon', 'change',
       'chesapeake', 'chisel', 'chronography', 'ciear', 'clear', 'click',
       'clockwork', 'clothesman', 'collaboration', 'color', 'colza',
       'com', 'comedies', 'company', 'compatibility', 'complete',
       'content', 'continuant', 'convenience', 'corporate',
       'creativeness', 'credit', 'days', 'death', 'deoxyribonucleic',
       'diffusion', 'distinctive', 'do', 'dorcas', 'draft', 'duane',
       'earmark', 'easier', 'easy', 'edt', 'effective', 'efforts',
       'einsteinian', 'end', 'esmark', 'even', 'extend', 'extra',
       'factor', 'fanny', 'fax', 'fee', 'finish', 'fix', 'form', 'format',
       'foward', 'full', 'gap', 'get',

In [122]:
vectorized_df = pd.DataFrame(
    X_bow.toarray(),
    columns = count_vectorizer.get_feature_names_out()
)
vectorized_df.head()

Unnamed: 0,additional,advantage,advertisement,affordability,aim,albeit,along,amount,approval,approve,...,visit,want,waterway,way,website,within,without,world,yes,yet
0,0,0,0,1,1,0,0,1,0,0,...,0,0,0,0,2,1,1,1,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,3,0
2,0,1,0,0,0,0,0,0,1,1,...,1,1,0,1,1,0,0,0,0,0
3,2,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [123]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

In [139]:
mulNB = MultinomialNB()

y = df['spam']

In [145]:
cv_result = cross_validate(mulNB, X_bow, y, cv = 5, scoring= ['accuracy'])
cv_result['test_accuracy'].mean()

0.9888272098889626

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !