# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [2]:
# !pip install nltk

In [3]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/alanoud/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/alanoud/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/alanoud/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/alanoud/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [4]:
import pandas as pd

df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [5]:
# YOUR CODE HERE
def remove_punctuation(text):
    punctuation = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    return ''.join(char for char in text if char not in punctuation)
df['clean_text'] = df['text'].apply(remove_punctuation)

df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,Subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,Subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,Subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,Subject re interest david please call shi...


In [6]:
df['clean_text'] = df['clean_text'].str.replace(r'(Subject|subject)', '', regex=True, case=True)

In [7]:
df['clean_text']

0        naturally irresistible your corporate identit...
1        the stock trading gunslinger  fanny is merril...
2        unbelievable new homes made easy  im wanting ...
3        4 color printing special  request additional ...
4        do not have money  get software cds from here...
                              ...                        
5723     re  research and development charges to gpg  ...
5724     re  receipts from visit  jim   thanks again f...
5725     re  enron case study update  wow  all on the ...
5726     re  interest  david   please  call shirley cr...
5727     news  aurora 5  2 update  aurora version 5  2...
Name: clean_text, Length: 5728, dtype: object

### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [8]:
# YOUR CODE HERE
def lowercase_text(text):
    return text.lower()
df['clean_text'] = df['clean_text'].apply(lowercase_text)
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,naturally irresistible your corporate identit...
1,Subject: the stock trading gunslinger fanny i...,1,the stock trading gunslinger fanny is merril...
2,Subject: unbelievable new homes made easy im ...,1,unbelievable new homes made easy im wanting ...
3,Subject: 4 color printing special request add...,1,4 color printing special request additional ...
4,"Subject: do not have money , get software cds ...",1,do not have money get software cds from here...
...,...,...,...
5723,Subject: re : research and development charges...,0,re research and development charges to gpg ...
5724,"Subject: re : receipts from visit jim , than...",0,re receipts from visit jim thanks again f...
5725,Subject: re : enron case study update wow ! a...,0,re enron case study update wow all on the ...
5726,"Subject: re : interest david , please , call...",0,re interest david please call shirley cr...


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [9]:
# YOUR CODE HERE
def numbers_remove(text):
    return''.join(char for char in text if not char.isdigit())

df['clean_text'] = df['clean_text'].apply(numbers_remove)
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,naturally irresistible your corporate identit...
1,Subject: the stock trading gunslinger fanny i...,1,the stock trading gunslinger fanny is merril...
2,Subject: unbelievable new homes made easy im ...,1,unbelievable new homes made easy im wanting ...
3,Subject: 4 color printing special request add...,1,color printing special request additional i...
4,"Subject: do not have money , get software cds ...",1,do not have money get software cds from here...
...,...,...,...
5723,Subject: re : research and development charges...,0,re research and development charges to gpg ...
5724,"Subject: re : receipts from visit jim , than...",0,re receipts from visit jim thanks again f...
5725,Subject: re : enron case study update wow ! a...,0,re enron case study update wow all on the ...
5726,"Subject: re : interest david , please , call...",0,re interest david please call shirley cr...


### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [10]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [None]:
# YOUR CODE HERE
def re_stopwords(text):
    return' '.join (word for word in text.split() if word not in stop_words)
df['clean_text'] = df['clean_text'].apply(re_stopwords)
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,naturally irresistible corporate identity lt r...
1,Subject: the stock trading gunslinger fanny i...,1,stock trading gunslinger fanny merrill muzo co...
2,Subject: unbelievable new homes made easy im ...,1,unbelievable new homes made easy im wanting sh...
3,Subject: 4 color printing special request add...,1,color printing special request additional info...
4,"Subject: do not have money , get software cds ...",1,money get software cds software compatibility ...
...,...,...,...
5723,Subject: re : research and development charges...,0,research development charges gpg forwarded shi...
5724,"Subject: re : receipts from visit jim , than...",0,receipts visit jim thanks invitation visit lsu...
5725,Subject: re : enron case study update wow ! a...,0,enron case study update wow day super thank mu...
5726,"Subject: re : interest david , please , call...",0,interest david please call shirley crenshaw as...


### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [12]:
# YOUR CODE HERE
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(w, pos='v') for w in word_tokenize(text)])

df['clean_text'] = df['clean_text'].apply(lemmatize_text)
df

[nltk_data] Downloading package punkt to /home/alanoud/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/alanoud/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/alanoud/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,naturally irresistible corporate identity lt r...
1,Subject: the stock trading gunslinger fanny i...,1,stock trade gunslinger fanny merrill muzo colz...
2,Subject: unbelievable new homes made easy im ...,1,unbelievable new home make easy im want show h...
3,Subject: 4 color printing special request add...,1,color print special request additional informa...
4,"Subject: do not have money , get software cds ...",1,money get software cds software compatibility ...
...,...,...,...
5723,Subject: re : research and development charges...,0,research development charge gpg forward shirle...
5724,"Subject: re : receipts from visit jim , than...",0,receipt visit jim thank invitation visit lsu s...
5725,Subject: re : enron case study update wow ! a...,0,enron case study update wow day super thank mu...
5726,"Subject: re : interest david , please , call...",0,interest david please call shirley crenshaw as...


## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [13]:
clean_text=df['clean_text']

In [14]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(clean_text)
X_bow= X.toarray()
X_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [15]:
# YOUR CODE HERE
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB

# Feature/Target
X = df["clean_text"]
y = df["spam"]

# Pipeline vectorizer + Naive Bayes
pipeline_naive_bayes = make_pipeline(
    CountVectorizer(),
    MultinomialNB()
)
# Cross-validation
cv_results = cross_validate(pipeline_naive_bayes, X, y, cv = 5, scoring = ["recall"])
average_recall = cv_results["test_recall"].mean()
np.round(average_recall,2)

0.98

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !