# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [1]:
# !pip install nltk

In [2]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [3]:
import pandas as pd

df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [4]:
import string

In [8]:
def punctuation(df, column_name):
    df['clean_text'] = df[column_name].apply(lambda text: text.translate(str.maketrans('', '', string.punctuation)))
    return df
df = punctuation(df, 'text')
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds fr...


### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [10]:
def lowercase(df, column_name):
    df[column_name] = df[column_name].apply(lambda text: text.lower())
    return df
df = lowercase(df, 'clean_text')
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [11]:
def numbers(df, column_name):
    df[column_name] = df[column_name].apply(lambda text: ''.join([char for char in text if not char.isdigit()]))
    return df
df = numbers(df, 'clean_text')
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...


### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [13]:
from nltk.corpus import stopwords

In [14]:
stop_words = set(stopwords.words('english'))

In [15]:
def stopwords(df, column_name):
    df[column_name] = df[column_name].apply(lambda text: ' '.join([word for word in text.split() if word.lower() not in stop_words]))
    return df
df = stopwords(df, 'clean_text')
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im wa...
3,Subject: 4 color printing special request add...,1,subject color printing special request additio...
4,"Subject: do not have money , get software cds ...",1,subject money get software cds software compat...


### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [16]:
from nltk.stem import WordNetLemmatizer

In [17]:
lemmatizer = WordNetLemmatizer()

In [18]:
def lemmatize_text(df, column_name):
    df[column_name] = df[column_name].apply(lambda text: ' '.join([lemmatizer.lemmatize(word) for word in text.split()]))
    return df
df = lemmatize_text(df, 'clean_text')
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trading gunslinger fanny merrill...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new home made easy im wan...
3,Subject: 4 color printing special request add...,1,subject color printing special request additio...
4,"Subject: do not have money , get software cds ...",1,subject money get software cd software compati...


## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

In [20]:
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['clean_text'])

In [23]:
print(df['clean_text'].head(10))

0    subject naturally irresistible corporate ident...
1    subject stock trading gunslinger fanny merrill...
2    subject unbelievable new home made easy im wan...
3    subject color printing special request additio...
4    subject money get software cd software compati...
5    subject great nnews hello welcome medzonline s...
6    subject hot play motion homeland security inve...
7    subject save money buy getting thing tried cia...
8    subject undeliverable home based business grow...
9    subject save money buy getting thing tried cia...
Name: clean_text, dtype: object


In [22]:
print(vectorizer.get_feature_names_out()[:50])

['aa' 'aaa' 'aaaenerfax' 'aadedeji' 'aagrawal' 'aal' 'aaldous' 'aaliyah'
 'aall' 'aanalysis' 'aaron' 'aawesome' 'ab' 'aba' 'abacha' 'abacus'
 'abahy' 'abaixo' 'abandon' 'abandoned' 'abandonment' 'abargain' 'abarr'
 'abattoir' 'abb' 'abbas' 'abbestellen' 'abbott' 'abbreviated'
 'abbreviation' 'abc' 'abcsearch' 'abdalla' 'abdallat' 'abdelnour' 'abdul'
 'abdulla' 'abdullah' 'abeis' 'abel' 'abello' 'aber' 'abernathy' 'abetted'
 'abeyance' 'abf' 'abhay' 'abide' 'abidjan' 'abiiity']


In [21]:
X_bow = pd.DataFrame(X_bow.toarray(), columns=vectorizer.get_feature_names_out())
X_bow.head()

Unnamed: 0,aa,aaa,aaaenerfax,aadedeji,aagrawal,aal,aaldous,aaliyah,aall,aanalysis,...,zwzm,zxghlajf,zyban,zyc,zygoma,zymg,zzmacmac,zzn,zzncacst,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Explorar el dataset para revisar si el listado de palabras realmente existen.

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [28]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

In [34]:
X = X_bow
y = df['spam']

In [35]:
pipeline = Pipeline([
    ('model', MultinomialNB())
])
accuracy_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy').mean()
accuracy_scores

0.9895252901681946

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !