# Ham or Spam?

üéØ The goal of this challenge is to classify emails as spams (1) or normal emails (0)

üßπ First, you will apply cleaning techniques to these textual data

üë©üèª‚Äçüî¨ Then, you will convert the cleaned texts into a numerical representation

‚úâÔ∏è Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [1]:
!pip install nltk



In [2]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gasparburgi/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gasparburgi/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gasparburgi/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/gasparburgi/nltk_data...


True

In [3]:
import pandas as pd

df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

‚ùì Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ‚ùì

In [7]:
import string

In [9]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [36]:
def basic_cleaning(sentence:str):
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation,'')
        
    sentence = sentence.strip()
    
    return sentence

In [39]:
# YOUR CODE HERE

df['clean_text']=df.text.map(lambda x: basic_cleaning(x))

In [40]:
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call shi...


### (1.2) Lower Case

‚ùì Create a function to lowercase the text. Apply it to `clean_text` ‚ùì

In [41]:
def lower_case(sentence:str):
    return sentence.lower()

In [42]:
# YOUR CODE HERE
df['clean_text'].map(lambda x: lower_case(x))

0       subject naturally irresistible your corporate ...
1       subject the stock trading gunslinger  fanny is...
2       subject unbelievable new homes made easy  im w...
3       subject  color printing special  request addit...
4       subject do not have money  get software cds fr...
                              ...                        
5723    subject re  research and development charges t...
5724    subject re  receipts from visit  jim   thanks ...
5725    subject re  enron case study update  wow  all ...
5726    subject re  interest  david   please  call shi...
5727    subject news  aurora    update  aurora version...
Name: clean_text, Length: 5728, dtype: object

### (1.3) Remove Numbers

‚ùì Create a function to remove numbers from the text. Apply it to `clean_text` ‚ùì

In [43]:
# YOUR CODE HERE
def remove_num(sentence:str):
    sentence = ''.join(char for char in sentence if not char.isdigit())
    return sentence

In [44]:
df['clean_text'].map(lambda x: remove_num(x))

0       subject naturally irresistible your corporate ...
1       subject the stock trading gunslinger  fanny is...
2       subject unbelievable new homes made easy  im w...
3       subject  color printing special  request addit...
4       subject do not have money  get software cds fr...
                              ...                        
5723    subject re  research and development charges t...
5724    subject re  receipts from visit  jim   thanks ...
5725    subject re  enron case study update  wow  all ...
5726    subject re  interest  david   please  call shi...
5727    subject news  aurora    update  aurora version...
Name: clean_text, Length: 5728, dtype: object

### (1.4) Remove StopWords

‚ùì Create a function to remove stopwords from the text. Apply it to `clean_text`. ‚ùì

In [48]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [46]:
# YOUR CODE HERE
stop_words = set(stopwords.words('english'))

In [49]:
def rm_stopwords(sentence:str):
    
    word_tokens = word_tokenize(sentence)
    tokens_cleaned = [w for w in word_tokens if not w in stop_words]
    
    return tokens_cleaned

In [50]:
df.clean_text.map(lambda x: rm_stopwords(x))

0       [subject, naturally, irresistible, corporate, ...
1       [subject, stock, trading, gunslinger, fanny, m...
2       [subject, unbelievable, new, homes, made, easy...
3       [subject, color, printing, special, request, a...
4       [subject, money, get, software, cds, software,...
                              ...                        
5723    [subject, research, development, charges, gpg,...
5724    [subject, receipts, visit, jim, thanks, invita...
5725    [subject, enron, case, study, update, wow, day...
5726    [subject, interest, david, please, call, shirl...
5727    [subject, news, aurora, update, aurora, versio...
Name: clean_text, Length: 5728, dtype: object

### (1.5) Lemmatize

‚ùì Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ‚ùì

In [52]:
from nltk.stem.wordnet import WordNetLemmatizer

In [54]:
def lemmatized(sentence:str):
    sentence=rm_stopwords(sentence)
    verb_lemmatized = [
        WordNetLemmatizer().lemmatize(word, pos='v')
        for word in sentence
    ]
    noun_lemmatized = [
        WordNetLemmatizer().lemmatize(word,pos='n')
        for word in verb_lemmatized
    ]
    
    result_string = ' '.join(noun_lemmatized)
    return result_string

In [56]:
# YOUR CODE HERE
df['clean_text']=df.clean_text.map(lambda x:lemmatized(x))

In [57]:
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible corporate ident...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trade gunslinger fanny merrill m...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new home make easy im wan...
3,Subject: 4 color printing special request add...,1,subject color print special request additional...
4,"Subject: do not have money , get software cds ...",1,subject money get software cd software compati...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject research development charge gpg forwar...
5724,"Subject: re : receipts from visit jim , than...",0,subject receipt visit jim thank invitation vis...
5725,Subject: re : enron case study update wow ! a...,0,subject enron case study update wow day super ...
5726,"Subject: re : interest david , please , call...",0,subject interest david please call shirley cre...


## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

‚ùì Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ‚ùì

In [58]:
from sklearn.feature_extraction.text import CountVectorizer

In [68]:
# YOUR CODE HERE
count_vectorized = CountVectorizer()
X = count_vectorized.fit_transform(df['clean_text'])
X.toarray()

vectorized_texts = pd.DataFrame(
    X.toarray(),
    columns=count_vectorized.get_feature_names_out()
)
vectorized_texts

Unnamed: 0,aa,aaa,aaaenerfax,aadedeji,aagrawal,aal,aaldous,aaliyah,aall,aanalysis,...,zwzm,zxghlajf,zyban,zyc,zygoma,zymg,zzmacmac,zzn,zzncacst,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### (2.2) Multinomial Naive Bayes Modelling

‚ùì Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ‚ùì

In [79]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline

In [80]:
# YOUR CODE HERE
model = MultinomialNB()
X = df['clean_text']
y = df['spam']
pipeline = make_pipeline(
        CountVectorizer(),
        MultinomialNB()
)
cross_validate(model,X,y,cv=5,scoring='recall')

ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/naive_bayes.py", line 745, in fit
    X, y = self._check_X_y(X, y)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/naive_bayes.py", line 578, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/base.py", line 622, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1146, in check_X_y
    X = check_array(
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/validation.py", line 915, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/pandas/core/series.py", line 872, in __array__
    return np.asarray(self._values, dtype)
ValueError: could not convert string to float: 'subject reduction high blood pressure age nothing number okay want hold young body long view new lifespan enhancement press increase longevity increase segment population frontier new millennium dr david howard medical journal news sorry address good reason rash youth idea speed ocean destine arrive shortly barbarous island brava coast africa yet case sun sink edge wave saw great relief large island directly path drop lower position air judge center island turn indicator zero stop short'

--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/naive_bayes.py", line 745, in fit
    X, y = self._check_X_y(X, y)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/naive_bayes.py", line 578, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/base.py", line 622, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1146, in check_X_y
    X = check_array(
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/validation.py", line 915, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/Users/gasparburgi/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/pandas/core/series.py", line 872, in __array__
    return np.asarray(self._values, dtype)
ValueError: could not convert string to float: 'subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website make task much easier promise havinq order iogo company automaticaily become world ieader isguite ciear without good product effective business organization practicable aim hotat nowadays market promise market effort become much effective list clear benefit creativeness hand make original logo specially do reflect distinctive company image convenience logo stationery provide format easy use content management system letsyou change website content even structure promptness see logo draft within three business day affordability market break make gap budget satisfaction guarantee provide unlimited amount change extra fee surethat love result collaboration look portfolio interest'


üèÅ Congratulations !

üíæ Don't forget to git add/commit/push your notebook...

üöÄ ... and move on to the next challenge !