# Ham or Spam?

In [154]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings("ignore")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [239]:
# when installing nltk for the first time we need to also download a few built in libraries
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

import re

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Simplon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Simplon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Simplon\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [156]:
import pandas as pd

df = pd.read_csv("emails.csv")

df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


The dataset is made up of email that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

## Remove Punctuation

👇 Create a function to remove the punctuation. Apply it to the entire data and add the output as a new column in the dataframe called `clean_text`

In [157]:
def remove_punctuation(string:str):
    return re.sub("[^a-zA-Z0-9\s]", " ", string)


df["clean_text"] = df.text.apply(remove_punctuation)

## Lower Case

👇 Create a function to lower case the text. Apply it to `clean_text`

In [158]:
df.clean_text = df.clean_text.str.lower()

## Remove Numbers

👇 Create a function to remove numbers from the text. Apply it to `clean_text`

In [159]:
def remove_numbers(string:str):
    return re.sub("[^\D]", " ", string)

df["clean_text"] = df.clean_text.apply(remove_numbers)

## Remove StopWords

👇 Create a function to remove stopwords from the text. Apply it to `clean_text`.

In [160]:
def remove_stopwords(series: pd.Series, stop_words: list):
    pat = r'\b(?:{})\b'.format('|'.join(stop_words))
    return series.str.replace(pat, " ")

stop_words = stopwords.words("english")
stop_words.append("subject")
df["clean_text"] = remove_stopwords(df.clean_text, stop_words)

## Lemmatize

👇 Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`.

In [163]:
def lemmatize_text(string: str):
    lemmatizer = WordNetLemmatizer()
    string = string.split()
    for i in range(len(string)):
        string[i] = lemmatizer.lemmatize(string[i], pos= "v")
        string[i] = lemmatizer.lemmatize(string[i], pos= "n")
        string[i] = lemmatizer.lemmatize(string[i], pos= "a")

    return " ".join(string)

df.clean_text = df.clean_text.apply(lemmatize_text)

Wall time: 17.3 s


## Bag-of-words Modelling

👇 Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer . Save as `X_bow`.

In [222]:
def vectorize_text(series: pd.Series):
    matrix = CountVectorizer()
    return matrix.fit_transform(series).toarray(), matrix.get_feature_names()

X_bow, words = vectorize_text(df.clean_text)

👇 Cross-validate a MultinomialNB model with the Bag-of-words. Score the model's accuracy.

In [242]:
def multinomial_naive_bayes(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state= 42)
    multinomial = MultinomialNB()
    multinomial.fit(X_train, y_train)
    y_pred = multinomial.predict(X_test)
    return accuracy_score(y_test, y_pred)

multinomial_naive_bayes(X_bow, df.spam)

0.9866201279813845

⚠️ Please push the exercise once you are done 🙃

## 🏁 