# Ham or Spam?

In [1]:
# when installing nltk for the first time we need to also download a few built in libraries

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Simplon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Simplon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Simplon\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import pandas as pd
import re
df = pd.read_csv("emails.csv")

df.tail()

Unnamed: 0,text,spam
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0
5727,Subject: news : aurora 5 . 2 update aurora ve...,0


The dataset is made up of email that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

## Remove Punctuation

👇 Create a function to remove the punctuation. Apply it to the entire data and add the output as a new column in the dataframe called `clean_text`

In [3]:
def clean_text_ponc(text):
    
    for i in range(len(text)):
        clean_text = text[i].strip()
        whitelist = set("'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ")
        clean_text = ''.join(filter(whitelist.__contains__, clean_text))
        clean_text = clean_text.strip()
        text[i] = clean_text

    return text

In [4]:
df['clean_text']=clean_text_ponc(df.text)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text[i] = clean_text


In [10]:
df

Unnamed: 0,text,spam,clean_text
0,Subject naturally irresistible your corporate ...,1,Subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is...,1,Subject the stock trading gunslinger fanny is...
2,Subject unbelievable new homes made easy im w...,1,Subject unbelievable new homes made easy im w...
3,Subject 4 color printing special request addi...,1,Subject 4 color printing special request addi...
4,Subject do not have money get software cds fr...,1,Subject do not have money get software cds fr...
...,...,...,...
5723,Subject re research and development charges t...,0,Subject re research and development charges t...
5724,Subject re receipts from visit jim thanks ...,0,Subject re receipts from visit jim thanks ...
5725,Subject re enron case study update wow all ...,0,Subject re enron case study update wow all ...
5726,Subject re interest david please call shi...,0,Subject re interest david please call shi...


## Lower Case

👇 Create a function to lower case the text. Apply it to `clean_text`

In [11]:
def lowercase(text):
     clean_text = text.apply(lambda x: x.lower())
     return clean_text

In [12]:
df['clean_text']=lowercase(df.clean_text)

## Remove Numbers

👇 Create a function to remove numbers from the text. Apply it to `clean_text`

In [13]:
def clean_text_num(text):
    clean_text = text.apply(lambda x:''.join([i for i in x if not i.isdigit()]))
    return clean_text
    

In [14]:
df['clean_text']=clean_text_num(df.clean_text)

## Remove StopWords

👇 Create a function to remove stopwords from the text. Apply it to `clean_text`.

In [17]:
from nltk.corpus import stopwords


def text_stopwords(text):
    clean_text =text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords.words('english')]))
    return clean_text



In [18]:
df['clean_text']=text_stopwords(df.clean_text)

## Lemmatize

👇 Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`.

In [None]:
from nltk.stem import WordNetLemmatizer

def text_lemmatizer(text):
    lemmatizer = WordNetLemmatizer()
    clean_text=text.apply(lambda x: lemmatizer.lemmatize(x))
    return clean_text

In [None]:
df['clean_text']=text_lemmatizer(df.clean_text)

## Bag-of-words Modelling

👇 Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer . Save as `X_bow`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(df.clean_text)
vector = vectorizer.transform(df.clean_text)
print(vectorizer.vocabulary_)


👇 Cross-validate a MultinomialNB model with the Bag-of-words. Score the model's accuracy.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB

#cross validation
x=vector.toarray()
y=df.spam
X_train, X_test, y_train, y_test= train_test_split(x, y, test_size= .2, random_state = 42, stratify= y)
#implement MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
#predictions
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
#accuracy score 
train_pred_score = accuracy_score(y_train, y_train_pred)
test_pred_score = accuracy_score(y_test, y_test_pred)
print('Training Set Accuracy Score: ', (100 * train_pred_score))
print('Testing Set Accuracy Score: ', (100 * test_pred_score))

⚠️ Please push the exercise once you are done 🙃

## 🏁 