# Ham or Spam?

In [1]:
# when installing nltk for the first time we need to also download a few built in libraries
#!pip install nltk
import nltk
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')

In [2]:
import pandas as pd

df = pd.read_csv("emails.csv")

df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


The dataset is made up of email that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

## Remove Punctuation

👇 Create a function to remove the punctuation. Apply it to the entire data and add the output as a new column in the dataframe called `clean_text`

In [3]:
import string 

#string.punctuation

In [4]:
def rem_punct(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

df['clean_text'] = df['text'].apply(rem_punct)

## Lower Case

👇 Create a function to lower case the text. Apply it to `clean_text`

In [5]:
def low_case(text):
    text = text.lower() 
    return text

df['clean_text'] = df['clean_text'].apply(low_case)

## Remove Numbers

👇 Create a function to remove numbers from the text. Apply it to `clean_text`

In [6]:
def rem_number(text):
    text = ''.join(word for word in text if not word.isdigit())
    return text

df['clean_text'] = df['clean_text'].apply(rem_number)

## Remove StopWords

👇 Create a function to remove stopwords from the text. Apply it to `clean_text`.

In [7]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english')) 

def rem_stop_words(text):
    
    word_tokens = word_tokenize(text) 
    text = [w for w in word_tokens if not w in stop_words]     
    #print(text)
    return text

df['clean_text'] = df['clean_text'].apply(rem_stop_words)

## Lemmatize

👇 Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`.

In [8]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def app_lemmatize(text):
    
    stemmed = [stemmer.stem(word) for word in text]
    return stemmed

df['clean_text'] = df['clean_text'].apply(app_lemmatize)



In [9]:
for i in range(len(df)):
    df['clean_text'][i] = ' '.join(df['clean_text'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_text'][i] = ' '.join(df['clean_text'][i])


## Bag-of-words Modelling

👇 Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer . Save as `X_bow`.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['clean_text'])

#X.toarray()

X_bow = pd.DataFrame(X.toarray(),columns = vectorizer.get_feature_names())

👇 Cross-validate a MultinomialNB model with the Bag-of-words. Score the model's accuracy.

In [11]:
y = df['spam']

In [12]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['clean_text'])


In [14]:
from sklearn.model_selection import cross_validate


cv_results = cross_validate(nb_model, X_bow, y, cv=5, n_jobs= -1)
print(f"CountVectorizer: {cv_results['test_score'].mean()}")

cv_results = cross_validate(nb_model, X, y, cv=5, n_jobs= -1)
print(f"TfidfVectorizer: {cv_results['test_score'].mean()}")


CountVectorizer: 0.9884778649107966
TfidfVectorizer: 0.8966472332091117


👇 Fit a MultinomialNB model with the Bag-of-words. Score the model's accuracy.

In [15]:

nb_model.fit(X_bow,y)

nb_model.score(X_bow,y)

0.9951117318435754

In [16]:

nb_model.fit(X,y)

nb_model.score(X,y)

0.9355796089385475

⚠️ Please push the exercise once you are done 🙃

## 🏁 