## Review of conditional probability and its application on Text

- Assume this small dataset is given:

<img src="../Notebooks/Images/spam_ham_data_set.png" width="600" height="600">

## Activity: Create spam and ham dictionary

- Create two dictionaries for spam and ham where keys are unique words and values are the frequency of each word
    - Example: if the word "password" shows up 4 times in the text, then in the dictionary, the key would be "password" and the value would be 4
- Create the dictionaries programatically using `for` loops
- Use the below text to create your dictionaries:
    - `spam_text= ['Send us your password', 'review us', 'Send your password', 'Send us your account']`
    - `ham_text= ['Send us your review', 'review your password']`

In [14]:
spam_text = ['Send us your password', 'review us', 'Send your password', 'Send us your account']
ham_text = ['Send us your review', 'review your password']

spam = {}
ham = {}

for item in spam_text:
    for j in item.lower().split(' '):
        if j not in spam:
            spam[j] = 1
        else:
            spam[j] += 1
print('Spam Dictionary')
print(spam)
print("\n")
for item in ham_text:
    for j in item.lower().split(' '):
        if j not in ham:
            ham[j] = 1
        else:
            ham[j] += 1
print('Ham Dictionary')            
ham

Spam Dictionary
{'send': 3, 'us': 3, 'your': 3, 'password': 2, 'review': 1, 'account': 1}


Ham Dictionary


{'send': 1, 'us': 1, 'your': 2, 'review': 2, 'password': 1}

## Question: Given our dictionaries from the last activity, if we know an email is spam, what is the probability that the word "password" is in the email? 

What is the frequency of "password" in a spam email?

- Answer:

 $P(password \mid spam) = 2/(3+3+3+2+1+1) = 2/13 \approx 15.38\%$ 

In [15]:
p_password_given_spam = spam['password']/sum(spam.values())
print(p_password_given_spam)

0.15384615384615385


## Question: Given our dictionaries from the last activity, if we know an email is ham, what is the probability that the word "password" is in the email? 

What is the frequency of "password" in a ham email?

- Answer:

$P(password \mid ham) = 1/(1+2+1+1+2+0) = 1/7 \approx 14.29\%$ 

## Question: Assume we have seen the word "password" in an email, what is the probability that the email is spam?

- $P(spam \mid password) = ?$
- Hint: Use Bayes' rule and Law of Total Probability (LOTP):
    - Bayes' Rule: $P(spam \mid password) = (P(password \mid spam) P(spam))/ P(password)$ 
    - LOTP: $P(password) = P(password \mid spam) P(spam) + P(password \mid ham) P(ham)$

## Activity: Apply the naive Bayes to spam/ham email dataset:

**In groups of 3, complete the following activity**

1. Please read this article, starting at the **Naive Bayes Assumption** section: https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/
1. We will use the [Spam Dataset](Datasets/spam.csv)
1. In the article, for the codeblock of the `fit` method, which line(s) of the method calculates the probabilty of ham and spam?
1. For the same `fit` method, which line(s) of the method calculates the spam and ham dictionaries?
1. In the article, for the codeblock of the `predict` method, which line(s) compares the scores of ham or spam based on log probabilities?

We will discuss as a class after workinging in groups.

## Activity: Find the Naive Bayes core parts in the SpamDetector Class

**In groups of 3, complete the following activity**

Assume we have written the `SpamDetector` class from the article. Train this model from the given [Spam Dataset](Datasets/spam.csv), and use it to make a prediction!

Use the starter code below, and then fill in the TODOs in the `main`.

**Hints:**

- you will need to use [train_test_split from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to obtain your training and test (prediction) data
- You will need to instantiate your `SpamDetector`, fit the training data to it, predict using the test values, and then measure the accuracy
- To calculate accuracy: add up all the correct predictions divided by the total number of predictions
- Use the following code to get your data ready for transforming/manipulating:
```
data = pd.read_csv('Datasets/spam.csv',encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":'label', "v2":'text'})
print(data.head())
tags = data["label"]
texts = data["text"]
X, y = texts, tags
```

In [35]:
import pandas as pd 
data = pd.read_csv('Datasets/spam.csv',encoding='latin-1')
data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)
data = data.rename(columns={"v1":'label', "v2":'text'})
data.head()


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [36]:
tags = data["label"]
texts = data["text"]
X, y = texts, tags

In [37]:
import os
import re
import string
import math
import pandas as pd

class SpamDetector(object):
    """Implementation of Naive Bayes for binary classification"""

    # clean up our string by removing punctuation
    def clean(self, s):
        translator = str.maketrans("", "", string.punctuation)
        return s.translate(translator)

    #  tokenize our string into words
    def tokenize(self, text):
        text = self.clean(text).lower()
        return re.split("\W+", text)

    # count up how many of each word appears in a list of words.
    def get_word_counts(self, words):
        word_counts = {}
        for word in words:
            word_counts[word] = word_counts.get(word, 0.0) + 1.0
        return word_counts

    def fit(self, X, Y):
        # obtaining probablities and making ham and spam dictionaries
        """Fit our classifier
        Arguments:
            X {list} -- list of document contents
            y {list} -- correct labels
        """
        self.num_messages = {}
        self.log_class_priors = {}
        self.word_counts = {}
        self.vocab = set()

        # Compute log class priors (the probability that any given message is spam/ham),
        # by counting how many messages are spam/ham, 
        # dividing by the total number of messages, and taking the log.
        n = len(X)
        self.num_messages['spam'] = sum(1 for label in Y if label == 'spam')
        self.num_messages['ham'] = sum(1 for label in Y if label == 'ham')
        self.log_class_priors['spam'] = math.log(self.num_messages['spam'] / n )
        self.log_class_priors['ham'] = math.log(self.num_messages['ham'] / n )
        self.word_counts['spam'] = {}
        self.word_counts['ham'] = {}

        # for each (document, label) pair, tokenize the document into words.
        for x, y in zip(X, Y):
            c = 'spam' if y == 'spam' else 'ham'
            counts = self.get_word_counts(self.tokenize(x))
            # For each word, either add it to the vocabulary for spam/ham, 
            # if it isn’t already there, and update the number of counts. 
            for word, count in counts.items():
                # Add that word to the global vocabulary.
                if word not in self.vocab:
                    self.vocab.add(word)
                if word not in self.word_counts[c]:
                    self.word_counts[c][word] = 0.0

                self.word_counts[c][word] += count

    # function to actually output the class label for new data.
    def predict(self, X):
        # adding log of the word given the email is spam 
        result = []
        # Given a document...
        for x in X:
            counts = self.get_word_counts(self.tokenize(x))
            spam_score = 0
            ham_score = 0
            # We iterate through each of the words...
            for word, _ in counts.items():
                if word not in self.vocab: continue
                # ... and compute log p(w_i|Spam), and sum them all up. The same will happen for Ham
                # add Laplace smoothing
                # https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf
                log_w_given_spam = math.log( (self.word_counts['spam'].get(word, 0.0) + 1) / (self.num_messages['spam'] + len(self.vocab)) )
                log_w_given_ham = math.log( (self.word_counts['ham'].get(word, 0.0) + 1) / (self.num_messages['ham'] + len(self.vocab)) )

                spam_score += log_w_given_spam
                ham_score += log_w_given_ham
            
            # Then we add the log class priors...
            spam_score += self.log_class_priors['spam']
            ham_score += self.log_class_priors['ham']

            # ... and check to see which score is bigger for that document.
            # Whichever is larger, that is the predicted label!
            if spam_score > ham_score:
                result.append('spam')
            else:
                result.append('ham')
        return result
        

# TODO: Fill in the below function to make a prediction, 
# your answer should match the final number in the below output (0.9641)
if __name__ == '__main__':
    from sklearn.model_selection import train_test_split 
    import numpy as np 
    sd = SpamDetector()
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

    sd.fit(X_train, y_train)
    print(np.mean(np.array(sd.predict(X_test)) == y_test))

0.9704035874439462


## Activity: use sklearn CountVectorizer and MultinomialNB to spam email dataset

As we've seen with previous topics, sklearn has a lot of built in functionality that can save us from writing the code from scratch. We are going to solve the same problem in the previous activity, but using sklearn!

For example, the `SpamDectector` class in the previous activity is an example of a **Multinomial Naive Bayes (MNB) model**. An MNB lets us know that each conditional probability we're looking at (i.e. $P(spam | w_1, w_2, ..., w_n)$) is a multinomial (several terms, polynomial) distribution, rather than another type distribution.

**In groups of 3, complete the activity by using the provided starter code and following the steps below:**

1 - Split the dataset

`from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)`

2 - Vectorize the dataset : `vect = CountVectorizer()`

3 - Transform training data into a document-term matrix (BoW): `X_train_dtm = vect.fit_transform(X_train)`

4 - Build and evaluate the model

**Hints:**

- Remember how you prepared/cleaned/labeled the dataset, created texts and tags, and split the data innto train vs test from the previous activity. You'll need to do so again here
- Review the [CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to see how you can transform text into numerical vectors
- Need more help? Check out this [MNB Vectorization](https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/) article and see what you can use from it.

In [47]:
## starter code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import pandas as pd
from sklearn.metrics import confusion_matrix

# TODO: Prepare the dataset
data = pd.read_csv('Datasets/spam.csv',encoding='latin-1')
data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)
data = data.rename(columns={"v1":'label', "v2":'text'})
# TODO: create texts and tags
tags = data["label"]
texts = data["text"]
X, y = texts, tags
# TODO: split the data into train vs test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# TODO: transform text into numerical vectors
vectorizer = CountVectorizer()
X_train_dtm = vectorizer.fit_transform(X_train)
# print(X_train_dtm[:1])
# print(X_train[:1])
# instantiate Multinomial Naive Bayes model
nb = MultinomialNB()
# fit to model, with the trained part of the dataset
nb.fit(X_train_dtm, y_train)
X_test_dtm = vectorizer.transform(X_test)
# make prediction
y_pred_class = nb.predict(X_test_dtm)
# test accurarcy of prediction
metrics.accuracy_score(y_test, y_pred_class)

0.9874439461883409

In [49]:
cm = confusion_matrix(y_test, y_pred_class)
cm

array([[947,   2],
       [ 12, 154]])

In [50]:
from sklearn.metrics import classification_report

In [58]:
target_name = ['Spam', 'Ham']
classification_report(y_test, y_pred_class, target_names=target_name)

'              precision    recall  f1-score   support\n\n        Spam       0.99      1.00      0.99       949\n         Ham       0.99      0.93      0.96       166\n\n    accuracy                           0.99      1115\n   macro avg       0.99      0.96      0.97      1115\nweighted avg       0.99      0.99      0.99      1115\n'