### Codio Activity 18.6: Naive Bayes Algorithm

**Expected Time = 60 minutes** 

**Total Points = 35** 

This activity focuses on the implementation of the Naive Bayes algorithm.  You will use the scikit-learn estimator together with your earlier vectorization strategies to model the WhatsApp text and compare to your earlier work with Logistic Regression.   

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

[Back to top](#-Index)

### Problem 1

#### Small Example

**10 Points**

The example below is adapted from Marsland's *Machine Learning an Algorithmic Perspective*.  A small dataset where the features are whether or not a student has a looming deadline, if there is a party going on, and whether or not the student feels lazy.  The activity column is the target, and your aim is to use the naive bayes formula below:

$$P(C_i) \prod_{k} P(X_j^k = a_k | C_i)$$

In [2]:
deadline = ['urgent','urgent','near', 'none', 'none', 'none', 'near', 'near', 'near','urgent']
party = ['yes', 'no', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'no']
lazy = ['yes', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no']
activity = ['party', 'study', 'party', 'party', 'pub', 'party', 'study', 'tv', 'party', 'study']

In [3]:
df = pd.DataFrame({'deadline': deadline, 
                  'party': party,
                  'lazy': lazy,
                  'activity': activity})
df

Unnamed: 0,deadline,party,lazy,activity
0,urgent,yes,yes,party
1,urgent,no,yes,study
2,near,yes,yes,party
3,none,yes,no,party
4,none,no,yes,pub
5,none,yes,no,party
6,near,no,no,study
7,near,no,yes,tv
8,near,yes,yes,party
9,urgent,no,no,study


Here, $C_i$ represents the class in the `activity` columm.  Accordingly, if we want to predict a category of activity given the input: 

```
deadline = near
party = no
lazy = yes
```

This means we need four probabilities:

- $P(party) \times P(near | party) \times P(no party | party) \times P(lazy | party)$
- $P(study) \times P(near | study) \times p(noparty | study) \times P(lazy | study)$
- $P(pub) \times P(near | pub) \times P(noparty | pub) \times P(lazy | pub)$
- $P(tv) \times P(near | tv) \times P(noparty | tv) \times P(lazy |tv)$

Compute these four probabilities and assign them to the list `probs` in the order above (party, study, pub, tv). 

Hint: No need to calculate the probabilities by hand.

In [4]:
### GRADED
probs = []

    
# YOUR CODE HERE
probs = [1/2*2/5*0, 3/10*1/3*1*1/3, 1/10*0, 1/10*1*1*1]

### ANSWER CHECK
print(probs)

[0.0, 0.03333333333333333, 0.0, 0.1]


[Back to top](#-Index)

### Problem 2

#### MAP solution

**10 Points**

Using these probabilities, the maximum aposteriori solution involves selecting the outcome that is associated with the highest probability.  Use your list of probabilities to identify the `argmax`.  Note you can use `np.argmax` for this or just inspect the values.  What is the activity associated with the MAP solution?  Assign your answer as a string -- `party`, `study`, `pub`, or `tv` -- to `ans2` below.

In [5]:
### GRADED
ans2 = ''

    
# YOUR CODE HERE
ans2 = 'tv'

### ANSWER CHECK
print(ans2)

tv


### Larger Example

Now, you are to use the scikitlearn vectorizers together with the `MultinomialNB` estimator to implement naive bayes algorithm for classifying the WhatsApp data.  The data is loaded and split for you below.

In [6]:
happy_df = pd.read_csv('data/Emotion(happy).csv')
sad_df = pd.read_csv('data/Emotion(sad).csv.zip', compression = 'zip')
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

[Back to top](#-Index)

### Problem 3

#### Pipeline with `CountVectorizer`

**5 Points**

Below, create a pipeline called `cvect_pipe` with named steps `cvect` and `bayes` that first vectorizes the text and then uses the `MultinomialNB` estimator with all default settings.  Fit this on the train and score it on the test, assigning the accuracy to `cvect_acc` below.

In [7]:
### GRADED
cvect_pipe = ''

cvect_acc = ''

    
# YOUR CODE HERE
cvect_pipe = Pipeline([('cvect', CountVectorizer()), ('bayes', MultinomialNB())])
cvect_pipe.fit(X_train, y_train)
cvect_acc = cvect_pipe.score(X_test, y_test)


### ANSWER CHECK
cvect_pipe.named_steps

{'cvect': CountVectorizer(), 'bayes': MultinomialNB()}

[Back to top](#-Index)

### Problem 4

#### Pipeline with `TfidfVectorizer`

**5 Points**

Below, create a pipeline called `tfidf_pipe` with named steps `tfidf` and `bayes` that first vectorizes the text and then uses the `MultinomialNB` estimator with all default settings.  Fit this on the train and score it on the test, assigning the accuracy to `tfidf_acc` below.


In [8]:
### GRADED
tfidf_pipe = ''

tfidf_acc = ''

    
# YOUR CODE HERE
tfidf_pipe = Pipeline([('tfidf', TfidfVectorizer()), ('bayes', MultinomialNB())])
tfidf_pipe.fit(X_train, y_train)
tfidf_acc = tfidf_pipe.score(X_test, y_test)


### ANSWER CHECK
tfidf_pipe.named_steps

{'tfidf': TfidfVectorizer(), 'bayes': MultinomialNB()}

[Back to top](#-Index)

### Problem 5

#### Assessing performance

**5 Points**

Now, consider searching the hyperparameters of the model.  Specifically, what is the parameter that controls Laplacian smoothing?  Assign your answer as a string to `ans5` below.  As an extra activity, perform a grid search over this parameter and compare the performance to that of `LogisticRegression`.  Also, compare the speed of fit between the logistic and naive bayes models.

In [9]:
### GRADED
ans5 = ''

    
# YOUR CODE HERE
ans5 = 'alpha'

### ANSWER CHECK
print(ans5)

alpha


In [18]:
pipelines = { 'lr': Pipeline([('tfidf', TfidfVectorizer()), ('lgr', LogisticRegression())]),
                'nb': Pipeline([('tfidf', TfidfVectorizer()), ('bayes', MultinomialNB())])
              }


In [20]:
scores = {}
for model_name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train.values.ravel())
    score = cross_val_score(pipeline, X_train, y_train.values.ravel(), cv=5, scoring='accuracy').mean()
    scores[model_name] = score
    print(f"{model_name} with accuracy score {score:.3f}")

lr with accuracy score 0.776
nb with accuracy score 0.721


In [21]:
best_model_name = max(scores, key=scores.get)
print(f"Best Model: {best_model_name} with accuracy score: {scores[best_model_name]}")

Best Model: lr with accuracy score: 0.7755381508300083


In [32]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],
          'tfidf__stop_words': ['english', None],
         # 'bayes__alpha': [0.5, 1],
          #'lgr__penalty': ['l1', 'l2']
          }

In [33]:
for pipeline in list(pipelines.values()):
    grid = GridSearchCV(pipeline, param_grid=params)
    grid.fit(X_train, y_train)
    print(grid.best_params_)

{'tfidf__max_features': 500, 'tfidf__stop_words': 'english'}
{'tfidf__max_features': 500, 'tfidf__stop_words': None}


In [28]:
list(pipelines.values())

[Pipeline(steps=[('tfidf', TfidfVectorizer()), ('lgr', LogisticRegression())]),
 Pipeline(steps=[('tfidf', TfidfVectorizer()), ('bayes', MultinomialNB())])]