### Week 3: The Naive Bayes classifier - Multinomial model

Instructor: Nedelina Teneva <br>
Email: cilin@ischool.berkeley.edu <br>


Citations: <br>
 - https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/
 - https://github.com/MIDS-W207/cilin-coursework/tree/master/live_sessions
 


### Objectives: 
 - short intro, and 
 - examples of supervised classification using the Naive Bayes classifier.

### Naive Bayes classifier:
- is a linear classfier, simple but efficient.
- the probabilistic model is based on Bayes' theorem.
- 'naive' because of the assumption that the features in the dataset are mutually independent.
- tends to perform well even under the unrealistic assumption of independence. 
- for example, if the sample size is small, it can outperform more powerful classifiers.
- performs well especially in the fields of document classification and disease prediction.
- however, if the assumption of indepence is strongly violated and the classification problem is non-linear, then the model can perform very poorly.


Two types of Naive Bayes classifiers we will be studying today:
- **multi-variate Bernoulli model**: based on binary data (tokens in the feature vector of a document can take the value of 1 or 0)
- **multinomial model**: rather than binary data, use term frequency (the number of times a given token appears in a document). Note: we will be focusing on this model here.

### Step 1: Import packages

In [22]:
import os 
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

### Step 2: Define working directories

In [23]:
in_dir = os.getcwd()[:-len("live_sessions/week03")] + "data/week03/"
print(in_dir)

/Users/nteneva/Downloads/berkeley/cilin-coursework-master/data/week03/


### Step 3: Define classes

### Step 4: Define functions

### Step 5: Read data

the data has only 5 Sport and Not sports sentences.

In [24]:
df = pd.read_csv(in_dir + 'sport_text.csv')
df.head()

Unnamed: 0,Text,Category
0,A great great game,Sports
1,The election was over,Not sports
2,Very clean match,Sports
3,A clean but forgettable game,Sports
4,It was a close election,Not sports


___
**Problem statement**: we want to know if the sentence 'a very close game' belongs to the 'Sport' or 'Not sports' class.

**Idea**: calculate the probability that P(Sports|'a very close game') and the probability that  P(Not sports|'a very close game'). Then classify the sentence based on the largest probabiltiy we get.

**Tools**: Bayes' Theorem! useful when working with conditional probabilities.

$ P(Sports|a\, very\, close\, game) = \frac{P(a\, very\, close\, game| Sports) \times P(Sports)}{P(a\, very\, close\, game)}$ 

$ P(Not \, Sports|a\, very\, close\, game) = \frac{P(a\, very\, close\, game| Not \, Sports) \times P(Not \, Sports)}{P(a\, very\, close\, game)} $

___
Question:

Looking at the formula above, define the following terms:
   - posterior probability
   - conditional probability
   - prior probability
   - evidence

___
We can discard the denomitor because we are interested to compare the two probabilities, so we only need to compare:

$ P(a\, very\, close\, game| Sports) \times P(sports)$ 

with

$P(a\, very\, close\, game| Not\, Sports) \times P(Not\, Sports)$

---
Question:

Assumming conditional independence (pretty 'naive' to do so) how can we rewrite the conditional probabilities?

$ P(a\, very\, close\, game| Sports) = P(a|Sports) \times P(very|Sports) \times P(close|Sports) \times P(game|Sports)$

---

### Step 6: Data preprocessing 

#### Step 6.1: Cleaning text data

Before we build our feature vector (X) it's important to clean the data by stripping it of all unwanted characters. Fortunately, our sport dataset looks pretty clean. 

The only thing we want to do is to convert the strings in the Text column to lowercase.

In [25]:
df['Text'] = df.Text.str.lower()
df.head()

Unnamed: 0,Text,Category
0,a great great game,Sports
1,the election was over,Not sports
2,very clean match,Sports
3,a clean but forgettable game,Sports
4,it was a close election,Not sports


#### Step 6.2: Transform words into feature vectors

Remember that last week we discussed how important is to convert categorical data (text or words) into a numerical form before we can pass them to a ML algorithm. 

The feature names will be the unique tokens(words) in the Text column. The feature values will be the word frequency for each sentence in the dataset.

Question: why the frequency and not just a binary measure for each token in the feature vector?

In [26]:
## construct X_train
count = CountVectorizer(token_pattern='\\b\\w+\\b')

# create a np.array with all train sentences
train_sentence = []
for row, row in df.iterrows():
    train_sentence.append(row.Text)  
    
train_sentence = np.array(train_sentence)
print('Sentences in train data:')
print(train_sentence)

# get feature names and values
X_train = count.fit_transform(train_sentence)

# feature names
X_train_names = count.get_feature_names()
print('\nUnique tokens(words) in the data: ', list(X_train_names))

# feature values
X_train_values = X_train.toarray()
print('\nFeature values: ')
print(X_train_values)

# put everything together
X_train = pd.DataFrame(X_train_values)
X_train.columns = X_train_names

## construct y train
y_train = df['Category']

# print the dataset
print('\nTraining data after transformation:')
train_df = pd.concat((y_train, X_train), axis=1)
train_df

Sentences in train data:
['a great great game' 'the election was over' 'very clean match'
 'a clean but forgettable game' 'it was a close election']

Unique tokens(words) in the data:  ['a', 'but', 'clean', 'close', 'election', 'forgettable', 'game', 'great', 'it', 'match', 'over', 'the', 'very', 'was']

Feature values: 
[[1 0 0 0 0 0 1 2 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 1 1 0 1]
 [0 0 1 0 0 0 0 0 0 1 0 0 1 0]
 [1 1 1 0 0 1 1 0 0 0 0 0 0 0]
 [1 0 0 1 1 0 0 0 1 0 0 0 0 1]]

Training data after transformation:


Unnamed: 0,Category,a,but,clean,close,election,forgettable,game,great,it,match,over,the,very,was
0,Sports,1,0,0,0,0,0,1,2,0,0,0,0,0,0
1,Not sports,0,0,0,0,1,0,0,0,0,0,1,1,0,1
2,Sports,0,0,1,0,0,0,0,0,0,1,0,0,1,0
3,Sports,1,1,1,0,0,1,1,0,0,0,0,0,0,0
4,Not sports,1,0,0,1,1,0,0,0,1,0,0,0,0,1


#### Step 6.3: Compute stats needed for Bayes' probabilities

In [27]:
# compute word frequency for each category (this will facilitate computation of conditional probabilities)
words_sum = train_df.groupby(['Category']).sum()
words_sum

Unnamed: 0_level_0,a,but,clean,close,election,forgettable,game,great,it,match,over,the,very,was
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Not sports,1,0,0,1,2,0,0,0,1,0,1,1,0,2
Sports,2,1,2,0,0,1,2,2,0,1,0,0,1,0


In [28]:
# compute number of words in each category (this will facilitate computation of prior probabilities)
category_sum = words_sum.sum(axis=1)
category_sum

Category
Not sports     9
Sports        12
dtype: int64

In [29]:
# compute number of y, y=Sports, y=Not sports
count_y_all = train_df['Category'].count()
count_y_sports = train_df[train_df.Category=='Sports']['Category'].count()
count_y_not_sports = train_df[train_df.Category=='Not sports']['Category'].count()

print('count_y_all: ', count_y_all)
print('count_y_sports: ', count_y_sports)
print('count_y_not_sports: ', count_y_not_sports)

count_y_all:  5
count_y_sports:  3
count_y_not_sports:  2


---
### Step 7: Analysis - Naive Bayes as a classification algorithm

In particular, we want to know if the sentence "A very close game" belongs to the 'Sport' or 'Non sports' class.

Idea: calculate the probability that P(Sports|'a very close game') and the probability that  P(Not Sports|'a very close game'). Then classify the sentence based on the largest probabiltiy we get.

#### Step 7.1 Define test example and tokenize

In [30]:
test_sentence = "A very close game"
test_words = test_sentence.lower().split()
test_words

['a', 'very', 'close', 'game']

#### Step 7.2 Compute conditional probabilities

$ P(a\, very\, close\, game| Not\, sports) = P(a|Not\, sports) \times P(very|Not\, sports) \times P(close|Not\, sports) \times P(game|Not\, sports)$

$ P(a\, very\, close\, game| Sports) = P(a|Sports) \times P(very|Sports) \times P(close|Sports) \times P(game|Sports)$

In [31]:
# compute conditional probability for each word
cond_prob = words_sum[test_words].apply(lambda x: x/category_sum)

## add conditional probability for all words
cond_prob['cond_prob'] = cond_prob.apply(lambda x: x.a * x.very * x.close * x.game, axis=1)
cond_prob

Unnamed: 0_level_0,a,very,close,game,cond_prob
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Not sports,0.111111,0.0,0.111111,0.0,0.0
Sports,0.166667,0.083333,0.0,0.166667,0.0


These results don't look promissing at all. The conditional probabilities are 0. Why?

To avoid the problem of zero probabilities, we can add a smoothing term (α) to the
multinomial Bayes model. 

Options for additive smoothing:
 - Lidstone smoothing (α < 1). 
 - Laplace smoothing (α = 1).

In [32]:
# recompute word frequency for each category
words_sum_smooth = words_sum.apply(lambda x: x+1)
words_sum_smooth

Unnamed: 0_level_0,a,but,clean,close,election,forgettable,game,great,it,match,over,the,very,was
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Not sports,2,1,1,2,3,1,1,1,2,1,2,2,1,3
Sports,3,2,3,1,1,2,3,3,1,2,1,1,2,1


In [33]:
# recompute number of words in each category
category_sum_smooth = words_sum_smooth.sum(axis=1)
category_sum_smooth

Category
Not sports    23
Sports        26
dtype: int64

In [34]:
# compute conditional probability for each word
cond_prob_smooth = words_sum_smooth[test_words].apply(lambda x: x/category_sum_smooth)

# add conditional probability for all words
cond_prob_smooth['cond_prob'] = cond_prob_smooth.apply(lambda x: x.a * x.very * x.close * x.game, axis=1)
cond_prob_smooth

Unnamed: 0_level_0,a,very,close,game,cond_prob
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Not sports,0.086957,0.043478,0.086957,0.043478,1.4e-05
Sports,0.115385,0.076923,0.038462,0.115385,3.9e-05


#### Step 7.3 Compute prior probabilities

P(Sports) and P(Not sports)

In [35]:
prior_sports = count_y_sports/count_y_all
prior_not_sports = count_y_not_sports/count_y_all

print('P(Sports)=', prior_sports)
print('P(Not sports)=', prior_not_sports)

P(Sports)= 0.6
P(Not sports)= 0.4


#### Step 7.4 Compute conditional probabilities * prior probabilities

In [36]:
numerator_sports =  cond_prob_smooth.loc['Sports','cond_prob'] * prior_sports
numerator_sports

2.363362627358986e-05

In [37]:
numerator_not_sports =  cond_prob_smooth.loc['Not sports','cond_prob'] * prior_not_sports
numerator_not_sports

5.7175324559303314e-06

#### Step 7.5 Compute evidence

P(a very close game)

In [38]:
denominator = (numerator_sports + numerator_not_sports)
denominator

2.9351158729520193e-05

#### Step 7.6 Compute posterior probabilities and compare

P(Sports|a very close game)

P(Not sports| a very close game)

In [39]:
posterior_prob = [numerator_sports/denominator, numerator_not_sports/denominator]
posterior_prob

[0.8052024961392794, 0.19479750386072053]

Question: So what is the class assigned to "A very close game'?

In [40]:
max(posterior_prob)

0.8052024961392794