In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA


### Discriminative and Generative Models

#### Discriminative Models

Logistic regression, multivariate logistic and softmax regression are discriminative models. They all directly compute the conditional probability P(y|x), i.e. the probability of a given class knowing the feature x.

Discriminative models determine the boundary between classes. They classify points without providing a model of how the points are actually generated. They directly model the mapping from the independent variables to the dependent ones.

Classification is done using a threshold to turn the computed probability into a boundary that determines class assignment

Discriminative models don't make as many assumptions as generative models.

#### Generative Models

Naive Bayes and Linear Discriminant Analysis (LDA) are Generative Models. They first model P(x|y) i.e. they model the distribution of x for each class and then assign the class label. 

They compute the joint distribution, P(x,y) = P(x|y)P(y). This models how the data was generated and enables generation of samples for each class.

Generative models make assumptions about structure of data. For example, the independence assumption in Naive Bayes and the distribution of the classes (P(y)) in LDA.

## Naive Bayes

A family of classification algorithms based on Bayes' Theorem

<div style="font-size: 125%;">
$$P(Class|Features) = \frac{P(Class)P(Features|Class)}{P(Features)}$$
</div>

They all assume that the features (i.e. predictors) are independent, the "Naive" assumption.

### Naive Bayes Intuition

In [3]:
golf = pd.read_csv("golf.csv",quotechar="'",escapechar="/")
golf.head(14)

Unnamed: 0,forecast,temp,humidity,wind,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


#### Golf dataset

The data set is a matrix of features containing 14 observations of 4 features: forecast,temp, humidity, and wind. The Response is a vector of two Contains value of two classes or categories: Play=yes and Play = no.

The goal is to predict whether to play golf given a set of weather conditions. For today's data, D = (forecast,temp,humidity and wind), which is more probable, P(Play = yes|D) or P(Play = no|D)

We assume the features are independent and are equally important in predicting Play.

#### Class Probabilities

The class probabilities P(C) are the count number of 'yes's and 'No's and divided by the total number of observations.

In [4]:
cnts = golf.Play.value_counts()
print(cnts)
L = np.sum(cnts)

P_yes = cnts['Yes']/L
P_no = cnts['No']/L
P_yes.round(2),P_no.round(2)

Yes    9
No     5
Name: Play, dtype: int64


(0.64, 0.36)

#### Conditional Feature Probabilities P(feature = value | Class = value)

The likelihood of the feature given the class is the count of the feature when the class occurred divided the total number of instances of the class.


In [5]:
def cond_feat_prob (feat,val,cls='Yes'):
    return((sum((golf[feat] == val) & (golf.Play == cls))/cnts[cls]).round(2))

In [6]:
# Humudity
P_normal_yes = cond_feat_prob('humidity','Normal')
P_high_yes = cond_feat_prob('humidity','High')
P_normal_no = cond_feat_prob('humidity','Normal',cls='No')
P_high_no = cond_feat_prob('humidity','High',cls='No')
P_normal_yes,P_high_yes,P_normal_no,P_high_no

(0.67, 0.33, 0.2, 0.8)

In [7]:
# Wind
P_weak_yes = cond_feat_prob('wind','Weak')
P_strong_yes = cond_feat_prob('wind','Strong')
P_weak_no = cond_feat_prob('wind','Weak',cls='No')
P_strong_no = cond_feat_prob('wind','Strong',cls='No')
P_weak_yes,P_strong_yes,P_weak_no,P_strong_no

(0.67, 0.33, 0.4, 0.6)

#### Probability Play today = yes  given today's  features:  P(Play=yes|X) ~ P(X|Play=yes) * P(Play=yes)

Probability that you **will** play today is proportional to the Likelihood of the features times the Prior probability that you will play based on your past experiences

In [8]:
today = ('High','Strong') # Today has High Humidity and Strong Wind

In [9]:
P_yes_today = (P_high_yes * P_strong_yes) * P_yes # Independence Assumption
P_yes_today.round(3)

0.07

#### Probability Play today = no  given today's  features: P(Play=no | X) ~ P(X|Play=no) * P(Play = no)

Probability that you **wont** play today is proportional to the Likelihood of the features times the Prior probability that you will play based on your past experiences

In [10]:
P_no_today = (P_high_no * P_strong_no) * P_no #Independence Assumption
P_no_today.round(3)

0.171

#### Play?

Compare P_yes_features to P_no_features. You don't need to compute P(Features) to compare

In [11]:
print('Play') if (P_yes_today / P_no_today) > 1 else print("Study")

Study


#### Convert to probabilities by normalizing


In [12]:
P_today = P_yes_today + P_no_today 

P_yes_X = P_yes_today / P_today
P_no_X = P_no_today / P_today

P_yes_X,P_no_X

(0.2899618354486554, 0.7100381645513446)

### Bayes Classifier
 

A Bayes Classifier assigns to each observation the most likely class given its feature vector. 

<div style="font-size: 125%;">
$$ \arg\max_{y} Pr(Y = y|X = x)\text{, }x \in R^d\text{, }y\in \{1,2,3,..,m\}$$
</div>

Naive Bayes is a special case of a Bayes Classifier since it assumes the features are independent.

#### Test Error Rate for Classification:

<div style="font-size: 110%;">
$$\frac{1}{n}\sum^n_{i=1}I(y_i \neq \hat{y}_i)$$  
</div>

$I(y_i \neq \hat{y}_i)$ is an indicator variable = 1 if $y_i \neq \hat{y}_i$ and 0 if $y_i = \hat{y}_i$

It can be proven that the Bayes Classifier produces the lowest possible test error rate, called the Bayes Error Rate.

<div style="font-size: 105%;">
$$ BayesErrorRate = 1 - E(\arg\max_{y}(Pr(Y=y|X)))$$
</dev>

The Expectation (E) averages the probability over all possible values of X.

  #### How to calculate P(Y=Class|X = features) ??

In general you can't compute the posterior conditional distribution because you have on the order of $2^n$ parameters if you have n binary features. Therefore you need to make some assumptions.
  

### Naive Bayes

#### Naive Bayes Assumptions:
 
1) All features are of equal importance.   
2) All features are conditionally independent.

The second assumption is often violated, since the features are often somewhat correlated. Even so, it works surprisingly well in many cases. 

 

#### Posterior Probability
 
What is the probability that a particular object belongs to class y given its observed feature values?

Given:

y is the class variable  
$x_i$ represents ith feature in the feature vector 


<div style="font-size: 125%;">

$$P(y|x_1,...,x_d) = \frac{P(y)P(x_1,...,x_d | y)}{P(x_1,...,x_d)}$$

</div>

#### Independence Assumption
<div style="font-size: 115%;">
$$ P(x_i|y,x_1,...,x_{i-1},x_{i+1},...,x_d) = P(x_i|y)$$
</div>

#### Applying Independence Assumption

<div style="font-size: 125%;">
$$P(y|x_1,...,x_d) = \frac{P(y)\prod^d_{i=1}P(x_i|y)}{P(x_1,...,x_d)}$$
</div>

$P(x_1,...,x_d)$ doesn't depend on y (it is a constant), so

<div style="font-size: 125%;">
$$P(y|x_1,...,x_d) \propto{P(y)\prod^d_{i=1}P(x_i|y)}$$
</div>

#### Classification Rule
 
Maximize the posterior probability given the training data to formulate a decision rule for new data.

<div style="font-size: 125%;">
$$\hat{y} = \arg\max_{y} P(y)\prod^d_{i=1}P(x_i|y)$$
</div>

#### Estimate the prior P(y) and the Likelihood $P(x_i|y)$ from the training data

#### Estimate Prior Probability P(y)
 
The prior probability is the probability that new observation belongs to class y. It is based on expert knowledge or estimated from training data (assumes traing data is i.d.d.).

The Maximum Likelihood Estimate is the relative frequency of class y in the training data.

<div style="font-size: 115%;">
$$ \hat{P}(Y = y) = \frac{NumY}{NumTotal}$$
</div>

### Estimate the  Likelihoods $P(x_i|y)$

Because of the independence assumption, each feature ($x_i$) can have its own distribution. The feature can be Categorical or Numerical.

#### Likelihood for Categorical data

The likelihoods for every feature is estimated to be simply the frequency (categorical data).

The Maximum Likelihood Estimate is:

<div style="font-size: 115%;">
$$\hat{P}(X_i = x|Y=y) = \frac{NumJoint(x,y)}{NumTotal(Y=y)}$$
</div>

#### Likelihood for Gaussian numerical data

<div style="font-size: 115%;">
$$\hat{P}(x_i|y) =  \frac{1}{\sqrt{2\pi\sigma^2_y}}exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$
</div>

### sklearn Naive Bayes

sklearn has Naive Bayes models for the following likelihoods:

* Gaussian: continuous features
* Multinomial: discrete features (generalization of binomial)
* Bernoulli: discrete features

#### sklearn Gaussian Naive Bayes

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [13]:
# Naive Bayes

df = pd.read_csv('KNN_data.csv')
df.head()

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


In [14]:
X = df.iloc[:, [0, 1]].values
y = df.iloc[:, 2].values
X[0:5,:],y[0:5]

(array([[   19, 19000],
        [   35, 20000],
        [   26, 43000],
        [   27, 57000],
        [   19, 76000]], dtype=int64),
 array([0, 0, 0, 0, 0], dtype=int64))

In [15]:
# Split the data into the training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
                                                    random_state = 1234,stratify= y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((300, 2), (100, 2), (300,), (100,))

In [16]:
# Scale Age and Salary features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [17]:
# Fit Naive Bayes model to the training data

model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [18]:
# Predict the test data
preds = model.predict(X_test)
X_test.shape

(100, 2)

In [19]:
# Make  the Confusion Matrix and compute the accuracy
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, preds)
print(cm)
accuracy = np.trace(cm)/np.sum(cm)
print("Accuracy: ", accuracy)

[[61  3]
 [10 26]]
Accuracy:  0.87


In [20]:
from matplotlib.colors import ListedColormap
cmap = ListedColormap(["white", "black"])
def plot_decision_boundary(model,X_data,y_data,ax):
    # Make grid: min to max of 1st column, min to max of 2nd column in small increments
    X1,X2 = np.meshgrid(np.arange(X_data[:, 0].min() - 1, X_data[:, 0].max() + 1, step = 0.01),
                     np.arange(X_data[:, 1].min() - 1, X_data[:, 1].max() + 1, step = 0.01))
    # Flatten X1 and X2, create an array and transpose to a 2-column array (one for each feature)
    v = np.array([X1.ravel(), X2.ravel()]).T
    #  Using fitted model, predict the points in v and reshape to a 2-dimensional array
    ax.contourf(X1, X2, model.predict(v).reshape(X1.shape),alpha = 0.75,cmap = ListedColormap(('magenta','cyan')))
    ax.set_xlim(X1.min(), X1.max())
    ax.set_ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_data)): # For each class
        ax.scatter(X_data[y_data == j, 0], X_data[y_data == j, 1],
                label = j,
                c = np.array(ListedColormap(('white', 'black'))(i)).reshape(1,-1))


In [None]:
# Plot Results
fig,(ax1,ax2)=plt.subplots(1,2,figsize = (10,6))

# Visualising the Training set results
plot_decision_boundary(model,X_train,y_train,ax1)
ax1.set_title('Training Data')
ax1.set_xlabel('Age')
ax1.set_ylabel('Salary')


# Visualising the Test set results
plot_decision_boundary(model,X_test,y_test,ax2)
ax2.set_title('Test Data')
ax2.set_xlabel('Age')
ax2.set_ylabel('Salary');



### Multinomial Distributed Data

Classification with discrete features, e.g. word counts for text classification. 

Multinomial distribution: extends binomial to more than 2 categories. Parameter n is the number of trials. Parameter p is a vector of probabilities, one per category. Estimated by additive smoothing.

Used in Natural Language Processing.

#### Additive Smoothing

In testing, it is possible to encounter a feature not seen in training. The conditional probability for that feature will be 0 making the class conditional probability 0.

To ensure conditional probability not 0 by adding an additional smoothing term. 
    
<div style="font-size: 115%;">
$$\hat{P}(x_i|y_j) = \frac{N_{x_i,y_j} + \alpha}{Ny_j + \alpha d}\text{ }\forall{i} = 1,2,...d$$
</div>

* d is the number of features
* $\alpha$ is a parameter for additive smoothing  
    - Laplace Smoothing: $\alpha$ = 1  
    - Lidstone Smoothing: $\alpha$ < 1

#### sklearn Multinomial Naive Bayes

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

#### Simulate some data: counts of words in documents

The rows are topics and the columns are words.

In [None]:
np.random.seed(1234)
X = np.random.randint(5, size=(6, 100)) #5 = counts, 6 = # of observations, 100 = # of features
y = np.array([1, 2, 3, 4, 5, 6])
print(X.shape)
print(np.unique(X,return_counts=True))
X[:,0:10]

In [None]:
#alpha is the Laplace/Lidstone smoothing parameter, default = Laplace
model = MultinomialNB(alpha = 1.0) 
model.fit(X, y)


In [None]:
X[2]

#### Predict class of features X[2] and probabilities of each class 

Truth is topic = 3

In [None]:
print(model.predict(X[2:3]))

model.predict_proba(X[2:3])

In [None]:
model.score(X,y)

### Bernouli Distributed Data

Data is distributed according to multivariate Bernoulli distributions. There may be multiple features but each one is assumed to be a boolean variable. 

Used in Natural Language Processing for Text Classification.
    
<div style="font-size: 115%;">
$$ \hat{P}(x_i|y) = P(i|y)x_i + (1 - P(i|y))(1 - x_i)$$
</div>

#### sklearn Bernoulli Naive Bayes

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html

In [None]:
np.random.seed(1234)
X = np.random.randint(2, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
print(X.shape)
print(np.unique(X,return_counts=True))
X[:,0:10]

In [None]:
#alpha is the Laplace/Lidstone smoothing parameter, default = Laplace
model = BernoulliNB(alpha = 1.0) 
model.fit(X, y)

In [None]:
X2 = X[2].reshape(1,-1);X2

#### Predict class of features (X[2] and probabilities of each class

In [None]:
print(model.predict(X2))
model.predict_proba(X2)

In [None]:
model.score(X,y)

### Naive Bayes Pros and Cons

#### Pros

* Easy and fast to predict classes of the test set
* Can perform better than other algorithms assuming independence
* Performs well with categorical input

#### Cons

* Test set can have features with zero frequency in training set
* Not a good estimator of the probabilities
* Independence assumption often violated 

### Why independence assumption?

* Even though it is often a poor assumption, it simplifies the model

#### A bad simple model often outperforms a better complex one