# **Lecture 8: Naive Bayes**
## **Applied Machine Learning**

## **Part 1: Text Classification**
We will now do a quick detour to talk about an important application of machine learning: Text classification.
Afterwards, we will see how text classification motivates new classification algorithms.

### **Review: CLassification**
Consider a training dataset $\mathcal{D}$
We distinguish between the two types of supervised learning problems depend on the tatgets $y^{(i)}$

*   **Regression:** The target variables $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \in \mathbb{R}$
*   **Classification:** The target vatiable is discrete and takes on one of $K$ possible values $\mathcal{Y}=\{y_1,y_2,\dots,y_K\}$. Each discrete value corresponds to a class that we want to predict.



### **Text Classification**
An interesting instance of a classification problem is classifying text.

*   Include a lot applied problem: spam filtering, fraud detection, medical record classification, etc.
*   Inputs $x$ are sequences of words of an arbitrary length.
*   The dimensionality of the text input is usually very large, proportional to the size of the vocabulary.


### **Calassification Dataset: Twenty Newsgroup**
To illustrate the text classification problem, we will use a popular dataset called [20-newsgroups]() 

*   It contains ~ 20,000 documents collected approximately evenly from 20 different online newsgroups.
*   Each newgroup covers a different topic such as medicine, computer graphics, or religion.
*   This dataset is widely used to benchmark text classification and other types of algorithms.




In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# For this lecture, we will restrict our attention to just 4 different newsgroups
categories = ['alt.atheism','soc.religion.christian','comp.graphics','sci.med']

# Load the dataset
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

# Print some information on it
print(twenty_train.DESCR)

In [None]:
# The set of of targets in this dataset are the newsgroup topic
twenty_train.target_names

In [None]:
# Let's examine one data point 
print(twenty_train.data[4])

In [None]:
# We have about 2k data points in total
print(len(twenty_train.data))

### **Feature Representations for Text**
Each data point $x$ in this dataset is a sequence of characters of an arbitrary length.

How do we transform these into $d$-dimensional features $\phi(x)$ that can be used with our machine learning algorithms?

* We may devise a hand-crafted features by inspecting the data:
 * Does the message contain the word "church"? Does the email of the user originate outside the United States? Is the organization is a university?
* We can count the number of occurences of each word:
 * Does this message contains "ABC" ...
* Finally, many modern deep learning models can directly work with sequences of characters of an arbitrary length.

### **Bag of Words Representations**
Perhaps the most widely used approach to representing text documents is called "bag of words".

We start by defining a vocabulary $V$ contiaining all the possible words we are interested in, e.g.,:
$$V = \{ \text{church, doctor, purple, snow, kitchen},\dots\}$$

A bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector:
$$\phi(x) = \left(
\begin{array}{c}
0 \\
1 \\
\vdots \\
0 \\
\vdots \\
\end{array}
\right)
\begin{array}{l}
\; \text{church} \\
\; \text{doctor} \\
\\
\; \text{snow}\\
\\
\end{array}
$$

of dimension $V$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ contains the $j$-th word in $V$ and $0$ otherwise.

Let's see an example of this approach on 20-newsgroups

We start by computing these features using the **sklearn library**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Vectorize the training dataset
count_vec = CountVectorizer(binary=True)
X_train = count_vec.fit_transform(twenty_train.data)
X_train.shape

In `sklearn` we can retrieve the index of $\phi(x)$ associated with each `word`using the expression `coun_vect.vocabulary_.get(word)`

In [None]:
# The CountVectorizer class records the index j associated with each word in V
print('The index for the word "church"', count_vec.vocabulary_.get(u'church'))
print('The index for the word "computer"', count_vec.vocabulary_.get(u'computer'))


Our featurized dataset is in the matrix `X_train`. We can use the above indices to retrieve the $0-1$ value that has been computed for each word

In [None]:
# We can estimate if any of these words are present in our previous datapoint 
print(twenty_train.data[3])

# Let's see if it contains these words
print('Value at the index for the word "church"', X_train[3, count_vec.vocabulary_.get(u"church")])
print('Value at the index for the word "computer"', X_train[3, count_vec.vocabulary_.get(u"computer")])
print('Value at the index for the word "slow"', X_train[3, count_vec.vocabulary_.get(u"slow")])
print('Value at the index for the word "relation"', X_train[3, count_vec.vocabulary_.get(u"relation")])

### **Practical Considerations**
In practice, we may use some additional modifications of this technique:
* Sometimes, we features the $\phi(x)_j$ for the $j$-th word holds the count of occurences of word $j$ instead of just the binary occurence.
* The raw text is usually preprocessed. One common technique is *stemming*, in which we only keep the root of the word.
 * e.g., "slowly" and "slowness" both map to "slow" 
* Filtering for common *stopwords* such as "the", "a", "and". Similarly, rare words are also typically excluded.

### **Classification using the BoW Features**
Let's now have a look at the performance of classification over bag of words features.

Now that we have a feature representation $\phi(x)$, we can apply the classifier of oour choice, such as logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create an instance of Softma Regressiion and fit the data
logreg = LogisticRegression(C=1e5, multi_class='multinomial', verbose=True)
logreg.fit(X_train, twenty_train.target)

And now we can use this model for predicting on new imputs 

In [None]:
docs_new = ['God is simple','take a medicine']

X_new = count_vec.transform(docs_new)
predicted = logreg.predict(X_new)

for doc, category in zip(docs_new ,predicted):
  print('%r -> %s' % (doc, twenty_train.target_names[category]))

### **Summary of Text Classification**
* Classifying text normally requires specifying features over the raw data.
* A widely used representation is "bag of words", in which features are occurences or counts of words.
* Once text is featurized, any off-the-shell supervised learning algorithms can be applied, but some work better than others, we will see next.

## **Part 2: Naive Bayes**
Next, we are going to look at Naive Bayes - a generative classification algorithm. We will apply naive bayes to text classification problem. 

### **Review: Generative Models**
There are two types of probabilistic models: *generative* and *discriminative*
$$\underbrace{P_{theta}(x,y): \mathcal{X} \times \mathcal{Y} \to [0,1]}_{text{generative models}}  \; \;   \underbrace{P_{\theta}(y \mid x): \mathcal{X} \times \mathcal{Y} \to [0,1]}_{\text{discriminative models}}$$

Given a new data point $x'$, we can match it against each class model and find the class that looks most similar to it.
 $$\arg\max_y\log p(y\mid x) = \arg\max_y\log \frac{p(x \mid y)p(y)}{p(x)} = \arg\max_y \log p(x \mid y)p(y),$$
 where we have applied the Bayes' rule in the second equation.

### **Review: Gaussian Mixture Model**
The GDA algorithm defines the following model family:
* The probability $P(x \mid y = k)$ of the data under class $k$ is a multivariate Gaussian $\mathcal{N}(x; \mu_k, \Sigma_k)$ with parameters $\mu_k, \Sigma_k$.
* The distribution over classses is [Categorical](), denoted $\text{Categorical}(\phi_1,\phi_2,\dots,\phi_K)$. Thus, $P_{\theta}(y=k) = \phi_k$

Thus, $P_{\theta}(x,y)$ is a mixture of $K$ Gaussians:
$$P_{\theta}(x,y) = \sum_{k=1}^KP_{\theta}(y=k)P_{\theta}(x \mid y=k) = \sum_{k=1}^K\phi_k\mathcal{N}(x;\mu_k,\Sigma_k)$$

### **Problem 1: Discrete data**
What would happen if we used GDA to perform text classification? The first problem we face is that the input data discrete.
$$\phi(x) = \left(
\begin{array}{c}
0 \\
1 \\
\vdots \\
0 \\
\vdots \\
\end{array}
\right)
\begin{array}{l}
\; \text{church} \\
\; \text{doctor} \\
\\
\; \text{snow}\\
\\
\end{array}
$$
This data does not follow a Normal distribution, hence the GDA is clearly misspecified.

### **Problem 2: High Dimensionality**
A first solution is to assume that $x$ is sampled from a Categorical distribution that assigns a probability to each possible state of $x$.
$$
p(x) = p \left( 
\begin{array}{c}
0 \\
1 \\
0 \\
\vdots \\
0 
\end{array}
\right.
\left.
\begin{array}{l}
\;\text{church} \\
\;\text{doctor} \\
\;\text{fervently} \\
\vdots \\
\;\text{purple}
\end{array}
\right) = 0.0012
$$

However, if the dimension of $d$ of $x$ is high (e.g. vocabulary has size 10,000) $x$ can take a huge number of values ($2^{10000}$ in our example). We need to specify $2^d - 1$ parameters for the categorical distribution.

### **Naive Bayes Assumption**
In order to deal with high dimensional $x$, we simplify the problem by making the Naive Bayes Assumption:
$$p(x\mid y) = \prod_{j=1}^dp(x_j \mid y)$$

In other words, the probability $p(x \mid y)$ is factorized over each dimension.
* For example, if $x$ is a binary bag of words representation, then $p(x_j \mid y)$ is the probablity of seeing the $j$-th word.
*We can model each $p(x_j \mid y )$ via a Bernoulli distribution, which has only one parameter.
* Hence, it takes only $d$ parameters (instead of $2^d - 1$) to specify the entire distribution $p(x \mid y) =  \prod_{j=1}^dp(x_j \mid y).$

### **Bernoulli Naive Bayes Model**
We can apply the Naive Bayes assumption to obtain a model when $x$ is in a bag of words representation.
The *Bernoulli Naive Bayes Models* is defined as follows:
* The distribution over classes is [Categorical](), denoted $\text{Categorical}(\phi_1,\phi_2,\dots,\phi_K)$. Thus, $P(y = k) = \phi_k$
* The conditional probability of the data under class $k$ factorizes as $P(x \mid y = k ) = \prod_{j=1}^d P(x_j \mid y = k)$ (the Naive Bayes assumption), where each $P(x_j \mid y = k)$ is a $\text{Bernoulli}(\psi_{jk})$.

Formally, we have 

\begin{align*}
P_{\theta}(y) &= \text{Categorical}(\phi_1,\phi_2,\dots,\phi_K) \\
P(x_j = 1 \mid y = k) &= \text{Bernoulli}(\psi_{jk}) \\
P(x \mid y = k ) &= \prod_{j=1}^d P(x_j \mid y = k)
\end{align*}

## **Part 3: Naive Bayes Learning**
We are going to continue our discussion of Naive Bayes.

We will now turn our attention to learning the parameters of the model and using them to make predictions.

### **Review: Maximum Likelihood Learning**
In order to fit the probabilistic models, we use the following objective:
$$\max_{\theta}\mathbb{E}_{x,y \sim \mathbb{P}}\log P_{\theta}(x,y).$$

This seeks to find a model that assigns high probability to the training data.

Let's use the maximum likelihood to fit the Bernoulli Naive Bayes Model. Note that model parameters is the union of the parameters of each sub-model:
$$\theta = (\phi_!,\phi_2, \dots, \phi_K, \psi_{11}, \psi_{21}, \dots, \psi_{dK}).$$

### **Learning a Bernoulli Naive Bayes Model**
Given a training set $\mathcal{D}$, we want to optimize the log-likelihood $\ell(\theta) = \log L(\theta):$
\begin{align*}
\ell(\theta) &= \sum_{i=1}^n \log P_{\theta}(x^{(i)},y^{(i)}) = \sum_{i=1}^n\log P_{\theta}(x^{(i)} \mid y^{(i)}) + \sum_{i=1}^n \log P_{\theta}(y^{(i)}) \\
&= \sum_{k=1}^K \sum_{j=1}^d \underbrace{\sum_{i: y^{(i)} = k}\log P(x_j^{(i)} \mid y^{(i)}; \psi_{jk})}_{\text{all the terms that involve} ~ \psi_{jk}} + \underbrace{\sum_{i=1}^n \log P(y^{(i)};\vec{\phi}) }_{\text{all the terms that involve} ~ \vec{\phi}}
.\end{align*}

Notice that each parameter $\psi_{jk}$ is found in only one subset of terms and the $\phi_k$ are also is found in the same set of terms.

As in Gaussian Discriminant Analysis, the log-likelihood decomposes into a sum of terms. To optimize for some $\psi_{jk}$, we only need to look at the set of terms that contain $\psi_{jk}$
$$\arg\max_{\psi_{jk}}\ell(\theta) = \arg \max_{\psi_{jk}}\sum_{i:y^{(i)}=k}\log p(x_j^{(i)}\mid y^{(i)};\psi_{jk}).$$

Similarly, optimizing for $\vec{\phi} = (\phi_1,\phi_2, \dots,\phi_K)$ only involves a single term:
$$\max_{\vec{\phi}}\sum_{i=1}^n\log P_{\theta}(x^{(i)},y^{(i)};\vec{\phi}) = \max_{\vec{\phi}}\sum_{i=1}^n \log P_{\theta}(y^{(i)};\vec{\phi})$$

### **Optimizing the Model Paramaters**
These observations greatly simplify the optimization of the model. Let's first consider the optimization over $\vec{\phi} = (\phi_1,\phi_2, \dots,\phi_K)$.

As in Gaussian Discriminant Analysis, we can take the derivative over $\phi_k$ and set it to zero to obtain:
$$\phi_k = \frac{n_k}{n}$$
for each $k$, where $n_k=|\{i:y^{(i)}=k\}|$ is the number of training targets with class $k$.

Thus, the optimal $\phi_k$ is just the proportion of data points with class $k$ in the training set!

Similarly, we can maximize the likelihood for the other parameters to obtain closed form solution:
$$\psi_{jk} = \frac{n_{jk}}{n},$$

where $|\{i: x_j^{(i)} = 1, y^{(i)} = k\}|$ is the number of $x^{(i)}$ with label $k$ and a possitive occurence of word $j$-th.

Each $\psi_{jk}$ is simply the proportion of documents in class $k$ that cotain the word $j$-th.

### **Querying the Model**
How do we ask the model for predictions? As discussed earlier, we applied Bayes' rule 
$$\arg\max_y P_{\theta}(y\mid x) = \arg\max_y P_{\theta}(x\mid y)P(y).$$

Thus, we can estimate the probability of $x$ under each $P_{\theta}(x\mid y)P(y)$ and choose the class that explains the data best.

### **Classification Dataset: 20-Newsgroups**

Let's load the dataset

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# for this lecture, we will restrict our attention to just 4 different newsgroups:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

# load the dataset
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

# print some information on it
print(twenty_train.DESCR)

### **Example: Text Classification**

Let's see how this approach can be used in practice on the text classification dataset.
* We will learn a good set of parameters for a Bernoulli Naive Bayes Model.
* We will compare the outputs to the true predictions.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Vectorize the training set
count_vec = CountVectorizer(binary=True,max_features=1000)
y_train = twenty_train.target
X_train = count_vec.fit_transform(twenty_train.data).toarray()
X_train.shape

Let's compute the maximum likelihood model parameters on our dataset

In [None]:
n = X_train.shape[0] # size of the dataset
d = X_train.shape[1] # number of features in pur dataset
K = 4                # number of classes

# These are shapes of the parameters
psis = np.zeros([K,d])
phis = np.zeros([K])
# We now compute the parameters
for k in range(K):
  X_k = X_train[y_train==k]
  psis[k] = np.mean(X_k, axis=0)
  phis[k] = X_k.shape[0]/float(n)

# print out the class proportions
print(phis)
print(psis)

We can compute predictions using Bayes' rule

In [None]:
# We can implement this in numpy 
def nb_predictions(x,psis,phis):
  """This returns class assignments and scores under the NB model
  We compute \arg\max_y p(y | x) as \arg\max_y p(x|y)p(y)
  """
  # Adjust shape
  n, d = x.shape
  x = np.reshape(x, (1,n,d))
  psis = np.reshape(psis, (K,1,d))

  # Clip probability to avoid log(0)
  psis = psis.clip(1e-14, 1-1e-14)

  # Compute the log-probabilities
  logpy = np.log(phis).reshape([K,1])
  logpxy = x*np.log(psis) + (1-x)*np.log(1-psis)
  logpyx = logpxy.sum(axis=2) + logpy

  return logpyx.argmax(axis=0).flatten(), logpyx.reshape([K,n])

idx, logpyx = nb_predictions(X_train, psis, phis)
print(idx[:10])
print(logpyx)

We can measure the accuracy on the training set

In [None]:
(idx==y_train).mean()

In [None]:
docs_new = ['Open GL is fast', 'God is love','I don\'t wanna go to hospital']

X_new = count_vec.transform(docs_new).toarray()
predicted, logpyx_new = nb_predictions(X_new, psis,phis)

for doc, category in zip(docs_new, predicted):
  print('%r -> %s'% (doc, twenty_train.target_names[category]))

### **Algorithm: Bernoulli Naive Bayes**
* **Type:** Supervised learning (multi-class classification)
* **Model Family:** Mixtures of BErnoulli distributions
* **Objective function:** Log-Likelihood
* **Optimizer:** closed form solution

## **Part 4: Discriminative vs. Generative Algorithms**

We conclude our lecture on generative algorithms by visiting the question of how they compare to discriminatiive algorithms.

### **Classification Dataset: Iris Flowers**

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris(as_frame=True)

# print part of the dataset
iris_X, iris_y = iris.data, iris.target
pd.concat([iris_X, iris_y], axis=1).head()

If we only consider the first two feature columns, we can visualize the dataset in 2D.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]

# create 2d version of dataset
X = iris_X.to_numpy()[:,:2]
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

# Plot also the training points
p1 = plt.scatter(X[:, 0], X[:, 1], c=iris_y, edgecolor='k', s=60, cmap=plt.cm.Paired)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Versicolour', 'Virginica'], loc='lower right')

### **Linear Discriminant Analysis**

When the covariance $\Sigma_k$ in GDA are equal, we have an algorithm called Linear Discriminant Analysis or LDA.

Let's try this algorithm on Iris dataset.

We may compute the parameters of this model similarly to how we did for GDA.

In [None]:
d = 2
K = 3
n = X.shape[0]

# These are the shapes of the parameters
mus = np.zeros([K,d])
Sigmas = np.zeros([K,d,d])
phis = np.zeros([K])

# We can now compute the parameters
for k in range(3):
  X_k = X[iris_y == k]
  mus[k] = np.mean(X_k, axis=0)
  Sigmas[k] = np.cov(X.T) # this is now X.T insttead of X_k.T
  phis[k] = X_k.shape[0] / float(n)

# print out the means
print(mus)

We can compute predictions using Bayes' rule

In [None]:
def gda_predictions(x, mus, sigmas, phis):
  # Adjust shape
  n, d = x.shape
  x = np.reshape(x, (1,n,d,1))
  mus = np.reshape(mus, (K,1,d,1))
  sigmas = np.reshape(Sigmas, (K,1,d,d))

  # compute probability 
  py = np.tile(phis.reshape((K,1)), (1,n)).reshape([K,n,1,1])
  pxy = (
      np.sqrt(np.abs((2*np.pi)**d*np.linalg.det(sigmas))).reshape([K,1,1,1])
      * -0.5*np.exp(
          np.matmul(np.matmul((x-mus).transpose([0,1,3,2]), np.linalg.inv(sigmas)), x-mus)
      )
  )
  pyx = pxy*py

  return pyx.argmax(axis=0).flatten(), pyx.reshape([K,n])

idx, pyx = gda_predictions(X, mus, Sigmas, phis)
print(idx)

We visualize the predictions we dis earlier

In [None]:
from matplotlib.colors import LogNorm
xx, yy = np.meshgrid(np.arange(x_min,x_max,0.02), np.arange(y_min, y_max, 0.02))
Z, pyx = gda_predictions(np.c_[xx.ravel(), yy.ravel()], mus, Sigmas, phis)
logpy = np.log(-1./3*pyx)

# Put the result into a color plot
Z = Z.reshape(xx.shape)
contours = np.zeros([K, xx.shape[0], xx.shape[1]])
for k in range(K):
  contours[k] = logpy[k].reshape(xx.shape)
plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:,0], X[:,1], c=iris_y, s=40, cmap=plt.cm.Paired, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.show()



Linear Disciminant Analysis outputs the decision boundaries that are linear.

Softmax or Logistic regression also produce linear boundaries. In fact, both types of algorithms make us of the same model class.

What is their differences then?

### **Generative vs. Discriminative Model Classes**
In binary classification, we can also show that the conditional probability $P_{\theta}(y \mid x)$ of a Bernoulli Naive Bayes for LDA model has the form:
$$P_{\theta}(y \mid x) = \frac{P_{\theta}(x\mid y)P_{\theta}(y)}{\sum_{y' \in \mathcal{Y}}P_{\theta}(x\mid y')P_{\theta}(y')} = \frac{1}{1 + \exp(-\gamma^\top x)}$$

for some set of parameters $\gamma$ (whose expression can be derived from $\theta$), which is the same form as Logistic Regression!

Does it mean that the two sets of algorithms are equivalent? No! They assume the same model class $\mathcal{M}$, they use a different objective $J$ to select a model in $\mathcal{M}$

### **Generative Model vs. Logistic Regression**
Given that both algorithms find linear boundaries, how should one choose between two?

* Bernoulli Naive Bayes or LDA assumes a logistic form for $p(y\mid x)$. But converse is not true: logistic regression does not assume a NB or LDA model for $p(x,y)$.
* Generative models make stronger modelling assumption. If these assumptions hold true, the generative models will perform better.
* But if they don't, logistic regression will be more robust to outliers and model misspecification, and achieve higher accuracy.

### **Other features of Generative Models**
Generative models can also do things that discriminative models can't do
* Generation: we can sample $x \sim p(x\mid y)$ to generate new data (images, audio, etc.)
* Missing value imputation: if $x_j$ is missing, we can infer it using $p(x\mid y)$
* Outlie detection: given a new $x'$, we can try detecting via $p(x')$ if $x'$ is invalid

### **Discriminative Approaches**

Discriminative algorithms are deservingly very popular
* Most state-of-the-art algorithms for classifications are discriminative.
* They are often more accurate because they make fewer modeling assumptions

### **Generative Approaches**
But generative models have many advantages
* Can do more than just prediction: generation, fill-in missing value features, etc.
* Can include extra prior knowledge: if prior knowledge is correct, model will be more accurate.
* Often have closed form solution, hence are faster to train.