# Introduction to Machine Learning
## Practical Session 05

Sharon Ong, Department of Cognitive Science and Artificial Intelligence – Tilburg University 
# Naive Bayes and Decision Trees 

# 1. Naive Bayes Classifers 
Naive Bayes classifiers are a family of classifiers that learn parameters by looking at each feature individually and collect simple per-class statistics from each feature.

There are three kinds of naive Bayes classifiers implemented in scikit-learn
* Gaussian Naive Bayes (_GaussianNB_)
* Multinomial Naive Bayes (_MultinomialNB_)  
* Bernoulli Naive Bayes (_BernoulliNB_) 

Gaussian Naive Bayes classifiers assume that features follow a normal distribution. Multinomial Naive Bayes classifiers assume count data (that is, that each feature represents an integer count of something, like how often a word appears in a sentence or a digit image). Bernoulli Naive Bayes models are useful if your feature vectors are binary (i.e. zeros and ones) or continuous values which can be precisely split (binarized) with a predeﬁned threshold . One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively. 

## 1.1  Gaussian Naive Bayes Classifer
The following code implements a Gaussian Naive Bayes classifer to predict someone's gender from their height, weight and footsize.  

In [1]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
import pandas as pd 

gender = ['male','male','male','male','female','female','female','female']
height = [180,177.6,167.4,177.6,150,165,162.6,172.5]
weight = [90,85,83,82,50,75,65,75]
foot_size = [12,11,12,10,6,8,7,9] 

# the following code convert the list "gender" to numerical data (0 for male and 1 for female) 
# to create the target variable
b = pd.get_dummies(gender)

print(b['female'].values)
print(b['male'].values)
y = b['female'].values

# Create the feature vector 
X = np.concatenate((height, weight,foot_size), axis=0)
# Feature vector is  
print(X)
X = X.reshape(len(height),3,order='F')
print(X)

# train a Naive Bayes classifier. 
clf = GaussianNB()
clf.fit(X, y)


[False False False False  True  True  True  True]
[ True  True  True  True False False False False]
[180.  177.6 167.4 177.6 150.  165.  162.6 172.5  90.   85.   83.   82.
  50.   75.   65.   75.   12.   11.   12.   10.    6.    8.    7.    9. ]
[[180.   90.   12. ]
 [177.6  85.   11. ]
 [167.4  83.   12. ]
 [177.6  82.   10. ]
 [150.   50.    6. ]
 [165.   75.    8. ]
 [162.6  65.    7. ]
 [172.5  75.    9. ]]


A new person has the following feature variables [Height = 180. Weight = 80, Foot Size = 10]. The following code predicts 
the class (gender) and the probability for each class.  

**Exercise 1**

Try varying the height, weight and foot size values.  

In [None]:
#
# Your code goes here
#

**Exercise 2**

Apply a Gaussian Naive Bayes classifier on the dataset plot and display the results.  

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt 

# Apply this on the two moons dataset 
# and then see how it works 
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

#
# Your code goes here 
#

## 1.2 Bernoulli Naive Bayes Classifer

If X is random variable Bernoulli-distributed, it can assume only two values (for simplicity, let’s call them 0 and 1). 

The following code generates a dummy dataset. Bernoulli naive Bayes expects binary feature vectors. The class BernoulliNB has a binarize parameter which allows specifying a threshold that will be used internally to transform the features. 

By setting a threshold of 0.0, each point can be characterized by the quadrant where it’s located. All Feature 0 that are less or equal than 0.0 are set to 0 and all Feature 0 values are set to 1.  

In [None]:
from sklearn.datasets import make_classification
import mglearn
import matplotlib.pyplot as plt 
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split

nb_samples = 300
[X, y] = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0)
plt.figure()
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# train a Bernoulli Naive Bayes
bnb = BernoulliNB(binarize=0.0)

bnb.fit(X_train,y_train)
print(bnb.score(X_test, y_test))
# you can check the Bernoulli NB solution as well 
data = np.array([[0, -1], [1, 0], [-1, -1], [1, 1]])
print(bnb.predict(data))
print(bnb.predict_proba(data))

**Exercise 3**

Compare the Bernoulli naive Bayes with the Gaussian naive Bayes classifier on this dataset. 

In [None]:
#
# Your code goes here 
#

## 1.3 Multinomial Naive Bayes Classifer

Let's compare the performance of a multinomial naive Bayes and Gaussian naive Bayes with the digit dataset. This dataset contains images of hand-written digits with 10 classes.  Each class refers to a digit. Each sample (belonging to 10 classes) is an 8×8 image encoded as an unsigned integer (0 – 255). The Multinomial classifer has been implemented for you.  

**Exercise 4**

Compare its performance with a Gaussian naive Bayes classifer. 

In [None]:
from sklearn.datasets import load_digits
from sklearn.naive_bayes import MultinomialNB

digits = load_digits()
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# let's extract out the 3rd row to the variable 'img' 
img = digits.data[2]
plt.figure()
plt.imshow(img.reshape(8,8)) 

mnb = MultinomialNB()
mnb.fit(X_train,y_train)
print(mnb.score(X_test, y_test))

#
# Your code goes here 
#

# 2. Decision Tree classifier 

Learning a decision tree involves learning the sequence of if/else questions that gets us to the "true" answer with the least number of questions. In the machine learning setting, these questions are called tests (not to be confused with the test set, which is the data we use to test to see how generalizable our model is). 

Usually data does not come in the form of binary yes/no features as in the animal example, but is instead represented as continuous features such as in the 2D dataset we will explore. The tests that are used on continuous data are of the form “Is feature i larger than value a?” To build a tree, the algorithm searches over all possible tests and finds the one that is most informative about the target variable.

## 2.1 Decision Tree classifier on a 2D dataset.
The following code implements a Decision Tree Classifer on a 2D dataset. The function get_depth() returns the depth of the tree. 

In [None]:
from sklearn.datasets import make_moons
import mglearn
import matplotlib.pyplot as plt 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
plt.figure()
mglearn.plots.plot_tree_partition(X_train, y_train, tree) 
print(tree.get_depth())

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

You can control the number of test(splits) performed by setting max_depth hyperparater. 

**Exercise 5**

Change the max_depth to 1,2 and 4. 

In [None]:
#
# Your code goes here 
#
tree = DecisionTreeClassifier(random_state=0, max_depth =1)
tree.fit(X_train, y_train)
plt.figure()
mglearn.plots.plot_tree_partition(X_train, y_train, tree) 
print(tree.get_depth())

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

#
tree = DecisionTreeClassifier(random_state=0, max_depth =2)
tree.fit(X_train, y_train)
plt.figure()
mglearn.plots.plot_tree_partition(X_train, y_train, tree) 
print(tree.get_depth())

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

#
tree = DecisionTreeClassifier(random_state=0, max_depth =4)
tree.fit(X_train, y_train)
plt.figure()
mglearn.plots.plot_tree_partition(X_train, y_train, tree) 
print(tree.get_depth())

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

 You can also set the criterion for splitting. 
 
 **Exercise 6**
 
 Change the criterion to "entropy" (criterion = "entropy") 

In [None]:
#
# Your code goes here 
#

## 2.2 Decision Tree classifier on a high dimensional dataset 
Let’s look at the effect of max_depth in more detail on the Breast Cancer dataset. As always, we import the dataset and split it into a training and a test part. Then we build a model using the default setting of fully developing the tree (growing the tree until all leaves are pure).

In [None]:
from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

#display the tree depth 
print(tree.get_depth())

**Exercise 7**

Set max_depth to 3,4,5 or 6. Which gives you max_depth returns the best test accuarcy.

In [None]:
#
# Your code goes here 
#


# 3. Decision Tree Regression 

Decision trees can also be applied for regression in a Decision TreeRegressor. The code below implements a Decision Tree Regressor. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.tree import DecisionTreeRegressor
# #############################################################################
# Generate sample data
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()

# #############################################################################
# Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(8))

regr = DecisionTreeRegressor()
regr.fit(X, y)

# Predict
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_pred = regr.predict(X_test)
print(regr.get_depth())
plt.figure()
plt.scatter(X,y)
plt.plot(X_test,y_pred, 'r', label='max_depth = 8')
plt.legend()



**Exercise 8**

Vary the max_depth parameter from 1,2,3 to 4. Plot the results of your regression.

In [None]:
#
# Your code goes here 
#

There is one particular property of using tree-based models for regression is that they are not able to extrapolate, or make predictions outside of the range of the training data. We will use the dataset in "ram_price.csv" which is the historical computer memory (RAM) prices. The training data is the historical data up to the year 2000. The test data is the RAM prices after the year 2000.   

In [None]:
import pandas as pd
ram_prices = pd.read_csv("ram_price.csv")

data_train = ram_prices[ram_prices.date < 2000]
data_test = ram_prices[ram_prices.date >= 2000]
# predict prices based on date
X_train = data_train.date[:, np.newaxis]
# we use a log-transform to get a simpler relationship of data to target
y_train = np.log(data_train.price)

X_test = data_test.date[:, np.newaxis] #ram_prices.date[:, np.newaxis] # data_test.date[:, np.newaxis]
# we use a log-transform to get a simpler relationship of data to target
y_test = np.log(data_test.price) # np.log(ram_prices.price) #

plt.semilogy(X_train,np.exp(y_train),label='train')
plt.semilogy(X_test,np.exp(y_test),label='test')
plt.xlabel("Year")
plt.ylabel("Price in $/Mbyte")
plt.legend()


**Exercise 9**

Train a Linear Regression and a Decision Tree Regression. Make the predictions over the test data and visualize your solution. 

In [None]:
#
# Your code goes here 
#

# 4. Naive Bayes Classifier without sklearn libraries 
Let's create a Naive Bayes Classifier without sklearn libraries to predict someone's gender from their height, weight and footsize. The following creates a pandas dataframe with the data we previously used in this worksheet.  

In [None]:
# Create an empty dataframe
data = pd.DataFrame()

# Create the target variable (y)
data['Gender'] = ['male','male','male','male','female','female','female','female']

# Create the feature variables (X)
data['Height'] = [180,177.6,167.4,177.6,150,165,162.6,172.5]
data['Weight'] = [90,85,83,82,50,75,65,75]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]

In a Bayes classifier, we are interested in finding out the class (e.g. male or female) of an observation given the data:

\begin{equation}    
p(\text{class} \mid \mathbf {\text{data}} )={\frac {p(\mathbf {\text{data}} \mid \text{class}) * p(\text{class})}{p(\mathbf {\text{data}} )}}
\end{equation}
where
- $\text{class}$ is a particular class (e.g. male)
- $\mathbf {\text{data}}$ is an observation’s data
- $p(\text{class} \mid \mathbf{\text{data}})$ is called the posterior
- $p(\text{data|class})$ is the likelihood
- $p(\text{class})$ is called the prior
- $p(\text{data})$ is called the marginal probability

In a Bayes classifier, we calculate the posterior for every class for each observation. Then, classify the observation based on the class with the largest posterior value. In our example, we have one observation to predict and two possible classes (e.g. male and female), therefore we will calculate two posteriors: one for male and one for female.
\begin{equation}
\begin{split}
p(\text{person is male} \mid \mathbf {\text{person’s data}} ) &={\frac {p(\mathbf {\text{person’s data}} \mid \text{person is male}) * p(\text{person is male})}{p(\mathbf {\text{person’s data}} )}}  \\
&= {\frac {p({\text{male}})\,p({\text{height}}\mid{\text{male}})\,p({\text{weight}}\mid{\text{male}})\,p({\text{foot size}}\mid{\text{male}})}{\text{marginal probability}}}
\end{split}
\end{equation}
\begin{equation}
\begin{split}
p(\text{person is female} \mid \mathbf {\text{person’s data}} ) &={\frac {p(\mathbf {\text{person’s data}} \mid \text{person is female}) * p(\text{person is female})}{p(\mathbf {\text{person’s data}} )}} \\
&={\frac {p({\text{female}})\,p({\text{height}}\mid{\text{female}})\,p({\text{weight}}\mid{\text{female}})\,p({\text{foot size}}\mid{\text{female}})}{\text{marginal probability}}}
\end{split}
\end{equation}
For the first equation:
- $p(\text{male})$ is the prior, which is the probability an observation is male.
- $p({\text{height}}\mid{\text{female}})\,p({\text{weight}}\mid{\text{female}})\,p({\text{foot size}}\mid{\text{female}})$ is the likelihood. 
- marginal probability is hard to compute, can be ignored for classification as the value is the same for all classes. 


View the data in pandas.  The dataset above is used to construct our classifier. 

In [None]:
data 

## Bayes rule 

## 4.1 Calculate Priors

Priors are either constants or probability distributions. In this example, the priors are the probability of the target variable, which is the gender.  

The following code counts the number of males, females and the total number of rows.

**Exercise 10**

(1) Next, find the priors by dividing the number of males or females over the total number of rows. In a bayes classifier, we are interested in finding out the class (e.g. male or female, spam or ham) of an observation given the data:

In [None]:
# Number of males

n_male = data['Gender'][data['Gender'] == 'male'].count()

# Number of males
n_female = data['Gender'][data['Gender'] == 'female'].count()

# Total rows
total_ppl = data['Gender'].count()

#
# Your code goes here 
#

# 4.2 Calculate Likelihood 
The following code finds the mean for each gender. 

In [None]:
# Group the data by gender and calculate the means of each feature
data_means = data.groupby('Gender').mean()

# View the values
data_means


**Exercise 11**

(2) Find the variance of each feature for each gender using the var() command in pandas 

In [None]:
#
# Your code goes here 
#

The code below is a function that you can call to calculate the likelihood of x given y 

In [None]:
#We can calculate p(x|y) with the following function. Here x is the feature value of a new data point 

def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

The following code shows you how to obtain the mean height, weight and footsize of the "male" class.  Find the mean for each feature for the female class. 

**Exercise 12**

Find the variance of each feature for both classes.  

In [None]:
# Means for male
#print( data_means['Height'][data_means.index == 'male'].values[0])


mhm = data_means['Height'][data_means.index == 'male'].values[0]
mwm = data_means['Weight'][data_means.index == 'male'].values[0]
mfm = data_means['Foot_Size'][data_means.index == 'male'].values[0]
#
# Your code goes here 
#

# 4.3 Apply Bayes Classifier To New Test Point
In Gaussian Naive Bayes Classifiers, we assume that each feature is uncorrelated from each other. For example, that foot size is independent of weight or height etc.. This is obviously not true, and is a “naive” assumption - hence the name “naive Bayes." We also assume that the value of the features (e.g. the height, weight, etc..) are normally (Gaussian) distributed. This means that the likelihood can be evaluated using the following example equation: 

\begin{equation}
p(\text{height}\mid\text{female})=\frac{1}{\sqrt{2\pi\text{variance of female height in the data}}}\,e^{ -\frac{(\text{observation’s height}-\text{average height of females in the data})^2}{2\text{variance of female height in the data}} }
\end{equation} 

**Exercise 13**

(3) A new person has the following feature variables. [Height = 170. Weight = 65, Foot Size = 10]. Evaluate the posterior for each feature. Predict whether this person is more likely to be male or female? 

In [None]:
#
# Your code goes here 
# 