# Classifying IRIS species using univariate Gaussian Classifier

**Note:** You can use built-in code for mean, variance, covariance, determinant, etc.

In [None]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# Useful module for dealing with the Gaussian density
from scipy.stats import norm, multivariate_normal #in case you use buit-in library
# installing packages for interactive graphs
import ipywidgets as widgets
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider
from sklearn import datasets

### Loading the IRIS dataset

In [None]:
iris = datasets.load_iris()
X = iris.data
Y = iris.target
featurenames = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']

Confirm the dimensions:

In [None]:
X.shape, Y.shape

In [None]:
# Split 150 instances into training set (trainx, trainy) of size 105 and test set (testx, testy) of size 45
np.random.seed(0)
perm = np.random.permutation(150)
trainx = X[perm[0:105],:]
trainy = Y[perm[0:105]]
testx = X[perm[105:150],:]
testy = Y[perm[105:150]]

Let's see how many training points there are from each class.

In [None]:
sum(trainy==0), sum(trainy==1), sum(trainy==2)

### Q1. Can you figure out how many test points there are from each class? 

In [None]:
# TODO: add your code to find how many test points there are from each class


### Look at the distribution of a single feature from one of the species

Let's pick just one feature: 'petal_length'. This is the first feature, that is, number 0. Here is a *histogram* of this feature's values under species 1, along with the *Gaussian fit* to this distribution.

<img src="density.png">

In [None]:
@interact_manual( feature=IntSlider(0,0,3), label=IntSlider(0,0,2))
def density_plot(feature, label):
    plt.hist(trainx[trainy==label,feature], density=True)
    #
    mu = np.mean(trainx[trainy==label,feature]) # mean
    var = np.var(trainx[trainy==label,feature]) # variance
    std = np.sqrt(var) # standard deviation
    x_axis = np.linspace(mu - 3*std, mu + 3*std, 1000)
    plt.plot(x_axis, norm.pdf(x_axis,mu,std), 'r', lw=2)
    plt.title("Species "+str(label) )
    plt.xlabel(featurenames[feature], fontsize=14, color='red')
    plt.ylabel('Density', fontsize=14, color='red')
    plt.show()

### Q2. In the function **density_plot**, the code for plotting the Gaussian density focuses on the region within 3 standard deviations of the mean. Do you see where this happens? Why do you think we make this choice?

### Q3. Here's something for you to figure out: for which feature (0-3) does the distribution of (training set) values for species-2 have the *smallest* standard deviation? what is the value?

In [None]:
# modify this cell
std = np.zeros(4)
### START CODE HERE ###



### 3. Fit a Gaussian to each class
Let's define a function that will fit a Gaussian generative model to the three classes, restricted to just a single feature.

In [None]:
# Assumes y takes on values 0,1,2
def fit_generative_model(x,y,feature):
    k = 3 # number of classes
    mu = np.zeros(k+1) # list of means
    var = np.zeros(k+1) # list of variances
    pi = np.zeros(k) # list of class weights
    for label in range(0,k):
        indices = (y==label)
        ### START CODE HERE ###
        mu[label] = None
        var[label] = None 
        pi[label] = None
        ### END CODE HERE ###
    return mu, var, pi

Call this function on the feature 'petal_length'. What are the class weights?

In [None]:
feature = 0 # 'petal_length'
### START CODE HERE ###


### END CODE HERE ###

Next, display the Gaussian distribution for each of the three classes

In [None]:
@interact_manual( feature=IntSlider(0,0,3) )
def show_densities(feature):
    mu, var, pi = fit_generative_model(trainx, trainy, feature)
    colors = ['r', 'k', 'g']
    for label in range(0,3):
        m = mu[label]
        s = np.sqrt(var[label])
        x_axis = np.linspace(m - 3*s, m+3*s, 1000)
        plt.plot(x_axis, norm.pdf(x_axis,m,s), colors[label], label="species-" + str(label))
    plt.xlabel(featurenames[feature], fontsize=14, color='red')
    plt.ylabel('Density', fontsize=14, color='red')
    plt.legend()
    plt.show()

### Questions:

Use the widget above to look at the three class densities for each of the 4 features. Here are some questions for you:
1. For which feature (0-3) do the densities for classes 0 and 2 *overlap* the most?
2. For which feature (0-3) is class 2 the most spread out relative to the other two classes?
3. For which feature (0-3) do the three classes seem the most *separated* (this is somewhat subjective at present)?

How well can we predict the class (0, 1, 2) based just on one feature? The code below lets us find this out.

In [None]:
@interact( feature=IntSlider(0,0,3) )
def test_model(feature):
    mu, var, pi = fit_generative_model(trainx, trainy, feature)

    k = 3 # Labels 0,1,2,...,k
    n_test = len(testy) # Number of test points
    score = np.zeros((n_test,k+1))
    for i in range(0,n_test):
        for label in range(0,k):
            ### START CODE HERE ###
            # Implement the formula for normal pdf. 
            # If you can't, use the built-in norm.logpdf() but to get the full grades you should implement your own  
            
            score[i,label] = None 
    predictions = None #think about using np.argmax on score[]
    ### END CODE HERE ###
    # Finally, tally up score
    errors = np.sum(predictions != testy)
    print ("Test error using feature " + featurenames[feature] + ": " + str(errors) + "/" + str(n_test))

### Questions:
In this notebook, we are looking at classifiers that use just one out of a possible 4 features. Choosing a subset of features is called **feature selection**. In general, this is something we would need to do based solely on the *training set*--that is, without peeking at the *test set*.

For the IRIS data, compute the training error and test error associated with each choice of feature.

In [None]:
### Write your code here

Based on your findings, answer the following questions:
* Which two features have the lowest training error? List them in order (best first).
* Which two features have the lowest test error? List them in order (best first).