# Winery classification with the multivariate Gaussian

In this notebook, we return to winery classification, using the full set of 13 features.

## 1. Load in the data 

As usual, we start by loading in the Wine data set. Make sure the file `wine.data.txt` is in the same directory as this notebook.

Recall that there are 178 data points, each with 13 features and a label (1,2,3). As before, we will divide this into a training set of 130 points and a test set of 48 points.

In [1]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# Useful module for dealing with the Gaussian density
from scipy.stats import norm, multivariate_normal 

In [2]:
# Load data set.
data = np.loadtxt('wine.data.txt', delimiter=',')
# Names of features
featurenames = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols', 
                'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 
                'OD280/OD315 of diluted wines', 'Proline']
# Split 178 instances into training set (trainx, trainy) of size 130 and test set (testx, testy) of size 48
np.random.seed(0)
perm = np.random.permutation(178)
trainx = data[perm[0:130],1:14]
trainy = data[perm[0:130],0]
testx = data[perm[130:178], 1:14]
testy = data[perm[130:178],0]

In [3]:
testx[:5]

array([[1.388e+01, 1.890e+00, 2.590e+00, 1.500e+01, 1.010e+02, 3.250e+00,
        3.560e+00, 1.700e-01, 1.700e+00, 5.430e+00, 8.800e-01, 3.560e+00,
        1.095e+03],
       [1.242e+01, 2.550e+00, 2.270e+00, 2.200e+01, 9.000e+01, 1.680e+00,
        1.840e+00, 6.600e-01, 1.420e+00, 2.700e+00, 8.600e-01, 3.300e+00,
        3.150e+02],
       [1.281e+01, 2.310e+00, 2.400e+00, 2.400e+01, 9.800e+01, 1.150e+00,
        1.090e+00, 2.700e-01, 8.300e-01, 5.700e+00, 6.600e-01, 1.360e+00,
        5.600e+02],
       [1.258e+01, 1.290e+00, 2.100e+00, 2.000e+01, 1.030e+02, 1.480e+00,
        5.800e-01, 5.300e-01, 1.400e+00, 7.600e+00, 5.800e-01, 1.550e+00,
        6.400e+02],
       [1.383e+01, 1.570e+00, 2.620e+00, 2.000e+01, 1.150e+02, 2.950e+00,
        3.400e+00, 4.000e-01, 1.720e+00, 6.600e+00, 1.130e+00, 2.570e+00,
        1.130e+03]])

In [4]:
testy[:5]

array([1., 2., 3., 3., 1.])

## 2. Fit a Gaussian generative model

We now define a function that fits a Gaussian generative model to the data.
For each class (`j=1,2,3`), we have:
* `pi[j]`: the class weight
* `mu[j,:]`: the mean, a 13-dimensional vector
* `sigma[j,:,:]`: the 13x13 covariance matrix

This means that `pi` is a 4x1 array (Python arrays are indexed starting at zero, and we aren't using `j=0`), `mu` is a 4x13 array and `sigma` is a 4x13x13 array.

In [5]:
def fit_generative_model(x,y):
    k = 3  # labels 1,2,...,k
    d = (x.shape)[1]  # number of features
    mu = np.zeros((k+1,d))
    sigma = np.zeros((k+1,d,d))
    pi = np.zeros(k+1)
    for label in range(1,k+1):
        indices = (y == label)
        mu[label] = np.mean(x[indices,:], axis=0)
        sigma[label] = np.cov(x[indices,:], rowvar=0, bias=1)
        pi[label] = float(sum(indices))/float(len(y))
    return mu, sigma, pi

In [6]:
# Fit a Gaussian generative model to the training data
mu, sigma, pi = fit_generative_model(trainx,trainy)

In [7]:
print(mu[1]) # Mean for label 1
print(mu[1,[2, 4, 6]]) # Mean for label 1 on features [2,4,6]

[1.37853488e+01 2.02232558e+00 2.42790698e+00 1.68813953e+01
 1.05837209e+02 2.85162791e+00 2.99627907e+00 2.89069767e-01
 1.93069767e+00 5.63023256e+00 1.06232558e+00 3.16674419e+00
 1.14190698e+03]
[  2.42790698 105.8372093    2.99627907]


In [8]:
i = 1
# print(sigma[i]) # Cov matrix of class i
# print(sigma[i][[2,4,6],:][:,[2,4,6]]) # slicing 3-D array for features 2, 4 and 6
print(np.diag(sigma[[i],[0,2,6],[0,2,6]])) # slicing 3-D array for features 0, 2 and 6 AND assuming independence 
print(sigma[i].shape)

[[0.23325279 0.         0.        ]
 [0.         0.03677469 0.        ]
 [0.         0.         0.15240941]]
(13, 13)


In [9]:
pi

array([0.        , 0.33076923, 0.41538462, 0.25384615])

## 3. Use the model to make predictions on the test set

<font color="magenta">**For you to do**</font>: Define a general purpose testing routine that takes as input:
* the arrays `pi`, `mu`, `sigma` defining the generative model, as above
* the test set (points `tx` and labels `ty`)
* a list of features `features` (chosen from 0-12)

It should return the number of mistakes made by the generative model on the test data, *when restricted to the specified features*. For instance, using the just three features 2 (`'Ash'`), 4 (`'Magnesium'`) and 6 (`'Flavanoids'`) results in 7 mistakes (out of 48 test points), so 

        `test_model(mu, sigma, pi, [2,4,6], testx, testy)` 

should print 7/48.

**Hint:** The way you restrict attention to a subset of features is by choosing the corresponding coordinates of the full 13-dimensional mean and the appropriate submatrix of the full 13x13 covariance matrix.

In [22]:
# Now test the performance of a predictor based on a subset of features
def test_model(mu, sigma, pi, features, tx, ty):
    ###
    ### Your code goes here   
    ###
    
    nb_labels = pi.shape[0]-1
    n, d = tx.shape
    
    # classify data points in tx
    t_prd = np.zeros((n, nb_labels+1)) # predicted labels in 1st column, scores for each lable in 2nd to last columns
    for i in range(n):
        for label in range (1, nb_labels+1):
            t_prd[i, label] = np.log(pi[label]) + multivariate_normal.logpdf(tx[i][features], 
                                                                             mean=mu[label][features], 
                                                                             cov=sigma[label][features,:][:,features])
        t_prd[i,0] = np.argmax(t_prd[i,1:]) + 1
    
    # count miscalassified points
    misclassified = (t_prd[:,0] != ty)
    n_misclassified = np.sum(misclassified)
    
    return n_misclassified
        

In [23]:
features = [2, 4, 6]
n_misclassified = test_model(mu, sigma, pi, features, tx=testx, ty=testy)

In [24]:
n_misclassified

7

### <font color="magenta">Fast exercises</font>

*Note down the answers to these questions. You will need to enter them as part of this week's assignment.*

Exercise 1. How many errors are made on the test set when using the single feature 'Ash'?

In [25]:
test_model(mu, sigma, pi, [2], testx, testy)

29

Exercise 2. How many errors when using 'Alcohol' and 'Ash'?

In [26]:
test_model(mu, sigma, pi, [0,2], testx, testy)

12

Exercise 3. How many errors when using 'Alcohol', 'Ash', and 'Flavanoids'?

In [27]:
test_model(mu, sigma, pi, [0,2,6], testx, testy)

3

Exercise 4. How many errors when using all 13 features?

In [28]:
test_model(mu, sigma, pi, range(0,13), testx, testy)

2

Exercise 5. In lecture, we got somewhat different answers to these questions. Why do you think that might be?