#### The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction to Machine Learning, 2020 Semester 2

## Week 4 

This week, we will be using scikit-learn to classify some data, and to evaluate some classifiers.

In [1]:
import numpy as np
from sklearn import datasets
from collections import Counter
import matplotlib.pyplot as plt

### Exercise 1.
Please load Car Evaluation dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data).

The common terminology in scikit-learn is that the array defining the attribute values is called X and the array defining the gold–standard (“ground truth”) labels is called y ; create these variables for the car data.

- **(a)** Load the data into a suitable format for scikit-learn:


In [14]:
X = []
y = []
with open('car.data', mode='r') as fin:
    for line in fin:
        atts = line.strip().split(",")
        X.append(atts[:-1]) #all atts, excluding the class
        y.append(atts[-1])

- **(b)** How many instances are there in this collection? How many attributes, and of what type(s)? What is the class we’re trying to predict, and how many values does it take?

In [8]:
from collections import Counter
print('There are', len(X), 'instances')
print('There are', len(X[0]), "attributes, for example:", X[0])
print('There are', len(set(y)), "class labels:", set(y))   
#use Counter to count the number of labels
label_counter = Counter(y)
print("Label frequencies: %s" %str(label_counter.most_common()))

There are 1728 instances
There are 6 attributes, for example: ['vhigh', 'vhigh', '2', '2', 'small', 'low']
There are 4 class labels: {'vgood', 'acc', 'good', 'unacc'}
Label frequencies: [('unacc', 1210), ('acc', 384), ('good', 69), ('vgood', 65)]


### Exercise 2
Unfortunately, scikit-learn isn’t set up to deal with our attributes in this format.

- **(a)** Write some functions that transform our **categorical** attributes into **numerical** attributes, by (perhaps arbitrarily) assigning each categorical value to an integer, for example:

```python
def convert_class(raw):
    if raw=="unacc": return 0
    elif raw=="acc": return 1
    elif raw=="good": return 2
    elif raw=="vgood": return 3
```


In [9]:
# We could check this from the "car.names" file linked above
# Here's one (somewhat inefficient) way of reading this from the data itself
feature_1_values = set([X[i][0] for i in range(len(X))])
feature_2_values = set([X[i][1] for i in range(len(X))])
feature_3_values = set([X[i][2] for i in range(len(X))])
feature_4_values = set([X[i][3] for i in range(len(X))])
feature_5_values = set([X[i][4] for i in range(len(X))])
feature_6_values = set([X[i][5] for i in range(len(X))])
print("feature 1: %s" %str(feature_1_values))
print("feature 2: %s" %str(feature_2_values))
print("feature 3: %s" %str(feature_3_values))
print("feature 4: %s" %str(feature_4_values))
print("feature 5: %s" %str(feature_5_values))
print("feature 6: %s" %str(feature_6_values))

feature 1: {'high', 'vhigh', 'med', 'low'}
feature 2: {'high', 'vhigh', 'med', 'low'}
feature 3: {'5more', '2', '3', '4'}
feature 4: {'more', '2', '4'}
feature 5: {'big', 'small', 'med'}
feature 6: {'high', 'med', 'low'}


In [27]:
import numpy as np

def convert_feature_1and2and6(raw):
    if raw == "low": return 0
    elif raw == "med": return 1
    elif raw == "high": return 2
    elif raw == "vhigh": return 3
    # In general, we might want to catch unexpected values, too
def convert_feature_3(raw):
    if raw == "2": return 0
    elif raw == "3": return 1
    elif raw == "4": return 2
    elif raw == "5more": return 3
def convert_feature_4(raw):
    if raw == "2": return 0
    elif raw == "4": return 1
    elif raw == "more": return 2
def convert_feature_5(raw):
    if raw == "small": return 0
    elif raw == "med": return 1
    elif raw == "big": return 2
def convert_class(raw):
    if raw == "unacc": return 0
    elif raw == "acc": return 1
    elif raw == "good": return 2
    elif raw == "vgood": return 3

X_ordinal = []
for x in X:
    f1, f2, f3, f4, f5, f6 = x
    f1 = convert_feature_1and2and6(f1)
    f2 = convert_feature_1and2and6(f2)
    f3 = convert_feature_3(f3)
    f4 = convert_feature_4(f4)
    f5 = convert_feature_5(f5)
    f6 = convert_feature_1and2and6(f6)
    x = [f1, f2, f3, f4, f5, f6]
    X_ordinal.append(x)
    
#convert to int array to make sure everything is converted.
X_ordinal = np.array(X_ordinal, dtype='int')

#convert ys
y_numeric = []
for this_y in y:
    this_y = convert_class(this_y)
    y_numeric.append(this_y)

y_num = np.array(y_numeric, dtype='int')

print('X shape: {}, y shape: {}'.format(X_ordinal.shape, y_num.shape))
print(X_ordinal) 

X shape: (1728, 6), y shape: (1728,)
[[3 3 0 0 0 0]
 [3 3 0 0 0 1]
 [3 3 0 0 0 2]
 ...
 [0 0 3 2 2 0]
 [0 0 3 2 2 1]
 [0 0 3 2 2 2]]


- **(b)** Load the dataset again, this time as integers. Observe that we can actually build a model using this data.

In [22]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_ordinal, y_num)

LinearSVC()

- **(c)** Split the data into training (80%) and test sets (20%)

In [23]:
from sklearn.model_selection import train_test_split # Newer versions
#from sklearn.cross_validation import train_test_split # Older versions
X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33)
print('X_train: {} X_test: {}'.format(X_train.shape, X_test.shape))

X_train: (1157, 6) X_test: (571, 6)


### Exercise 3.
Read up on different implementations of the Naive Bayes classifier in `sklearn.naive_bayes`. Which one do you think is most suitable for the dataset we have?

- **(a)** Implement the Bernoulli Naive Bayes classifier and inspect its performance

In [24]:
import sklearn.naive_bayes as nb
from sklearn.naive_bayes import BernoulliNB

bnb_accs = []
bnb = BernoulliNB()

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))

    

BNB score 0.763573 
BNB score 0.814361 
BNB score 0.779335 
Avg BNB score: 0.7857559836544074


### Exercise 4.

Read up on the implelentation of the KNN classifier in `sklearn.neighbors.KNeighborsClassifier` and the implementation of distance functions in `sklearn.neighbors.DistanceMetric`. Implement the KNN classifier 
- with Manhattan distance 
- inverse distance weighting
- K=5

**(a)** Play with different values of K and weighting strategies


In [25]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='hamming')
knn_accs = []

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    
    knn.fit(X_train, y_train)
    acc = knn.score(X_test, y_test)
    print("KNN score %f " %acc)
    knn_accs.append(acc)
    
print('Avg KNN score: {}'.format(np.mean(knn_accs)))




KNN score 0.833625 
KNN score 0.821366 
KNN score 0.823117 
Avg KNN score: 0.8260361938120256


### Exercise 5.
The transformation of the data in Q2 implicitly creates ordinal attributes. At first glance, such a strategy does seem reasonable in light of the given values (such as *small, med, big*).
A different strategy would be to `binarise` the attributes: to replace a categorical attribute having `m` values with `m binary attributes`. One way of doing this in scikit-learn is using the **OneHotEncoder** :

```python
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X)
X_trans = ohe.transform(X).toarray()
```

Note that this transformation should be done before we split the data into training and test sets. (Why?)

- **(a)** Check the shape of `X_trans` — how many attributes do we have now? Does this correspond to your expectations?

In [26]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X_ordinal)
X_trans = ohe.transform(X_ordinal).toarray()

print(X_trans.shape)
print('X[0]:', X[0])
print('X_trans[0]:', X_trans[0])


(1728, 21)
X[0]: ['vhigh', 'vhigh', '2', '2', 'small', 'low']
X_trans[0]: [0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.]


- **(b)** Split the dataset comprised of `one–hot attributes` into **train** and **test** sets. Compare the accuracy of the Bernoulli Naive Bayes model and KNN using ordinal attributes with the same model using `one–hot attributes`: are you surprised? What can we infer?



In [28]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier

bnb_accs = []
bnb = BernoulliNB()
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='hamming')
knn_accs = []

for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X_trans, y_num, test_size=0.33, random_state=i)
   
    bnb.fit(X_train, y_train)
    knn.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    kacc = knn.score(X_test, y_test
                    )
    print(f"BNB score {acc}\tKNN score {kacc}")
    bnb_accs.append(acc)
    knn_accs.append(kacc)
    
print(f'\nAvg BNB score: {np.mean(bnb_accs)} \t ({np.std(bnb_accs)})')
print(f'Avg KNN score: {np.mean(knn_accs)} \t ({np.std(knn_accs)})')

BNB score 0.8371278458844134	KNN score 0.882661996497373
BNB score 0.8914185639229422	KNN score 0.8739054290718039
BNB score 0.8581436077057794	KNN score 0.8931698774080561
BNB score 0.8546409807355516	KNN score 0.8669001751313485
BNB score 0.8879159369527145	KNN score 0.8756567425569177

Avg BNB score: 0.8658493870402802 	 (0.020739574561243358)
Avg KNN score: 0.8784588441330998 	 (0.00890246236577151)


A few observations.

*Both classifiers see a slight performance boost, and now perform very similarl. The standard deviation of BNB results is much higher than the variance of KNN.*

*The Bernoulli NB classifier improves in performance -- which is unsurprising as only after 1-hot encoding the data distribution matches the assumptions of the classifier (namely, that the values for each feature were generated by the Bernoully distribution)*

*At this point, we can also observe that scikit-learn's Bernoulli NB is to do ... something ... with non-binary attributes. Most classifiers will return SOME result, for the data that is fed in. Is the Bernoulli classifier on the ordinal data representation a useful machine learning model? Is it a valid machine learning model? Which model can we trust more in deployment? The questions are vital and only possible to answer with a sound understanding of the underlying algorithms.*