# School of Computing and Information Systems

## The University of Melbourne
### COMP30027 MACHINE LEARNING (Semester 1, 2019)
### Practical exercises: Week 6
Today, we will expect you to be referring to the API for scikit-learn http://scikit-learn.org/stable/modules/classes.html — you should also refer to previous weeks’ exercises where
necessary.

# 1.
We will use the Car Evaluation dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data).

*There is an excellent chance that you already have this data, from Project 1.*

### (a) 
Load the data into a suitable format for scikit-learn , for example:
```python
>>> for line in f:
        atts = line[:-1].split(",")
        X.append(atts[:-1])
        y.append(atts[-1])
```

In [2]:
X = []
y = []
with open('car.data', mode='r') as fin:
    for line in fin:
        atts = line.strip().split(",")
        X.append(atts[:-1]) #all atts, excluding the class
        y.append(atts[-1])

### (b) 
How many instances are there in this collection? How many attributes, and of what type(s)?
What is the class we’re trying to predict, and how many values does it take?

In [3]:
from collections import Counter
print('There are', len(X), 'instances')
print('There are', len(X[0]), "attributes, for example:", X[0])
print('There are', len(set(y)), "class labels:", set(y))   
#use Counter to count the number of labels
label_counter = Counter(y)
print("Label frequencies: %s" %str(label_counter.most_common()))

There are 1728 instances
There are 6 attributes, for example: ['vhigh', 'vhigh', '2', '2', 'small', 'low']
There are 4 class labels: {'good', 'unacc', 'acc', 'vgood'}
Label frequencies: [('unacc', 1210), ('acc', 384), ('good', 69), ('vgood', 65)]


### (c) 
Are there any missing attribute values? Is there any evidence that this is an artificially–
constructed dataset?

*You might to check out the data description:*
https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names

*The exact (hierarchical) mechanism for determining the class value is described there. Moreover, every combination of attribute values is listed in the dataset exactly once, which is unusual for real datasets. (Effectively, we might expect that this is the entire population, and not an incomplete sample of it.)*

# (d) 
What happens if we try to build a classifier (using fit() ) using this data?

In [4]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X, y)

'''
The values should be numerical in scikit-learn, we need to find a way to convert string values to numbers.
otherwise, we'd see the following errors.
'''

ValueError: could not convert string to float: 'vhigh'

# 2. 
Unfortunately, scikit-learn isn’t set up to deal with our attributes in this format.

### (a) 
Write some functions that transform our categorical attributes into numerical attributes, by
(perhaps arbitrarily) assigning each categorical value to an integer, for example:

```python
def convert_class(raw):
    if raw=="unacc": return 0
    elif raw=="acc": return 1
    elif raw=="good": return 2
    elif raw=="vgood": return 3
```


In [5]:
# We could check this from the "car.names" file linked above
# Here's one (somewhat inefficient) way of reading this from the data itself
feature_1_values = set([X[i][0] for i in range(len(X))])
feature_2_values = set([X[i][1] for i in range(len(X))])
feature_3_values = set([X[i][2] for i in range(len(X))])
feature_4_values = set([X[i][3] for i in range(len(X))])
feature_5_values = set([X[i][4] for i in range(len(X))])
feature_6_values = set([X[i][5] for i in range(len(X))])
print("feature 1: %s" %str(feature_1_values))
print("feature 2: %s" %str(feature_2_values))
print("feature 3: %s" %str(feature_3_values))
print("feature 4: %s" %str(feature_4_values))
print("feature 5: %s" %str(feature_5_values))
print("feature 6: %s" %str(feature_6_values))

feature 1: {'low', 'med', 'vhigh', 'high'}
feature 2: {'low', 'med', 'vhigh', 'high'}
feature 3: {'5more', '2', '4', '3'}
feature 4: {'2', '4', 'more'}
feature 5: {'med', 'big', 'small'}
feature 6: {'low', 'med', 'high'}


In [6]:
import numpy as np

def convert_feature_1and2and6(raw):
    if raw == "low": return 0
    elif raw == "med": return 1
    elif raw == "high": return 2
    elif raw == "vhigh": return 3
    # In general, we might want to catch unexpected values, too
def convert_feature_3(raw):
    if raw == "2": return 0
    elif raw == "3": return 1
    elif raw == "4": return 2
    elif raw == "5more": return 3
def convert_feature_4(raw):
    if raw == "2": return 0
    elif raw == "4": return 1
    elif raw == "more": return 2
def convert_feature_5(raw):
    if raw == "small": return 0
    elif raw == "med": return 1
    elif raw == "big": return 2
def convert_class(raw):
    if raw == "unacc": return 0
    elif raw == "acc": return 1
    elif raw == "good": return 2
    elif raw == "vgood": return 3

X_ordinal = []
for x in X:
    f1, f2, f3, f4, f5, f6 = x
    f1 = convert_feature_1and2and6(f1)
    f2 = convert_feature_1and2and6(f2)
    f3 = convert_feature_3(f3)
    f4 = convert_feature_4(f4)
    f5 = convert_feature_5(f5)
    f6 = convert_feature_1and2and6(f6)
    x = [f1, f2, f3, f4, f5, f6]
    X_ordinal.append(x)
    
#convert to int array to make sure everything is converted.
X_ordinal = np.array(X_ordinal, dtype='int')


#convert ys
y_numeric = []
for this_y in y:
    this_y = convert_class(this_y)
    y_numeric.append(this_y)

y_num = np.array(y_numeric, dtype='int')


print('X shape: {}, y shape: {}'.format(X_ordinal.shape, y_num.shape))

X shape: (1728, 6), y shape: (1728,)


### (b) 
Load the dataset again, this time as integers. Observe that we can actually build a model
using this data.

In [7]:
clf.fit(X_ordinal, y_num)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

### (c) 
Split the data into training and test sets

In [8]:
from sklearn.model_selection import train_test_split # Newer versions
#from sklearn.cross_validation import train_test_split # Older versions
X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33)
print('X_train: {} X_test: {}'.format(X_train.shape, X_test.shape))

X_train: (1157, 6) X_test: (571, 6)


# 3. 
Read up on different implementations of the Naive Bayes classifier in sklearn.naive_bayes .
Which one do you think is most suitable for the dataset we have?

### (a) 
Train the (default) Naive Bayes model and determine its accuracy on the held–out test set.

### (b)
Compare the accuracies of all three different kinds of Naive Bayes classifier. Does this accord with your expectations?

In [9]:
import sklearn.naive_bayes as nb
print(dir(nb))
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

gnb_accs = []
mnb_accs = []
bnb_accs = []
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    gnb.fit(X_train, y_train)
    acc = gnb.score(X_test, y_test)
    print("GNB score %f " %acc)
    gnb_accs.append(acc)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('Avg GNB score: {}'.format(np.mean(gnb_accs)))
print('Avg MNB score: {}'.format(np.mean(mnb_accs)))
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))

GNB score 0.686515 
MNB score 0.711033 
BNB score 0.763573 
GNB score 0.709282 
MNB score 0.705779 
BNB score 0.814361 
GNB score 0.719790 
MNB score 0.718039 
BNB score 0.779335 
Avg GNB score: 0.705195563339171
Avg MNB score: 0.7116170461179219
Avg BNB score: 0.7857559836544074


*It's no real surprise that Multinomial NB doesn't work here; for example "high" (2), is not really "medium" (1) repeated twice.*

*We might have expected that Gaussian NB would work a little bit here, but the ordering appears to be less significant than the feature values themselves. A secondary concern might the uniform distribution of attribute values.*

### (c)

By default, this implementation of Naive Bayes uses Laplace smoothing. Turn this off, and
see what happens — what is the significance of the reported accuracy?

In [12]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

mnb_accs = []
bnb_accs = []
# Gaussian NB doesn't use smoothing; all of the probabilities for the Gaussian are already non-zero
# You can try this for yourself, but scikit-learn will flatly refuse to do it
#mnb = MultinomialNB(alpha=0)
#bnb = BernoulliNB(alpha=0)
mnb = MultinomialNB(alpha=1.0e-10)
bnb = BernoulliNB(alpha=1.0e-10)

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('Avg MNB score: {}'.format(np.mean(mnb_accs)))
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))

MNB score 0.711033 
BNB score 0.763573 
MNB score 0.705779 
BNB score 0.814361 
MNB score 0.718039 
BNB score 0.779335 
Avg MNB score: 0.7116170461179219
Avg BNB score: 0.7857559836544074


*Due to the implementation (as log-probabilities), numerical errors would result from unseen events.*

*This is now add-k smoothing, for a very small k. You can see that the predictions are largely the same for this particular dataset.*

### (d)

What happens if you increase the smoothing parameter instead? Calculate the accuracy for
a range of values from 5 to 500. For the very large values, examine the predicted classes for
the test instances — what is happening?

In [14]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

mnb_accs = []
bnb_accs = []
# Let's not mess around, and go straight to a large value:
mnb = MultinomialNB(alpha=500)
bnb = BernoulliNB(alpha=500)

for i in range(1):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    print(bnb.predict(X_test))

MNB score 0.698774 
BNB score 0.698774 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

*For large values of the smoothing parameter, every instance is predicted to be the majority-class - effectively, this is the same behaviour as 0-R!*

# 4.
The transformation of the data in Q2 implicitly creates ordinal attributes. At first glance, such a
strategy does seem reasonable in light of the given values (such as small, med, big).
A different strategy would be to binarise the attributes: to replace a categorical attribute hav-
ing m values with m binary attributes. One way of doing this 3 in scikit-learn is using the
OneHotEncoder :
```python
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X)
X_trans = ohe.transform(X).toarray()
```

Note that this transformation should be done before we split the data into training and test sets.
(Why?)

### (a) 
Check the shape of X_trans — how many attributes do we have now? Does this correspond
to your expectations?

In [15]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X_ordinal)
X_trans = ohe.transform(X_ordinal).toarray()
print(X_trans.shape)
print("feature 1: %s" %str(feature_1_values))
print("feature 2: %s" %str(feature_2_values))
print("feature 3: %s" %str(feature_3_values))
print("feature 4: %s" %str(feature_4_values))
print("feature 5: %s" %str(feature_5_values))
print("feature 6: %s" %str(feature_6_values))
print('X[0]:', X[0])
print('X_trans[0]:', X_trans[0])
#print("categories: ", ohe.categories_)

(1728, 21)
feature 1: {'low', 'med', 'vhigh', 'high'}
feature 2: {'low', 'med', 'vhigh', 'high'}
feature 3: {'5more', '2', '4', '3'}
feature 4: {'2', '4', 'more'}
feature 5: {'med', 'big', 'small'}
feature 6: {'low', 'med', 'high'}
X[0]: ['vhigh', 'vhigh', '2', '2', 'small', 'low']
X_trans[0]: [0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.]


### (b) 

Split the dataset comprised of one–hot attributes into train and test sets. Compare the accuracies of the three Naive Bayes models using ordinal attributes with the three models using one–hot attributes: are you surprised? What can we infer?

In [16]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

gnb_accs = []
mnb_accs = []
bnb_accs = []
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_trans, y_num, test_size=0.33, random_state=i)
    gnb.fit(X_train, y_train)
    acc = gnb.score(X_test, y_test)
    print("GNB score %f " %acc)
    gnb_accs.append(acc)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('Avg GNB score: {}'.format(np.mean(gnb_accs)))
print('Avg MNB score: {}'.format(np.mean(mnb_accs)))
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))

GNB score 0.793345 
MNB score 0.816112 
BNB score 0.837128 
GNB score 0.824869 
MNB score 0.865149 
BNB score 0.891419 
GNB score 0.789842 
MNB score 0.814361 
BNB score 0.858144 
Avg GNB score: 0.8026853473438411
Avg MNB score: 0.8318739054290717
Avg BNB score: 0.8622300058377116


*This is a fairly drastic difference: Bernoulli NB is still the best option, but both Gaussian and Multinomial NB are no longer useless. It appears that all of these learners can identify meaningful patterns, just by taking the attribute value in isolation (and not in relation to the presumed ordering) - and so, perhaps our original assignment of 0,1,2,3 was too simple to discover patterns.*

*At this point, we can also observe that the default behaviour of scikit-learn's Bernoulli NB is to do ... something ... with non-binary attributes, but it is usually better to make them explicitly binary using the one-hot transformer. (If you're curious, in this case, it's treating whichever value is 0 as "N", and the other values as "Y".)*

# 5. 
Recall that we built a DecisionTreeClassifier in Week 4.

### (a) 
Do you think the dataset comprised of ordinal attributes, or the dataset comprised of one–hot
attributes would be more appropriate for a typical Decision Tree? Check the test accuracy
of the default Decision Tree model built on each of these datasets.

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

dt = DecisionTreeClassifier(max_depth=None)
dt_accs = []
for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    
    dt.fit(X_train, y_train)
    acc = dt.score(X_test, y_test)
    print("DT score (ordinal) %f " %acc)
    dt_accs.append(acc)
print('Avg DT score (ordinal): {}'.format(np.mean(dt_accs)))

dt_accs = []
for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_trans, y_num, test_size=0.33, random_state=i)
    
    dt.fit(X_train, y_train)
    acc = dt.score(X_test, y_test)
    print("DT score (one-hot) %f " %acc)
    dt_accs.append(acc)
print('Avg DT score (one-hot): {}'.format(np.mean(dt_accs)))


DT score (ordinal) 0.975482 
DT score (ordinal) 0.957968 
DT score (ordinal) 0.975482 
Avg DT score (ordinal): 0.9696438995913602
DT score (one-hot) 0.961471 
DT score (one-hot) 0.947461 
DT score (one-hot) 0.947461 
Avg DT score (one-hot): 0.9521307647402217


### (b) 
How does the accuracy of the Decision Tree models compare with Naive Bayes on these
datasets? Why might this be?


*Since we already know that the test label is decided through a hierarchical logical process, the fact that the Decision Tree works well is probably not so surprising.*

*It's a little hard to see, but the DT is slightly better on the ordinal dataset compared to the one-hot dataset.* 

### (c) 
(n.b. This step might not be possible in the labs.) The main strategy for visualising a Decision
Tree in scikit-learn is through the export_graphviz() method. Read up on the method,
and explore whether the trees created in the previous question are indeed different.

In [23]:
from sklearn import tree
dt = DecisionTreeClassifier(max_depth=None)
dt.fit(X_ordinal, y_num)
tree.export_graphviz(dt, out_file='tree_ordinal.dot')     
dt.fit(X_trans, y_num)
tree.export_graphviz(dt, out_file='tree_onehot.dot')    

#print('go to http://viz-js.com/ (or any online website/tool that can visualise graphvis format) and copy/paste the content of the files to visualise the trees.')

*There are various web-hosted toolkits for visualising the graphviz format; one example is http://vis-js.com*

*You will notice that the one-hot encoded tree is larger (as it cannot group a sequence of ordered values into a single node). This does suggest that there is a little bit of significance to the attribute ordering, which a Decision Tree can learn, but Gaussian NB cannot (based on a dataset of this size).*

### (d) 
Try altering the value of the max_depth parameter, between 1 and None. Visualise the resulting trees, if you can. Compare the estimated training and test accuracies; is there any
evidence that the algorithm is over–fitting (or under–fitting) for trees of certain depths?

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=4)
for max_depth in range(1, 12):
    dt = DecisionTreeClassifier(max_depth=max_depth)
    dt.fit(X_train, y_train)
    acc_test = dt.score(X_test, y_test)
    acc_train = dt.score(X_train, y_train)
    print('DT ordinal depth={}:'.format(max_depth), 'acc_train: {:.2f} acc_test: {:.2f}'.format(acc_train, acc_test))
    tree.export_graphviz(dt, out_file='tree_ordinal_depth={}.dot'.format(max_depth))   

DT ordinal depth=1: acc_train: 0.70 acc_test: 0.71
DT ordinal depth=2: acc_train: 0.77 acc_test: 0.80
DT ordinal depth=3: acc_train: 0.79 acc_test: 0.79
DT ordinal depth=4: acc_train: 0.85 acc_test: 0.86
DT ordinal depth=5: acc_train: 0.87 acc_test: 0.88
DT ordinal depth=6: acc_train: 0.93 acc_test: 0.94
DT ordinal depth=7: acc_train: 0.94 acc_test: 0.93
DT ordinal depth=8: acc_train: 0.97 acc_test: 0.98
DT ordinal depth=9: acc_train: 0.99 acc_test: 0.96
DT ordinal depth=10: acc_train: 0.99 acc_test: 0.98
DT ordinal depth=11: acc_train: 1.00 acc_test: 0.98


*Both the training accuracy and the test accuracy start low (for 1-R), and gradually improve. Between 8 and 10 nodes deep, the test accuracy reaches a plateau, but the training accuracy can be improved further - this is probably where over-fitting happens.*

*After all, there are only 21 attribute values in total; for the deeper trees, we are considering almost all of them!*

# 6. 
The filtering approach to Feature Selection in scikit-learn can be done using SelectKBest.

### (a) 
What happens to the shape of X_train and X_test now?

### (b)
Find out what the best features were for your dataset, according to $\chi^2$.

In [40]:
from sklearn.feature_selection import SelectKBest, chi2

X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=4)

x2 = SelectKBest(chi2, k=3)
X_train = x2.fit_transform(X_train,y_train)
X_test = x2.transform(X_test)
print(X_train.shape)
print(X_test.shape)

for feat_num in x2.get_support(indices=True):
    print(feat_num)
    
#It's more difficult to keep track of the various attributes after they have been one-hot encoded:
X_train, X_test, y_train, y_test = train_test_split(X_trans, y_num, test_size=0.33, random_state=4)

x2 = SelectKBest(chi2, k=5)
X_train = x2.fit_transform(X_train,y_train)
X_test = x2.transform(X_test)
print(X_train.shape)
print(X_test.shape)

for feat_num in x2.get_support(indices=True):
    print(feat_num)

(1157, 3)
(571, 3)
0
3
5
(1157, 5)
(571, 5)
0
12
13
18
20


*Comparing these, it's striking that the best attributes overall are quite different to the best unique attribute values. Of course, we would rather not do feature selection over the one-hot data, because there will be numerous test instances that don't have any of the most predictive values!*