#### The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction to Machine Learning, 2023 Semester 1

## Week 4 

This week, we will be using `scikit-learn` to classify some data, and to evaluate some classifiers.

Scikit-learn is a popular Python library for machine learning that provides a wide range of tools for data preprocessing, modeling, and evaluation. It is built on top of NumPy, SciPy, and Matplotlib, and is designed to be easy to use and efficient for both small and large-scale machine learning tasks.

Scikit-learn includes a variety of machine learning algorithms, such as classification, regression, clustering, and dimensionality reduction, and provides a consistent API for using these algorithms. It also includes tools for feature extraction, selection, and scaling, as well as for evaluating and optimizing machine learning models.



In [1]:
import numpy as np
import pandas as pd

from sklearn import datasets
import matplotlib.pyplot as plt

### Exercise 1.
Please load Car Evaluation dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data).

The common terminology in scikit-learn is that the array defining the attribute values is called X and the array defining the gold–standard (“ground truth”) labels is called y ; create these variables for the car data.

- **(a)** Load the data into a suitable format for scikit-learn:


In [2]:
data = pd.read_csv('car.data', header = None)
X = ...
y = data.iloc[:,-1]


- **(b)** How many instances are there in this collection? How many attributes, and of what type(s)? What is the class we’re trying to predict, and how many values does it take?

In [3]:
from collections import Counter
print('There are', ..., 'instances')
print('There are', ..., "attributes")

unique_labels, label_counts = np.unique(y, return_counts=True)

print('There are', len(unique_labels), "class labels:", unique_labels)   
#use Counter to count the number of labels
label_counter = Counter(y)
print("Label frequencies:", list(zip(unique_labels, label_counts)))

There are Ellipsis instances
There are Ellipsis attributes
There are 4 class labels: ['acc' 'good' 'unacc' 'vgood']
Label frequencies: [('acc', 384), ('good', 69), ('unacc', 1210), ('vgood', 65)]


### Exercise 2
Unfortunately, scikit-learn isn’t set up to deal with our attributes in this format.

- **(a)** Write some functions that transform our **categorical** attributes into **numerical** attributes, by (perhaps arbitrarily) assigning each categorical value to an integer, for example:


"unacc" = 0
"acc" = 1
"good" = 2
"vgood" = 3



In [4]:
# Here's one  way of reading this from the data itself
for column in X:
    print("for feature", column, "that values are", np.unique(X[column]))

TypeError: 'ellipsis' object is not iterable

In [None]:
X_num = ...  

In [None]:
for column in X_num:
    print("for feature", column, "that values are", np.unique(X_num[column]))

- **(b)** Split the data into training (80%) and test sets (20%)

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print('X_train: {} X_test: {}'.format(X_train.shape, X_test.shape))

### Exercise 3.
Read up on different implementations of the Naive Bayes classifier in `sklearn.naive_bayes`. Which one do you think is most suitable for the dataset we have?

- **(a)** Compare the accuracies of all three different kinds of Naive Bayes classifier. Does this accord with your expectations?

In [None]:
import sklearn.naive_bayes as nb
##print(dir(nb))
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

gnb_accs = []
mnb_accs = []
bnb_accs = []
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_num, y, test_size=0.33, random_state=i)
    gnb.fit(X_train, y_train)
    acc = gnb.score(X_test, y_test)
    print("\nGNB score %f " %acc)
    gnb_accs.append(acc)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('\nAvg GNB score:', ...)
print('Avg MNB score:', ...)
print('Avg BNB score:', ...)

    

- **(b)** By default, this implementation of Naive Bayes uses Laplace smoothing. Turn this off, and see what happens — what is the significance of the reported accuracy?

In [None]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

mnb_accs = []
bnb_accs = []
# Gaussian NB doesn't use smoothing; all of the probabilities for the Gaussian are already non-zero
# You can try this for yourself, but scikit-learn will flatly refuse to do it
mnb = MultinomialNB(alpha=0)
bnb = BernoulliNB(alpha=0)
mnb.fit(X_train, y_train)
acc = mnb.score(X_test, y_test)
print("MNB score %f " %acc)
    
    
bnb.fit(X_train, y_train)
acc = bnb.score(X_test, y_test)
print("BNB score %f " %acc)

*Due to the implementation (as log-probabilities), numerical errors would result from unseen events.*



- **(c)** What happens if we change the smoothing parameter ($\alpha$)? Calculate the accuracy for a range of values from 5 to 500. For the very large values, examine the predicted classes for the test instances — what is happening?

In [None]:
Alpha_list = [...]

for i in Alpha_list:
    mnb = MultinomialNB(alpha=i)
    bnb = BernoulliNB(alpha=i)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("\nMNB with aplha =", i ," score is %f " %acc)

    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB with aplha =", i ," score is %f " %acc)

### Exercise 4.
The transformation of the data in Q2 implicitly creates ordinal attributes. At first glance, such a strategy does seem reasonable in light of the given values (such as *small, med, big*).
A different strategy would be to `binarise` the attributes: to replace a categorical attribute having `m` values with `m binary attributes`. One way of doing this in scikit-learn is using the **OneHotEncoder** :

```python
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X)
X_trans = ohe.transform(X).toarray()
```

Note that this transformation should be done before we split the data into training and test sets. (Why?)

- **(a)** Check the shape of `X_trans` — how many attributes do we have now? Does this correspond to your expectations?

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X)
X_trans_ohe = ohe.transform(X).toarray()
X_trans = pd.DataFrame(X_trans_ohe)

print(X_trans.shape)
print('X:', X.iloc[0])
print('X_trans:', X_trans.iloc[0])


- **(b)** Split the dataset comprised of `one–hot attributes` into **train** and **test** sets. Compare the accuracies of the three Naive Bayes models using ordinal attributes with the three models using `one–hot attributes`: are you surprised? What can we infer?



In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

gnb_accs = []
mnb_accs = []
bnb_accs = []
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_trans, y, test_size=0.33, random_state=i)
    gnb.fit(X_train, y_train)
    acc = gnb.score(X_test, y_test)
    print("\nGNB score %f " %acc)
    gnb_accs.append(acc)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('\nAvg GNB score: {}'.format(np.mean(gnb_accs)))
print('Avg MNB score: {}'.format(np.mean(mnb_accs)))
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))

*This is a fairly drastic difference: Bernoulli NB is still the best option, but both Gaussian and Multinomial NB are no longer useless. It appears that all of these learners can identify meaningful patterns, just by taking the attribute value in isolation (and not in relation to the presumed ordering) - and so, perhaps our original assignment of 0,1,2,3 was too simple to discover patterns.*

*At this point, we can also observe that the default behaviour of scikit-learn's Bernoulli NB is to do ... something ... with non-binary attributes, but it is usually better to make them explicitly binary using the one-hot transformer. (If you're curious, in this case, it's treating whichever value is 0 as "N", and the other values as "Y".)*

### Exercise 5.

Now let's check other metrics results.

In [None]:
from sklearn.metrics import classification_report

gnb.fit(X_train, y_train)
gnb_predictions = gnb.predict(X_test)

mnb.fit(X_train, y_train)
mnb_predictions = mnb.predict(X_test)

bnb.fit(X_train, y_train)
bnb_predictions = bnb.predict(X_test)

print("\n\n ===========\n MNB FULL RESULTS\n===========")
print(classification_report(y_test,mnb_predictions))

print("\n\n ===========\n BNB FULL RESULTS\n===========")
print(classification_report(y_test,bnb_predictions))

print("\n\n ===========\n GNB FULL RESULTS\n===========")
print(classification_report(y_test,gnb_predictions))



- **(a)** What is the difference between macro average and weighted average? Why the results are different for Bernoulli, Multinominal and Gaussian Naive Bayes? 

- **(b)** In this dataset which label has the best F1-score along all different NB models? Can explain the reason.

## Challenge Question
In this dataset which label has the worst F1-score along all different NB models? Can explain the reason.