# School of Computing and Information Systems
### The University of Melbourne
### COMP30027 MACHINE LEARNING (Semester 1, 2019)
### Practical exercises: Week 7
Today, we’re going to do a “bake–off” between Logisitic Regression and the classifiers which are
its most obvious competitors: Naive Bayes (a simpler probabilistic approach) and Support Vector Machines (using a linear kernel).
Don’t forget that you should refer back to earlier weeks or the scikit-learn API where necessary.

###  1. 
Let’s begin with the simple Iris dataset. Recall that this dataset has four numerical attributes, three classes, and only a small number of instances.
Using Zero-R as a baseline, and the following classifiers in their default configurations (i.e. no parameter tuning):

(i) naive_bayes.MultinomialNB

(ii) naive_bayes.GaussianNB

(iii) svm.LinearSVC

(iv) svm.SVC (this should the only time you try this model today!)

(v) linear_model.LogisticRegression

Which has the best accuracy when cross–validating on this dataset? (Note that the whole bake–off should only take at most a couple of seconds to run.) Why?

In [13]:
from sklearn import svm, datasets
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
import time
import numpy as np
np.random.seed(30027)
iris = datasets.load_iris()
#print(dir(iris))
X = iris.data
y = iris.target

models = [DummyClassifier(strategy='most_frequent'),
          GaussianNB(),
          MultinomialNB(),
          svm.LinearSVC(),
          svm.SVC(),
          LogisticRegression()]
titles = ['Zero-R',
          'GNB',
          'MNB',
          'LinearSVC',
          'SVC',
          'Logistic Regression']


for title, model in zip(titles, models):
    start = time.time()
    acc = np.mean(cross_val_score(model, X, y, cv=10))
    end = time.time()
    t = end - start
    print(title, acc, 'time:', t)

Zero-R 0.33333333333333337 time: 0.009168148040771484
GNB 0.9533333333333334 time: 0.014063835144042969
MNB 0.9533333333333334 time: 0.011994123458862305
LinearSVC 0.9666666666666668 time: 0.07668805122375488
SVC 0.9800000000000001 time: 0.018596887588500977
Logistic Regression 0.9533333333333334 time: 0.015269041061401367


The problem is not fully linear so SVC is able to slightly outperform linear models.

# 2. 
Download the Abalone dataset from the UCI ML repository https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data. The class here is numeric
(number of rings, which defines the age of the mollusc), but we can set up a two-class problem:
```python
def convert_class(raw):
    if int(raw)<=10: return 0
        else: return 1
for line in f:
   atts = line[:-1].split(",")
   X.append(atts[1:-1])
   y.append(convert_class(atts[-1]))
```

Don’t forget to make X as a numpy array using .astype(np.float) — again, using Zero-R as a
baseline, and the four classifiers above (forget about SVC() ):

### (a)
Who wins this bake–off? Why might that be? (It should again take at most a few seconds to
run.)

In [2]:
### Python can download files, too (-:
#import os
#import urllib.request
#
#if os.path.isfile('abalone.data'):
#    print('already downloaded.')
#else:
#    urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', 'abalone.data')

In [17]:
models = [DummyClassifier(strategy='most_frequent'),
          GaussianNB(),
          MultinomialNB(),
          svm.LinearSVC(),
          LogisticRegression()]
titles = ['Zero-R',
          'GNB',
          'MNB',
          'LinearSVC',
          'Logistic Regression']

In [5]:
def convert_class(raw, num_class=2):
    raw = int(raw)
    if num_class == 2:
        if raw<=10: return 0
        else: return 1
    elif num_class == 3:
        if raw <= 8:
            return 0
        elif 9<=raw<=10:
            return 1
        elif 11<=raw:
            return 2
    elif num_class == 29:
        return raw

def load_abalone(addsex=False, num_class=2):
    X, y = [], []
    with open('abalone.data', 'r') as fin:
        for line in fin:
            atts = line[:-1].split(",")
            if not addsex:
                X.append(atts[1:-1])
            else:
                sex = atts[0]
                if sex == "M": sex = 0
                elif sex=="I": sex = 1
                elif sex=="F": sex = 2
                else: sex = 3
                
                X.append([sex] + atts[1:-1])
            y.append(convert_class(atts[-1], num_class))
    X = np.array(X, dtype=float)
    return X, y

X, y = load_abalone(addsex=False, num_class=2)

#print(X[0], y[0])

for title, model in zip(titles, models):
    start = time.time()
    acc = np.mean(cross_val_score(model, X, y, cv=10))
    end = time.time()
    t = end - start
    print(title, acc, 'time:', t)

Zero-R 0.6535799111906648 time: 0.02172088623046875
GNB 0.6767787683728615 time: 0.03435921669006348
MNB 0.6535799111906648 time: 0.03149080276489258
LinearSVC 0.7744707583215724 time: 0.4357430934906006
Logistic Regression 0.7648967907014101 time: 0.06963706016540527


NB here is largely the same as 0-R: this is to be expected for the Multinomial NB (as the attributes are not frequencies); it's somewhat surprising that the Gaussian NB is almost as bad - perhaps the distributions are not normal because the data has outliers.

LinearSVC is very slightly better than Logistic Regression here, but the difference is small, and might not be significant.


### (b) 
Modify convert_class() so that it instead sets up a three-class problem: 8 rings or less; 9
or 10 rings; 11 rings or more — which classifier wins now?

In [6]:
X, y = load_abalone(addsex=False, num_class=3)

for title, model in zip(titles, models):
    start = time.time()
    acc = np.mean(cross_val_score(model, X, y, cv=10))
    end = time.time()
    t = end - start
    print(title, acc, 'time:', t)

Zero-R 0.34642075045918785 time: 0.018862009048461914
GNB 0.577260492694719 time: 0.03551197052001953
MNB 0.5570936039372009 time: 0.03243899345397949
LinearSVC 0.644483665671723 time: 1.0409269332885742
Logistic Regression 0.6320370619049334 time: 0.1367959976196289


Mostly this is the same story, but notice that NB now is substantially better than 0-R. It seems that there are some patterns that NB can find after all.

### (c) 
Set up the 29-class problem (!) as follows: y.append(int(atts[-1])) — now which classifier is the best?

Naive Bayes seems like it would be woefully inadequate for a problem like
this: how is it going? This run will probably take noticeably longer than the others — which classifier is the main culprit?

In [7]:
X, y = load_abalone(addsex=False, num_class=29)

for title, model in zip(titles, models):
    start = time.time()
    acc = np.mean(cross_val_score(model, X, y, cv=10))
    end = time.time()
    t = end - start
    print(title, acc, 'time:', t)



Zero-R 0.16500855703117906 time: 0.03310108184814453
GNB 0.23591700319832828 time: 0.08421778678894043
MNB 0.16525547061142598 time: 0.05653882026672363
LinearSVC 0.258368737839951 time: 4.7963268756866455




Logistic Regression 0.24687601563564093 time: 1.2486748695373535


There are a few things we can notice here; one is that scikit-learn is not impressed that we are trying to cross-validate with so few instances of each class. This will guarantee that stratification is impossible, which is undesirable in an evaluation framework.

Looking at the performance, the 29-class task is far more difficult; even the better models are not much better than 0-R. Multinomial NB is back to useless, but now Gaussian NB is roughly as effective as the more complex linear models. One reason for this is probably that there is very little information available to support a prediction.

The LinearSVC is now noticeably slower than the other models; this is to be expected with a large number of classes. SVC would be very difficult to train in this environment.


### (d) 
The gender attribute ( atts[0] ) is mostly un-helpful for this problem. Nevertheless, let’s
incorporate it into the models as a one–hot attribute and see what happens:
Pay particularly close attention to the performance on Abalone-29: which model(s) are most
resilient to the addition of these three attributes? Why do you suppose this would be?

In [8]:
from sklearn.preprocessing import OneHotEncoder
X, y = load_abalone(addsex=True, num_class=29)
print('before', X.shape)
ohe = OneHotEncoder(categorical_features=[0])
X = ohe.fit_transform(X).toarray()
print(X[0:5])
print('after', X.shape)
for title, model in zip(titles, models):
    start = time.time()
    acc = np.mean(cross_val_score(model, X, y, cv=10))
    end = time.time()
    t = end - start
    print(title, acc, 'time:', t)

before (4177, 8)
[[1.     0.     0.     0.455  0.365  0.095  0.514  0.2245 0.101  0.15  ]
 [1.     0.     0.     0.35   0.265  0.09   0.2255 0.0995 0.0485 0.07  ]
 [0.     0.     1.     0.53   0.42   0.135  0.677  0.2565 0.1415 0.21  ]
 [1.     0.     0.     0.44   0.365  0.125  0.516  0.2155 0.114  0.155 ]
 [0.     1.     0.     0.33   0.255  0.08   0.205  0.0895 0.0395 0.055 ]]
after (4177, 10)
Zero-R 0.16500855703117906 time: 0.029197216033935547
GNB 0.09995373301345349 time: 0.09067392349243164
MNB 0.2200867294398984 time: 0.05632901191711426




LinearSVC 0.24991067442390813 time: 6.016511917114258




Logistic Regression 0.24658107691038467 time: 1.3327722549438477


The linear models are essentially the same - it's likely that the weights for the extra attributes are close to 0. Multinomial NB is surprisingly _better_ in this situation, but still pretty terrible. Gaussian NB has totally fallen apart, as the one-hot distribution is decidedly not normal, so the probabilities that are being "learned" for the gender attribute are actually dominating the more meaningful probabilities for the other attributes.

### 3.

Let’s look at something a little bigger: the US census data Adult. We’ve uploaded a slightly pre-processed version of this dataset (adult.txt) to the LMS.
Note that this dataset is substantially larger; we recommend saving time by doing hold–out instead of cross–validation — but note that the accuracies you observe will be more subject to variance.
Many of the attributes are nominal; for now, let’s ignore them (or you can convert to one-hot if you’re feeling brave!). An inelegant way of loading the data is to smush the numeric attributes into a list, for example:
```python
X.append([atts[0],atts[2],atts[4],atts[10],atts[11],atts[12]])
```
Don’t forget to convert the class to integers! Which of the classifiers wins this bake–off? What might be different about this problem, compared to the previous ones?

In [18]:
X, y = [], []
with open('adult.txt', mode='r') as fin:
    for line in fin:
        atts = line.strip().split(",")
        X.append([atts[0],atts[2],atts[4],atts[10],atts[11],atts[12]])
        #labels: >50K, <=50K.
        if atts[-1] == '>50K':
            label = 1
        else:
            label = 0
        y.append(label)

X = np.array(X, dtype=float)

X_train, X_test, y_train, y_test = train_test_split(X, y)


for title, model in zip(titles, models):
    start = time.time()
    model.fit(X_train,y_train)
    acc = accuracy_score(model.predict(X_test),y_test)
    end = time.time()
    t = end - start
    print(title, acc, 'time:', t)

Zero-R 0.7632968922736765 time: 0.005110025405883789
GNB 0.8008844122343692 time: 0.012317895889282227
MNB 0.7897064242722024 time: 0.010676145553588867
LinearSVC 0.3000859845227859 time: 1.7251451015472412
Logistic Regression 0.8059206485689718 time: 0.06301021575927734


There isn't much to choose between these methods; they are all slightly better than 0-R, but only barely.

If you run this a few times, you will find that suddenly LinearSVC will become horrible (~30% accuracy). For some random partitions, the SVM will learn entirely the wrong margin (effectively, over-fitting the training data). This is significant if we're using repeated random subsampling or cross-validation, as we will see that the performance is worse than the other models, but we won't necessarily realise why. Most of the folds will be just as good as the other models, but one fold will randomly be far worse, lowering the "average" effectiveness. 

# 4. 
A better way of loading in a mixed data set like this one is the DictVectorizer() , which converts an array of dictionaries into a sparse representation , for example:
```python
for line in f:
    atts = line[:-1].split(",")
    this = {}
    this["age"]=int(atts[0])
    this["workclass"]=atts[1]
    ...
    X.append(this)
```
(Note that the numeric attributes shouldn’t be left as strings!) and:
```python
vec = DictVectorizer()
X = vec.fit_transform(X).toarray()
```

[For completeness, here are the attribute specifications from UCI.]
age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



### (a) 
Any changes to the winner of the bake–off? Was it worthwhile adding all of these various
extra attributes? (There are a lot!)
### (b) 
As a final summary, compare the results of the four classifiers (and the baseline) when cross–validating over this large dataset. (It will probably take a few minutes to run.) Confirm that the averaged accuracy for LinearSVC is consistent with what you expect, based on your observations across the folds (or when trying different hold–out partitions).

In [19]:
from sklearn.feature_extraction import DictVectorizer
X, y = [], []
with open('adult.txt', mode='r') as fin:
    for line in fin:
        atts = line.strip().split(",")
        if len(atts) != 15:
            continue
        atts = [att.strip() for att in atts]
        this = {}
        this['age'] = int(atts[0])
        this['workclass'] = atts[1]
        this['fnlwgt'] = atts[2]
        this['education'] = atts[3]
        this['education-num'] = int(atts[4])
        this['marital-status'] = atts[5]
        this['occupation'] = atts[6]
        this['relationship'] = atts[7]
        this['race'] = atts[8]
        this['sex'] = atts[9]
        this['capital-gain'] = int(atts[10])
        this['capital-loss'] = int(atts[11])
        this['hours-per-week'] = int(atts[12])
        this['native-country'] = atts[13]

        X.append(this)
        #labels: >50K, <=50K.
        if atts[-1] == '>50K':
            label = 1
        else:
            label = 0
        y.append(label)

vec = DictVectorizer()
X = vec.fit_transform(X).toarray()
print(X.shape)

for title, model in zip(titles, models):
    start = time.time()
    acc = np.mean(cross_val_score(model, X, y, cv=5))
    end = time.time()
    t = end - start
    print(title, acc, 'time:', t)

(32561, 21755)
Zero-R 0.7591904454179904 time: 53.013346672058105
GNB 0.8008352836945651 time: 441.9942111968994
MNB 0.7791530009344381 time: 93.86210513114929
LinearSVC 0.7951229085959625 time: 102.93285512924194
Logistic Regression 0.8508953601019469 time: 81.79078817367554


Compared to just the numeric features, these one-hot nominal features don't help NB very much. On the other hand, both Logistic Regression and LinearSVC are noticeably improved, so there appear to be some predictive attribute values.

LinearSVC should actually have roughly the same performance as Logistic Regression here; again, it's likely that one of the x-val partitions has a margin that is overfitting, and is substantially lowering the overall accuracy.