In [67]:
import numpy as np
import seaborn as sb
import pandas
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_boston
from sklearn.linear_model import LinearRegression, RidgeClassifier, LogisticRegression, Lasso
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, StratifiedKFold
%matplotlib notebook

# Linear Classification

Let's first repurpose our regression model to do classification. For this, we will train two classifiers with different label distributions on the training data. We then run both classifiers on the test data, and the one with the higher output "wins".

In [26]:
# get iris data and only select the first two iris species
iris = load_iris()
X = iris.data[0:100]
y = iris.target[0:100]

# normalize
X = (X-X.mean(axis=0))/X.std(axis=0)

# train first classifier
lr1 = LinearRegression()
lr1.fit(X,y)

# train second classifier on inverse labels
lr2 = LinearRegression()
lr2.fit(X,1-y)

# plot the classifier outputs and the true labels
fig,ax = plt.subplots(figsize=(8,6))
plt.plot(lr1.predict(X),label='classifier 1')
plt.plot(lr2.predict(X),label='classifier 2')
plt.plot(y,label='true label')
plt.legend(loc='best')
plt.title('{:.3f}% recognition accuracy'.format(np.size(np.where(((lr1.predict(X)>lr2.predict(X))-y)==0))))
plt.show()


<IPython.core.display.Javascript object>

As we can see, this works pretty well. Of course, as you know by now, we made a big mistake here:

** We did not do cross validation!** 

Our results are therefore most likely "overfit". Let's remedy this:

In [101]:
# get iris data and only select the second two iris species
# we know that these are a bit harder to separate!
iris = load_iris()
X = iris.data[50:]
# change the labels from "1","2" to "0","1"
y = iris.target[50:]-1

# normalize
X = (X-X.mean(axis=0))/X.std(axis=0)

# make cross validation splits, taking care to choose the same
# amount of classes in each split!
CV = StratifiedKFold(n_splits=8)

# make a list of accuracy values to be populated 
acc = list()

# init two classifiers
lr1 = LinearRegression()
lr2 = LinearRegression()

# this loops across all splits for the data
for train, test in CV.split(X, y):
    
    # fit the two classifiers
    lr1.fit(X[train],y[train])
    lr2.fit(X[train],1-y[train])
    
    # check the accuracy on the test data
    acc.append(np.size(np.where(((lr1.predict(X[test])>lr2.predict(X[test]))-y[test])==0))/len(test))

print('found mean accuracy of {:.3f}'.format(np.mean(acc)))
print(acc)

found mean accuracy of 0.958
[1.0, 1.0, 0.9166666666666666, 0.9166666666666666, 1.0, 0.8333333333333334, 1.0, 1.0]


As we see, not all splits are able to generate this good performance. Hence, our estimate of 100% accuracy is definitely over-optimistic!!

# Logistic regression

Here, we will use the `sklearn` implementation for logistic regression (which is a classification algorithm). The implementation is a lot more powerful than we discussed in class, but for now we will force it to do the standard model.

In [102]:
acc = list()

# init logistic regression - note that the default logistic
# regression in sklearn has a regularizer built in on the weights
# if we specify a very large "C" value, we switch this OFF and
# obtain the "standard" logistic regression model!
logit = LogisticRegression(C=1000000)

# this loops across all splits for the data
for train, test in CV.split(X, y): 
    # fit the classifier
    logit.fit(X[train],y[train])    
    acc.append(logit.score(X[test],y[test]))

print('found mean accuracy of {:.3f}'.format(np.mean(acc)))
print(acc)

found mean accuracy of 0.969
[1.0, 1.0, 1.0, 0.91666666666666663, 1.0, 0.83333333333333337, 1.0, 1.0]


It seems that logistic regression is a tiny bit better compared to the linear classifier.

Do not be overawed but the 100.0% compared to 91.6% - the amount of examples that are classified better is likely just one! To justify that logistic regression really fares better, we should compare the results on a few more test sets and then do proper statistical testing (t-tests would be fine).  

We can also take a look at the class probabilities for one of the test splits, since logistic regression outputs these:

In [103]:
print(logit.predict_proba(X[test]))

[[  9.99981472e-01   1.85278255e-05]
 [  9.99999667e-01   3.33451458e-07]
 [  9.99996111e-01   3.88904910e-06]
 [  9.99997072e-01   2.92795384e-06]
 [  1.00000000e+00   9.83672429e-11]
 [  9.99997012e-01   2.98826572e-06]
 [  1.68711954e-08   9.99999983e-01]
 [  8.56820552e-06   9.99991432e-01]
 [  1.00732959e-03   9.98992670e-01]
 [  1.15722983e-03   9.98842770e-01]
 [  5.66114541e-06   9.99994339e-01]
 [  2.42994662e-02   9.75700534e-01]]


# Perceptrons - learning the decision hyperplane

Let's try to implement the perceptron according to the DataMining book algorithm.

```
Set all weights to zero
Until all instances in the training data are classified correctly
  For each instance I in the training data
    If I is classified incorrectly by the perceptron
      If I belongs to the first class add it to the weight vector
      else subtract it from the weight vector
```

In [171]:
# load the two classes of IRIS data
iris = load_iris()

firstTwo = True

if firstTwo==True:
    X = iris.data[:100]
    y = iris.target[:100]
else:
    X = iris.data[50:]
    y = iris.target[50:]-1    

# normalize
X = (X-X.mean(axis=0))/X.std(axis=0)

# for the iris data, we have four inputs, so we need 4+1 weights
w = np.zeros(5)

# everybody classified correctly?
allCorrect = False

# number of iterations
numIter = 0

# update weights until everything is correct, or we have reached
# maximum number of iterations
while (allCorrect==False and numIter<100):
    numIter +=1
    currCorrect = 0
    # go through training examples
    for idx,xc in enumerate(X):   
        # add the one to the beginning, this is our instance
        I = np.hstack((1,xc))
        # get the current class
        yc = y[idx]
        # classify example and check whether it belongs to first class
        isFirstClass = np.dot(w,I)>=0     
        # check the two correct cases
        if (yc == 0 and isFirstClass):
            currCorrect += 1
        if (yc == 1 and ~isFirstClass):
            currCorrect += 1
        # update weights for the two incorrect cases
        if (yc == 0 and ~isFirstClass):
            w += I  
        if (yc == 1 and isFirstClass):
            w -= I      
    print('Iteration',numIter,': classified',currCorrect,'examples correctly')
    if currCorrect==len(X):
        allCorrect=True

print('final weights:',w)

Iteration 1 : classified 98 examples correctly
Iteration 2 : classified 100 examples correctly
final weights: [-2.         -1.50043433  1.24136855 -1.57912603 -1.4730698 ]


This works very well and very quickly for the first two IRIS classes, but it does fail to converge on the second two IRIS classes.

In the latter case, a few flowers seem to not be linearly separable.

# Perceptron - Version 2

Let's implement the second version of the perceptron and add the learning rate according to the class notes.

We will therefore update the weights according to:

$ w_i = w_i + \eta (target_i - output_i) x_i$

In [172]:
# load the two classes of IRIS data
iris = load_iris()

firstTwo = True

if firstTwo==True:
    X = iris.data[:100]
    y = iris.target[:100]
else:
    X = iris.data[50:]
    y = iris.target[50:]-1    

# normalize
X = (X-X.mean(axis=0))/X.std(axis=0)

# for the iris data, we have four inputs, so we need 4+1 weights
w = np.zeros(5)

# everybody classified correctly?
allCorrect = False

# number of iterations
numIter = 0

# learning rate
eta = 0.02

# update weights until everything is correct, or we have reached
# maximum number of iterations
while (allCorrect==False and numIter<100):
    numIter +=1
    currCorrect = 0
    # go through training examples
    for idx,xc in enumerate(X):   
        # add the one to the beginning, this becomes our instance I
        I = np.hstack((1,xc))
        # get the current class
        yc = y[idx]
        # classify example and produce output
        output = 1 if np.dot(w,I)>=0 else 0
        # the error is the difference between output and target
        error = yc-output
        # now check and update weights if necessary
        if (error ==0):
            currCorrect += 1
        else:
            w += eta*error*I
    print('Iteration',numIter,': classified',currCorrect,'examples correctly')
    if currCorrect==len(X):
        allCorrect=True

print('final weights:',w)

Iteration 1 : classified 99 examples correctly
Iteration 2 : classified 100 examples correctly
final weights: [-0.02        0.01162132 -0.01714271  0.02028719  0.02076496]


This works well enough for the first set of iris flowers, but fails to converge fully for the second set just like the previous version of the algorithm that did not include the learning rate.

Optimizing the learning rate $\eta$ is another science in itself. We will come back to this issue, when we talk about multi-layer neural networks.