# Linear Classification

In this lab you will implement parts of a linear classification model using the regularized empirical risk minimization principle. By completing this lab and analysing the code, you gain deeper understanding of these type of models, and of gradient descent.


## Problem Setting

The dataset describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images. Each of the patients is classified into two categories: normal (1) and abnormal (0). The training data contains 80 SPECT images from which 22 binary features have been extracted. The goal is to predict the label for an unseen test set of 187 tomography images.

In [2]:
import urllib.request
import pandas as pd
import numpy as np
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

testfile = urllib.request.URLopener()
testfile.retrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECT.train", "SPECT.train")
testfile.retrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECT.test", "SPECT.test")

df_train = pd.read_csv('SPECT.train',header=None)
df_test = pd.read_csv('SPECT.test',header=None)

train = df_train.values
test = df_test.values

y_train = train[:,0]
X_train = train[:,1:]
y_test = test[:,0]
X_test = test[:,1:]

np.random.seed(404)



In [3]:
#First I'd like to understand my data and the problem at hand.
#We got a very small data set with only 87 examples.
# All features are binary, and so is our target feature, normal or abnormal tomography.
# Since we have our target feature in our data as a label, we are talking about supervised learning.
# And since we only have two classes, we are in a binary classification situation.



### Exercise 1

Analyze the function learn_reg_ERM(X,y,lambda) which for a given $n\times m$ data matrix $\textbf{X}$ and binary class label $\textbf{y}$ learns and returns a linear model $\textbf{w}$.
The binary class label has to be transformed so that its range is $\left \{-1,1 \right \}$. 
The trade-off parameter between the empirical loss and the regularizer is given by $\lambda > 0$. 
To adapt the learning rate the Barzilai-Borwein method is used.

Try to understand each step of the learning algorithm and comment each line.


In [4]:

np.random.seed(404)

def learn_reg_ERM(X,y,lbda):
    
    #So we get our data matrix as X, our labels as y, and our lambda value.
    max_iter = 200 # Max iterations would be the epochs in NLP, how many optimization runs I do.
    e  = 0.001 # e would be epsilon, our the treshhold when we assume our model is not learning any new information.
    alpha = 1. #Alphajor. Learning rate, our by how much we need to update our model for optimization.

    w = np.random.randn(X.shape[1]);
    # So we've created a linear model that currently holds random values for all the features in our training data.
    for k in np.arange(max_iter): #simply an array that holds values from "0" to "max_iter" to perform optimization.

        h = np.dot(X,w) # This "h" is a dot multiplication of our features by our random model, or performed linear transformation if we want to be all nerdy about it. 
        l,lg = loss(h, y) # by using these new matrix h, we can already calculate loss and the loss gradient (how close are our initial predictions). (Cross entropy loss I suppose?)
        # print ('loss: {}'.format(np.mean(l))) #and then we simply find the mean of the loss across all different training examples.
        r,rg = reg(w, lbda) #I hate regularizers, me and my homies hate regularizers. It's just the bias from NLP with its ML name, so it's supposed to help the model do better predictions and avoid overfittin? right?.
        g = np.dot(X.T,lg) + rg # We take the gradients of the loss and multiply it by a transposed X, then we add the gradient of our regularizer. All of this to update our weights. g gives us a total gradient, which will allow us to later update our model w
        if (k > 0):# So for every iteration except the first one. 
            alpha = alpha * (np.dot(g_old.T,g_old))/(np.dot((g_old - g).T,g_old)) 
            # boi, so our learning rate is created by 
            # using itself times:
            #   the dot product of our previous gradient descent times itself 
            #   divided by:
            #       our the dot product of our previous gradient descent times 
            #       the previous gradient descent minus our current gradent descent (transposed).
            # This means our learning rate is dynamically generated.
            # Barzilai-Borwein method (research this, what alternatives are there?)
            
            
        w = w - alpha * g
        # And so, we multiply our learning rate by our gradient, and rest that amount from our model
        if (np.linalg.norm(alpha * g) < e):
            #If we are below our epsilon, we are free. The machine will learn no more.
            break
        g_old = g
    return w



### Exercise 2

Fill in the code for the function loss(h,y) which computes the hinge loss and its gradient. 
This function takes a given vector $\textbf{y}$ with the true labels $\in \left \{-1,1\right \}$ and a vector $\textbf{h}$ with the function values of the linear model as inputs. The function returns a vector $\textbf{l}$ with the hinge loss $\max(0, 1 − y_{i} h_{i})$ and a vector $\textbf{g}$ with the gradients of the hinge loss w.r.t $\textbf{h}$. (Note: The partial derivative of the hinge loss with respect to $\textbf{h}$  is $g_{i} = −y $ if $l_{i} > 0$, else $g_{i} = 0$)

In [5]:
def loss(h, y):
    #L(y,f(x))= max(0,1−y∗f(x)) Hinge loss formula
    #################
    
    l = np.maximum(0,1 - y * h)

    g = np.zeros(l.shape)
    g[l > 0] = -y[l > 0]
    ##################
    
    return l, g

### Exercise 3

Fill in the code for the function reg(w,lambda) which computes the $\mathcal{L}_2$-regularizer and the gradient of the regularizer function at point $\textbf{w}$. 


$$r = \frac{\lambda}{2} \textbf{w}^{T}\textbf{w}$$

$$g = \lambda \textbf{w}$$

In [6]:
def reg(w, lbda):
    
    r = (lbda / 2) * np.dot(w.T, w)
    g = lbda * w
    return r, g


### Exercise 4

Fill in the code for the function predict(w,x) which predicts the class label $y$ for a data point $\textbf{x}$ or a matrix $X$ of data points (row-wise) for a previously trained linear model $\textbf{w}$. If there is only a data point given, the function is supposed to return a scalar value. If a matrix is given a vector of predictions is supposed to be returned.

In [7]:
def predict(w, X):
    preds = np.sign(np.dot(X, w))
    return preds

### Exercise 5

#### 5.1 
Train a linear model on the training data and classify all 187 test instances afterwards using the function predict. 
Please note that the given class labels are in the range $\left \{0,1 \right \}$, however the learning algorithm expects a label in the range of $\left \{-1,1 \right \}$. Then, compute the accuracy of your trained linear model on both the training and the test data. 

In [8]:
##################
y_train[y_train == 0] = -1
y_test[y_test == 0] = -1

w = learn_reg_ERM(X_train, y_train, 0.1)
y_hat = predict(w, X_test)
accuracy = np.mean(y_test == y_hat)
print(accuracy)
##################

0.6898395721925134


#### 5.2
Compare the accuracy of the linear model with the accuracy of a random forest and a decision tree on the training and test data set.

In [9]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from skopt import BayesSearchCV
from sklearn.model_selection import cross_val_score

In [10]:
##################
#INSERT CODE HERE#

clf_tree = DecisionTreeClassifier(criterion='entropy')

search_space = {
    'max_depth': (1, 50),                        
    'min_samples_split': (2, 20),                 
    'min_samples_leaf': (1, 20),                  
    'max_features': (0.1, 1.0, 'uniform'),      
}

opt = BayesSearchCV(
    clf_tree,
    search_spaces=search_space,
    n_iter=32,                
    scoring='accuracy',       
    cv=5, # cv=5 means each combination is tested with 5‑fold cross‑validation (done in parallel via n_jobs=-1).         
    random_state=42,
    n_jobs=-1                 
)

opt.fit(X_train, y_train)
print("Best parameters:", opt.best_params_)
print("Best score:", opt.best_score_)

##################

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

train_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

Best parameters: OrderedDict({'max_depth': 50, 'max_features': 0.8010471965418119, 'min_samples_leaf': 1, 'min_samples_split': 20})
Best score: 0.7875
Training Accuracy: 0.9375
Test Accuracy: 0.7754
