# Laboratory exercise: Classification

In this laboratory exercise we will apply classification and regression algorithms over a synthetic dataset.


## Data load.

Load the data variables from the npz file provided with this exercise

In [1]:
import numpy as np

# Load data from file 10012345.mat .
data = np.load('10012345.npz')

xRtrain = data['xRtrain']
xRtrainLost = data['xRtrainLost']
xRval = data['xRval']
sRtrain = data['sRtrain']
sRval = data['sRval']
xCtrain = data['xCtrain']
xCval = data['xCval']
yCtrain = data['yCtrain']
yCval = data['yCval']

Initialize all requested variables to 0.

In [2]:
# Classification:
w_full, e_full, p20, emin, nvar, wmin, cv0, rp_opt, fpr = 0, 0, 0, 0, 0, 0, 0, 0, 0
# Regression:
wML, AAE, NLL, wmean, Vw = 0, 0, 0, 0, 0


## Part 1: Clasification

Each of the data matrices `xCtrain` and `xCval` contains 240 data vectors with dimension $D=5$. 

Assume that the binary labels `yCtrain` and `yCval` (with values in {0, 1}) were generated according to a logistic regression model:
$$p(y = 1 | {\bf w}, {\bf x}) = \frac{1}{1 + \exp(-{\bf w}^T {\bf z})}$$
where  
$$
{\bf z} = \begin{pmatrix} 1 \\ {\bf x} \end{pmatrix}
$$

### Exercise C0 [extra]:

Normalize the input matrices in such a way that each feature has zero mean and unit standard deviation. You can do it using standard python commands or by means of the `preprocessing.StandardScaler` from `sklearn`. Use `xCtrain` to estimate the mean and variance of each feature, and make sure that the same normalization is applied to any input, ${\bf x}$, no matter if it belogs to the training or the validation set.

Store the normalized matrices in the same variables, `xCtrain`and `xCval`.

In [3]:
# Write your code here.
# <SOL>

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(xCtrain)
xCtrain = scaler.transform(xCtrain)
xCval = scaler.transform(xCval)

# </SOL>    

### Exercise C1:

As a preliminary task, fit a logistic regression model using the training data available in `xCtrain` and `yCtrain`, using the implementation available from `sklearn`. Use regularization parameter $C=2$, set the `random_state` to 42 and use the default values for all other arguments. 

Store the resulting weight vector in the variable `w_full`.


In [4]:
# Write your code here.
#<SOL>
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(C=2, random_state=42)
logistic.fit(xCtrain, yCtrain)

w_full = logistic.coef_.flatten()

# Add intercept term to the weight vector.
w_full = np.concatenate((logistic.intercept_, w_full))
#</SOL>

print(w_full)

[ 0.02964435 -0.12589325  1.15997149 -0.15527988  0.7133148   0.27936155]


### Exercise C2.

Determine the classification error rate measured on the validation data (`xCval` and `yCval`). Store the error rate in the variable `e_full`.

In [5]:
# Write your code here.
#<SOL>
e_full = 1 - logistic.score(xCval, yCval)
# </SOL>

# Print the error rate.
print(e_full)

0.2510416666666667


### Exercise C3

Determine the probability that the $k$-th sample in the validation set belongs to category  $y_k = 1$, according to the model computed in exercise 1, for $k = 0, 1, \dots, 19$. Store the result in the variable `p20`.



In [6]:
# Write your code here
# <SOL>
p20 = logistic.predict_proba(xCval[:20])[:,1]
# </SOL>


# Print the probabilities.
print(p20)

[0.88397831 0.28853908 0.08093859 0.02582069 0.87146138 0.84385989
 0.19683427 0.33489592 0.41218717 0.23983442 0.23100049 0.41446499
 0.15930776 0.16412988 0.8642481  0.19540716 0.2827784  0.28855852
 0.02787595 0.39437038]


### Exercise C4.

It is known that all coefficients  $w_i$  (with  $i > n$) are zero, that is, all variables $x_{n+1}, \dots, x_{D-1}$ are irrelevant for the classification task, but the value of $n$ is unknown. Consequently, the goal is to fit a model that includes only the relevant variables:

Train $D$ different logistic regression models, starting with the model that uses only the first variable, and adding one variable at a time, so that the $i$-th model will use only the variables $x_0, x_1, \dots, x_{i-1}$.  Using $C=2$, `random_state`=42 and all other default parameters.

For each model, compute the classification error rate (on the validation data), and keep the best result. 

Store the following variables:

  * `emin`: the lowest validation error
  * `nvar`: an integer indicating the number of variables in the model, 
  * `wmin`: the corresponding weight vector (only for the best case).



In [7]:
# Write your code here.
#<SOL>
D = xCtrain.shape[1]
emin = 1
nvar = 0
wmin = np.zeros(D+1)

# Iterate over the number of variables.
for i in range(1, D+1):
    # Train the model.
    logistic = LogisticRegression(C=2, random_state=42)
    logistic.fit(xCtrain[:, :i], yCtrain)
    
    # Compute the error rate.
    e = 1 - logistic.score(xCval[:, :i], yCval)
    
    # Check if the error rate is lower than the current minimum.
    if e < emin:
        emin = e
        nvar = i
        wmin = logistic.coef_.flatten()
        wmin = np.concatenate((logistic.intercept_, wmin))
#</SOL>

# Print the results.
print(f"emin = {emin}")
print(f"nvar = {nvar}")
print(wmin)

emin = 0.2510416666666667
nvar = 4
[ 0.03121687 -0.13570936  1.16054771 -0.15423108  0.72103519]


### Exercise C5 [extra].

In this exercise we will train a classifier based on **quadratic discriminant analysis**, using the appropriate class from `sklearn`.

The algorithm has a regularization parameter, `reg_param`, that must take some value between 0 and 1. We will select the appropriate value by means of 10-fold cross validation. As a validation metric, we will use the <a href=https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score>F1-score</a>.

To do so:

  1. Join the train and validation sets into a single dataset, by stacking matrices `xCtrain` (on top) and `xCval` (down) into a single matrix `xCV`. In a similar way, join labels into aarray `yV`.
  2. Using the CV dataset and the `cross_val_score` method from `sklearn.model_selection`, compute the cross validation F1-score (averaged over all folds), for `reg_param=0`. Save the result in variable `cv0`
  3. Select the best value of `reg_param` in $\{0, 0.1, 0.2, 0.3, \ldots, 1.0\}$ by 10-fold cross validation, according to the F1-score. Save the result in variable `rp_opt`.


In [8]:
# Write your code here.
#<SOL>
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import cross_val_score

# Join the train and validation sets into a single dataset.
xCV = np.vstack((xCtrain, xCval))
yCV = np.hstack((yCtrain, yCval))

# Compute the cross-validation score for reg_param=0.
qda = QuadraticDiscriminantAnalysis(reg_param=0)
cv0 = np.mean(cross_val_score(qda, xCV, yCV, cv=10))

# Select the best value of reg_param in {0, 0.1, 0.2, ..., 1.0}.
best_score = 0
rp_opt = 0
for rp in np.arange(0, 1.1, 0.1):
    qda = QuadraticDiscriminantAnalysis(reg_param=rp)
    score = np.mean(cross_val_score(qda, xCV, yCV, cv=10, scoring='f1'))
    if score > best_score:
        best_score = score
        rp_opt = rp
#</SOL>

# Print the results.
print(f"cv0 = {cv0}")
print(f"rp_opt = {rp_opt}")

cv0 = 0.7416666666666668
rp_opt = 0.9


### Exercise C6 [extra].

Take the regularization parameter selected in C5, train the quadratic discriminant using `xCtrain`and `yCtrain`, and compute the false positive rate (i.e. the ratio of false positives vs the total number of negative samples) of the classifier over the validation set. Save the result in variable `fpr`.

In [9]:
# Write your code here.
#<SOL>

from sklearn.metrics import confusion_matrix

# Train the QDA classifier with the optimal regularization parameter.
qda = QuadraticDiscriminantAnalysis(reg_param=rp_opt)
qda.fit(xCtrain, yCtrain)

# Predict the labels for the validation set.
y_pred = qda.predict(xCval)

# Compute the confusion matrix.
tn, fp, fn, tp = confusion_matrix(yCval, y_pred).ravel()

# Compute the false positive rate.
fpr = fp / (fp + tn)
#</SOL>

# Print the false positive rate.
print(fpr)

0.2698072805139186


## Part 2: Regression

The mentioned variables include a training set consisting of 300 data points, each consisting of input-output pairs:  $D = \{{\bf x}_k, s_k\}_{k=0}^{299}$. The input vectors are provided as the rows of the variable `xRtrain`, while their corresponding labels are available in the vector `sRtrain`. Use these variables as provided, without applying any normalization procedure.

Assume the data were generated according to the following model:
$$
s = w_0 + w_1 x_0 + w_2 x_2^3 + w_3 \exp(x_4) + \varepsilon
$$
where the noise samples follow a Gaussian distribution with zero mean and variance $\sigma^2_{\varepsilon} = 0.4$.

### Exercise R1

Obtain the maximum likelihood estimator of the model. Store your result in the variable `wML`.


In [10]:
xRtrain.shape

(300, 5)

In [11]:
# Write your code here.
#<SOL>

# Transform variables (x0, x1, x2, x3, x4, x5) into a new matrix.
Z = np.column_stack((np.ones(xRtrain.shape[0]), xRtrain[:, 0], xRtrain[:, 2]**3, np.exp(xRtrain[:, 4])))


# Compute the maximum likelihood estimator.
wML = np.linalg.inv(Z.T @ Z) @ Z.T @ sRtrain
#</SOL>

# Print the result.
print(wML)                    

[-0.22762496  0.10460595  0.00210107  0.00307295]


### Exercise R2

For the previously obtained estimator, determine the average absolute error on the training dataset, i.e.,
$$
\text{AAE} = \frac{1}{N} \sum_{i=1}^{N} |s(i) - \hat{s}(i)|
$$
where  N  is the number of training data points. Store your result in the variable AAE.


In [12]:
# Write your code here.
#<SOL>
AAE = np.mean(np.abs(sRtrain - Z @ wML))
#</SOL>

# Print the result.
print(AAE)

0.8189814763278366


### Exercise R3

Compute the negative log-likelihood,  of the previously obtained estimator using the training data, and store the result in the variable `NLL`.

In [13]:
# Write your code here.
#<SOL>
n_samples = len(sRtrain)
NLL = 0.5 * n_samples * np.log(2 * np.pi * 0.4) + 0.5 * np.sum((sRtrain - Z @ wML)**2) / 0.4
#</SOL>

### Exercise R4

Assume that the weight vector ${\bf w}$  has a prior distribution  $p_W({\bf w})$ , which is Gaussian with zero mean, unit variances ($\text{var}\{w_i\} = 1$), and covariances $\text{cov}\{w_i, w_j\} = 0.5$,  $i \neq j$. 

Compute the posterior mean and the posterior covariance matrix of ${\bf w}$ . Store your results in the variables `wmean` and `Vw`.

In [14]:
# Write your code here.
#<SOL>

# Compute the posterior covariance matrix.
Sigma = (np.ones((Z.shape[1], Z.shape[1])) + np.eye(Z.shape[1]))/2.0
Vw = np.linalg.inv(Z.T @ Z/0.4 + 2 * np.linalg.inv(Sigma))

# Compute the posterior mean.
wmean = Vw @ Z.T @ sRtrain / 0.4
#</SOL>

# Print the results.
print(wmean)
print(Vw)

[-0.22602065  0.10412973  0.00210942  0.00306339]
[[ 1.75338425e-03 -3.37670835e-04  9.75937904e-06 -9.78155808e-06]
 [-3.37670835e-04  3.90910576e-04 -8.95788025e-07  2.61584204e-06]
 [ 9.75937904e-06 -8.95788025e-07  7.35878113e-07 -2.35934261e-08]
 [-9.78155808e-06  2.61584204e-06 -2.35934261e-08  2.24170791e-06]]


In [15]:
# ###########################################
# Save results in file results.npz
np.savez('results.npz',
         w_full=w_full, e_full=e_full, p20=p20, emin=emin, nvar=nvar,
         cv0=cv0, rp_opt=rp_opt, fpr=fpr, xCtrain=xCtrain, xCval=xCval,
         wmin=wmin, wML=wML, AAE=AAE, NLL=NLL, wmean=wmean, Vw=Vw)