# Regression exercise
In this exercise we are going to code from scratch linear and logistic regression.

In [1]:
#Import all the packages
import numpy as np
import matplotlib.pyplot as plt

These are the functions that you will need, load them first.

In [2]:
def mse(y,y_hat):
    '''Calculates mean square error'''
    return np.mean(np.square(np.subtract(y,y_hat)))

In [3]:
def rmse(y, y_hat):
    "Calculates Root mean square error"
    return 1-np.sqrt(mse(y,y_hat))

In [4]:
def logistic_loss(y,y_hat):
    '''Calculates logistic loss'''
    return -np.mean(y * np.log(y_hat) + (1-y) * np.log(1 - y_hat))

In [5]:
def sigmoid(Z):
    '''Calculates the sigmoid function'''
    return 1 / (1+np.exp(-Z))

In [6]:
def accuracy(y,y_hat):
    '''Calculates the accuracy of a prediction'''
    pred_y = np.round(y_hat)
    return np.sum(pred_y == y)/len(y)

In [9]:
## Symbolic link to the data: 
%cd
%mkdir ml_data
%cd ml_data
!ln -s /exercises/ml_intro/ml_data/freq_train.txt ./freq_train.txt # command to make symbolic link
!ln -s /exercises/ml_intro/ml_data/freq_val.txt ./freq_val.txt # command to make symbolic link
!ln -s /exercises/ml_intro/ml_data/label_train.txt ./label_train.txt # command to make symbolic link
!ln -s /exercises/ml_intro/ml_data/label_val.txt ./label_val.txt # command to make symbolic link
!pwd
!ls

/home/jupyter-admin
/home/jupyter-admin/ml_data
/home/jupyter-admin/ml_data
freq_train.txt	freq_val.txt  label_train.txt  label_val.txt


Load the datasets using the function `np.loadtxt`.

In [10]:
X_train = np.loadtxt('freq_train.txt')
X_valid = np.loadtxt('freq_val.txt')
y_train = np.loadtxt('label_train.txt')
y_train = y_train.reshape(-1,1)
y_valid = np.loadtxt('label_val.txt')
y_valid = y_valid.reshape(-1,1)

### Exercise 1

**Q1**: For the Linear regression we will be predicting the frequencies for amino acid A based on the fequencies of amino acids C, D and E.
The frequencies for C, D and E will thus be used as input and the frequencies for A, will be your labels when you train your model **(5 points)**

In [11]:
# The dataset for linear regression exercise 1
X_train_lin = X_train[:,1:4]
X_valid_lin = X_valid[:,1:4]
y_train_lin =  X_train[:,0]
y_train_lin = y_train_lin.reshape(-1,1)
y_valid_lin =  X_valid[:,0]
y_valid_lin = y_valid_lin.reshape(-1,1)

**Step 1:** Initialize the weight $w_{1}$ and bias $w_{0}$ with zeros. Think carefully which shape should have $w_{1}$ and $w_{0}$.

In [None]:
w1 = 
w0 =

**Step 2**: Based on the **slide 13** complete the training loop for logistic regression. Feel free to add additional lines if you feel like you need them.

In [None]:
# n_epochs: number of times the weights are updated, try to increase or decrease them
n_epochs = 10000
# lr: learning rate, rate at which the weights are changed
lr = 0.1
# list to save loss for each epoch
loss_epochs = []
for epoch in range(n_epochs):
    y_hat = 
    loss =  
    D0 =
    D1 = 
    w1 =
    w0 = 
    loss_epochs.append(loss)

**Step 3**: Plot the training curve using the following code:

In [None]:
plt.figure(figsize=(10,7))
plt.plot(loss_epochs)
plt.title('Training curve')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.grid(linestyle='--')
plt.show()

**Q2**: Looking at the training curve, has the model converged? Explain why or why not **(2 points)**

**Q3**: Calculate the accuracy on the training set and validation set and report both of them. Has the model overfitted? Explain why or why not **(2 points)**

In [None]:
y_hat_train = 
y_hat_valid = 
train_accuracy =
valid_accuracy =

### Exercise 2

**Q4**: Now we will be predicting the frequencies for amino acid A based on the fequencies of the 19 amino acids, using amino acid A as lable to predict the frequencies for amino A **(3 points)**

In [None]:
# The dataset for linear regression exercise 2
X_train_lin = X_train[????]  # <--- input the correct slice to get 19 amino acids out of the 20
X_valid_lin = X_valid[????]  # <--- input the correct slice to get 19 amino acids out of the 20
y_train_lin =  X_train[:,0]
y_train_lin = y_train_lin.reshape(-1,1)
y_valid_lin =  X_valid[:,0]
y_valid_lin = y_valid_lin.reshape(-1,1)

In [None]:
w1 = 
w0 =

In [None]:
# n_epochs: number of times the weights are updated, try to increase or decrease them
n_epochs = 10000
# lr: learning rate, rate at which the weights are changed
lr = 0.1
# list to save loss for each epoch
loss_epochs = []
for epoch in range(n_epochs):
    y_hat = 
    loss =  
    D0 =
    D1 = 
    w1 =
    w0 = 
    loss_epochs.append(loss)

Plot the training curve using the following code:

In [None]:
plt.figure(figsize=(10,7))
plt.plot(loss_epochs)
plt.title('Training curve')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.grid(linestyle='--')
plt.show()

**Q5**: Looking at the training curve, has the model converged? Explain why or why not **(2 points)**

**Q6**: Calculate the accuracy on the training set and validation set and report both of them. Has the model overfitted? Explain why or why not **(2 points)**

In [None]:
y_hat_train = 
y_hat_valid = 
train_accuracy =
valid_accuracy =

**Q7**: Which of the two datasets obtained the best results, the dataset from exercise 1 or exercise 2 and explain why? **(3 points)**

### Exercise 3

Knowing how to code from scratch is always a good thing for your general understanding, but knowing the different libraries and how to use them, can make implemetation a bit quicker in real life. 

You will thus now implement `sklearn`

In [None]:
from sklearn import linear_model
# set up the model:
lm_model = linear_model.LinearRegression()

# train the model:
lm_model.fit(X_train_lin, y_train_lin.reshape(-1,))

# use the model to make predictions:
y_hat_train = lm_model.predict(X_train_lin)
y_hat_valid = lm_model.predict(X_valid_lin)

# calculate performance
train_accuracy = rmse(y_train_lin.reshape((-1,)),y_hat_train)
valid_accuracy = rmse(y_valid_lin.reshape((-1,)),y_hat_valid)

**Q8**: Report the accuracy on the training and validation set. Report the difference in accuracy between your model and the `sklearn` model. **(1 point)**

### Exercise 4

**Q9**: In this exercise you will use the data as it was loaded, so for example X_train as is, using all 20 frequencies. Furthermore, instead of using amino acid A as your lable like you did in the earlier exercises, you will be using for example y_train as lable for X_train.  **(5 points)**

**Step 1:** Initialize the weight $w_{1}$ and bias $w_{0}$ with zeros. Think carefully which shape should have $w_{1}$ and $w_{0}$.

In [None]:
w1 = 
w0 =

**Step 2**: Based on the **slide 21** complete the training loop for logistic regression. Feel free to add additional lines if you feel like you need them.

In [None]:
# n_epochs: number of times the weights are updated, try to increase or decrease them
n_epochs = 10000
# lr: learning rate, rate at which the weights are changed
lr = 0.1
# list to save loss for each epoch
loss_epochs = []
for epoch in range(n_epochs):
    Z =
    y_hat = 
    loss =  
    D0 =
    D1 = 
    w1 =
    w0 = 
    loss_epochs.append(loss)

**Step 3**: Plot the training curve using the following code:

In [None]:
plt.figure(figsize=(10,7))
plt.plot(loss_epochs)
plt.title('Training curve')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.grid(linestyle='--')
plt.show()

**Q10**: Looking at the training curve, has the model converged? Explain why or why not **(2 points)**

**Step 4**: Try to run the training again but using `n_epochs = 100000` and `lr=0.5`. Plot the training curve as well.

**Q11**: Looking at the new training curve, has the model converged? Explain why or why not **(2 points)**

**Q12**: Calculate the accuracy on the training set and validation set and report both of them. Has the model overfitted? Explain why or why not **(2 points)**



In [None]:
y_hat_train = 
y_hat_valid = 
train_accuracy =
valid_accuracy = 

### Exercise 5

**Q13**: Here it is an example on how to do logistic regression with `sklearn`. Report the accuracy on the training and validation set. Report the difference in accuracy between your model and the `sklearn` model **(1 point)**

In [None]:
from sklearn import linear_model
# set up the model:
lm_model = linear_model.LogisticRegression()

# train the model:
lm_model.fit(X_train, y_train.reshape(-1,))

# use the model to make predictions:
y_hat_train = lm_model.predict(X_train)
y_hat_valid = lm_model.predict(X_valid)

# calculate performance
train_accuracy = accuracy(y_train.reshape((-1,)),y_hat_train)
valid_accuracy = accuracy(y_valid.reshape((-1,)),y_hat_valid)

**Q14**: When should we use linear regression and why? **(3 points)**

**Q15**: When should we use logistic regression and why? **(3 points)**