# Exercise Sheet 4: Machine Learning Fundamentals & Linear Regression (Deadline: 01 Dec 23:59)

# ML Fundamentals(7 points)
For theoretical tasks you are encouraged to write in $\\LaTeX$. Jupyter notebooks support them by default. For reference, please have a look at the examples in this short excellent guide: [Typesetting Equations](http://nbviewer.jupyter.org/github/ipython/ipython/blob/3.x/examples/Notebook/Typesetting%20Equations.ipynb)

Alternatively, you can upload the solutions in the written form as images and paste them inside the cells. But if you do this, **make sure** that the images are of high quality, so that we can read them without any problems.

###### 1. Sigmoid Function (1.5 points)
The special case of the logistic function is the *sigmoid function* which is defined as:

\begin{equation*}
  \sigma(a) = \frac{1}{1 + e^{-a}}
\end{equation*}

a) Compute its gradient analytically. (0.5 points)

\begin{align}
(1) \; \; \; &\sigma'(a) = \frac{d}{da}(1+e^{-a})^{-1} = (-1)(1+e^{-a})^{-2}\frac{d}{da}(1+e^{-a}) =\\
    &(-1)(1+e^{-a})^{-2}(0 +\frac{d}{da}e^{-a}) = (-1)(1+e^{-a})^{-2}(e^{-a}\frac{d}{da}(-a)) =\\
    &(-1)(1+e^{-a})^{-2}e^{-a}(-1) = \frac{e^{-a}}{(1+e^{-a})^{2}}\\[15pt]
    &\sigma'(a) =  \frac{e^{-a}+1-1}{(1+e^{-a})^{2}} =  \frac{1 + e^{-a}}{(1+e^{-a})^{2}} -\frac{1}{(1+e^{-a})^{2}} =\\
(2)\; \; \; &\frac{1}{(1+e^{-a})} - \frac{1}{(1+e^{-a})^{2}} = \frac{1}{(1+e^{-a})}\left(1-\frac{1}{(1+e^{-a})}\right) = \sigma(a)(1-\sigma(a))\\
 \end{align}   

b) What are the inherent properties that you observe from the above computed gradient? (0.5 points) <br />
   *Hint: Think about how would the gradient signal be for the whole domain of the sigmoid function*

As we can see in (2) the calculation of $\sigma'(a)$ is very easy and can be very efficiently reused and modified in the course of the training. However, for very big or very small $a$ the values of $\sigma'(a)$ saturate to 0. Eventually this might cause that the signal that flows to the neuron is too small and the network doesn't learn. One should also take this into account when choosing initial weights for the network.

c) Prove that the sigmoid function is symmetric. (0.5 points)

As the point of symmetry of the sigmoid function is not (0,0) but (0, $\frac{1}{2}$) and the function is odd, we want to prove, that $f(x) = \sigma(x) - \frac{1}{2}$ is symmetric at the origin (0,0). Odd functions are symmetric iff $f(-x) = -f(x)$ <br/>
\begin{align} 
f(x) &=  \sigma(x) - \frac{1}{2} = \frac{1}{1+e^{-x}} - \frac{1}{2} = \frac{2-1-e^{-x}}{2(1+e^{-x})} = \frac{1-e^{-x}}{2(1+e^{-x})} = \frac{1-\frac{1}{e^{x}}}{2(1+\frac{1}{e^{x}})} = \frac{\frac{e^{x}-1}{e^{x}}}{2(\frac{e^{x}+1}{e^{x}})} =\frac{e^{x}-1}{2(e^{x}+1)}\\
-f(x) &= -\frac{e^{x}-1}{2(e^{x}+1)} = \frac{1 - e^{x}}{2(e^{x}+1)}\\[15pt]
f(-x) &=  \sigma(-x) - \frac{1}{2} = \frac{1}{1+e^{x}} - \frac{1}{2} = \frac{2-1-e^{x}}{2(1+e^{x})} = \frac{1-e^{x}}{2(e^{x}+1)}\\
\end{align}
$f(-x)$ = $-f(x)$, so we have shown, that $f(x)$ is odd and symmetric at the origin. From this follows that $\sigma(x)$ is symmetric at (0, $\frac{1}{2}$).

###### 2. Regularization (3.5 points)

In the lecture, we've seen that we can add a *regularizer* to our cost function to avoid *over or underfitting*. For example, consider the following training criterion for linear regression:

\begin{equation*}
  J(\textbf{w}) = \frac{1}{m}\sum_{i=1}^{m} \Vert\hat{y}^{(i)} - y^{(i)}\Vert^{2} + \lambda\Omega(\textbf{w})
\end{equation*}
where $\Omega(\textbf{w}) = \textbf{w}^{T}\textbf{w}$ is the regularizer.

a) In the above criterion, what is the role of the regularization parameter $\lambda$ on the regularizer (i.e. parameters of our model) while minimizing $J(\textbf{w})$? (1.0 point)

b) Is $\lambda$ the model parameter or a hyperparameter? Justify.(0.5 points)

c) Derive the closed form solution for the weights ($\textbf{w}$) in the above criterion.(2.0 points)

###### 3. Maximum Likelihood Estimation (MLE) (2 points)
Consider the density function of a ***univariate Gaussian distribution***


\begin{equation*}
 p(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{1}{2\sigma^2}(x-\mu)^{2}\right)
\end{equation*}
where $\mu$ is the $\textit{mean}$ and $\sigma^{2}$ is the $\textit{variance}$. 

Let's say you're given *N* samples (i.e. $x_1, x_2, x_3, ..., x_N$) which are drawn from the above stated distribution. Also, you can assume that these samples are **i.i.d** (i.e. [independent and identically distributed](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables)).

Now, please derive the *MLE step-by-step* for:

a) *mean* $(\mu)$. (1.0 point)

b) *variance* $(\sigma^2)$. (1.0 point)

# Multiple Linear Regression (13 points)

#### 1. Introduction
As we have seen in first assignment sheet, when we have one independent (or explanatory) variable and a scalar dependent variable, it is called **simple linear regression**.
But, when there are more than one explanatory variable (i.e. $x^{(1)}, x^{(2)}, ...,x^{(k)}$), and a single scalar dependent variable (*y*), then it's called $\textit{multiple linear regression}$. (Please don't confuse this with *multivariate linear regression* where we predict more than one (correlated) dependent variable.)

Here, we will implement a **multiple linear regression** model in Python/NumPy using the *Gradient Descent* algorithm. Particularly, we will be using $\textit{stochastic gradient descent}$ (*SGD*) where one performs the update step using a small set of training samples of size *batch_size* which we will set to 64. This is again a hyperparameter but in this exercise we will just use a fixed batch-size of *64* (i.e. we go through the training samples sampling 64 at a time and perform gradient descent.) Such a procedure is sometimes called *mini-batch gradient descent* in the deep learning community.

Going through all the training samples *once* is called an **epoch**. Ideally, the algorithm has to go through multiple epochs over the training samples, each time shuffling it, until a convergence criterion has been satisfied. <br />

Here, we will set a *tolerance value* for the difference in error (i.e. change in MSE values between subsequent epochs) that we will accept. Once this difference falls below the *tolerance value*, we terminate our training phase and return the parameters. 

We repeat the above training procedure for all possible hyperparameter combinations. Later on, using these parameters (*i.e. weight vectors*), we compute the prediction for validation data and the corresponding MSE values. And then, we pick the hyperparameter combination which yielded the least MSE.

As a next step, we will combine training data and validation data and make it as our *new training data*. We keep the test data as it is. Using the hyperparameter combination (for the least MSE) that we found above, we train the model again with the *new training data* and obtain the parameter (*i.e. weight vector*) after convergence according to our *tolerance value*.

Phew! That will be our much desired *weight vector*. This is then used on the *test data*, which has not been seen by our algorithm so far, to make a prediction. The resulting MSE value will be the so-called [*generalization error*](https://en.wikipedia.org/wiki/Generalization_error).

It is this *generalization error* that we want it to be as low as possible for *unseen data* (implies that we can achieve higher accuracy).

#### 2. Dataset
For our task, we will be using the *Wine Quality* dataset and predict the quality of white wine based on 11 features such as acidity, citric acid content, residual sugar etc. .

In [2]:
%matplotlib inline
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# get data
data_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
data = pd.read_csv(data_url, sep=';')

# inspect data
print(data.head())
#print(data.shape)

# data as np array
data_npr = data.values

print(data_npr)

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.0              0.27         0.36            20.7      0.045   
1            6.3              0.30         0.34             1.6      0.049   
2            8.1              0.28         0.40             6.9      0.050   
3            7.2              0.23         0.32             8.5      0.058   
4            7.2              0.23         0.32             8.5      0.058   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 45.0                 170.0   1.0010  3.00       0.45   
1                 14.0                 132.0   0.9940  3.30       0.49   
2                 30.0                  97.0   0.9951  3.26       0.44   
3                 47.0                 186.0   0.9956  3.19       0.40   
4                 47.0                 186.0   0.9956  3.19       0.40   

   alcohol  quality  
0      8.8        6  
1      9.5        6  
2     10.1        6 

#### 3. Loss function
We will use a *regularized* form of the MSE loss function. In matrix form it can be written as follows:

\begin{equation*}
    J(\textbf{w}) = \frac{1}{2} \Vert{X\textbf{w}-\textbf{y}}\Vert^{2} + \frac{\lambda}{2}\Vert{\textbf{w}}\Vert^{2}
\end{equation*}

It's important to note that, in the above equation, $X$, called *design matrix*, is the horizontal concatenation of shape *(batch_size, num_features)* according to the *order* of the polynomial. To make things easier, you can add the *bias* term as the first column of $X$. Take care to have the *weight* vector $\textbf{w}$ with matching dimensions.

$\textit{Hint}$: see [Design_matrix#Multiple_regression](https://en.wikipedia.org/wiki/Design_matrix#Multiple_regression) for how $X$ with 2 features looks like for $1^{st}$ degree polynomial.

a) Derive the gradient (w.r.t $\textbf{w}$) for the regularized loss function given in **3**. (1.0 point)

\begin{equation*}
    J(\textbf{w}) = \frac{1}{2m} ({X\textbf{w}-\textbf{y}})^{T}({X\textbf{w}-\textbf{y}}) + \frac{\lambda}{2m}\sum_{j=1}^{n}{\textbf{w}_j}^{2}
\end{equation*}
\begin{equation*}
\nabla_wJ(\textbf{w}) = X^T({X\textbf{w}-\textbf{y}})+ \lambda\textbf{w}
\end{equation*}

#### 4. Matrix format for higher order polynomial

Written in matrix form, linear regression model for second order would look like: <br />
$$\hat{\textbf{y}} = X\textbf{w}_{1} + X^{2}\textbf{w}_{2} + \textbf{b}$$

where $X^{2}$ is the element-wise squaring of the original design matrix $X$, $\textbf{w}_1$ and $\textbf{w}_2$ are the *weight* vectors, and **b** is the *bias* vector.

a) Now, please write down the matrix format for a $9^{th}$ order linear regression model (0.5 points)
$$\hat{\textbf{y}} =\textbf{b} + \sum_{i=1}^{9} w_{i}x^{i} $$

#### 5. Hyperparameters
we will experiment with three hyperparameters:

i) regularization parameter $\lambda$ <br />
ii) learning rate $\epsilon$ <br />
iii) order of polynomial *p*

And do a grid search over the values that these hyperparameters can take in order to select the best combination (i.e. the one that achieves lowest test error). This approach is called **hyperparameter optimization or tuning**.

In [3]:
polynomial_order = [1, 5, 9]
learning_rates = [1e-5, 1e-8]
lambdas = [0.1, 0.8]

#hyperparams combination
comb_gen = itertools.product(*(polynomial_order, learning_rates, lambdas))
hparams_comb = list(comb_gen)

batch_size = 64

#### 6. Normalization
First of all, inspect the data, and understand its structure and features. Ideally, before starting to train our learning algorithm, we would want the data to be normalized. Here, we normalize the data (i.e. normalize each column) using the formula:

\begin{equation*}
  norm\_x_i = \frac{x_i - min(x)}{max(x) - min(x)}
\end{equation*}
where $x_i$ is the $i^{th}$ sample in feature $x$

a) Complete the following function which performs normalization (i.e. normalizes columns of $X$). (0.5 points)

In [4]:
def data_normalization(data):
    # TODO: implement
    newDF = pd.DataFrame() #creates a new dataframe that's empty
    for column in data:
        min_col = np.amin(data[column])
        max_col = np.amax(data[column])
        new_col = (data[column] - min_col)/(max_col - min_col)
        newDF = pd.concat([newDF, new_col], axis=1) 
    
    data_normalized = newDF.values
    return data_normalized

# perform data normalization
data_normalized = data_normalization(data)
data_npr = data_normalized
print (data_npr)

[[ 0.30769231  0.18627451  0.21686747 ...,  0.26744186  0.12903226  0.5       ]
 [ 0.24038462  0.21568627  0.20481928 ...,  0.31395349  0.24193548  0.5       ]
 [ 0.41346154  0.19607843  0.24096386 ...,  0.25581395  0.33870968  0.5       ]
 ..., 
 [ 0.25961538  0.15686275  0.11445783 ...,  0.27906977  0.22580645  0.5       ]
 [ 0.16346154  0.20588235  0.18072289 ...,  0.18604651  0.77419355
   0.66666667]
 [ 0.21153846  0.12745098  0.22891566 ...,  0.11627907  0.61290323  0.5       ]]


In [5]:
def split_data(data_npr):
    # (in-place) shuffling of data_npr along axis 0
    np.random.shuffle(data_npr)

    n_tr = 3898
    n_va = n_tr + 500
    n_te = n_va + 500
    
    X_train = data_npr[0:n_tr, 0:-1]
    Y_train = data_npr[0:n_tr, -1]
    
    X_val = data_npr[n_tr:n_va, 0:-1]
    Y_val = data_npr[n_tr:n_va, -1]
    
    X_test = data_npr[n_va:, 0:-1]
    Y_test = data_npr[n_va:, -1]
    
    return [(X_train, Y_train), (X_val, Y_val), (X_test, Y_test)]


# shuffle only the training data along axis 0
def shuffle_train_data(X_train, Y_train):
    """called after each epoch"""
    perm = np.random.permutation(len(Y_train))
    Xtr_shuf = X_train[perm]
    Ytr_shuf = Y_train[perm]
    
    return Xtr_shuf, Ytr_shuf

###### 7. Implementation of required functions

Complete the following function which computes the MSE value. (0.5 point) <br />
(i.e. just a vanilla version of it.) That is, you can ignore the regularization term and also the constants $\frac{1}{2}$

In [6]:
def compute_mse(prediction, ground_truth):
    # TODO: implement
    residual = np.subtract(prediction, ground_truth)
    squared = residual**2
    sum_of_squared = np.sum(squared)
    mse = (1/ground_truth.size)*sum_of_squared
    return mse

prediction = np.array([1, 3, 6])
truth = np.array([0, 0, 0])
mse = compute_mse(prediction, truth)
print (mse)

15.3333333333


Implement a function which computes the prediction of your model. (0.5 point)

In [7]:
def get_prediction(X, W):
    # TODO: implement
    Yhat = np.matmul(X, W)
    return Yhat

dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
coef = [0.4, 0.8]
for row in dataset:
    yhat = get_prediction(dataset, np.transpose(coef))
    
print (yhat)

[ 1.2  3.2  4.   2.8  6. ]


Implement a function which computes the gradient of your loss function. (1.0 point) <br />
*Hint: Just implementing the gradient computed in **3.** (a)*

In [8]:
def compute_gradient(X, Y, Yhat, W, lambda_):
    # TODO: implement
    X_T = np.transpose(X)
    pre = np.matmul(X,W)
    gradient = X_T*(pre - Y) + W * lambda_
    return gradient

X = np.array([1, 1])
coef = np.transpose(np.array([0.4, 0.8]))
truth = np.array([1])
Yhat = np.array([1.2])
print (compute_gradient(X, 1.0, Yhat, coef, 1.0))


[ 0.6  1. ]


Implement a function which performs a single update step of SGD. (0.5 point)

In [9]:
# Hint: avoid in-place modification
def sgd(gradient, lr, cur_W):
    # TODO: implement
    new_W = cur_W - (lr * gradient)
    return new_W

Complete the following function which reformats your data as a design matrix. (0.5 point)

In [None]:
# concatenate X acc. to order of polynomial; likewise do it for W
# where X is design matrix, W is the corresponding weight vector
# [1 X X^2 X^3], [1 W1 W2 W3].T

#this function is not done yet
def prepare_data_matrix(X, W, order):
    # TODO: implement
    for i in range(0, order+1):
        if(i==0):
            X_mat = np.ones(X.shape)
            W_vec = np.ones(W.shape)
        else:
            X_mat = np.concatenate((X_mat, X**(i)), axis=1)
            W_vec = np.concatenate((W_vec, W**(i)))
    
#     X_mat = None
    W_vec = np.transpose(W_vec)
    return X_mat, W_vec


# X = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
# W = np.array([[1, 2, 3]], np.int32)
# x_mat, w_vec = prepare_data_matrix(X, W, 3)
# print (x_mat)
# print (w_vec)
# prediction = get_prediction(x_mat, w_vec)

###### 8. Training
Complete the code in the following cell such that it performs **mini-batch gradient descent** on the training data for all possible hyperparameter combinations. (4.0 points)

Note: You can also define a function, named appropriately, which performs training. But, take care to do correct bookkeeping of hyperparameter combinations, weight vectors, and the MSE values.

In [None]:
splits = split_data(data_npr)
X_train, Y_train, X_val, Y_val, X_test, Y_test = itertools.chain(*splits)

tolerance = 1e-3
start = 1

# initialize weight vector from normal distribution
# TODO: implement
w_shape = X_train.shape[1]
W_init = np.random.randn(w_shape)

# cache weights for each hyperparam combination
# TODO: implement
weights_hist = {}
for order in polynomial_order:
    for lr in learning_rates:
        for lamb in lambdas:
            weights_hist[(order, lr, lamb)] = W_init

# keep track of MSE for each hparam combination. will be useful for plotting
# TODO: implement
mse_hist = {}
for order in polynomial_order:
    for lr in learning_rates:
        for lamb in lambdas:
            mse_hist[(order, lr, lamb)] = 0.0

# find optimal hyperparameters
for order in polynomial_order:
    for lr in learning_rates:
        for lamb in lambdas:
            # initialize necessary stuffs
            # TODO: implement
            W = weights_hist[(order, lr, lamb)]
            
            # design matrix needed at this point
            # use the function that we defined above
            # TODO: implement
            X_mat, W_vec = prepare_data_matrix(X_train, W, order)

            epochs = 1
            # goes through multiple epochs
            while True:
                # good idea to shuffle the train data
                # TODO: implement
                X_mat, Y_train = shuffle_train_data(X_mat, Y_train)
                
                # some more initialization
                # TODO: implement
                bs = 0
                nsamples = X_train.shape[0]
                prediction = np.empty(Y_train.shape)
                # goes through 1 epoch
                while bs < nsamples:
                    x = X_mat[bs]
                    prediction[bs] = get_prediction(x, W_vec)
                    gradient = compute_gradient(x, Y_train[bs], prediction[bs], W_vec, lamb)
                    W_vec = sgd(gradient, lr, W_vec)
#                     print("prediction: {} , ground truth: {} ".format(prediction[bs], Y_train[bs]))
                    bs = bs + 1
                    # complete code for 1 epoch
                    # TODO: implement
                    
                # after each epoch
                # get prediction for whole X_train
                # compute the MSE
                # might need to do bookkeeping of mse values as well
                mse = compute_mse(prediction, Y_train)
                mse_hist[(order, lr, lamb)] = mse

                # stopping/convergence criterion
                # check whether diff-in-mse < tolerance
                # TODO: implement
                if(mse < tolerance):
                    break
                epochs += 1
                weights_hist[(order, lr, lamb)] = W_vec
                    # cache weight vector for later use
                    # but we also need the hparam combination
                    # TODO: implement
                print("order: {} , learning rate: {} , regularizer: {} ".format(order, lr, lamb))
                print("Convergence after epoch {} with MSE {}".format(epochs, mse), "\n")
            

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2 with MSE 84.59030123662546 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 3 with MSE 33.629184710153844 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 4 with MSE 13.404989000795506 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 5 with MSE 5.3803902232362075 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 6 with MSE 2.1967087350399654 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 7 with MSE 0.9337222511501009 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 8 with MSE 0.43282826887708087 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 9 with MSE 0.23395895739774208 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 10 with MSE 0.15493970193207565 

order: 1 , learning rate:

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 81 with MSE 0.0587773686695766 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 82 with MSE 0.058356604019046354 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 83 with MSE 0.05792785601392933 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 84 with MSE 0.05751216758154337 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 85 with MSE 0.057091218501921885 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 86 with MSE 0.05668560525150137 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 87 with MSE 0.056268471887808265 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 88 with MSE 0.05587480338387038 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 89 with MSE 0.05547482190306782 

order: 1

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 160 with MSE 0.03577739693628679 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 161 with MSE 0.035583592840001496 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 162 with MSE 0.03541089541776219 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 163 with MSE 0.0352262324324917 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 164 with MSE 0.035042427767048245 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 165 with MSE 0.03487539272440728 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 166 with MSE 0.03469495914607296 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 167 with MSE 0.034521373332450575 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 168 with MSE 0.0343565095308009 




order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 239 with MSE 0.025968168162256312 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 240 with MSE 0.02589172289521226 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 241 with MSE 0.025814746199337473 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 242 with MSE 0.025739112660120342 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 243 with MSE 0.025661524830245818 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 244 with MSE 0.025593504764696896 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 245 with MSE 0.02551687778147573 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 246 with MSE 0.025449283997793136 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 247 with MSE 0.02537086475530

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 319 with MSE 0.021834100733019328 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 320 with MSE 0.021801983230027203 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 321 with MSE 0.021771258142987974 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 322 with MSE 0.021737261430775215 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 323 with MSE 0.021710939250128423 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 324 with MSE 0.021677769410406326 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 325 with MSE 0.021649749363555952 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 326 with MSE 0.02162125754075724 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 327 with MSE 0.02159352113239

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 398 with MSE 0.020186989128239314 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 399 with MSE 0.02017190877046409 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 400 with MSE 0.02015803588877171 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 401 with MSE 0.020145645951517847 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 402 with MSE 0.020135617771503203 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 403 with MSE 0.020127607850635984 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 404 with MSE 0.020109275348206314 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 405 with MSE 0.02009768786925816 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 406 with MSE 0.0200906149726644

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 478 with MSE 0.019544035823660138 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 479 with MSE 0.01954111904117343 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 480 with MSE 0.019538246925644712 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 481 with MSE 0.019534789527146396 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 482 with MSE 0.01952897062816453 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 483 with MSE 0.019520985228455207 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 484 with MSE 0.01951583808898775 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 485 with MSE 0.01951542134887939 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 486 with MSE 0.01951232366867126

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 558 with MSE 0.01932712105476183 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 559 with MSE 0.019325507347236613 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 560 with MSE 0.01932105620002989 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 561 with MSE 0.019322106095134226 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 562 with MSE 0.019318391211009065 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 563 with MSE 0.019322584267271122 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 564 with MSE 0.019322861429124354 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 565 with MSE 0.019320978065784263 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 566 with MSE 0.019319467573175

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 638 with MSE 0.01927374410235688 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 639 with MSE 0.01927529005264754 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 640 with MSE 0.019271815261939718 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 641 with MSE 0.0192725890516035 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 642 with MSE 0.01927398615055298 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 643 with MSE 0.019270692040231234 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 644 with MSE 0.019269490773262545 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 645 with MSE 0.019271143706864335 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 646 with MSE 0.0192739834116244 


order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 717 with MSE 0.01927667259814086 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 718 with MSE 0.019275581139421696 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 719 with MSE 0.01927606348872446 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 720 with MSE 0.019274756523778185 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 721 with MSE 0.019276918384170963 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 722 with MSE 0.01926734512904051 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 723 with MSE 0.019274479065927722 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 724 with MSE 0.019277534253227657 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 725 with MSE 0.0192741645636900

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 798 with MSE 0.019287226647213645 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 799 with MSE 0.01929186670278897 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 800 with MSE 0.019292073949256482 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 801 with MSE 0.019293459263175886 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 802 with MSE 0.01929241176310421 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 803 with MSE 0.019290820865087033 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 804 with MSE 0.01929155299904051 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 805 with MSE 0.01929227078477191 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 806 with MSE 0.01929211892927280

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 878 with MSE 0.01930767149201894 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 879 with MSE 0.01930686761646479 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 880 with MSE 0.019308654885551158 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 881 with MSE 0.019307492096711013 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 882 with MSE 0.019309816085406946 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 883 with MSE 0.019310857077757492 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 884 with MSE 0.01930820022637428 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 885 with MSE 0.019305800706115202 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 886 with MSE 0.0193075534471363

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 958 with MSE 0.019314847816728813 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 959 with MSE 0.019322218777213466 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 960 with MSE 0.019324059499456012 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 961 with MSE 0.019319741235681257 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 962 with MSE 0.019321321664642264 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 963 with MSE 0.01931861013167492 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 964 with MSE 0.01932070834440693 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 965 with MSE 0.019317449072572438 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 966 with MSE 0.019323917994648

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1039 with MSE 0.019327079454349964 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1040 with MSE 0.019325273923204454 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1041 with MSE 0.019329096269559073 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1042 with MSE 0.019329382790762878 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1043 with MSE 0.019332810764754224 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1044 with MSE 0.01933073996076984 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1045 with MSE 0.019327621753752477 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1046 with MSE 0.019330810029202332 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1047 with MSE 0.01933

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1119 with MSE 0.019334974291785757 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1120 with MSE 0.019332187984080835 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1121 with MSE 0.01933379215269263 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1122 with MSE 0.019336516890520856 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1123 with MSE 0.01933235532494496 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1124 with MSE 0.019339588736124055 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1125 with MSE 0.019333794530399854 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1126 with MSE 0.019334109889363932 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1127 with MSE 0.019336

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1198 with MSE 0.019335747742664975 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1199 with MSE 0.019339859494413362 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1200 with MSE 0.019334161082887664 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1201 with MSE 0.01933882225761047 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1202 with MSE 0.019338742387020047 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1203 with MSE 0.019339582233813057 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1204 with MSE 0.019338796195649995 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1205 with MSE 0.019336051240555034 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1206 with MSE 0.01933

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1278 with MSE 0.019340842417587233 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1279 with MSE 0.019341557119171803 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1280 with MSE 0.019336961910802777 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1281 with MSE 0.019340308152363022 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1282 with MSE 0.0193406761387756 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1283 with MSE 0.0193393804978498 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1284 with MSE 0.019336338345927635 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1285 with MSE 0.019339467697514615 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1286 with MSE 0.01934008

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1357 with MSE 0.01934131920710245 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1358 with MSE 0.01934191501875093 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1359 with MSE 0.019337612955934353 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1360 with MSE 0.01934115892457766 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1361 with MSE 0.01934272024720031 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1362 with MSE 0.019340020934645284 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1363 with MSE 0.01933973776136878 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1364 with MSE 0.019338376397491416 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1365 with MSE 0.019343623

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1437 with MSE 0.019342978484977815 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1438 with MSE 0.019341499370979877 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1439 with MSE 0.019341213266046212 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1440 with MSE 0.01934120830954139 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1441 with MSE 0.019342398705822224 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1442 with MSE 0.019338648455408985 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1443 with MSE 0.019342573878886462 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1444 with MSE 0.0193434976161208 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1445 with MSE 0.0193405

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1517 with MSE 0.019344789322089655 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1518 with MSE 0.019338291875094643 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1519 with MSE 0.019342847831279666 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1520 with MSE 0.019343352941185762 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1521 with MSE 0.019341632046663452 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1522 with MSE 0.019341252676495387 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1523 with MSE 0.019340190159916094 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1524 with MSE 0.019342279168985864 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1525 with MSE 0.0193

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1597 with MSE 0.019343531420065507 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1598 with MSE 0.019337518006889043 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1599 with MSE 0.019343743500594996 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1600 with MSE 0.01934007211932839 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1601 with MSE 0.0193407838723275 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1602 with MSE 0.019339243270462183 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1603 with MSE 0.019341293583786456 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1604 with MSE 0.01933978146513547 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1605 with MSE 0.01934073

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1670 with MSE 0.019341406912090898 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1671 with MSE 0.01934162107351726 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1672 with MSE 0.01934069868567875 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1673 with MSE 0.01934007676515616 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1674 with MSE 0.01934276338730943 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1675 with MSE 0.019339614860822526 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1676 with MSE 0.019339787152159765 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1677 with MSE 0.01934465474040343 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1678 with MSE 0.019340146

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1743 with MSE 0.019342184455147245 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1744 with MSE 0.0193438413183635 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1745 with MSE 0.019340302548983553 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1746 with MSE 0.019338821122332976 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1747 with MSE 0.01934189386443222 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1748 with MSE 0.019341155212113798 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1749 with MSE 0.01934128913465192 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1750 with MSE 0.019342637084831717 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1751 with MSE 0.01934091

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1824 with MSE 0.01934054195539226 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1825 with MSE 0.01934210409835911 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1826 with MSE 0.01934182781249522 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1827 with MSE 0.019343289305533258 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1828 with MSE 0.01934220255256369 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1829 with MSE 0.019336772394095213 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1830 with MSE 0.019341415017247145 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1831 with MSE 0.019339057879766183 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1832 with MSE 0.01934037

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1897 with MSE 0.019339291968195834 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1898 with MSE 0.01934073340580878 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1899 with MSE 0.019340171057230557 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1900 with MSE 0.019340719475269816 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1901 with MSE 0.019339862278402325 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1902 with MSE 0.01933869422840985 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1903 with MSE 0.019339454945874024 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1904 with MSE 0.01934072226734081 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1905 with MSE 0.0193395

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1977 with MSE 0.019339763416702026 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1978 with MSE 0.019339782777006313 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1979 with MSE 0.019337619273746665 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1980 with MSE 0.019341909288381618 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1981 with MSE 0.019340301918102383 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1982 with MSE 0.019339454102541554 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1983 with MSE 0.019344166305846427 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1984 with MSE 0.019337547603401388 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 1985 with MSE 0.0193

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2057 with MSE 0.019337500136801623 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2058 with MSE 0.019341208613146085 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2059 with MSE 0.019341870992518995 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2060 with MSE 0.019342941093005255 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2061 with MSE 0.01934106930932812 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2062 with MSE 0.01934152393422392 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2063 with MSE 0.01934005588889921 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2064 with MSE 0.01934021873080491 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2065 with MSE 0.01934245

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2130 with MSE 0.019340849313801405 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2131 with MSE 0.01934108037407489 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2132 with MSE 0.01933962831909686 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2133 with MSE 0.019338615708658947 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2134 with MSE 0.019342930140765383 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2135 with MSE 0.019336320532240363 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2136 with MSE 0.01934441118576548 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2137 with MSE 0.01933995782604614 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2138 with MSE 0.01934178

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2210 with MSE 0.01934145239088924 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2211 with MSE 0.01933766325493316 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2212 with MSE 0.019335139702871828 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2213 with MSE 0.01934202007127108 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2214 with MSE 0.019340697662088197 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2215 with MSE 0.01934352678483394 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2216 with MSE 0.01933905562207809 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2217 with MSE 0.019343397076889885 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2218 with MSE 0.019340251

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2284 with MSE 0.019339379355666535 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2285 with MSE 0.019338570379761964 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2286 with MSE 0.019342572520127065 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2287 with MSE 0.019342508941814242 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2288 with MSE 0.019338802779298125 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2289 with MSE 0.0193385096219872 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2290 with MSE 0.01933709196886068 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2291 with MSE 0.019341653130369476 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2292 with MSE 0.0193415

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2364 with MSE 0.019342269223559208 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2365 with MSE 0.01934238006238631 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2366 with MSE 0.019339886808165202 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2367 with MSE 0.01934209367918691 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2368 with MSE 0.019338626260765407 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2369 with MSE 0.019339449141988502 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2370 with MSE 0.01934038046561685 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2371 with MSE 0.019338196279642228 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2372 with MSE 0.0193426

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2437 with MSE 0.019339492597094297 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2438 with MSE 0.01934044275384906 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2439 with MSE 0.019340894852821585 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2440 with MSE 0.019338458242308173 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2441 with MSE 0.01934303620517021 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2442 with MSE 0.01934286843056731 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2443 with MSE 0.01933715456540937 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2444 with MSE 0.019339980768076054 

order: 1 , learning rate: 1e-05 , regularizer: 0.1 
Convergence after epoch 2445 with MSE 0.01934257

Complete the following function which selects the best hyperparameter combination (i.e. the one that gives lowest MSE on **validation data**). (0.5 point)

In [None]:
# find hparams of minimum MSE on Validation data
def find_best_hparams(weights_hist):
    # TODO: implement
    hpm_best, mse_best = None
    return hpm_best, mse_best

best_hpm_combination = find_best_hparams(weights_hist)

###### 9. Re-Training on Train + Validation data
Complete the following function which does re-training on the combined training and validation data. (**1 point**)

In [None]:
# re-run the training on X_train + X_val combined
# Later test it on X_test; That will be our best possible MSE on test data
# this will be more or less the same training code as you did above
# but, here we just have only one value for each hyperparameter.

# TODO: implement

In [None]:
# plot the convergence of MSE values using matplotlib
# i.e. #epochs on X-axis and MSE values on Y-axis
# TODO: implement

###### 10. Evaluation on Test set
Evaluate your model on test data. (1.0 point)

**Please note that you should keep X_test undisturbed throughout this whole phase.** Else restart the kernel and start from beginning. The whole point of this exercise would not make sense if test data has been *seen in training*.

In [None]:
# finally!!!
# test it on X_test with the Weight vector that you found above
# this will be the generalization error of our model!!
# TODO: implement

#print("Finally!!! MSE achieved on X_test is : {}".format(round(mse_test, 6)))

###### 11. Results
Please report the following

a) MSE value on Test data. (0.5 points)

b) Which hyperparameter combination turned out to be the best? In your understanding, why do you think such a combination turned out to be the best for this task? (1.0 point)

# Bonus (2 points)

Now, please repeat the whole *training, validation, re-training, and testing* procedure that we talked about above with the following hyperparameter combination:

In [None]:
polynomial_order = [1]
learning_rates = [0.1]
lambdas = [0.1]

What are your observations during the training phase? Please explain why such a behaviour happened.

---

## Submission instructions
You should provide a single Jupyter notebook as the solution. The naming should include the assignment number and matriculation IDs of all members in your team in the following format:
**assignment-4_matriculation1_matriculation2_matriculation3.ipynb** (in case of 3 members in a team). 
Make sure to keep the order matriculation1_matriculation2_matriculation3 the same for all assignments.

Please submit the solution to your tutor (with **[NNIA][assignment-4]** in email subject):
1. Maksym Andriushchenko <s8mmandr@stud.uni-saarland.de>
2. Marius Mosbach <s9msmosb@stud.uni-saarland.de>
3. Rajarshi Biswas <rbisw17@gmail.com>
4. Marimuthu Kalimuthu <s8makali@stud.uni-saarland.de>

Note: **If you are in a team, please submit only 1 solution to only 1 tutor.**