# Lab 1 Report
## Problem 1: Linear Regression
### 1.1 Loss function and gradient

$$L = L_{d} + L_{W} = \sum_{n=1}^N\frac{1}{2N} \|W^{T}x^{(n)} + b - y^{(n)}\|^2 + \frac{\lambda}{2}\|W\|^2$$

The gradient with respect to b is 
$$\frac{\partial L}{\partial b} = \sum_{n=1}^N\frac{1}{2N}(2b+2W^{T}x^{(n)} - 2y_{n})$$

The gradient with respect to W is
$$\frac{\partial L}{\partial W} = \sum_{n=1}^N\frac{1}{2N}(x^{(n)^{T}}Wx^{(n)} + 2bx^{(n)} - 2y_{n}x^{(n)}) + {\lambda}W$$

Following is the code snippet of MSE and gradMSE:

In [1]:
def MSE(W, b, x, y, reg):
    N, M= x.shape   # 3500x784
    mse = la.norm(np.dot(x, W) + b - y) ** 2
    total_loss = 1 / (2 * N) * mse + reg / 2 * (la.norm(W) ** 2)
    return total_loss


def gradMSE(W, b, x, y, reg):
    N, M= x.shape   # 3500x784
    grad_term = np.dot(x, W) + b - y
    gradMSE_bias = 1/N * np.sum(grad_term)
    gradMSE_W = 1/N * np.dot(x.T, grad_term) + reg * W
    
    return gradMSE_bias, gradMSE_W

### 1.2 Gradient Descent Implementation
Following is the code snippet of grad_descent for MSE only:

In [None]:
def grad_descent(W, b, trainingData, trainingLabels, alpha, epochs, reg, EPS):
    W_comb = []
    b_comb = []

    if lossType == "None":
        print('in GD with \u03B1 = {}, \u03BB = {}'.format(alpha, reg))
        train_W = W
        train_bias = b
        losses = []

        for i in range(epochs):
            gradMSE_bias, gradMSE_W = gradMSE(train_W, train_bias, trainingData, trainingLabels, reg)
            old_W = train_W
            train_W = train_W - alpha * gradMSE_W
            if la.norm(train_W - old_W) < EPS:
                break;
            train_bias = train_bias - alpha * gradMSE_bias
            mse = MSE(train_W, train_bias, trainingData, trainingLabels, reg)
            W_comb.append(train_W)
            b_comb.append(train_bias)
            losses.append(mse)
        print('GD with \u03B1 = {}, \u03BB = {} finished'.format(alpha, reg))
    return b_comb, W_comb, losses

### 1.3 Tuning the Learning Rate
We train the data with $\alpha$ = {0.005, 0.001, 0.0001} and plotted the training, validation and test losses on each $\alpha$. Following are the figures.
![alt text](train_loss_linear.png)
<center>Figure 1: losses on training data with $\alpha$ = 0.005, 0.001, 0.0001 respecively</center>
![title](valid_loss_linear.png)
<center>Figure 2: losses on validation data with $\alpha$ = 0.005, 0.001, 0.0001 respecively</center>
![title](test_loss_linear.png)
<center>Figure 3: losses on test data with $\alpha$ = 0.005, 0.001, 0.0001 respecively</center>
<br><br><br>



<center>Table 1: Accuracy of all sets of data with $\alpha$ = 0.005, 0.001, 0.0001 respecively</center>
<style>
td {
  font-size: 100px
}
</style>

| | Training | Validation | Test |
| --- | --- | --- |
| $\alpha$ = 0.005 | 0.758 | 0.67 | 0.744 |
| $\alpha$ = 0.001 | 0.649 | 0.61 | 0.572 |
| $\alpha$ = 0.0001 | 0.554 | 0.57 | 0.544 |


### 1.4 Generalization
We train the data with fixed $\alpha$ = 0.005, $\lambda$ = {0.001, 0.1, 0.5}. We plotted the training, validation and test losses on each $\lambda$. Following are the figures.
![alt text](train_loss_linear_reg.png)
<center>Figure 4: losses on training data with $\lambda$ = 0.001, 0.1, 0.5 respecively</center>
![title](valid_loss_linear_reg.png)
<center>Figure 5: losses on validation data with $\lambda$ = 0.001, 0.1, 0.5 respecively</center>
![title](test_loss_linear_reg.png)
<center>Figure 6: losses on test data with $\lambda$ = 0.001, 0.1, 0.5 respecively</center>
<br><br><br>



<center>Table 2: Accuracy of all sets of data with $\lambda$ = 0.001, 0.1, 0.5 respecively</center>
<style>
td {
  font-size: 100px
}
</style>

| | Training | Validation | Test |
| --- | --- | --- |
| $\lambda$ = 0.001 | 0.763 | 0.68 | 0.751 |
| $\lambda$ = 0.1 | 0.977 | 0.98 | 0.965 |
| $\lambda$ = 0.5 | 0.976 | 0.97 | 0.965 |

### 1.5 Comparing Batch GD with normal equation

## Problem 2: Logistic Regression
### 2.1 Binary cross-entropy loss

$$L=L_{d}+L_{W} = \sum_{n=1}^N\frac{1}{N} [-y_{n}log\hat{y}(x^{(n)})-(1-y_{n})log(1-\hat{y}(x^{(n)}))] + \frac{\lambda}{2}\|W\|^2$$

The gradient with respect to b is 
$$\frac{\partial L}{\partial b} = -\frac{1}{N}\sum_{n=1}^N[y_{n} - \frac{1}{1+e^{-(W^{T}x^{(n)} + b)}}]$$

The gradient with respect to W is
$$\frac{\partial L}{\partial W} = -\frac{1}{N}\sum_{n=1}^N [y_{n} - \frac{1}{1+e^{-(W^{T}x^{(n)} + b)}}]x^{(n)} + {\lambda}W$$

Following is the code snippet of crossEntropyLoss and gradCE:

In [None]:
def crossEntropyLoss(W, b, x, y, reg):
    N, M = x.shape   # 3500x784
    sigmoid = 1/(1 + np.exp(-np.dot(x, W) - b))  #sigmoid function with input Wx+b
    cross_entropy = np.multiply(y, np.log(sigmoid)) + np.multiply(1-y, np.log(1 - sigmoid))
    total_loss = -1/N * np.sum(cross_entropy) + reg / 2 * (la.norm(W) ** 2)
    return total_loss


def gradCE(W, b, x, y, reg):
    N, M = x.shape  # 3500x784
    sigmoid = 1 / (1 + np.exp(-np.dot(x, W) - b))
    gradCE_bias = -1/N * np.sum(y - sigmoid)
    gradCE_W = -1/N * np.dot(x.T, y - sigmoid) + reg * W
    return gradCE_bias, gradCE_W

### 2.2 Learning
We train the data with fixed $\alpha$ = 0.005, $\lambda$ = 0.1. We plotted the training, validation and test losses and accuracy. Following is the code snippet of modified grad_descent supporting:

In [1]:
def grad_descent(W, b, trainingData, trainingLabels, alpha, epochs, reg, EPS, lossType="None"):
    W_comb = []
    b_comb = []

    if lossType == "None":
        print('in GD with \u03B1 = {}, \u03BB = {}'.format(alpha, reg))
        train_W = W
        train_bias = b
        losses = []

        for i in range(epochs):
            gradMSE_bias, gradMSE_W = gradMSE(train_W, train_bias, trainingData, trainingLabels, reg)
            old_W = train_W
            train_W = train_W - alpha * gradMSE_W
            if la.norm(train_W - old_W) < EPS:
                break;
            train_bias = train_bias - alpha * gradMSE_bias
            mse = MSE(train_W, train_bias, trainingData, trainingLabels, reg)
            W_comb.append(train_W)
            b_comb.append(train_bias)
            losses.append(mse)
        # plt.plot(losses, label='MSE: \u03B1 = {}, \u03BB = {}'.format(alpha, reg))
        print('GD with \u03B1 = {}, \u03BB = {} finished'.format(alpha, reg))
    else:
        b_comb, W_comb, losses = grad_descent_CE(W, b, trainingData, trainingLabels, alpha, epochs, reg, EPS)

    return b_comb, W_comb, losses

Following figure shows the losses and accuracy on training, validation, and test data with $\alpha$ = 0.005, $\lambda$ = 0.1.

![alt text](log_loss_acc.png)
<center>Figure 7: losses and accuracy on all data with $\alpha$ = 0.005, $\lambda$ = 0.1</center>
<br><br><br>

### 2.3 Comparison to Linear Regression
Following figure shows comparison between cross entropy and mse losses.
![alt text](linear_log_compare.png)
<center>Figure 8: cross entropy and MSE loss comparison ($\alpha$ = 0.005, $\lambda$ = 0)</center>
<br><br><br>

## Problem 3: Batch Gradient Descent vs. SGD and Adam
### 3.1 SGD
Following is the code snippet of buildGraph:

In [None]:
def buildGraph(beta1, beta2, epsilon, loss="None"):
    W = tf.Variable(tf.truncated_normal([784, 1], stddev=0.5, dtype=tf.float32))
    b = tf.Variable(tf.zeros(1))

    x = tf.placeholder(tf.float32, [None, 784])
    y = tf.placeholder(tf.float32, [None, 1])
    lambda_ = tf.placeholder(tf.float32)
    tf.set_random_seed(421)

    if loss == "MSE":
        y_hat = tf.matmul(x, W) + b
        loss_t = 0.5 * tf.reduce_mean(tf.square(y - y_hat)) + lambda_ * tf.nn.l2_loss(W)
    elif loss == "CE":
        logits = (tf.matmul(x, W) + b)
        loss_t = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits) + lambda_ * tf.nn.l2_loss(W)

    adam_op = tf.train.AdamOptimizer(learning_rate=0.001, beta1=beta1, beta2=beta2, epsilon=epsilon).minimize(loss_t)
    return x, y, W, b, lambda_, loss_t, adam_op

### 3.2 Implementing Stochastic Gradient Descent
Following is the code snippet of SGD implementation:

In [None]:
def minibatch(minibatch_size, trainingData, trainingTarget, beta1, beta2, epsilon, plot_loss, lossType):
    N, _ = trainingData.shape
    x, y, W, b, lambda_, loss_t, adam_op = buildGraph(beta1, beta2, epsilon, loss=lossType)

    n_epochs = 700
    iterations = N // minibatch_size

    # "minibatch" training
    acc_train = np.zeros(n_epochs)
    loss_train = np.zeros(n_epochs)
    acc_valid = np.zeros(n_epochs)
    loss_valid = np.zeros(n_epochs)
    acc_test = np.zeros(n_epochs)
    loss_test = np.zeros(n_epochs)

    init = tf.global_variables_initializer()
    sess = tf.InteractiveSession()

    sess.run(init)

    for i in range(n_epochs):
        #shuffle
        s = np.arange(N)
        np.random.shuffle(s)
        trainingData = trainingData[s]
        trainingTarget = trainingTarget[s]

        #iterating mini batch
        for j in range(iterations):
            batch_data = trainingData[j*minibatch_size:(j+1)*minibatch_size, :]
            batch_target = trainingTarget[j*minibatch_size:(j+1)*minibatch_size, :]
            _, train_W, train_b = sess.run([adam_op, W, b], feed_dict={x: batch_data, y: batch_target, lambda_: 0})

        #calc accuracy and loss for each epoch
        acc_train[i] = np.sum((np.dot(trainingData, train_W) + train_b >= 0.5) == trainingTarget) / trainingTarget.shape[0]
        loss_train[i] = sess.run(loss_t, feed_dict={x: trainingData, y: trainingTarget, lambda_: 0})
        #valid and test
        acc_valid[i] = np.sum((np.dot(validData, train_W) + train_b >= 0.5) == validTarget) / validTarget.shape[0]
        loss_valid[i] = sess.run(loss_t, feed_dict={x: validData, y: validTarget, lambda_: 0})
        acc_test[i] = np.sum((np.dot(testData, train_W) + train_b >= 0.5) == testTarget) / testTarget.shape[0]
        loss_test[i] = sess.run(loss_t, feed_dict={x: testData, y: testTarget, lambda_: 0})

Follosing figure shows a general case of minibatch size 500.
![alt text](SGD_MSE_500.png)
<center>Figure 9: SGD with a minibatch size of 500 (MSE, $\lambda$ = 0)</center>
<br><br><br>

### 3.3 Batch Size Investigation
Following figures show the effects on minibatch size changeing.
![alt text](SGD_MSE_100.png)
<center>Figure 10: SGD with a minibatch size of 100 (MSE, $\lambda$ = 0)</center>
<br><br><br>
![alt text](SGD_MSE_700.png)
<center>Figure 11: SGD with a minibatch size of 700 (MSE, $\lambda$ = 0)</center>
<br><br><br>
![alt text](SGD_MSE_1750.png)
<center>Figure 12: SGD with a minibatch size of 1750 (MSE, $\lambda$ = 0)</center>
<br><br><br>

### 3.4 Hyperparameter Investigation
Following figures show the effects on hyperparmeters.
![alt text](SGD_MSE_BETA1.png)
<center>Figure 13: SGD with $\beta_{1}$ = 0.95, 0.99 (MSE, minibatch size 500)</center>
<br><br><br>
![alt text](SGD_MSE_BETA2.png)
<center>Figure 14: SGD with $\beta_{2}$ = 0.99, 0.9999 (MSE, minibatch size 500)</center>
<br><br><br>
![alt text](SGD_MSE_EPI.png)
<center>Figure 15: SGD with $\epsilon$ = 1e-9, 1e-4 (MSE, minibatch size 500)</center>
<br><br><br>

### 3.5 Cross Entropy Loss Investigation
Following figures show the losses calculated using cross entropy. The general demonstration uses minibatch size of 500. And the other hyperparameters investigation can be compared with those in Section 3.4.
![alt text](SGD_CE_500.png)
<center>Figure 16: SGD with a minibatch size of 500 (CE, $\lambda$ = 0)</center>
<br><br><br>
![alt text](SGD_CE_BETA1.png)
<center>Figure 17: SGD with $\beta_{1}$ = 0.95, 0.99 (CE, minibatch size 500)</center>
<br><br><br>
![alt text](SGD_CE_BETA2.png)
<center>Figure 18: SGD with $\beta_{2}$ = 0.99, 0.9999 (CE, minibatch size 500)</center>
<br><br><br>
![alt text](SGD_CE_EPI.png)
<center>Figure 19: SGD with $\epsilon$ = 1e-9, 1e-4 (CE, minibatch size 500)</center>
<br><br><br>

### 3.6 Comparison against Batch GD
Comparison between CE batch gradient descent and SGD:
![alt](log_loss_acc.png)
<center>Figure 7: losses and accuracy on all data (CE)</center>
![alt](SGD_CE_500.png)
<center>Figure 16: SGD with a minibatch size of 500 (CE)</center>