<a href="https://colab.research.google.com/github/HQhanqiZHQ/DL_labs/blob/main/ProblemSet1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Set 1
### Due Wednesday, April 3rd by 11:59pm EST
**Your Harvard ID**: 81631239

Note: when completed, please download as an IPython Notebook file and upload to Canvas.

### Question 1
You will be performing one iteration of the forward pass and backpropagation calculations for a small network using Python. Here we will focus on the calculations for one training example, though in reality your data sets will be much larger and require matrix computation. You will also calculate the associated loss.

Let $X_1 = 2$ and $X_2 = -1$ be the feature inputs and initialize the weights to be as shown in the figure below. This is a neural network with a single hidden layer consisting of three nodes. The blue numbers within each node represent the values for the bias terms and the black numbers along the edges represent the weights. The hidden layer outputs a single node, from which your task is binary classification. The label for this particular training example outcome is $y = 1$.



![network](https://drive.google.com/uc?id=1lfmGA56cIu81xD0y1SPKB7VuddS11G5o)


Implement a single forward pass of the network. Assume the hidden layer uses a linear activation function (which is equivalent to assuming no activation function). You do not need to implement the network in keras and should instead use numpy operations (either scalar or matrix). Please use the variable names and print statements provided in the code chunks to display results for the TAs.

In [3]:
# Your code here
import numpy as np
# Initialize the inputs and weights
X = np.array([2, -1])
W_hidden = np.array([[1, 1.1], [0.2, 0], [-0.6, -0.3]])
b_hidden = np.array([-1.8, -0.4, 0.96])
W_output = np.array([0.5, 0.1, 1.3])
b_output = 2

# Perform the forward pass
hidden = np.dot(W_hidden, X) + b_hidden
output = np.dot(W_output, hidden) + b_output
y_hat = 1 / (1 + np.exp(-output))  # Sigmoid for binary classification
prediction = 1 if y_hat >= 0.5 else 0

print('The values for the hidden layer are:', hidden)
print('The value for the output layer is:', output)
print('The predicted probability is:', y_hat)
print('The prediction is:', prediction)

The values for the hidden layer are: [-0.9   0.    0.06]
The value for the output layer is: 1.6280000000000001
The predicted probability is: 0.8358954745887476
The prediction is: 1


Calculate the loss for the training example making sure to select the appropriate loss function.

In [4]:
# Your code here
# Given true label
y_true = 1

# Compute the loss using binary cross-entropy
loss_i = - (y_true * np.log(y_hat) + (1 - y_true) * np.log(1 - y_hat))

print('The loss is:',loss_i)

The loss is: 0.17925170411062194


Implement a single backward pass of the network. Again use numpy and report the values using the print statements provided. Please interpret these values. In other words, what are the values you just calculated used for?

In [5]:
# Your code here
# Forward pass was already computed; we are reusing the hidden, output, and y_hat variables.
# Let's assume these variables are populated with the results from the forward pass we calculated earlier.

# Backward pass: Compute the gradient of the loss w.r.t. the network's parameters
d_loss_output = y_hat - y_true  # Gradient of loss w.r.t. output of network (pre-activation)
d_loss_hidden = W_output * d_loss_output  # Gradient of loss w.r.t. hidden layer output

# Gradient of loss w.r.t. hidden layer weights and biases
d_loss_W_hidden = np.outer(d_loss_hidden, X)  # d_loss_hidden reshaped for outer product with X
d_loss_b_hidden = d_loss_hidden  # Gradient of loss w.r.t. hidden layer bias is the same as d_loss_hidden

# Gradient of loss w.r.t. output layer weights and bias
d_loss_W_output = hidden * d_loss_output  # Element-wise multiplication
d_loss_b_output = d_loss_output  # Gradient of loss w.r.t. output layer bias

dl_dw_h = d_loss_W_hidden
dl_db_h = d_loss_b_hidden

dl_dw_1 = d_loss_W_hidden[0]
dl_dw_2 = d_loss_W_hidden[1]
dl_dw_3 = d_loss_W_hidden[2]

print('The gradients of the loss wrt to the hidden weights are:', dl_dw_h)
print('The gradient of the loss wrt to the hidden bias is:', dl_db_h)
print('The gradients of the loss wrt to the input weights going to hidden node 1 are:', dl_dw_1)
print('The gradients of the loss wrt to the input weights going to hidden node 2 are:', dl_dw_2)
print('The gradients of the loss wrt to the input weights going to hidden node 3 are:', dl_dw_3)

The gradients of the loss wrt to the hidden weights are: [[-0.16410453  0.08205226]
 [-0.03282091  0.01641045]
 [-0.42667177  0.21333588]]
The gradient of the loss wrt to the hidden bias is: [-0.08205226 -0.01641045 -0.21333588]
The gradients of the loss wrt to the input weights going to hidden node 1 are: [-0.16410453  0.08205226]
The gradients of the loss wrt to the input weights going to hidden node 2 are: [-0.03282091  0.01641045]
The gradients of the loss wrt to the input weights going to hidden node 3 are: [-0.42667177  0.21333588]


### Question 2
In class we were considering classification problems where the goal was to predict a single discrete label of an input data point. Another common type of machine learning problem is "regression", which consists of predicting a continuous value instead of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a software project will take to complete, given its specifications.

You will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset you will be using has another interesting difference from our previous examples: it has very few data points, only 506 in total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has a different scale. For instance, some values are proportions, which take a value between 0 and 1, others take values between 1 and 12, others between 0 and 100.

The data consists 13 features. The 13 features in the input data are as follows:

1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet.
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per $10,000.
11. Pupil-teacher ratio by town.
12. 1000(Bk - 0.63)^2 where Bk is the proportion of Black people by town.
13. % lower SES status of the population.

The targets (outcomes, y) are the median values of owner-occupied homes, in thousands of dollars. The prices are typically between 10,000 and 50,000 dollars. If that sounds cheap, remember this was the mid-1970s, and these prices are not inflation-adjusted.

In [6]:
# Import needed packages
import tensorflow as tf
import numpy as np
from tensorflow import keras
import pandas as pd
from tensorflow.keras.datasets import reuters
from keras.utils import to_categorical
from tensorflow.keras import layers
import matplotlib.pyplot as plt
%matplotlib inline

In [7]:
# Load the data
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz


Print the dimensions of the training set, i.e. its shape

In [None]:
# Your code here

Print the dimensions of the test set, i.e. its shape

In [None]:
# Your code here

It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a unit standard deviation.

Normalize the data. Be sure to normalize the test set with the training set mean and standard deviation.

In [None]:
# Your code here

Fit a fully connected neural network with 2 hidden layers and an output layer. Include 64 hidden units in each hidden layer and an appropriate number of units in the output layer. You are free to choose the activation functions. Use the `rmsprop` optimization function, and choose an appropriate loss function and model performance measure. Referring to the table shown in lectures 2 and 3 may help with these choices. Run the network for 50 epochs and use a batch size of 10.

In [None]:
# Your code here

Report the test set accuracy (as in, a measure of how good your predictions are) and compare it to the training set accuracy. **Interpret what this means in words, in terms of what you are trying to do with your network**.

In [None]:
# Your code here

Answer:

Now fit the same network as above but with 16 hidden nodes in each hidden layer. **What is the test set accuracy and how does it compare to the first network you created? Which model do you think is better?**

In [None]:
# Your code here

Answer: