# Homework Assignment 7 - Chem 277B
## Neural Networks and Deep Learning

### 1) Objective

The goal is to perform different regression and classification tasks using neural networks and to compare the performance to standard tools such as linear regression. In order to understand how an ANN actually works, we want to use our custom layers (see lecture) for the analysis.<br>
**Note:** in order to create your ANN efficiently, you can follow the structure provided in the lecture material (ANNI.ipynb, ANNII.ipynb and ANNIII.ipynb)

### 2) Preparation

Before starting, import the necessary libraries for data analysis and visualization. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from tqdm import tqdm

from sklearn.datasets import make_moons

Next, we define our custom layers, such as the dense layer, activation functions and finally an optimizer and a loss function. 

In [None]:
class Layer_Dense():
    
    def __init__(self, n_inputs, n_neurons):
        self.weights = np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
        
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        self.inputs = inputs

    def backward(self, dvalues):
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        self.dinputs = np.dot(dvalues, self.weights.T)

class Activation_Step():

    def forward(self, inputs):
        self.output = (inputs >= 0).astype(float)
        self.inputs = inputs

    def backward(self, dvalues):
        self.dinputs = np.zeros_like(dvalues)

class Activation_ReLU():

    def forward(self, inputs):
        self.output = np.maximum(0, inputs)
        self.inputs = inputs

    def backward(self, dvalues):
        self.dinputs = dvalues.copy()
        self.dinputs[self.inputs <= 0] = 0

class Activation_Sigmoid():
        
    def forward(self, inputs):
        self.output = np.clip(1 / (1 + np.exp(-inputs)), 1e-7, 1 - 1e-7)

    def backward(self, dvalues):
        sigm = self.output
        deriv = sigm * (1 - sigm)
        self.dinputs = deriv * dvalues

class Activation_Softmax():

    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

    def backward(self, dvalues):
        # self.dinputs = np.empty_like(dvalues)
        # for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
        #     single_output = single_output.reshape(-1, 1)
        #     jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
        #     self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)
        jacobian_matrices = (
            np.einsum('ij,jk->ijk', self.output, np.eye(self.output.shape[1]))
            - np.einsum('ij,ik->ijk', self.output, self.output)
        )
        self.dinputs = np.einsum('ijk,ik->ij', jacobian_matrices, dvalues)

class Optimizer_SGD:

    def __init__(self, learning_rate = 0.01):
        self.learning_rate = learning_rate
        
    def update_params(self, layer):
        weight_updates = -self.learning_rate * layer.dweights
        bias_updates = -self.learning_rate * layer.dbiases

        layer.weights += weight_updates
        layer.biases += bias_updates

class Loss_MeanSquaredError:

    def forward(self, y_pred, y_true):
        assert y_pred.shape == y_true.shape, "Shapes of predicted and true values must match."
        return np.mean((y_pred - y_true) ** 2)

    def backward(self, dvalues, y_true):
        Nsamples = len(dvalues)
        self.dinputs = 2 * (dvalues - y_true) / Nsamples

class Loss_BinaryCrossEntropy:

    def forward(self, y_pred, y_true):
        assert y_pred.shape == y_true.shape, "Shapes of predicted and true values must match."
        correct_confidences = y_pred * y_true + (1 - y_pred) * (1 - y_true)
        negative_log_likelihoods = -np.log(correct_confidences)
        return np.mean(negative_log_likelihoods)
    
    def backward(self, dvalues, y_true):
        Nsamples = len(dvalues)
        self.dinputs = - (y_true / dvalues - (1 - y_true) / (1 - dvalues)) / Nsamples

class Loss_MultiClassCrossEntropy:

    def forward(self, y_pred, y_true):
        assert y_pred.shape == y_true.shape, "Shapes of predicted and true values must match."
        correct_confidences = np.sum(y_pred * y_true, axis=1)
        negative_log_likelihoods = -np.log(correct_confidences)
        return np.mean(negative_log_likelihoods)
    
    def backward(self, dvalues, y_true):
        Nsamples = len(dvalues)
        self.dinputs = - (y_true / dvalues) / Nsamples

### 3) Regression Task

#### 3.1) Data Generation

First, we will perform a regression task using a neural network on a synthetic dataset generated from an exponential function with added noise.

In [None]:
np.random.seed(42)

x = np.random.uniform(-2, 2, (100, 1))
y = np.exp(x) + 0.1 * np.random.randn(len(x), 1)

plt.plot(x, y, 'k.', label='Data points')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

#### 3.2) Linear Model

Without using a neural network, fit a linear regression model to the data and visualize the results. For that purpose, you can use a single dense layer without activation as the model (**recall: the neuron itself is just a linear model**), and use the gradient descent to minimize the mean squared error. Plot the original data points and the model predictions. Discuss the performance of the linear regression model.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################

*Your discussion*



Some of the early activation functions, like a linear function or a Heaviside step function, are not used in modern neural networks. Explain why these activation functions are not suitable for training deep neural networks.

*Your discussion*


#### 3.3) Single Neuron

Use a single neuron to fit the same data. Again, train the network using gradient descent to minimize the mean squared error, and plot the original data points and the model predictions. Compare the performance of the single neuron model with that of the linear regression model.

Hint: You need a linear layer to scale the input of the neuron, an activation function (e.g., ReLU, which I recommend for its simplicity and effectiveness), and another linear layer to scale the output of the neuron.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################

*Your discussion*



#### 3.4) Neural Network

Now, use a neural network with 1 hidden layer containing 2 neurons and an appropriate activation function (e.g., ReLU) to fit the same data. Train the model, plot the original data points and the model predictions, and compare the performance with the previous models.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################

*Your discussion*


#### 3.5) Universal Approximation Theorem

According to the universal approximation theorem, a neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function. Try a large number of neurons (e.g., 128 or 1024) in the hidden layer, and see how well the model fits the data. Discuss your observations. Does the theory hold in practice? If not, what could be the reasons and possible solutions? 

Hint: You don't need to implement the solutions, just discuss them. Or you can try them if you want in Question 6 afterwards.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################

*Your discussion*


### 4) Binary Classification Task

#### 4.1) Data Generation

The second task is a binary classification problem. Generate the double moon dataset using the provided function.

In [None]:
x, y = make_moons(n_samples=1000, noise=0.2, shuffle=True, random_state=42)
x[:, 0] = x[:, 0] - 0.5
x[:, 1] = x[:, 1] - 0.25
y = y.reshape(-1, 1)

plt.scatter(x[:, 0], x[:, 1], c=y, s=10)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

#### 4.2) Linear Model

Using a single dense layer without activation as the model, train the network using gradient descent to minimize the binary cross-entropy loss. Plot the data points and the model predictions. Discuss the performance of the linear model.

Hint: The model should output a single value between 0 and 1 to represent the probability of one of the classes. Use the sigmoid activation function after the dense layer to output probabilities between 0 and 1.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################
plt.contour(xx, yy, zz, levels=[0, 0.5, 1], alpha=0.5)
plt.contourf(xx, yy, zz, levels=np.linspace(0, 1, 33), alpha=0.2, zorder=-1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.colorbar(label='Model output')
plt.show()

*Your discussion*



#### 4.3) Neural Network Model

Using a neural network with 1 hidden layer, fit the double moon dataset. How many neurons are needed in the hidden layer? Discuss the performance of the model.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################
plt.contour(xx, yy, zz, levels=[0, 0.5, 1], alpha=0.5)
plt.contourf(xx, yy, zz, levels=np.linspace(0, 1, 33), alpha=0.2, zorder=-1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.colorbar(label='Model output')
plt.show()

*Your discussion*


#### 4.4) Universal Approximation Theorem

Instead of going wider with more neurons in the hidden layer, this time try adding more hidden layers to the network. Use more than 1 hidden layer with the same number of neurons as before for each layer and discuss the performance of the model.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################
plt.contour(xx, yy, zz, levels=[0, 0.5, 1], alpha=0.5)
plt.contourf(xx, yy, zz, levels=np.linspace(0, 1, 33), alpha=0.2, zorder=-1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.colorbar(label='Model output')
plt.show()

*Your discussion*


### 5) Multiclass Classification Task

#### 5.1) Data Generation

The last task is a multiclass classification problem. Combine two double moon datasets using the provided function.

In [None]:
x1, y1 = make_moons(n_samples=500, noise=0.2, shuffle=True, random_state=42)
x2, y2 = make_moons(n_samples=500, noise=0.2, shuffle=True, random_state=24)
x1[:, 0] = x1[:, 0] - 0.5
x1[:, 1] = x1[:, 1] + 0.75
x2[:, 0] = x2[:, 0] - 0.5
x2[:, 1] = x2[:, 1] - 1.25
x = np.vstack([x1, x2])
y = np.hstack([y1, y2 + 2])
y_onehot = np.zeros((len(y), 4))
y_onehot[np.arange(len(y)), y] = 1

plt.scatter(x[:, 0], x[:, 1], c=y, s=10)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

#### 5.2) Linear Model

At this point, you should be familiar with the process. Using a single dense layer without activation as the model, train the network using gradient descent to minimize the multi-class cross-entropy loss. Plot the data points and the model predictions. Discuss the performance of the linear model.

Hint: The model should output 4 values (one for each class) for each data point. Use the softmax activation function after the dense layer to convert the outputs into probabilities.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################
plt.contourf(xx, yy, zz, levels=np.linspace(0, 4, 33), alpha=0.2, zorder=-1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.colorbar(label='Model output')
plt.show()

*Your discussion*


#### 5.3) Neural Network Model

Use a neural network to fit the multiclass dataset. How many layers and neurons are needed in the hidden layer? Discuss the performance of the model.

In [None]:
np.random.seed(0)

######## Fill in the code below ########

########################################
plt.contourf(xx, yy, zz, levels=np.linspace(0, 4, 33), alpha=0.2, zorder=-1)
plt.xlabel('x1')
plt.ylabel('x2')
plt.colorbar(label='Model output')
plt.show()

*Your discussion*



### 6) Moving Beyond (Optional)

You might have noticed that simply increasing the number of neurons or layers does not always lead to better performance. The best way to improve the model is highly dependent on the specific problem and dataset, and often requires experimentation and tuning. Try out some of the following techniques to see if you can achieve better results on the multiclass classification task. 

- Model architecture changes
    - Adding more layers (deepening the network)
    - Adding more neurons (widening the network)
    - Using different activation functions (e.g., sigmoid, tanh, Leaky ReLU)
    - Changing the initialization of weights and biases (e.g., Kaiming initialization, Xavier initialization)
- Loss function modifications
    - Incorporating regularization techniques (e.g., L1, L2 regularization)
- Optimization techniques
    - Using advanced optimizers (e.g., Adam, RMSprop)
    - Implementing learning rate schedules (e.g., step decay, exponential decay)
- Data processing
    - Normalizing or standardizing input features
    - Normalizing or standardizing target values
    - Batching the data for training

Since you have the access to all the attributes of the layers, loss functions, and optimizers, feel free to modify them as needed. Document your findings and discuss the impact of these changes on the model's performance.

In [None]:
np.random.seed(42)

x = np.random.uniform(-2, 2, (100, 1))
y = np.exp(x) + 0.1 * np.random.randn(len(x), 1)

In [None]:
np.random.seed(0)

######## Fill in the code below ########


########################################
xx = np.linspace(-2, 2, 1000).reshape(-1, 1)
dense1.forward(xx)
relu.forward(dense1.output)
dense2.forward(relu.output)
plt.plot(xx, dense2.output, 'r-', label='Model predictions')

plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

*Your discussion*
