# Machine Learning - Assignment 3

## Artificial Neural Network

The aim of the assignment is to implement an artificial neural network (mostly) from scratch. This includes implementing or fixing the following:

* Add support for additional activation functions and their derivatives.
* Add support for loss functions and their derivative.
* Add the use of a bias in the forward propagation.
* Add the use of a bias in the backward propagation.

In addition, you will we doing the following as well:

* Test the algorithm on 3 datasets.
* Compare neural networks with and without scaling.
* Hyper-parameter tuning.

The forward and backward propagation is made to work through a single layer, and are re-used multiple times to work for multiple layers.

Follow the instructions and implement what is missing to complete the assignment. Some functions have been started to help you a little bit with the implementation.

**Note:** You might need to go back and forth during your implementation of the code. The structure is set up to make implementation easier, you might find yourself going back and and forth to change something to make it easier later on.

## Assignment preparations

We help you out with importing the libraries.

**IMPORTANT NOTE:** You may not import any more libraries than the ones already imported!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# We set seed to better reproduce results later on.
#np.random.seed(12345)

## Neural Network utility functions

In [None]:
np.random.seed(0)

def spiral_data(samples, classes):
    X = np.zeros((samples*classes, 2))
    y = np.zeros(samples*classes, dtype='uint8')
    for class_number in range(classes):
        ix = range(samples*class_number, samples*(class_number+1))
        r = np.linspace(0.0, 1, samples) # radius
        t = np.linspace(class_number*4, (class_number+1)*4, samples) + np.random.randn(samples)*0.2
        X[ix] = np.c_[r*np.sin(t*2.5), r*np.cos(t*2.5)]
        y[ix] = class_number
    return X, y

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        # We need to know the size of the input thats coming in, and how many neurons we want to have (in each layer?)
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))

    def forward(self, inputs):
        self.inputs = inputs
        self.output = np.dot(inputs, self.weights) + self.biases
        # Why use an activation function? 
        # We use it to create more complex decisions IE, we can then fit non linear data. We need non linear actvation function to fit non linear data
        # activation = 

    def backward(self, dvalues):
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        # Gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)


class Activation_ReLu:
    def forward(self, inputs):
        self.inputs = inputs # Remember input values
        self.output = np.maximum(0, inputs) # Calculate output values from inputs

    def backward(self, dvalues):
        self.dinputs = dvalues.copy()
        self.dinputs[self.inputs <= 0] = 0  # Zero gradient where input values were negative

class Activation_Softmax:
    def forward(self, inputs):
        self.inputs = inputs  # Remember input values
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True)) # Get non normalized probs + Overflow prevention: v = u - max u
        probapilities = exp_values / np.sum(exp_values, axis=1, keepdims=True) # Normalize values
        self.output = probapilities 
    
    def backward(self, dvalues):
        self.dinputs = np.empty_like(dvalues) # Skapar en tom array med samma shape som dvalues
        # Enumerate outputs and gradients
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            single_output = single_output.reshape(-1, 1) # Flatten output array
            # Calculate Jacobian matrix of the output
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            # Calculate sample-wise gradient and add it to the array of sample gradients
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)


class Loss:
    def calculate(self, output, y):
        sample_losses = self.forward(output, y)
        data_loss = np.mean(sample_losses) # or batch loss
        return data_loss


class Loss_CategoricalCrossEntropy(Loss):
    def forward(self, y_pred, y_true): # y_pred NN predictions;  y_true target training values
        samples = len(y_pred)
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7) # To prevent dividing by 0

        if len(y_true.shape) == 1: # Then we have passed scalar values
            correct_confidences = y_pred_clipped[
                range(samples), 
                y_true
            ]
        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(
                y_pred_clipped * y_true, 
                axis=1
            )
        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    
    def backward(self, dvalues, y_true):
        samples = len(dvalues) # Number of samples
        labels = len(dvalues[0]) # Number of labels in every sample
        
        # If labels are spare, turn them into one-hot vector
        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]
        
        #Calculate gradient
        self.dinputs = -y_true / dvalues
        # Normalize gradient
        self.dinputs = self.dinputs / samples


class Activation_Softmax_Loss_CategoricalCrossEntropy():
    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossEntropy()
        
    def forward(self, inputs, y_true):
        self.activation.forward(inputs)
        self.output = self.activation.output
        return self.loss.calculate(self.output, y_true)
    
    def backward(self, dvalues, y_true):
        samples = len(dvalues)
        # If labels are one-hot encoded we turn them into discrete values
        if len(y_true.shape) == 2: 
            y_true = np.argmax(y_true, axis=1)
        
        self.dinputs = dvalues.copy() # Copy to safely modify 
        self.dinputs[range(samples), y_true] -= 1 # Calculate gradient 
        self.dinputs = self.dinputs / samples # Normalize gradient

# Optimizer Stochastic Gradient Descent 
class Optimizer_SGD:
    def __init__(self, learning_rate=1., decay=0., momentum=0.):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.momentum = momentum
    
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))
    
    # Update parameters
    def update_params(self, layer):
        if self.momentum: 
            if not hasattr(layer, 'weight_momentums'):
                layer.weight_momentums = np.zeros_like(layer.weights)
                layer.bias_momentums = np.zeros_like(layer.biases)
                            
            weight_updates = \
                self.momentum * layer.weight_momentums - \
                self.current_learning_rate * layer.dweights
            layer.weight_momentums = weight_updates

            bias_updates = \
                self.momentum * layer.bias_momentums - \
                self.current_learning_rate * layer.dbiases
            layer.bias_momentums = bias_updates

        else:
            weight_updates = -self.current_learning_rate * layer.dweights
            bias_updates   = -self.current_learning_rate * layer.dbiases
        
        layer.weights += weight_updates
        layer.biases += bias_updates
    
    def post_update_params(self):
        self.iterations +=1


class Optimizer_Adam:
    def __init__(self, learning_rate=0.001, decay=0., epsilon=1e-7, beta_1=0.9, beta_2=0.999):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.beta_1 = beta_1
        self.beta_2 = beta_2

    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * \
            (1. / (1. + self.decay * self.iterations))
        
    # Update parameters
    def update_params(self, layer):

        # If layer does not contain cache arrays, create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)
            layer.bias_cache = np.zeros_like(layer.biases)
        
        layer.weight_momentums = self.beta_1 * layer.weight_momentums + (1 - self.beta_1) * layer.dweights
        layer.bias_momentums = self.beta_1 * layer.bias_momentums + (1 - self.beta_1) * layer.dbiases

        weight_momentums_corrected = layer.weight_momentums / (1 - self.beta_1 ** (self.iterations + 1))
        bias_momentums_corrected = layer.bias_momentums / (1 - self.beta_1 ** (self.iterations + 1))

        layer.weight_cache = self.beta_2 * layer.weight_cache + (1 - self.beta_2) * layer.dweights**2
        layer.bias_cache = self.beta_2 * layer.bias_cache + (1 - self.beta_2) * layer.dbiases**2

        weight_cache_corrected = layer.weight_cache / (1 - self.beta_2 ** (self.iterations + 1))
        bias_cache_corrected = layer.bias_cache / (1 - self.beta_2 ** (self.iterations + 1))

        layer.weights += -self.current_learning_rate * weight_momentums_corrected / \
                        (np.sqrt(weight_cache_corrected) + self.epsilon)
        layer.biases += -self.current_learning_rate * bias_momentums_corrected / \
                        (np.sqrt(bias_cache_corrected) + self.epsilon)
    
    def post_update_params(self):
        self.iterations +=1


X, y = spiral_data(samples=100, classes=3)
num_classes = 3

n_inputs = len(X[0])

dense1 = Layer_Dense(n_inputs, 64)
activation1 = Activation_ReLu()
dense2 = Layer_Dense(64, num_classes)
loss_activation = Activation_Softmax_Loss_CategoricalCrossEntropy()

optimizer = Optimizer_Adam(learning_rate=0.05, decay=5e-7)

#optimizer = Optimizer_SGD(decay=1e-3, momentum=0.9)

for epoch in range(10001):

    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = loss_activation.forward(dense2.output, y)

    predictions = np.argmax(loss_activation.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)

    if not epoch % 100:
        print(f"epoch: {epoch}, " +  
            f"accuracy: {accuracy:.3f}, " + 
            f"loss: {loss:.3f}, " +
            f"lr: {optimizer.current_learning_rate}"
        )

    # Backward pass (Back propagation)
    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)

    # Update weights and biases
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()


X_test, y_test = spiral_data(samples=100, classes=3)

# Perform a forward pass of our testing data through this layer
dense1.forward(X_test)

# Perform a forward pass through activation function
# takes the output of first dense layer here
activation1.forward(dense1.output)

# Perform a forward pass through second Dense layer
# takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)

# Perform a forward pass through the activation/loss function
# takes the output of second dense layer here and returns loss
loss = loss_activation.forward(dense2.output, y_test)

# Calculate accuracy from output of activation2 and targets
# calculate values along first axis
predictions = np.argmax(loss_activation.output, axis=1)
if len(y_test.shape) == 2:
    y_test = np.argmax(y_test, axis=1)
accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')



epoch: 0, accuracy: 0.360, loss: 1.099, lr: 0.05
epoch: 100, accuracy: 0.717, loss: 0.675, lr: 0.04999752512250644
epoch: 200, accuracy: 0.793, loss: 0.525, lr: 0.04999502549496326
epoch: 300, accuracy: 0.857, loss: 0.431, lr: 0.049992526117345455
epoch: 400, accuracy: 0.860, loss: 0.377, lr: 0.04999002698961558
epoch: 500, accuracy: 0.900, loss: 0.321, lr: 0.049987528111736124
epoch: 600, accuracy: 0.900, loss: 0.281, lr: 0.049985029483669646
epoch: 700, accuracy: 0.913, loss: 0.249, lr: 0.049982531105378675
epoch: 800, accuracy: 0.917, loss: 0.231, lr: 0.04998003297682575
epoch: 900, accuracy: 0.917, loss: 0.211, lr: 0.049977535097973466
epoch: 1000, accuracy: 0.920, loss: 0.195, lr: 0.049975037468784345
epoch: 1100, accuracy: 0.927, loss: 0.195, lr: 0.049972540089220974
epoch: 1200, accuracy: 0.933, loss: 0.186, lr: 0.04997004295924593
epoch: 1300, accuracy: 0.933, loss: 0.180, lr: 0.04996754607882181
epoch: 1400, accuracy: 0.933, loss: 0.176, lr: 0.049965049447911185
epoch: 1500, a

### 1) Activation functions

Below is some setup for choosing activation function. Implement 2 additional activation functions, "ReLU" and one more of your choosing.

In [None]:
# activations = input signal
# Activation functions
def activate(activations, selected_function = "none"):
    y = activations
    if selected_function == "none":
        y = activations
    elif selected_function == "relu":
        y = np.maximum(0, activations)
    elif selected_function == "elu":
        alpha = 1
        y = np.where(activations > 0, activations, alpha * (np.exp(activations) - 1))
    return y

In [None]:
# TODO Test your activation functions, is the returning values what you expect?
print(activate(0, selected_function="elu"))

### 2) Activation function derivatives

Neural networks need both the activation function and its derivative. Finish the code below.

In [None]:
# ReLU: https://stats.stackexchange.com/questions/333394/what-is-the-derivative-of-the-relu-activation-function
# ELU:  https://medium.com/@krishnakalyan3/introduction-to-exponential-linear-unit-d3e2904b366c
def d_activate(activations, selected_function = "none"):
    dy = 0
    if selected_function == "none":
        dy = np.ones_like(activations)
    elif selected_function == "relu":
        dy =  np.where(activations > 0, 1, 0)
    elif selected_function == "elu" :
        alpha = 1
        dy = np.where(activations > 0, 1, alpha * np.exp(activations))
    return dy

In [None]:
# TODO Test your activation function derivatives, is the returning values what you expect?

### 3) Loss functions

To penalize the network when it predicts incorrect, we need to meassure how "bad" the prediction is. This is done with loss-functions.

Similar as with the activation functions, the loss function needs its derivative as well.

Finish the MSE_loss (Mean Squared Error loss), as well as adding one additional loss function.

In [None]:
# This is the loss for a set of predictions y_hat compared to a set of real valyes y
def MSE_loss(y_hat, y): # y_hat = predictions, y = the real values (targets)
    y_hat = np.array(y_hat)
    y = np.array(y)
    loss = np.mean(np.square(np.subtract(y, y_hat)))
    return loss

y_h = [2,3,4,5]
y = [1,1,1,1]

print(MSE_loss(y_h,y))

# TODO: Choose another loss function and implement it
def MAE_loss(y_hat, y):
    y_hat = np.array(y_hat)
    y = np.array(y)
    loss = np.mean(np.abs(np.subtract(y, y_hat)))
    return loss

The derivatives of the loss is with respect to the predicted value **y_hat**.

In [None]:
def d_MSE_loss(y_hat, y): # y_hat = predictions, y = the real values (targets)
    y_hat = np.array(y_hat)
    y = np.array(y)
    dy = (2 / len(y_hat)) * np.subtract(y, y_hat)
    return dy

# TODO: Choose another loss function and implement it
def d_MAE_loss(y_hat, y):
    y_hat = np.array(y_hat)
    y = np.array(y)
    dy = (2 / len(y_hat)) * np.subtract(y_hat, y)
    return dy

### 4) Forward propagation

The first "fundamental" function for neural networks is to be able to propagate the data forward through the neural network. We will implement this function here.

In [None]:
def propagate_forward(weights, activations, biases, activation_function="none"):
    # NOTE: activations = input 
    dot_product = np.dot(activations, weights) + biases
    new_activations = activate(dot_product, activation_function)

    return new_activations

### 5) Back-propagation

To be able to train a neural network, we need to be able to propagate the loss backwards and update the weights. We will implement this function here.

In [None]:
# Calculates the backward gradients that are passed throught the layer in the backward pass.
# Returns both the derivative of the loss in respect to the weights and the input signal (activations).

def propagate_backward(weights, activations, dl_dz, biases, activation_function="none"):
    # NOTE: dl_dz is the derivative of the loss based on the previous layers activations/outputs
    
    dot_product = np.dot(activations, weights) + biases  # Transpose bias to match the shape of dot_product
    d_loss = d_activate(dot_product, activation_function) * dl_dz
    d_weights = np.dot(activations.T, d_loss)
    d_activations = np.dot(d_loss, weights.T)
    d_biases = np.sum(d_loss, axis=0, keepdims=True).T  # Transpose to match the shape of biases
    return d_weights, d_biases, d_activations

# Test
# Example usage
weights = np.array([[0.2, 0.8], [0.5, 0.1]])
activations = np.array([[1.0, 2.0], [3.0, 4.0]])
dl_dz = np.array([[0.1, 0.2], [0.3, 0.4]])
bias = np.array([0.1, 0.2])

d_weights, d_activations, d_loss = propagate_backward(weights, activations, dl_dz, bias, activation_function="relu")
print("d_weights:", d_weights)
print("d_activations:", d_activations)
print("d_loss:", d_loss)

## Neural network implementation

### 6) Fixing the neural network

Below is a class implementation of a MLP neural network. This implementation is still lacking several areas that are needed for the network to be robust and function well. Your task is to improve and fix it with the following:

1. Add a bias to the activation functions, and make sure the bias is also updated during training. 
2. Add a function that trains the network using minibatches (such that the neural network trains on a few samples at a time). 
3. Make use of an validation set in the training function. The model should stop training when the loss starts to increase for the validatin set. This feature should be able to be turned on and off to test the difference.


In [None]:
# MLP = Multi Layer Perceptron Neural Network
class NeuralNet(object):
    
    # Setup all parameters and activation functions.
    # This function runs directly when a new instance of this class is created. 
    # Input_dim is the size of the input (number of features?), output_dim is the size of the output (number of classes) and neurons is a list of the number of neurons in each layer.
    def __init__ (self, input_dim, output_dim, neurons = []):
        # NOTE: The "neurons" parameter is given as a list.
        # E.g., [4, 8, 4] means 4 neurons in layer 1, 8 neurons in layer 2 etc...

        # TODO: Add support for bias for each neuron in the code below.
        self.weights = [0.01 * np.random.randn(n, m) for n, m in zip([input_dim] + neurons, neurons + [output_dim])]
        self.biases = [0.01 * np.random.randn(n, 1) for n in neurons + [output_dim]]
        self.activation_functions = ["relu"] * len(neurons) + ["none"]
    
    # Predict the input throught the network and calculate the output.
    def forward(self, x):
        """ x is the input to the network, or to the next layer. """
        # TODO: Add support for a bias for each neuron in the code below.
        for layer_weights, layer_biases, layer_activation_function in zip(self.weights, self.biases, self.activation_functions):
            x = propagate_forward(layer_weights, x, layer_biases, layer_activation_function)
            
        return x
    
    # Adjust the weights in the network to better fit the desired output (y), given the input (x).
    # The weight updates are happening "in-place", thus we are only returning the loss from this function.
    # Note that this function can handle a variable size of the input (x), both full datasets or smaller parts of the dataset.
    def adjust_weights(self, x, y, learning_rate=1e-4):
        activation = x
        activation_history = []
        
        for layer_weights, layer_biases, layer_activation_function in zip(self.weights, self.biases, self.activation_functions):
            activation_history.append(activation)
            activation = propagate_forward(layer_weights, activation, layer_biases, layer_activation_function)

        loss = MSE_loss(activation, y)
        d_activations = d_MSE_loss(activation, y)
        
        for layer_weights, layer_activation_function, layer_biases, previous_activations in reversed(list(zip(self.weights, self.activation_functions, self.biases, activation_history))):
            d_weights, d_biases, d_activations = propagate_backward(layer_weights, previous_activations, d_activations, layer_biases, layer_activation_function)
            layer_weights -= learning_rate * d_weights
            layer_biases -= learning_rate * d_biases
                    
        return loss
    
    # A function for the training of the network.
    def train_net(self, x, y, batch_size=32, epochs=100, learning_rate=1e-4, use_validation_data=False):
        
        # TODO: Add a training loop where the weights and biases of the network is learnt over several epochs.

        # TODO: Add support for mini batches. That is, in each epoch the data should be split into several
        #       smaller subsets and the model should be trained on each of these subsets one at a time.

        # TODO: Implement the use of validation data, that is, splitting the training data into training data and validation data.
        #       The validation data should be used to stop the training when the model stops to generalise and starts to overfit.
        #       This feature should be able to be turned on and off to test the difference.

        # NOTE: Make use of previously implemented functions here.

        ...
    

## Train Neural Networks

### 7) Simple test

In this a very simple test for you to use and toy around with before using the datasets.

Make sure to test both the **adjust_weights** function and the **train_net** function. What is the difference between the two?

Also, be sure to **plot the loss for each epoch** to see how the network training is progressing!

In [None]:
n = 1000
input_dimension = 4
output_dim = 1
neurons = [18, 12]  # NOTE: 18 neurons in layer 1, 12 neurons in layer 2 etc...

k = np.random.randint(0, 10, (input_dimension, 1))
x = np.random.normal(0, 1, (n, input_dimension))
y = np.dot(x, k) + 0.1 + np.random.normal(0, 0.01, (n, 1))

# Create an instance of the NeuralNet class
nn = NeuralNet(input_dimension, output_dim, neurons)

# Adjust weights using the adjust_weights function
loss_1 = [nn.adjust_weights(x, y) for _ in range(1000)]

# Train the network using the train_net function
loss_2 = []
for epoch in range(100):
    loss = nn.train_net(x, y, batch_size=32, epochs=1, learning_rate=1e-4, use_validation_data=False)
    loss_2.append(loss)

# Plot the losses
plt.plot(loss_1)
plt.title("Loss 1")
plt.show()

plt.plot(loss_2)
plt.title("Loss 2")
plt.show()

### Real test and preprocessing

When using real data and neural networks, it is very important to scale the data between smaller values, usually between 0 and 1. This is because neural networks struggle with larger values as input compared to smaller values. 

To test this, we will use our first dataset and test with and without scaling.

Similar as with assignment 2, we will use the scikit-learn library for this preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html

### 8) Dataset 1: Wine - with and without scaling

Wine dataset: https://archive.ics.uci.edu/dataset/109/wine

Train two neural network, one with scaling and one without. Are we able to see any difference in training results or loss over time?

**Note:** Do not train for to many epochs (more than maybe 50-100). The network might "learn" anyway in the end, but you should still be able to see a difference when training.

In [None]:
from sklearn import preprocessing

data_wine = pd.read_csv("wine.csv").to_numpy()

# TODO: Set up the data and split it into train and test-sets.

# TODO: Train and test your neural networks.
# NOTE: Use the same train/test split for both neural network models!

# TODO: Do the above at least 3 times
# NOTE: Use loops here!

# TODO: Plot the results with matplotlib (plt)
# NOTE: One combined lineplot with the scaling and one without the scaling, 2 plots in total.
# NOTE: Plot both the accuracy and the loss!

### Real data and hyper-parameter tuning

Now we are going to use real data, preprocess it, and do hyper-parameter tuning.

Choose two hyper-parameters to tune to try and achive an even better result.

**NOTE:** Changing the number of epochs should be part of the tuning, but it does not count towards the two hyper parameters.

### 9) Dataset 2: Mushroom

Mushroom dataset: https://archive.ics.uci.edu/dataset/73/mushroom

Note: This dataset has one feature with missing values. Remove this feature.

In [None]:
data_mushroom = pd.read_csv("mushroom.csv").to_numpy()

# TODO: Preprocess the data.

# TODO: Split the data into train and test

# TODO: Train a neural network on the data

# TODO: Visualize the loss for each epoch

# TODO: Visulaize the test accuracy for each epoch

When hyper-parameter tuning, please write the parameters and network sizes you test here:

* Parameter 1: 
* Parameter 2:

* Neural network sizes: 

In [None]:
# TODO: Hyper-parameter tuning

# TODO: Visualize the loss after hyper-parameter tuning for each epoch

# TODO: Visulaize the test accuracy after hyper-parameter tuning for each epoch

### 10) Dataset 3: Adult

Adult dataset: https://archive.ics.uci.edu/dataset/2/adult

**IMPORTANT NOTE:** This dataset is much larger than the previous two (48843 instances). If your code runs slow on your own computer, you may exclude parts of this dataset, but you must keep a minimum of 10000 datapoints.

In [None]:
dataset_3 = pd.read(...) # TODO: Read the data.

# TODO: Preprocess the data.

# TODO: Split the data into train and test

# TODO: Train a neural network on the data

# TODO: Visualize the loss for each epoch

# TODO: Visulaize the test accuracy for each epoch

When hyper-parameter tuning, please write the parameters and network sizes you test here:

* Parameter 1: 
* Parameter 2:

* Neural network sizes: 

In [None]:
# TODO: Hyper-parameter tuning

# TODO: Visualize the loss after hyper-parameter tuning for each epoch

# TODO: Visulaize the test accuracy after hyper-parameter tuning for each epoch

# Questions for examination:

In addition to completing the assignment with all its tasks, you should also prepare to answer the following questions:

1) Why would we want to use different activation functions?

2) Why would we want to use different loss functions?

3) Why are neural networks sensitive to large input values?

4) What is the role of the bias? 

5) What is the purpose of hyper-parameter tuning?

6) A small example neural network will be shown during the oral examination. You will be asked a few basic questions related to the number of weights, biases, inputs and outputs.

If we don't apply an activation function, every node will just be a linear combination of every node before it (+ a bias term)

# Finished!

Was part of the setup incorrect? Did you spot any inconsistencies in the assignment? Could something improve?

If so, please write them and send via email and send it to:

* marcus.gullstrand@ju.se

Thank you!