# Requirements
- After finishing the second exercise you should be familiar with data loading, data inspection and simple models including their evaluation.
- Building up on this technical understanding, we will combine it with the theoretical progress in lecture two to four. Next to other things you should have learned about the basics of simple neural networks and their evaluation. This will be the focus point of the following exercise. 
- In the source code, gaps, which are marked with **N TODO**, have to be filled and theoretical questions have to be answered. For experimentation, further changes to the code should be made at the end, if requested. In total we got **17 TODO**s.

---

## Load modules & Check PyTorch

In [1]:
import numpy as np
import pandas as pd
import os

# torch modules
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.svm import SVC

Right at the beginning: check if a cuda compatible GPU is available in your computer. If so, set device = cuda:0, which means that later all calculations will be performed on the graphics card.  If no GPU is available, the calculations will run on the CPU, which is also absolutely sufficient for the examples in these exercises.

In [2]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

if device.type == 'cpu':
    device_num = 0
    print('No GPU available.')
else:
    device_num = torch.cuda.device_count()
    print('Device:', device, '-- Number of devices:', device_num)

No GPU available.


---

# Regression network
The first neural network you develop will be a regression network. For this task, the following source code is incomplete, with which life expectancy can be estimated on the basis of certain features collected from the WHO.

## Load & prepare data

Load the data set ”LifeExpectancyData.csv” and (**1. TODO**) print its size, which is the number of samples and its dimension, which is the number of features per sample. It may be necessary to adjust the path for reading the data.

In [3]:
# Path to WHO data
data_path = '../data/LifeExpectancyData.csv'

# Read csv sheet with pandas
df = pd.read_csv(data_path)

df = df.dropna()

# Get numpy out of pandas dataframe
data = df.values

# Get column names to see, which columns we have to extract as x and y
column_names = np.array(df.columns)

# TODO
print('Dimension of the dataset:', column_names)

Dimension of the dataset: ['Country' 'Year' 'Status' 'Life expectancy ' 'Adult Mortality'
 'infant deaths' 'Alcohol' 'percentage expenditure' 'Hepatitis B'
 'Measles ' ' BMI ' 'under-five deaths ' 'Polio' 'Total expenditure'
 'Diphtheria ' ' HIV/AIDS' 'GDP' 'Population' ' thinness  1-19 years'
 ' thinness 5-9 years' 'Income composition of resources' 'Schooling']


Divide the data set into input $x$ and target $y$ features (**2. TODO**). For a start, the BMI column shall be used as $x$ and the column with the life expectancy as $y$.

In [4]:
# Split in X and Y

# TODO**
x = np.array(data[:, 10], dtype=np.float32)
y = np.array(data[:, 3], dtype=np.float32)

print('x shape:', x.shape)
print('y shape:', y.shape)

x shape: (1649,)
y shape: (1649,)


In the following $x$ and $y$ are normalized, which is very important for the stability of neural networks. If several input features were used, those in lower value ranges would possibly be suppressed without normalization.
They are also converted to torch tensors $x \rightarrow X$ and $y \rightarrow Y$, so that they can later pass through the network

In [5]:
# If multiple features in X are selected, each feature is normalized individually
scale_x = np.max(x, axis=0)
scale_y = np.max(y, axis=0)
x = x/scale_x
y = y/scale_y
print('Scale_x:',scale_x)
print('Scale_y:',scale_y)

Scale_x: 77.1
Scale_y: 89.0


In [6]:
# Convert to torch tensors
# If tensors have only one dimension, an artificial dimension is created with unsqueeze (e.g. [10]->[10,1], so 1D->2D)
Y = torch.from_numpy(y)
Y = Y.float()
if len(Y.shape)==1:
    Y = Y.unsqueeze(dim=1)

X = torch.from_numpy(x)
X = X.float()
if len(X.shape)==1:
    X = X.unsqueeze(dim=1)

Divide $X$ and $Y$ into a training, validation and test set . Determine the division of the sets appropriately.
The proportions must add up to 1. Explain, why a split in training, validation and test set is important (**3. TODO**).

...

In [7]:
# Split dataset in training, validation and test tensors
# TODO**
prop_train = 0.8
prop_val = 0.1
prop_test = 0.1

sample_num = {'all': X.shape[0], 
              'train': round(prop_train*X.shape[0]),
              'val': round(prop_val*X.shape[0]),
              'test': round(prop_test*X.shape[0])}

# idx shuffle
idx = np.random.choice(sample_num['all'], sample_num['all'], replace=False)
# assign idx to each sample
sample_idx = {'all': idx[:], 
              'train': idx[0:sample_num['train']],
              'val': idx[sample_num['train']:sample_num['train']+sample_num['val']],
              'test': idx[sample_num['train']+sample_num['val']:]}

# Create train data
X_train = X[sample_idx['train']]
Y_train = Y[sample_idx['train']]

# Create validation data
X_val = X[sample_idx['val']]
Y_val = Y[sample_idx['val']]

# Create test data
X_test = X[sample_idx['test']]
Y_test = Y[sample_idx['test']]


# Show data point
print('Input of first ten train Sample:', X_train[0:10])
print('Target of first ten train Sample:', Y_train[0:10])

Input of first ten train Sample: tensor([[0.6031],
        [0.2918],
        [0.4721],
        [0.5694],
        [0.6641],
        [0.2309],
        [0.7588],
        [0.6394],
        [0.0661],
        [0.7224]])
Target of first ten train Sample: tensor([[0.8202],
        [0.7360],
        [0.7348],
        [0.7404],
        [0.8449],
        [0.7506],
        [0.8180],
        [0.8730],
        [0.8225],
        [0.8966]])


---

## Build neural network

After preparing the data we can now start designing the architecture of our regression network.

Look at the RegressNet class. It consists of 2 methods. *init(self, inputSize, outputSize)* specifies the layers that this class holds. *forward(self, x)* specifies the connection between these layers. Note that activation functions are also used here. Describe the architecture of the network with the technical vocabulary you learned in the lecture (**4. TODO**).

...

In [8]:
# Class of neural network 'RegressNet'
# Set up layer and architecture of network in constructor __init__
# Define operations on layer in forward pass method

class RegressNet(nn.Module):
    
    def __init__(self, inputSize, outputSize):
        super(RegressNet, self).__init__()
        self.fc1 = nn.Linear(inputSize, 128)
        self.fc2 = nn.Linear(128, 32)
        self.fc3 = nn.Linear(32, outputSize)
    
    def forward(self, x):
        # max pooling over (2, 2) window
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Consider which input and output dimensions the network needs to have in the current case. Afterwards create an instance net of RegressNet (**5. TODO**).

Hint: Think about the dimensions of $X$ and $Y$.

In [9]:
# Specify network hyperparameter and create instance of RegressNet
# TODO**        
inputDim = 1
outputDim = 1

# Create instance of RegressNet
net = RegressNet(inputDim, outputDim)

print(net)

RegressNet(
  (fc1): Linear(in_features=1, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=32, bias=True)
  (fc3): Linear(in_features=32, out_features=1, bias=True)
)


If a graphics card is available, $X, Y$ and net are now sent to the GPU for efficient computation. Otherwise the network is trained on the CPU

In [10]:
# Send tensors and networks to GPU (if you have one which supports cuda) for faster computations
X_train, Y_train = X_train.to(device), Y_train.to(device)
X_val, Y_val = X_val.to(device), Y_val.to(device)
X_test, Y_test = X_test.to(device), Y_test.to(device)

# The network itself must also be sent to the GPU. Either you write net = RegressNet() and then later net.to(device) or directly net = RegressNet().to(device)
# The latter option may have the advantage that the instance net is created directly on the GPU, whereas in variant 1 it must first be sent to the GPU.
if device_num>1:
    print("Let's use", device_num, "GPU's")
    net = nn.DataParallel(net)
net.to(device)

RegressNet(
  (fc1): Linear(in_features=1, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=32, bias=True)
  (fc3): Linear(in_features=32, out_features=1, bias=True)
)

---

## Train Network

In order to train the network, the so-called hyperparameters must be defined. In this case it is the number of epochs (*num_epoch*) the network should be trained, the learning rate (*learn_rate*), the loss function and the optimizer. The latter two have already been defined. We use a mean squared error loss function MSELoss() and ADAM optimizer (**6. TODO**).

Hint: The learning rate is usually in a range from 1e-1 to 1e-5.

In [None]:
# Specify hyperparameter
# Hyperparemter: num_epoch, num_lr, loss_func, optimizer

# TODO** 
num_epoch = 100
learn_rate = 1e-5

# Loss and optimizer
loss_func = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=learn_rate)

In [None]:
# Loss before training
# Compute loss of test data before training the network (with random weights)
Y_pred_train_before = net(X_train)
loss_train_before = loss_func(Y_pred_train_before, Y_train)
Y_pred_val_before = net(X_val)
loss_val_before = loss_func(Y_pred_val_before, Y_val)
Y_pred_test_before = net(X_test)
loss_test_before = loss_func(Y_pred_test_before, Y_test)

Now follows the training. In a loop over the epochs, the training data passes through the network, which is called forward pass. The training loss is calculated from the forward pass prediction and the reference. Then the network
is optimized by backpropagation.

For comparison purposes, the forward pass and the loss are also calculated for the validation set within the same loop in each epoch. Insert the calculation for *y_pred_val* and *loss_val* (**7. TODO**).

In order to evaluate the training process we monitor how the loss changes with the number of epochs. Thereby both, train and validation loss is computed.

In [None]:
# Training
# Monitor loss curve during training
plt.figure() 

for epoch in range(num_epoch):
    # Classical forward pass -> predict new output from train data
    Y_pred_train = net(X_train)
    # Compute loss    
    loss_train = loss_func(Y_pred_train, Y_train)
    
    # Compute gradients
    optimizer.zero_grad()
    # Calling .backward() mutiple times accumulates the gradient (by addition) for each parameter. This is why you should call optimizer.zero_grad() after each .step() call
    # Note that following the first .backward call, a second call is only possible after you have performed another forward pass.
    loss_train.backward()
    # Perform a parameter update based on the current gradient (stored in .grad attribute of a parameter)
    optimizer.step()
    
    # TODO**
    # forward pass for validation
    Y_pred_val = net(X_val)
    loss_val = loss_func(Y_pred_val, Y_val)
    
    # Plot train and val loss
    plt.scatter(epoch, loss_train.data.item(), color='b', s=10, marker='o')    
    plt.scatter(epoch, loss_val.data.item(), color='r', s=10, marker='o')
    
    # Print message with actual losses
    print('Train Epoch: {}/{} ({:.0f}%)\ttrain_Loss: {:.6f}\tval_Loss: {:.6f}'.format(
    epoch+1, num_epoch, epoch/num_epoch*100, loss_train.item(), loss_val.item()))


# Show training and validation loss    
plt.legend(['train-loss','val-loss'])
plt.xlabel('epoch')
plt.ylabel('loss')
plt.savefig('../results/who_loss.png')
#plt.show()

print('Train loss before training was:', loss_train_before.item())
print('Train loss after training is:', loss_train.item())
print('Val loss before training was:', loss_val_before.item())
print('Val loss after training is:', loss_val.item())

There is another figure that compares the predicted life expectancy with the reference. With a perfect model, these values are identical, i.e. all points are on the black line.

In [None]:
# Pred vs. Ref Figure Train/Val set
# Plot the prediction against the reference for the train/val points

plt.figure()
plt.title('pred vs. ref: train/val points')
plt.scatter(Y_train.cpu().numpy(), Y_pred_train.cpu().detach().numpy(), color='b', s=5, marker='o')
plt.scatter(Y_val.cpu().numpy(), Y_pred_val.cpu().detach().numpy(), color='r', s=5, marker='o')
plt.scatter(Y_val.cpu().numpy(), Y_pred_val_before.cpu().detach().numpy(), color='m', s=5, marker='^')
plt.plot((0,1),(0,1), color='k')
plt.xlabel('reference')
plt.ylabel('prediction')
plt.legend(['perfect model', 'train-sample after tr','val-sample after tr', 'val-sample before tr'])
plt.xlim((0,1))
plt.ylim((0,1))
plt.savefig('../results/who_pred_vs_ref_val.png')

**8. TODO**: Try to analyze the loss curves: How can you see from the loss curves how many epochs you should train a network? Please discuss the training loss as well as the validation loss. Therefore vary the hyperparameters learn rate and num epoch and look at the different loss curves.


**9. TODO**: To further optimize the network, the architecture of the network can be improved: Change the net depth (vary number of hidden layer) and width (vary number of neurons in hidden layer) in the class RegressNet. Then briefly describe the influence on the loss curves and computing time with unchanged hyperparameters.

Hint 1: PyTorch documentation (https://pytorch.org/docs/stable/nn.html) is useful for more information
about the layers.

Hint 2: Afterwards the hyperparameters should be adjusted to fit the new architecture.

**10. TODO**: Improve the model by predicting life expectancy not only as a function of BMI, but of multiple features. Use any number of other features combined as $X$. To do this, you have to go back to the point where you split the data into $$X$ and $Y$. Do not use the first three columns, since they contain no float values or only metadata. 

Important: The input dimension of the network must also be adjusted. 

Note: If you make a change to an already executed cell in the .ipynb script, all other cells should also be executed again.

...

---

## Testing

Finally, the trained neural network can be tested for new/unseen data.

**11. TODO**: Insert the calculation of *y_pred_test* and *loss_test*. Look also at the mean absolute difference between the predicted and the reference life expectancy from the test set

In [None]:
# Test results
# TODO**
Y_pred_test = net(X_test)
loss_test = loss_func(Y_pred_test, Y_test)

print('Test loss before training was:', loss_test_before.item())
print('Test loss after training is:', loss_test.item())

# Plot mean abs difference between prediction and reference
print('Mean abs difference:', np.mean(abs(Y_pred_test.cpu().detach().numpy()-Y_test.cpu().numpy()), axis=0)*scale_y, 'years')

---

# Classification Network

The step to the first classification network is not big anymore. For this task, you are provided with the source code, which is very similar to that of the regression network. The iris-flower dataset consists of 160 iris, each described by 4 features and divided into 3 different classes. The aim is to use a simple neural classification network to predict the class from the features.

- Solve the **TODO**s analogous to the regression network. Note that we now have to work with a one-hot encoded ([take a look here](https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179))
target vector.
- The evaluation is slightly modified: Along with the loss plots we use a confusion matrix now to compare reference and prediction.


**18. TODO**: It may be that the classification will not yet work optimally. This is because there is no activation function above the output of the network. Explain, why it makes sense to use a softmax activation for a classification task. Build such an activation with *F.softmax(output, dim=1)* into the architecture and compare the *y_pred* after the respective runs with and without activation. Note: Instead of manually adding a softmax function, you can also use another loss function e.g. CrossEntropyLoss(), which applies the softmax activation internally. We will use this in the following exercise. It might make sense to solve this **TODO** after all the other **TODO**...


...

## Load & prepare data

**12. TODO**

In [None]:
# Read data
data_path = '../data/iris.data'

# Read csv sheet with pandas
df = pd.read_csv(data_path, sep=',')

df = df.dropna()

# Get numpy out of pandas dataframe
data = df.values

column_names = np.array(df.columns[:])

# TODO**
print('Dimensions of the dataset:',)

In [None]:
# Split in X and Y
x = np.array(data[:,:-1], dtype=np.float32)

y = pd.factorize(data[:,-1])[0]

class_names = np.unique(data[:,-1])

# Save number of classes
nc = np.max(y)+1

print('x shape:', x.shape)
print('y shape:', y.shape)
print('number of classes', nc)

In [None]:
# Normalize X between (0,1). If multiple features in X are selected, each feature is normalized individually
scale_x = np.max(x, axis=0)
x = x/scale_x
print('Scale_x:',scale_x)

In [None]:
# Convert to torch tensors
# if tensors have only one dimension, an artificial dimension is created with unsqueeze (e.g. [10]->[10,1], so 1D->2D)
Y = torch.from_numpy(y)
Y = Y.long()

# Produce onehot target tensor
# scatter_ mehtod fills the tensor with values from a source tensor along the indices provided as arguments
# oh = one hot encoding
Y_oh = torch.zeros(Y.shape[0], nc)
Y_oh.scatter_(1,Y.unsqueeze(1), 1.0)

X = torch.from_numpy(x)
X = X.float()
if len(X.shape)==1:
    X = X.unsqueeze(dim=1)

**13. TODO**:

In [None]:
# Split dataset in training, validation and test tensors
# TODO**
prop_train = 0.8
prop_val = 0.1
prop_test = 0.1

sample_num = {'all': X.shape[0], 
              'train': round(prop_train*X.shape[0]),
              'val': round(prop_val*X.shape[0]),
              'test': round(prop_test*X.shape[0])}

# idx shuffle
idx = np.random.choice(sample_num['all'], sample_num['all'], replace=False)

# Assign idx to each sample
sample_idx = {'all': idx[:], 
              'train': idx[0:sample_num['train']],
              'val': idx[sample_num['train']:sample_num['train']+sample_num['val']],
              'test': idx[sample_num['train']+sample_num['val']:]}

# Create train data
X_train = X[sample_idx['train']]
Y_train_oh = Y_oh[sample_idx['train']]
Y_train = Y[sample_idx['train']]

# Create validation data
X_val = X[sample_idx['val']]
Y_val_oh = Y_oh[sample_idx['val']]
Y_val = Y[sample_idx['val']]

# Create test data
X_test = X[sample_idx['test']]
Y_test_oh = Y_oh[sample_idx['test']]
Y_test = Y[sample_idx['test']]

# Show data point
print('Input of first ten train Sample:', X_train[0:10])
print('Target of first ten train Sample:', Y_train[0:10])
print('One-Hot-Encoded Target of first ten train Sample:', Y_train_oh[0:10])

---

## Build classification neural network

In [None]:
# Class of neural network 'ClassificationNet'
# Set up layer and architecture of network in constructor __init__
# Define operations on layer in forward pass method

class ClassificationNet(nn.Module):
    
    def __init__(self, inputSize, outputSize):
        super(ClassificationNet, self).__init__()
        self.fc1 = nn.Linear(inputSize, 128)
        self.fc2 = nn.Linear(128, 32)
        self.fc3 = nn.Linear(32, outputSize)
    
    def forward(self, x):
        # max pooling over (2, 2) window
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

**14. TODO**:

In [None]:
# Specify network parameter
# TODO**  
inputDim = 4
outputDim = 3
 
# Create instance of ClassificationNet
net = ClassificationNet(inputDim, outputDim)


---

## Train network

In [None]:
# Send tensors and networks to GPU (if you have one which supports cuda) for faster computations
# Note: Y is one-hot-encoded
X_train, Y_train_oh = X_train.to(device), Y_train_oh.to(device)
X_val, Y_val_oh = X_val.to(device), Y_val_oh.to(device)
X_test, Y_test_oh = X_test.to(device), Y_test_oh.to(device)

# The network itself must also be sent to the GPU. Either you write net = RegressNet() and then later net.to(device) or directly net = RegressNet().to(device)
# The latter option may have the advantage that the instance net is created directly on the GPU, whereas in variant 1 it must first be sent to the GPU.
if device_num>1:
    print("Let's use", device_num, "GPU's")
    net = nn.DataParallel(net)
net.to(device) 
print(net)

**15. TODO**:

In [None]:
# Specify hyperparameter
# Hyperparemter: num_epoch, num_lr, loss_func, optimizer
# TODO**  
num_epoch = 2000
num_lr = 1e-4

# Loss and optimizer
# loss_func = nn.MSELoss() # -> one hot encoded 'target' to loss-function
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=num_lr)

In [None]:
# Loss and Accuracy before training
# Compute loss of test data before training the network (with random weights)

Y_pred_test_before_oh = net(X_test)
# Loss function input looks as follows: loss_func(prediction, target)
# Note: for CrossEntropyLoss(): prediction is one_hot_encoded, target has single dimension
# for MSELoss(): target and loss has to be both one_hot_encoded 

loss_test_before = loss_func(Y_pred_test_before_oh, Y_test_oh)

# Accuracy before training
y_pred_test_before = np.argmax(Y_pred_test_before_oh.cpu().detach().numpy(), axis=1)
correct_before = np.sum(y_pred_test_before == Y_test.numpy())

**16. TODO**:

In [None]:
# Train the network
plt.figure() 

for epoch in range(num_epoch):
    # classical forward pass -> predict new output from train data
    Y_pred_train_oh = net(X_train)
    # Compute loss    
    loss_train = loss_func(Y_pred_train_oh, Y_train_oh)
    
    # Compute gradients
    optimizer.zero_grad()
    # Note: Calling .backward() mutiple times accumulates the gradient (by addition) for each parameter. This is why you should call optimizer.zero_grad() after each .step() call
    # Note that following the first .backward call, a second call is only possible after you have performed another forward pass.
    loss_train.backward()
    # Perform a parameter update based on the current gradient (stored in .grad attribute of a parameter)
    optimizer.step()
    
    # TODO**
    # Forward pass for validation
    Y_pred_val_oh = net(X_val)
    loss_val = loss_func(Y_pred_val_oh, Y_val_oh)
    
    # Compute actual train accuracy
    y_pred_train = np.argmax(Y_pred_train_oh.cpu().detach().numpy(), axis=1)
    correct_train = np.sum(y_pred_train == Y_train.numpy())
    
    # Compute actual val accuracy
    y_pred_val = np.argmax(Y_pred_val_oh.cpu().detach().numpy(), axis=1)
    correct_val = np.sum(y_pred_val == Y_val.numpy())
    
    # Plot train and val loss and accuracies
    plt.scatter(epoch, loss_train.data.item(), color='r', s=10, marker='o')
    plt.scatter(epoch, loss_val.data.item(), color='b', s=10, marker='o')
    plt.scatter(epoch, correct_train/Y_train.shape[0], color='m', s=10, marker='o') 
    plt.scatter(epoch, correct_val/Y_val.shape[0], color='c', s=10, marker='o')
    
    # Print message with actual losses
    print('Train Epoch: {}/{} ({:.0f}%)\ttrain_Loss: {:.6f}\tval_Loss: {:.6f}'.format(
    epoch+1, num_epoch, epoch/num_epoch*100, loss_train.item(), loss_val.item()))
       

# Show training and validation loss    
plt.legend(['train-loss','val-loss','train-acc','val-acc'])
plt.xlabel('epoch')
plt.ylabel('loss')
plt.savefig('../results/irisflower_loss.png')


---

## Testing

**17. TODO**:

In [None]:
# Test results
# TODO**
# Forward pass 
# Y_pred_test_oh is on the GPU, because net and X_test are on the GPU, but we want it on the CPU from now on.
Y_pred_test_oh = net(X_test)
# Compute and print losses
loss_test = loss_func(Y_pred_test_oh, Y_test_oh)

print('Test loss before training was:', loss_test_before.item())
print('Test loss after training is:', loss_test.item())

# Compute and print accuracies
y_pred_test = np.argmax(Y_pred_test_oh.cpu().detach().numpy(), axis=1)
correct = np.sum(y_pred_test == Y_test.numpy())
print('Test accuracy before training: ', correct_before/Y_test.shape[0]*100, '%')
print('Test accuracy after training: ', correct/Y_test.shape[0]*100, '%')

In [None]:
# Evaluation module
clf = SVC(random_state=0)
clf.fit(X_train, y_pred_train)
SVC(random_state=0)
predictions = clf.predict(X_test)

# Plot test confusion matrix
cm = confusion_matrix(y_pred_test, predictions, labels=clf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot()
plt.show()
plt.savefig('../results/irisflower_confusion.png')