# Neural Networks

In this exercise you will learn how to implement a feedforward neural network and train it with backpropagation.

In [42]:
import numpy as np
from numpy.random import multivariate_normal
from numpy.random import uniform
from scipy.stats import zscore

We define two helper functions "init_toy_data" and "init_model" to create a simple data set to work on and a 2 layer neural network. 

First, we create toy data with categorical labels by sampling from different multivariate normal distributions for each class. 

In [43]:
def init_toy_data(num_samples,num_features, num_classes, seed=3):
    # num_samples: number of samples *per class*
    # num_features: number of features (excluding bias)
    # num_classes: number of class labels
    # seed: random seed
    np.random.seed(seed)
    X=np.zeros((num_samples*num_classes, num_features))
    y=np.zeros(num_samples*num_classes)
    for c in range(num_classes):
        # initialize multivariate normal distribution for this class:
        # choose a mean for each feature
        means = uniform(low=-10, high=10, size=num_features) # 주어진 범위에서 균등 분포를 따르는 난수를 생성
        # choose a variance for each feature
        var = uniform(low=1.0, high=5, size=num_features)
        # for simplicity, all features are uncorrelated (covariance between any two features is 0)
        cov = var * np.eye(num_features) # 공분산 행렬을 생성함. eye는 대각행렬을 생성함.
        # draw samples from normal distribution
        X[c*num_samples:c*num_samples+num_samples,:] = multivariate_normal(means, cov, size=num_samples)
        # set label
        y[c*num_samples:c*num_samples+num_samples] = c
    return X,y


In [44]:
def init_model(input_size,hidden_size,num_classes, seed=3):
    # input size: number of input features
    # hidden_size: number of units in the hidden layer
    # num_classes: number of class labels, i.e., number of output units
    np.random.seed(seed)
    model = {}
    # initialize weight matrices and biases randomly
    model['W1'] = uniform(low=-1, high=1, size=(input_size, hidden_size))  # 입력 - 은닉층 사이의 가중치를 초기화
    model['b1'] = uniform(low=-1, high=1, size=hidden_size)                # 은닉층의 바이어스를 초기화
    model['W2'] = uniform(low=-1, high=1, size=(hidden_size, num_classes)) # 은닉 - 출력 사이의 가중치를 초기화
    model['b2'] = uniform(low=-1, high=1, size=num_classes)                # 출력층의 바이어스를 초기화
    return model

In [45]:
# create toy data
X,y= init_toy_data(2,4,3) # 2 samples per class; 4 features, 3 classes
# Normalize data
X = zscore(X, axis=0)
print('X: ' + str(X)) # 3개의 클래스 (0,1,2)를 가지는 2개의 샘플이 있음. 각 샘플은 4개의 feature를 가짐.
print('y: ' + str(y)) # 클래스

X: [[ 0.39636145  1.09468144 -0.89360845  0.91815536]
 [ 0.94419323 -0.94027869  1.22268078  1.29597409]
 [-1.41577399  1.15477931 -0.62099631  0.08323307]
 [-1.35264614 -0.13598976 -1.14221784  0.26928935]
 [ 0.9352123   0.38225626  1.419864   -1.51152157]
 [ 0.49265316 -1.55544856  0.01427781 -1.0551303 ]]
y: [0. 0. 1. 1. 2. 2.]


We now initialise our neural net with one hidden layer consisting of $10$ units and and an output layer consisting of $3$ units. Here we expect (any number of) training samples with $4$ features. We do not apply any activation functions yet. The following figure shows a graphical representation of this neuronal net. 

<img src="nn.graphviz.png"  width="30%" height="30%">

In [46]:
# initialize model
model = init_model(input_size=4, hidden_size=10, num_classes=3)

print('model: ' + str(model)) 
print('model[\'W1\'].shape: ' + str(model['W1'].shape))
print('model[\'W2\'].shape: ' + str(model['W2'].shape))
print('model[\'b1\'].shape: ' + str(model['b1'].shape))
print('model[\'b12\'].shape: ' + str(model['b2'].shape))
print('number of parameters: ' + str((model['W1'].shape[0] * model['W1'].shape[1]) + 
     np.sum(model['W2'].shape[0] * model['W2'].shape[1]) + 
     np.sum(model['b1'].shape[0]) +
     np.sum(model['b2'].shape[0] )))

model: {'W1': array([[ 0.10159581,  0.41629565, -0.41819052,  0.02165521,  0.78589391,
         0.79258618, -0.74882938, -0.58551424, -0.89706559, -0.11838031],
       [-0.94024758, -0.08633355,  0.2982881 , -0.44302543,  0.3525098 ,
         0.18172563, -0.95203624,  0.11770818, -0.48149511, -0.16979761],
       [-0.43294984,  0.38627584, -0.11909256, -0.68626452,  0.08929804,
         0.56062953, -0.38727294, -0.55608423, -0.22405748,  0.8727673 ],
       [ 0.95199084,  0.34476735,  0.80566822,  0.69150174, -0.24401192,
        -0.81556598,  0.30682181,  0.11568152, -0.27687047, -0.54989099]]), 'b1': array([-0.18696017, -0.0621195 , -0.46152884, -0.41641445, -0.0846272 ,
        0.72106783,  0.17250581, -0.43302428, -0.44404499, -0.09075585]), 'W2': array([[-0.58917931, -0.59724258,  0.02807012],
       [-0.82554126, -0.03282894, -0.27564758],
       [ 0.41537324,  0.49349245,  0.38218584],
       [ 0.37836083, -0.25279975,  0.33626961],
       [-0.32030267,  0.14558774, -0.34838568]

- W1: 가중치와 바이어스를 가지는 딕셔너리. 각 feature마다 10개의 unit을 가져, 총 4 x 10 = 40개의 가중치(파라미터)를 가짐. (weight = $\theta$)
- b1: 10개의 unit을 가지는 은닉층의 바이어스를 가짐. (bias = $\theta_0$)
- w1: 10개의 unit을 가지는 은닉층과 3개의 클래스를 가지는 출력층 사이의 가중치를 가짐. 총 10 x 3 = 30개의 가중치(파라미터)를 가짐.
- b2: 3개의 클래스를 가지는 출력층의 바이어스를 가짐.

<b>Exercise 1</b>: Implement softmax layer.

Implement the softmax function given by 

$softmax(x_i) = \frac{e^{x_i}}{{\sum_{j\in 1...J}e^{x_j}}}$, 

where $J$ is the total number of classes, i.e. the length of  **x** .

Note: Implement the function such that it takes a matrix X of shape (N, J) as input rather than a single instance **x**; N is the number of instances.

신경망은 대체로 분류(classification) 또는 회귀(regression) 문제를 다루는데 softmax는 분류에 사용된다. softmax 함수는 다중 클래스 분류모델을 만들 때 사용한다. 특정 인풋이이 여러 분류중 어떤 분류(eg. 강아지, 고양이, 사람)에 속하는 지를 확률로 예측해주고, 결과를 확률로 해석할 수 있게 변환해주는 함수로 높은 확률을 가지는 class로 분류한다. 위에서 xj는 소프트맥스 함수의 입력값이다. 
이는 (j번째 입력값) / (입력값의 합)으로 볼 수 있으며, 따라서 확률이다. 지수함수가 사용되는 이유는 미분이 가능하도록 하게 함이며, 입력값 중 큰 값은 더 크게 작은값은 더 작게 만들어 입력벡터가 더 잘 구분되도록 한다. 

위 예시에서, xi는 출력층의 뉴런 중 i번째를 뜻하고, j는 출력층의 뉴런 수, 즉 클래스 수를 의미한다. 간단히 말해서, 분자는 입력신호 xi의 지수함수, 분모는 모든 입력신호의 지수함수의 합이라고 볼수 있다.

In [47]:
def softmax(X):
    #print(X.shape)
    X = X - np.max(X, axis=1, keepdims=True) # overflow 방지, normalization
    #print(X)
    exi = np.exp(X)
    exj = np.sum(exi, axis=1, keepdims=True) # 각 행별로 합
    xi = np.zeros((len(X), len(X[0])))
    for i in range(len(X)):
        xi[i] = exi[i] / exj[i]
        
    return xi

Check if everything is correct.

In [48]:
x = np.array([[0.1, 0.7],[0.7,0.4]])
exact_softmax = np.array([[ 0.35434369,  0.64565631],
                         [ 0.57444252,  0.42555748]])
sm = softmax(x)
difference = np.sum(np.abs(exact_softmax - sm))
try:
    assert difference < 0.000001   
    print("Testing successful.")
    print(sm)
except:
    print("Tests failed.")

Testing successful.
[[0.35434369 0.64565631]
 [0.57444252 0.42555748]]


<b>Exercise 2</b>: Implement the forward propagation algorithm for the model defined above.

The activation function of the hidden neurons is a Rectified Linear Unit $relu(x)=max(0,x)$ (to be applied element-wise to the hidden units)
The activation function of the output layer is a softmax function as (as implemented in Exercise 1).

The function should return both the activation of the hidden units (after having applied the $relu$ activation function) (shape: $(N, num\_hidden)$) and the softmax model output (shape: $(N, num\_classes)$). 

In [49]:
def forward_prop(X,model):
    W1=model['W1']
    b1=model['b1']
    W2=model['W2']
    b2=model['b2']

    # activate potential h = weight * x0 + bias
    H = np.dot(X, W1) + b1
    # x1 = activation(h)
    hidden_activations = np.maximum(H, 0) # using ReLU as a activation
    # output potential o = weight * x1 + bias
    O = np.dot(hidden_activations, W2) + b2
    # output = softmax(o)
    probs = softmax(O)
    
    return hidden_activations,probs

In [50]:
acts,probs = forward_prop(X, model)
correct_probs = np.array([[0.22836388, 0.51816433, 0.25347179],
                            [0.15853289, 0.33057078, 0.51089632],
                            [0.40710319, 0.41765056, 0.17524624],
                            [0.85151353, 0.03656425, 0.11192222],
                            [0.66016592, 0.19839791, 0.14143618],
                            [0.70362036, 0.08667923, 0.20970041]])

# the difference should be very small.
difference =  np.sum(np.abs(probs - correct_probs))

try:
    assert probs.shape==(X.shape[0],len(set(y)))
    assert difference < 0.00001   
    print("Testing successful.")
except:
    print("Tests failed.")

Testing successful.


<b>Exercise 3:</b> 

How would you train the above defined neural network? Which loss-function would you use? You do not need to implement this.

- for training through back propagation, we need optimizer and evaluation metric. And with these, we have to complie the model.
- we used softmax as a function for output layer, so we can use softmax-loss for calculating loss.

<b>Part 2 (Neural Net using Keras)</b>

Instead of implementing the model learning ourselves, we can use the neural network library Keras for Python (https://keras.io/). Keras is an abstraction layer that either builds on top of Theano or Google's Tensorflow. So please install Keras and Tensorflow/Theano for this lab.

<b>Exercise 4:</b>
    Implement the same model as above using Keras:
    
    ** 1 hidden layer à 10 units
    ** softmax output layer à three units
    ** 4 input features
    
Compile the model using categorical cross-entropy (also referred to as 'softmax-loss') as loss function and using categorical crossentropy together with categorical accuracy as metrics for runtime evaluation during training.

Hint 1: Use the Sequential Class API of Keras (https://keras.io/api/models/sequential/ or https://www.tensorflow.org/guide/keras/sequential_model)

Hint 2: You can use the Adam optimizer of Keras for the model compilation

In [51]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.utils import to_categorical

# define the model 
model = Sequential() # from tensorflow.keras.models
model.add(Dense(10, input_dim=4, activation="relu")) # 10 hidden units, 4 input units
#model.add(Activation("relu"))
model.add(Dense(3, activation="softmax")) # 3 output units

model.summary()

# compile the model
model.compile(optimizer = "adam", loss = "categorical_crossentropy", metrics = ["categorical_accuracy"])

The description of the current network can always be looked at via the summary method. The layers can be accessed via model.layers and weights can be obtained with the method get_weights. Check if your model is as expected. 

In [52]:
# Check model architecture and initial weights.

W_1, b_1 = model.layers[0].get_weights()
print("First layer weights: %s; shape: %s" % (W_1,W_1.shape))
print("First layer bias: %s; shape: %s" % (b_1,b_1.shape))
W_2, b_2 = model.layers[1].get_weights()
print("Second layer weights: %s; shape: %s" % (W_2,W_2.shape))
print("Second layer bias: %s; shape: %s" % (b_2,b_2.shape))
print("number of layes: " + str(len(model.layers)))
model.summary()


First layer weights: [[-0.3349277  -0.20815682  0.07303619  0.5585607   0.43144965 -0.3591347
  -0.14473945 -0.46396878  0.4558953  -0.26218727]
 [ 0.02293992  0.5296025   0.47306192 -0.30638427  0.479069    0.5002371
  -0.2770795  -0.4525519   0.2796756   0.64514565]
 [ 0.5460888   0.16778523  0.3889724  -0.32512313  0.1853447  -0.20481151
  -0.4875121   0.20600724 -0.53105044  0.31995165]
 [ 0.21698058 -0.1970346   0.63833404  0.20191354  0.2528442  -0.00997531
   0.6020616   0.6002233  -0.64970714  0.6070931 ]]; shape: (4, 10)
First layer bias: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]; shape: (10,)
Second layer weights: [[-5.2669907e-01  3.1668967e-01 -1.9972163e-01]
 [ 7.0066333e-02  2.0276010e-02 -6.4692861e-01]
 [ 4.4379890e-01 -3.5927296e-03 -4.0056151e-01]
 [ 2.9915571e-04 -3.7570083e-01  2.8342664e-01]
 [-6.1577672e-01 -3.0768493e-01 -4.1390425e-01]
 [-6.4112395e-01 -3.2214803e-01 -6.5467513e-01]
 [-1.4445269e-01 -6.3309640e-01  5.1082373e-01]
 [-4.6026725e-01 -1.7106354e-02  5.6230903

<b>Exercise 5:</b> Train the model on the toy data set generated below: 

Hints: 

* Keras expects one-hot-coded labels 

* Don't forget to normalize the data

In [53]:
from sklearn.model_selection import train_test_split

X, y = init_toy_data(1000,4,3, seed=3)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=67)

# Normalize data
X_train = zscore(X_train, axis=0)
X_test = zscore(X_test, axis=0)

# one-hot encoding
y_train = to_categorical(y_train, num_classes=3)
y_test = to_categorical(y_test, num_classes=3)

# train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test)) # epochs and batch_size, refer to the example provided

train_loss, train_accuracy = model.evaluate(X_train, y_train)
print(f"Train Accuracy: {train_accuracy:.4f}")

test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.4f}")


Epoch 1/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - categorical_accuracy: 0.7069 - loss: 0.8321 - val_categorical_accuracy: 0.8101 - val_loss: 0.6600
Epoch 2/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 936us/step - categorical_accuracy: 0.8054 - loss: 0.6229 - val_categorical_accuracy: 0.8788 - val_loss: 0.4965
Epoch 3/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 938us/step - categorical_accuracy: 0.8910 - loss: 0.4621 - val_categorical_accuracy: 0.9313 - val_loss: 0.3792
Epoch 4/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 989us/step - categorical_accuracy: 0.9333 - loss: 0.3577 - val_categorical_accuracy: 0.9576 - val_loss: 0.2909
Epoch 5/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - categorical_accuracy: 0.9631 - loss: 0.2716 - val_categorical_accuracy: 0.9687 - val_loss: 0.2239
Epoch 6/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m

Compare the test accuracy with the train accuracy. What can you see? Is the model performing well?

- You can see that accuracy gradually increases with each epoch, and loss gradually decreases.
- The accuracy of both the train set and the test set are high, and they match well. The model seems to be working well.