<a href="https://colab.research.google.com/github/Hertie-School-Machine-Learning-F2022/Class_Lab_09/blob/main/Class_Lab_09_GRAD_C24_fall_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Deep Learning

### Understanding Gradient Descent 

### Lab instructor: Paulina Garcia Corral

### Lab 09 

### Date: 18.11.2022

Code adapted from dhavalsays's script for coding a Gradient Descent from scratch.

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.

In particular tf allows us to use use pre-trained models, such as BERT, but it also allows us to train our own, using their architecture. 

In [4]:
# We import tf in google collab as this:
import tensorflow as tf

# Keras is an open-source software library that provides a Python interface 
# for artificial neural networks. Keras acts as an interface for the TensorFlow library. 
from tensorflow import keras

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

Gradient Descent (prediction function) is one of the functions that NN use to find the values of the weights and biases. 

y = B1X1 * B2X2 + bias | z = 1/(1+e^-y)

Forward pass: we add the values of the features and the selected weights and bias and get a prediction. Using the prediction we get the z. 

From there we can calculate the error. For logistic regression we use log loss error 

error = -(ylog(y_hat) + (1 - y)log(1-y_hat)) 

We add all the errors and take a simple average (bianry cross entropy), this is called "Epoch total loss"

Update the weights to reduce log loss, given a set learning rate that we establish (usually 0.1) 

We reset the value of the weights using w1 = w1 - learning rate * rate of change of derivate w1

and being again... new forward pass (epoch 2...)



## Let's first apply it using the tf methods

We will predict if someone will vote in the next elections based on their age and if they voted on the last elections. We will use a ficticious dataset called voting that's on this weeks repo. 

In [3]:
# Read csv
df = pd.read_csv("voting.csv")
df.head()

Unnamed: 0,age,last_elections,next_election
0,22,1,0
1,25,0,0
2,47,1,1
3,52,0,0
4,46,1,1


**Scaling the data**

In [8]:
# Even if we are using tf, the sklearn library has so many useful functions that
# we can use them to prep our data

# Let's try with a Standard Scaler first.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(df[['age']])

df[['age']] = scaler.transform(df[['age']])

# This is a normal process but there are other approaches that can keep interpretability!
df.head()

Unnamed: 0,age,last_elections,next_election
0,-1.175749,1,0
1,-0.978617,0,0
2,0.467014,1,1
3,0.795566,0,0
4,0.401303,1,1


In [10]:
# Knowing that last_elec moves from 0 to 1, we can trasnform the age variable
# into an interpretable and scaled new version! Just a different way to look at
# preprocessing possibilites :)


df = pd.read_csv("voting.csv")

df['age'] = df['age'] / 100

df

In [14]:
 # We are now going to split our data into testing and training 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['age','last_elections']],
                                                    df.next_election,test_size=0.2, random_state=25)

In [18]:
# We will build a model using the tf method to create a simple neural network

# Number of layers
lay = 1

# Number of features in our data
xs = 2

# We specify a sequential model, then we add the layers and features. 
# We know the activation has to be sigmoid, and set the random initializar to zero and one
model = keras.Sequential([
    keras.layers.Dense(lay, input_shape=(xs,), activation='sigmoid', 
                       kernel_initializer='ones', bias_initializer='zeros')
])

# We then complie based on adam and bce, and two different metrics for evaluation.
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=[tf.keras.metrics.Precision(), tf.keras.metrics.Accuracy()])

# We will now train just as we did for sklearn methods, and choose 6000 epochs
# model.fit(X_train, y_train, epochs=6000)

# We see that 6000 epoches are too many, the loss reaches minimum and starts growing again. 
# Let's set it to 5000

model.fit(X_train, y_train, epochs=5000)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 2502/5000
Epoch 2503/5000
Epoch 2504/5000
Epoch 2505/5000
Epoch 2506/5000
Epoch 2507/5000
Epoch 2508/5000
Epoch 2509/5000
Epoch 2510/5000
Epoch 2511/5000
Epoch 2512/5000
Epoch 2513/5000
Epoch 2514/5000
Epoch 2515/5000
Epoch 2516/5000
Epoch 2517/5000
Epoch 2518/5000
Epoch 2519/5000
Epoch 2520/5000
Epoch 2521/5000
Epoch 2522/5000
Epoch 2523/5000
Epoch 2524/5000
Epoch 2525/5000
Epoch 2526/5000
Epoch 2527/5000
Epoch 2528/5000
Epoch 2529/5000
Epoch 2530/5000
Epoch 2531/5000
Epoch 2532/5000
Epoch 2533/5000
Epoch 2534/5000
Epoch 2535/5000
Epoch 2536/5000
Epoch 2537/5000
Epoch 2538/5000
Epoch 2539/5000
Epoch 2540/5000
Epoch 2541/5000
Epoch 2542/5000
Epoch 2543/5000
Epoch 2544/5000
Epoch 2545/5000
Epoch 2546/5000
Epoch 2547/5000
Epoch 2548/5000
Epoch 2549/5000
Epoch 2550/5000
Epoch 2551/5000
Epoch 2552/5000
Epoch 2553/5000
Epoch 2554/5000
Epoch 2555/5000
Epoch 2556/5000
Epoch 2557/5000
Epoch 2558/5000
Epoch 2559/5000
Epoch 2

<keras.callbacks.History at 0x7f6bac6473d0>

In [19]:
# We can call evaluation and observe the metrics

model.evaluate(X_test,y_test)



[0.5840319991111755, 0.75, 0.0]

In [20]:
# Just as before, we can see the predictions on unobserved data

y_hat = model.predict(X_test)

# And compare. Remember because this is a Z, the values < 0.5 are classified as 0
# and higher > 0.50 are classified as 1.
print(y_hat, y_test)

[[0.7318652 ]
 [0.17979313]
 [0.24374732]
 [0.52211785]
 [0.7503508 ]
 [0.84273803]] 2     1
10    1
21    0
11    0
14    1
9     1
Name: next_election, dtype: int64


In [21]:
# Now get the value of weights and bias from the model as we do in sklearn

weight, bias = model.get_weights()
print(weight, bias)

[[4.8188286]
 [1.1243962]] [-2.3851388]


We can now take it piece by piece and code it from scratch. 

We know we need: 



1.   W1, W2 and bias to initialize
2.   prediction equation 
3.   Sigmoid
4.   Log loss
5.   Total epoch loss
6.   Weight update
7.   Number of epochs

Let's set the functions and put it together. 



In [22]:
# Define the sigmoid function

def sigmoid(X):
   return 1/(1+np.exp(-X))

sigmoid(np.array([12,0,1]))

array([0.99999386, 0.5       , 0.73105858])

In [23]:
# Define the prediction function (weights and bias)

def prediction_function(age, last_elections):
    pred_ = weight[0]*age + weight[1]*last_elections + bias
    return sigmoid(pred_)

prediction_function(.47, 1)

array([0.7318652], dtype=float32)

In [24]:
# Set log loss

def log_loss(y_true, y_predicted):
    epsilon = 1e-15
    y_predicted_new = [max(i,epsilon) for i in y_predicted]
    y_predicted_new = [min(i,1-epsilon) for i in y_predicted_new]
    y_predicted_new = np.array(y_predicted_new)
    return -np.mean(y_true*np.log(y_predicted_new)+(1-y_true)*np.log(1-y_predicted_new))

In [25]:
# Wrap it all together. Two loops, one for epochs one for number of smaples 

def gradient_descent(age, last_elections, y_true, epochs, break_th):
    w1 = 1
    w2 = 1
    bias = 0
    rate = 0.5
    n = len(age)
    for i in range(epochs):
        pred_ = w1 * age + w2 * last_elections + bias
        y_predicted = sigmoid(pred_)
        loss = log_loss(y_true, y_predicted)

        w1d = (1/n)*np.dot(np.transpose(age),(y_predicted-y_true)) 
        w2d = (1/n)*np.dot(np.transpose(last_elections),(y_predicted-y_true)) 

        bias_d = np.mean(y_predicted-y_true)
        w1 = w1 - rate * w1d
        w2 = w2 - rate * w2d
        bias = bias - rate * bias_d

        print (f'Epoch:{i}, w1:{w1}, w2:{w2}, bias:{bias}, loss:{loss}')

        if loss <= break_th:
            break

    return w1, w2, bias

In [27]:
# Lets roll it out and see how well it does compared to the tf model 

gradient_descent(X_train['age'],X_train['last_elections'], y_train, 5000, 0.5)

print("Tensor Flow model: ", weight, bias)

# We could add epochs, or change learning rate to see if this helps our function, try it yourself 

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch:2, w1:0.9520571026446174, w2:0.9150039849592233, bias:-0.2260171515128791, loss:0.6440852390199736
Epoch:3, w1:0.9434664684078207, w2:0.8974334014773501, bias:-0.2828494040855703, loss:0.634944402601831
Epoch:4, w1:0.9378472145492064, w2:0.8842411148460922, bias:-0.3322668094324422, loss:0.6282411528158327
Epoch:5, w1:0.9347857552961497, w2:0.8748195337188789, bias:-0.3753018408998168, loss:0.6233240477293662
Epoch:6, w1:0.9339033428406024, w2:0.868599560205655, bias:-0.41289313051660936, loss:0.6196941855981541
Epoch:7, w1:0.9348621075962664, w2:0.8650653743704472, bias:-0.4458734435508208, loss:0.6169803577689964
Epoch:8, w1:0.9373665344288216, w2:0.8637607008604709, bias:-0.4749681623868305, loss:0.6149125053521447
Epoch:9, w1:0.9411619854930613, w2:0.8642892755432325, bias:-0.500800405310719, loss:0.6132975437700648
Epoch:10, w1:0.9460315249807447, w2:0.866311739376513, bias:-0.5238998246352516, loss:0.611999252