# Study of Gradient Descent with squared error in Python
This notebook is a study of gradient descent technique in Python using the saqured error function.

<br>

### Formula reference

<br> 

In a simple one layer neural network(inpus and activation/output), the gradient of the squared error in respect to $w_i$ is:

<br>

$$\frac{\partial E}{\partial w_i} = -(y-\hat{y}){f}'(h)x_i$$

<br>

where : 

<br>

$${f}'(h) = {\sigma}'(h) = \sigma(h)(1-\sigma(h))$$

<br>

leading to :

<br>

$$\frac{\partial E}{\partial w_i} = -(y-\hat{y})\sigma(h)(1-\sigma(h))x_i$$

<br>

The weight update step is given by:

<br>

$$\Delta w_i = \alpha(y-{y}'){f}'(h)x_i$$

where we can define for convenience, a variable "error term" $\delta$, given by :

<br>

$$\delta  = (y-\hat{y}){f}'(h)$$

<br>

Giving us the following weight update formula:

<br>

$$w_i=w_i+\alpha\delta x_i$$

### Simple Python implementation
Only one input sample, using sigmoid activation function $f(h)$.

In [14]:
import numpy as np

# Sigmoid activation function
def sigmoid(x):
    return 1 / (1+np.exp(-x))

# Derivative of the sigmoid activation function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Simulation of gradient descent

learning_rate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)
w = np.array([0.5, -0.5, 0.3, 0.1])

# Prediction or feedforward
h = np.dot(x, w)

# NN output
output = sigmoid(h)

# Error calculation
error = (y - output)

# Error term
error_term = error * sigmoid_prime(h)

# Weights update
del_w = [learning_rate * error_term * x]

print('Neural Network output:')
print(output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.6899744811276125
Amount of Error:
-0.1899744811276125
Change in Weights:
[array([-0.02031869, -0.04063738, -0.06095608, -0.08127477])]


### Gradient Descent applied to a Neural Network for multiple weights update - UCLA admissions dataset
The last example was a simulation of a really small neural network with only one input sample on the input layer and running for only one epoch. 

<br>

Now, the concept from the previous example is taken and scaled up for an example of a simple neural network with only one input layer and one activation/output layer, but this time using multiple input samples on the training and testing datasets, running over hundreads epochs.

In [21]:
import numpy as np
import pandas as pd
np.random.seed(42)

In [22]:
# Dataset load
data = pd.read_csv('../dataset/ucla-admissions')
# Transforms the rank column into one hot encoded columns
rank_one_hot = pd.get_dummies(data['rank'], prefix='rank')
# Add one hot encoded columns to main dataframe and 
# drop old categorical rank column
data = pd.concat([data, rank_one_hot], axis=1)
data = data.drop('rank', axis=1)
# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
# Split features and targets
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
train_data, test_data = data.iloc[sample], data.drop(sample)
x_train = train_data.drop('admit', axis=1)
y_train = train_data['admit']
x_test = test_data.drop('admit', axis=1)
y_test = test_data['admit']

In [23]:
# Sigmoid activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def training(x_train, y_train, epochs=1000, learning_rate=0.5):
    n_records, n_features = x_train.shape
    last_loss = None
    weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
    
    for e in range(epochs):
        del_w = np.zeros(weights.shape)
        for x, y in zip(x_train.values, y_train):
            # Network output or prediction
            output = sigmoid(np.dot(x, weights))
            # Error from previous prediction
            error = y - output
            # Calculate the error term
            error_term = error * output * (1 - output)
            # The gradient step for the current sample being analyzed
            del_w += error_term * x
        
        # After analyzign all sample, update the weights based on the
        # total error or the total gradient steps for each sample
        weights += learning_rate * (del_w/n_records)
        
            # Printing out the mean square error on the training set
        if e % (epochs / 10) == 0:
            out = sigmoid(np.dot(x_train, weights))
            loss = np.mean((out - y_train) ** 2)
            if last_loss and last_loss < loss:
                print("Train loss: ", loss, "  WARNING - Loss Increasing")
            else:
                print("Train loss: ", loss)
            last_loss = loss
    
    return weights

In [25]:
# Train model
optimized_weights = training(x_train, y_train)

# Calculate accuracy on test data
tes_out = sigmoid(np.dot(x_test, optimized_weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == y_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Train loss:  0.27399399505027644
Train loss:  0.21258469351613782
Train loss:  0.2015285048367407
Train loss:  0.19878389269310112
Train loss:  0.19782937236263026
Train loss:  0.19742326198825286
Train loss:  0.19722617724052421
Train loss:  0.19712124179879126
Train loss:  0.19706147612351815
Train loss:  0.19702571727085305
Prediction accuracy: 0.725
