# Implementing gradient descent
Okay, now we know how to update our weights: DeltaW = η . δ . x

You've seen how to implement that for a single update, but how do we translate that code to calculate many weight updates so our network will learn?, As an example, I'm going to have you use gradient descent to train a network on graduate school admissions data (found at http://www.ats.ucla.edu/stat/data/binary.csv). This dataset has three input features: GRE score, GPA, and the rank of the undergraduate school (numbered 1 through 4). Institutions with rank 1 have the highest prestige, those with rank 4 have the lowest.

The goal here is to predict if a student will be admitted to a graduate program based on these features. For this, we'll use a network with one output layer with one unit. We'll use a sigmoid function for the output unit activation.

# Algorithm

1. Set the weight step to zero: Δw i=0

2. For Each Record in the training data
    - Make a forward pass through the network, calculating the output y^=f(∑ w x)
    - Calculate the error term for the output unit, δ = (y−y^) * 'f(∑ w x)
    - Update the weight step Δw =Δw +δx 

3. Update the weights w = w + η * Δw / m , where η is the learning rate and m is the number of records. Here we're averaging the weight steps to help reduce any large variations in the training data.

4. Repeat for ee epochs.

.You can also update the weights on each record instead of averaging the weight steps after going through all the records.


# Data cleanup
## Get Dummies

You might think there will be three input units, but we actually need to transform the data first. The rank feature is categorical, the numbers don't encode any sort of relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank 2. Instead, we need to use dummy variables to encode rank, splitting the data into four new columns encoded with ones or zeros. Rows with rank 1 have one in the rank 1 dummy column, and zeros in all other columns. Rows with rank 2 have one in the rank 2 dummy column, and zeros in all other columns. And so on.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

## Standardize the GRE and GPA 
We'll also need to standardize the GRE and GPA data, which means to scale the values such that they have zero mean and a standard deviation of 1. This is necessary because the sigmoid function squashes really small and really large inputs. The gradient of really small and large inputs is zero, which means that the gradient descent step will go to zero too. Since the GRE and GPA values are fairly large, we have to be really careful about how we initialize the weights or the gradient descent steps will die off and the network won't train. Instead, if we standardize the data, we can initialize the weights easily and everyone is happy.

Now that the data is ready, we see that there are six input features: gre, gpa, and the four rank dummy variables.
    
# Mean Square Error
We're going to make a small change to how we calculate the error here. Instead of the SSE, we're going to use the mean of the square errors (MSE). Now that we're using a lot of data, summing up all the weight steps can lead to really large updates that make the gradient descent diverge. To compensate for this, you'd need to use a quite small learning rate. Instead, we can just divide by the number of records in our data, mm to take the average. This way, no matter how much data we use, our learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the MSE (shown below) to calculate the gradient and the result is the same as before, just averaged instead of summed.    

https://classroom.udacity.com/nanodegrees/nd101/parts/94643112-2cab-46f8-a5be-1b6e4fa7a211/modules/89a1ec1d-4c22-4a77-b230-b0da99240c89/lessons/07f472eb-0210-446f-8ec2-d297b06c86d0/concepts/4b167ce0-9d45-45e1-bfe6-891b2c68ac94

# Implementing with NumPy.

For the most part, this is pretty straightforward with NumPy.

First, you'll need to initialize the weights. We want these to be small such that the input to the sigmoid is in the linear region near 0 and not squashed at the high and low ends. It's also important to initialize them randomly so that they all have different starting values and diverge, breaking symmetry. So, we'll initialize the weights from a normal distribution centered at 0. A good value for the scale is 1/\sqrt{n}1/ 
n where n is the number of input units. This keeps the input to the sigmoid low for increasing numbers of input units


In [None]:
import numpy as np
import pandas as pd

# Reading from the csv file
admissions = pd.read_csv('binary.csv')

# Make dummy variables for rank
# Prefix for all them is ranl
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
# Remove the rank
data = data.drop('rank', axis=1)


# Standarize features
for field in ['gre', 'gpa']:
    
    # Getting the mean and standard deviation
    mean, std = data[field].mean(), data[field].std()
    
    # Standardizing the current feature be it GRE or GPA
    data.loc[:,field] = (data[field]-mean)/std
    
# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

In [None]:

# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape

last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.491

for e in range(epochs):
    
    # the rate of change for every weight
    del_w = np.zeros(weights.shape)
    
    # feature values and targets
    for x, y in zip(features.values, targets):
        
        # Loop through all records, x is the input, y is the target
        # Activation of the output unit
        #   Notice we multiply the inputs and the weights here 
        #   rather than storing h as a separate variable 
        output = sigmoid(np.dot(x,weights))

        # The error, the target minus the network output
        error = y - output

        # The error term
        #   Notice we calulate f'(h) here instead of defining a separate
        error_term = error * output * (1 - output)

# The gradient descent step, the error times the gradient times the inputs
        del_w += error_term * x

   # Update the weights here. The learning rate times the 
    # change in weights, divided by the number of records to average
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))