## Implementing Gradient Descent
- Use gradient descent to train a network on graduate school admissions data
- File path "/home/isaac/UdacityDL/00_Prep/binary.csv"

### Data Cleanup
- we actually need to transform the data first. The rank feature is categorical, the numbers don't encode any sort of relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank 2. Instead, we need to use dummy variables to encode rank, splitting the data into four new columns encoded with ones or zeros.
- We'll also need to standardize the GRE and GPA data, which means to scale the values such they have zero mean and a standard deviation of 1. This is necessary because the sigmoid function squashes really small and really large inputs. The gradient of really small and large inputs is zero, which means that the gradient descent step will go to zero too. 

### Algorithm Explained
- Set the weight step to zero: $\Delta w_{i} = 0$
- For each record in the training data
    1. Make a forward pass through the network, calculating the output $\hat y = f(\sum_{i}w_{i}x_{i})$
    2. Calculate the error gradient in the output unit, $\delta = (y - \hat y) \times f'(\sum_{i}w_{i}x_{i})$
    3. Update the weight step $\Delta w_{i} = w_{i} + \delta x_{i}$
- Update the weights $ \large w_{i} = w_{i} + \frac {\eta \Delta w_{i}} {m}, \hspace{2mm}$ where $\eta$ is the learning rate and m is the number of records
- Repeat for $e$ epochs

#### Notes:
- Use Sigmoid for the activation function: $ \large f(h) = \frac {1}{1 + e^{-x}}$
- Gradient of the Sigmoid is $f'(h) = f(h) (1 - f(h))$
- $h$ is the input to the output unit: $h = \sum_{i}w_{i}x_{i}$

In [1]:
# Import Numpy
import numpy as np

1. Need to initialize the weights
2. Weights should be small such that the input to the sigmoid is in the linear region near 0 and not squashed at the high and low ends
3. It's also important to initialize them randomly so that they all have different starting values and diverge, breaking symmetry
4. So, we'll initialize the weights from a normal distribution centered at 0. A good value for the scale is $\large \frac{1}{\sqrt n}$ where $n$ is the number of input units. This keeps input to the sigmoid low for increasing numbers of input units

Initialize weights
weights = np.random.normal(scale = 1/n_features ** (-0.5), size = n_features)

In [4]:
# Data Preparation
import numpy as np
import pandas as pd

admissions = pd.read_csv('binary.csv')

# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

In [24]:
# Define Sigmoid Function
def sigmoid(x):
    return (1 / (1 + np.exp(-x)))

In [39]:
# Run Steps for Gradient Descent
def run():
    #set seed to generate same results
    np.random.seed(42)
    n_records, n_features = features.shape
    last_loss = None
    
    # Initialize weights
    weights = np.random.normal(scale = 1/n_features ** (-0.5), size = n_features)
    
    # Set up NN hyperparameters
    epochs = 1000
    learnrate = 0.5
    
    for e in range(epochs):
        del_w = np.zeros(len(weights))
        for x,y in zip(features.values, targets):
            ## 1. Calculate the output
            nn_output = sigmoid(np.dot(weights, x))
            ## 2. Calculate the error & Gradient Error
            nn_error = y - nn_output
            nn_grd_error = nn_error * nn_output * (1 - nn_output)
            ## 3. Calculate the weight change
            del_w += nn_grd_error * x
        weights += learnrate * del_w / n_records
        
        #Printing the MSE on the training set
        if e % (epochs / 10) == 0:
            out = sigmoid(np.dot(features, weights))
            loss = np.mean((out - targets) ** 2)
            if last_loss and last_loss < loss:
                print("Training Loss: ", loss, "Warning - loss increasing")
            else:
                print("Training Loss: ", loss)
            last_loss = loss
    
    #Calculate accuracy on test data
    test_out = sigmoid(np.dot(features_test, weights))
    prediction = test_out > 0.5
    accuracy = np.mean(prediction == targets_test)
    print("Prediction Accuracy: {a}".format(a = accuracy))

In [40]:
#Write the main function
if __name__ == "__main__":
    run()

Training Loss:  0.368439541983512
Training Loss:  0.349170785218876
Training Loss:  0.33217440574485835
Training Loss:  0.30603721927356853
Training Loss:  0.25612134085589033
Training Loss:  0.21026563561120573
Training Loss:  0.19921050410028213
Training Loss:  0.1974844717751469
Training Loss:  0.1971205928346378
Training Loss:  0.19702125482094843
Prediction Accuracy: 0.725
