## Building the neural network from scratch
In this notebook we will build a simple neural network from scratch. The dataset is taken from http://www.ats.ucla.edu/stat/data/binary.csv This dataset has three input features: GRE score, GPA, and the rank of the undergraduate school (numbered 1 through 4). Institutions with rank 1 have the highest prestige, those with rank 4 have the lowest.

The goal here is to predict if a student will be admitted to a graduate program based on these features. For this, we'll use a network with one output layer with one unit. We'll use a sigmoid function for the output unit activation.

In [1]:
#importing the libraries
import numpy as np
import pandas as pd

In [5]:
#reading the data
dataset = pd.read_csv('binary.csv')
#printing the first 5 values
dataset.head()

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


### Encoding the categorical variable
The rank feature is categorical, the numbers don't encode any sort of relative valuesRank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank 2. Hence we need to split the data into four new columns

In [6]:
dataset= pd.get_dummies(dataset, prefix='rank', columns=['rank'], drop_first=True)
dataset.head()

Unnamed: 0,admit,gre,gpa,rank_2,rank_3,rank_4
0,0,380,3.61,0,1,0
1,1,660,3.67,0,1,0
2,1,800,4.0,0,0,0
3,1,640,3.19,0,0,1
4,0,520,2.93,0,0,1


### Scaling the values
We'll also need to standardize the GRE and GPA data, which means to scale the values such that they have zero mean and a standard deviation of 1. This is necessary because the sigmoid function squashes really small and really large inputs. The gradient of really small and large inputs is zero, which means that the gradient descent step will go to zero too. Since the GRE and GPA values are fairly large, we have to be really careful about how we initialize the weights or the gradient descent steps will die off and the network won't train. Instead, if we standardize the data, we can initialize the weights easily and everyone is happy.

In [7]:
for x in ['gre', 'gpa']:
    mean, std = dataset[x].mean(), dataset[x].std()
    dataset[x] = (dataset[x]-mean)/std

dataset.head()

Unnamed: 0,admit,gre,gpa,rank_2,rank_3,rank_4
0,0,-1.798011,0.578348,0,1,0
1,1,0.625884,0.736008,0,1,0
2,1,1.837832,1.603135,0,0,0
3,1,0.452749,-0.525269,0,0,1
4,0,-0.586063,-1.208461,0,0,1


### Splitting the dataset
First we split the dataset into training and testing data. We take 90% of the data for training and 10% of the data for testing. Next we split the training and testing data into features and targets

In [20]:
np.random.seed(0)
sample = np.random.choice(dataset.index, size=int(len(dataset)*0.9), replace=False)

#splitting into training and testing set
data_train, data_test = dataset.iloc[sample], dataset.drop(sample)
print(data_train.head())

#splitting into features and targets
features_train, targets_train = data_train.drop('admit', axis=1), data_train['admit']
features_test, targets_test = data_test.drop('admit', axis=1), data_test['admit']

     admit       gre       gpa  rank_2  rank_3  rank_4
132      0 -0.066657  0.026539       1       0       0
309      0 -1.278605 -1.077078       0       1       0
341      1 -0.239793 -1.944205       0       1       0
196      0  0.625884 -0.840588       0       1       0
246      0  0.799020 -0.131120       1       0       0


In [14]:
#defining the utility functions
def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    return sigmoid(x)*(1 - sigmoid(x))

In [42]:
n_records, n_features = features_train.shape
#initialising the weights
weights = np.random.normal(scale = 1/n_features**-0.5, size=n_features)

In [43]:
epochs = 4000
learn_rate=0.1
print_every = 1000
#training
for e in range(epochs):
    del_w = 0
    for x, y in zip(features_train.values, targets_train):
        
        #calculating the output x*w
        output = sigmoid(np.dot(x, weights))
        
        #calculating the error
        error = y - output
        
        #calculating the error_term which is (y-output)*sigmoid_prime(output)
        error_term = error*sigmoid_prime(output)
        
        #adding to the change in weights
        del_w += error_term*x
    
    weights += learn_rate*del_w/n_records
    
    
    if e%print_every==0:
        #printing the loss on training data for every 1000 epochs
        output = sigmoid(np.dot(features_train, weights))
        training_loss = np.mean((output - targets_train)**2)
        
        print('Epoch: {}/{} -- Training Loss: {}'.format(e+1, epochs, training_loss))
        

Epoch: 1/4000 -- Training Loss: 0.21593654103673804
Epoch: 1001/4000 -- Training Loss: 0.19814690720794337
Epoch: 2001/4000 -- Training Loss: 0.19629137631492913
Epoch: 3001/4000 -- Training Loss: 0.19541598171972419


In [46]:
#testing on test data
output = sigmoid(np.dot(features_test, weights))
#we take all the probabilities whose value is more than 50% as 1
pred = output >= 0.5
accuracy = np.mean((targets_test==pred))
print('Accuracy on test set: ', accuracy)

Accuracy on test set:  0.75
