<a href="https://colab.research.google.com/github/TobiasSunderdiek/my_udacity_deep_learning_solutions/blob/master/intro-neural-networks/student_admissions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting student admissions


This notebook is based on the udacity deep learning nanodegree exercise for gradient descent, which can be found here:

https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-neural-networks/student-admissions/StudentAdmissions.ipynb

The original version is implemented with python and numpy, I try to implement it with swift-only as an exercise to learn swift.

Additionally to the implementation, to better understand the underlying calculus, I do some derivation of the formula for updating the weights.

## Math - backpropagation with mean-squared-error as error function within perceptron

The underlying single-layer perceptron [1] of this notebook can be shown as follows:


<p><a href="https://commons.wikimedia.org/wiki/File:Perceptron.svg#/media/File:Perceptron.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/3/31/Perceptron.svg" alt="Perceptron.svg" height="353" width="440"></a><br>By <a href="https://en.wikipedia.org/wiki/User:Mat_the_w" class="extiw" title="wikipedia:User:Mat the w">Mat the w</a> at <a href="https://en.wikipedia.org/wiki/" class="extiw" title="wikipedia:">English Wikipedia</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=23766733">Link</a></p>

In our case, we  have inputs $greScaled$, $gpaScaled$ and $encodedRank$ and their weights. Due to the hot-encoding of rank, we have input size of 7 ($w_1 x_1 + w_2 x_2...+w_7 x_7$), so n=7, and we add a bias b and o is output (= $\hat{y}$): $$\hat{y} = f(w_1 x_1 + w_2 x_2... + b)$$
The function f is our activation function, which is sigmoid: $$\hat{y} = \sigma(w_1 x_1 + w_2 x_2... + b)$$

#### Updating the weights

Generally, gradient descent[2] describes the change in each weight for multilayer perceptrons as:

$$\Delta w_{ji} (n) = -\eta\frac{\partial\mathcal{E}(n)}{\partial v_j(n)} y_i(n)$$

$n$ is the node, in this case we have a single-layer perceptron which has 1 node

$v_j(n)$ is the weighted sum of the input connections in the node, in this case $w_1 x_1 + w_2 x_2...w_7 x_7 + b$, including bias

$y_i(n)$ is the output of the previous node, the i-th node, in this case we don't have a previous node, our inputs are the general inputs of the perceptron $x_i$

$\mathcal{E}(n)$ is the error function, which in this case is the mean-squared-error $\mathcal{E}(n)=\frac{1}{2}\sum_i (y-\hat{y})_i^2(n)$

$\eta$ is the learning rate

Given the information above,  with one node $i=1$:

$$\Delta w_{j} (n) = -\eta\frac{\partial (\frac{1}{2} (y-\hat{y})^2(n))}{\partial v_j} x$$


How much of the total error can be influenced by an individual $v_j$ is calculated by getting the partial derivative of the loss-function with respect to $v_j$:

$$\frac{\partial (\frac{1}{2} (y-\hat{y})^2)}{\partial v_j}$$


As $\hat{y} = \sigma(w_1 x_1 + w_2 x_2... + b)$, and $v_j = w_1 x_1 + w_2 x_2... + b$, we have $\hat{y} = \sigma(v_j)$

$$\frac{\partial (\frac{1}{2} (y-\sigma(v_j))^2)}{\partial v_j}$$

Now, we use the chain-rule $\frac{\partial}{\partial z}p(q(z)) = \frac{\partial p}{\partial q} \frac{\partial q}{\partial z}$, where $q(z) = y-\sigma(v_j)$ and $p=\frac{1}{2}q^2$:

$$\frac{\partial \frac{1}{2}q^2}{\partial q} \frac{\partial (y-\sigma(v_j))}{\partial v_j}=q\frac{\partial }{\partial v_j}(y-\sigma(v_j))=q(0- (\sigma(v_j)(1-\sigma(v_j))))=(y-\sigma(v_j))(-(\sigma(v_j)(1-\sigma(v_j)))=-(y-\hat{y})(\sigma^\prime(v_j))$$

Put this back in:

$$\Delta w_{j} (n) = \eta(y-\hat{y})(\sigma^\prime(v_j)) x$$

This is the derivative of the formula for the weight update, same for bias:

 $$w_i \longrightarrow w_i + \eta(y-\hat{y})(\sigma^\prime(v_i)) x_i$$

[1] https://en.wikipedia.org/wiki/Perceptron

[2] https://en.wikipedia.org/wiki/Multilayer_perceptron

## Loading dataset from github
The original dataset is located here:

https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv

which originally came from: http://www.ats.ucla.edu/

In [0]:
import Foundation

let url = "https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv"

// author of this query function: https://gist.github.com/groz/85b95f663f79ba17946269ea65c2c0f4
func query(address: String) -> String {
    let url = URL(string: address)
    let semaphore = DispatchSemaphore(value: 0)
    
    var result: String = ""
    
    let task = URLSession.shared.dataTask(with: url!) {(data, response, error) in
        result = String(data: data!, encoding: String.Encoding.utf8)!
        semaphore.signal()
    }
    
    task.resume()
    semaphore.wait()
    return result
}

let rawData = query(address: url)

### Convert data
- Make data an array of Tensors

- Drop first row with column header


|admit|	gre|	gpa|	rank|
|-|-|-|-|
|0	|380	|3.61	|3|
|1	|660	|3.67	|3|
|1	|800	|4.00	|1|
|1	|640	|3.19	|4|
|0	|520	|2.93	|4|

- One-hot encode the numerical values of column rank, which contains values from 1-4:

> 1 -> [0.0, 1.0, 0.0, 0.0, 0.0]

> 3 -> [0.0, 0.0, 0.0, 1.0, 0.0]

> 4 -> [0.0, 0.0, 0.0, 0.0, 1.0]

|admit|gre|	gpa|	encodedRank|
|-|-|-|-|
|0	|380	|3.61	|0.0 0.0 0.0 1.0 0.0|
|1	|660	|3.67	|0.0 0.0 0.0 1.0 0.0|
|1	|800	|4.00	|0.0 1.0 0.0 0.0 0.0|
|1	|640	|3.19	|0.0 0.0 0.0 0.0 1.0|
|0	|520	|2.93	|0.0 0.0 0.0 0.0 1.0|

> **Difference to the pandas get_dummies-function used in the original solution: oneHotAtIndices starts counting indices at 0 and this results in 5 instead of 4 columns for the encoded rank**

-  Scaling the data

> *gre* has values between 200-800, *gpa* has values between 1.0-4.0, so the features have to be scaled to be normalized.

> Fit features into range within 0-1 by dividing *gre*/800 and *gpa*/4

|admit|greScaled|	gpaScaled|	encodedRank|
|-|-|-|-|
|0	|0.475	|0.9025	|0.0 0.0 0.0 1.0 0.0|
|1	|0.825	|0.9175	|0.0 0.0 0.0 1.0 0.0|
|1	|1.0	|1.0	|0.0 1.0 0.0 0.0 0.0|
|1	|0.8	|0.7975	|0.0 0.0 0.0 0.0 1.0|
|0	|0.65	|0.7325|0.0 0.0 0.0 0.0 1.0|

In [0]:
import TensorFlow

let rows = rawData.components(separatedBy: "\n")
let featuresAndTargetsAsString = rows.dropFirst().map({ $0.components(separatedBy: ",") }).filter {$0[0] != ""}              
var data = [Tensor<Double>]()
for featureWithTarget in featuresAndTargetsAsString {
  let admit = Double(featureWithTarget[0])!
  let gre = Double(featureWithTarget[1])!
  let gpa = Double(featureWithTarget[2])!
  let rank = Int32(featureWithTarget[3])!
  
  let greScaled = gre/800
  let gpaScaled = gpa/4
  let encodedRank = Tensor<Double>(oneHotAtIndices: Tensor<Int32>(rank), depth:5)
  
  let admitGreGpa = Tensor<Double>([admit, greScaled, gpaScaled])
  let feature = admitGreGpa.concatenated(with: encodedRank)
  data.append(feature)
}

### Split into training and test set
Testing set will be 10% of total size

In [0]:
let dataShuffled = data.shuffled()
let ninetyPercentCount = data.count * 90 / 100

let dataTrain = dataShuffled.prefix(upTo: ninetyPercentCount)
let dataTest = dataShuffled.suffix(from: ninetyPercentCount)

let featuresTrain = dataTrain.map { $0.slice(lowerBounds: [1], upperBounds: [8]) }
let targetsTrain = dataTrain.map { $0.slice(lowerBounds: [0], upperBounds: [1]).scalars[0] }

let featuresTest = dataTest.map { $0.slice(lowerBounds: [1], upperBounds: [8]) }
let targetsTest = dataTest.map { $0.slice(lowerBounds: [0], upperBounds: [1]) }

### Output (prediction) formula, sigmoid
This functions are the same as in my `gradient_descent.ipynb` in

https://github.com/TobiasSunderdiek/my_udacity_deep_learning_solutions/blob/master/intro-neural-networks/gradient_descent.ipynb

Calculation of the output in the udacity version of this notebook is different than in my version: a bias is added here

In [0]:
func mySigmoid(_ x: Tensor<Double>) -> Tensor<Double> {
  return 1 / (1 + exp(-x))
}
func myOutputFormula(_ features: Tensor<Double>, _ weights: Tensor<Double>, _ bias: Tensor<Double>) -> Double {
  let res = mySigmoid((features * weights).sum() + bias)
  return res.scalar!
}

### Error function, error term formula, gradient descent step, backpropagation, sigmoid prime

Error formula (result of error formula) is not used in original version. I use MSE for calculating loss as in original version and put it in error formula function instead. I sum up the errors and calculate mean during and after iteration of epoch.

For updating the weights, the gradient descent step of the backpropagation, mean-squared error is used.  In the original version, the update is calculated within each iteration in `error_term_formula` and performed after each epoch-iteration. Here it is performed at each step of iteration in `myUpdateWeights`.

During this step, the bias is updated.

I choose this approach just to keep the train function the same as in the `gradient_descent.ipynb`.

In [0]:
func sigmoidPrime(_ x: Tensor<Double>) -> Tensor<Double> {
  return sigmoid(x) * (1 - sigmoid(x))
}
func myErrorFormula(_ y: Double, _ output: Double) -> Double {
  return pow(y-output, 2)
}
func myUpdateWeights(_ features: Tensor<Double>, _ targets: Double, _ weights: Tensor<Double>, _ bias: Tensor<Double>, _ learningRate: Double) -> (Tensor<Double>, Tensor<Double>) {
  let delta = learningRate * (targets - myOutputFormula(features, weights, bias)) * sigmoidPrime(features).sum() //sum here because sigmoidPrime give shape of 7, we need a scalar here
  let updatedWeights = weights + delta * features
  let updatedBias = bias + delta

  return (updatedWeights, updatedBias)
}

## Training function
Initialization of weights in the course

`weights = np.random.normal(scale=1 / n_features**.5, size=n_features)`

is different than in my version

`var weights = Tensor<Double>(randomNormal: [n_features])`

Training function is same as in https://github.com/TobiasSunderdiek/my_udacity_deep_learning_solutions/blob/master/intro-neural-networks/gradient_descent.ipynb

In [0]:
func train(_ features: [Tensor<Double>], _ targets: [Double], epochs: Int, learningRate: Double) { //make learningRate Double because swift's won't multiply Double with Float
  let numberRecords = Double(features.count)
  var weights = Tensor<Double>(randomNormal: features[0].shape)
  var bias = Tensor<Double>.zero
  var lastLoss = Double.infinity
  
  for epoch in 0...epochs {
    var errors = 0.0
    var correctPredictions = 0.0
    var prediction = 0.0

    for (x, y) in zip(features, targets) {
      let output = myOutputFormula(x, weights, bias)
      errors += myErrorFormula(y, output)
      (weights, bias) = myUpdateWeights(x, y, weights, bias, learningRate)

      if (output > 0.5) {
        prediction = 1.0
      } else {
        prediction = 0.0
      }
      
      if (prediction == y) {
        correctPredictions+=1
      }
    }
    
    let loss = errors / numberRecords

    if epoch % (epochs / 10) == 0 {
      print("Epoch: \(epoch)")
      
      let warning = lastLoss < loss ? "WARNING - Loss increasing" : ""
      print("Train loss: \(loss) \(warning)")
      lastLoss = loss
      
      let accuracy = correctPredictions / numberRecords
      print("Accuracy: \(accuracy)")
      
      print("Errors: \(errors)")
    }
  }        
}

## Train

In [16]:
train(featuresTrain, targetsTrain, epochs: 1000, learningRate: 0.5)

Epoch: 0
Train loss: 0.26075068142563673 
Accuracy: 0.6333333333333333
Errors: 93.87024531322922
Epoch: 100
Train loss: 0.2538984042839295 
Accuracy: 0.6472222222222223
Errors: 91.40342554221463
Epoch: 200
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 300
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 400
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 500
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 600
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 700
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 800
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 900
Train loss: 0.25389840428215077 
Accuracy: 0.6472222222222223
Errors: 91.40342554157428
Epoch: 100

Compare with python calculation

Epoch: 0
Train loss:  0.27151046424991654

Epoch: 100
Train loss:  0.20925670061926063

Epoch: 900
Train loss:  0.203646868060691

Prediction accuracy: 0.725