# General Introduction

### 1. Architecture

![title](Architecture1.png)
![title](Architecture2.png)
![title](Architecture3.png)

### 2. Feedforward

![title](Feedforward1.png)

### 3. Backpropagation
Step 1: Doing a feedforward operation.  
Step 2: Comparing the output of the model with the desired output.  
Step 3: Calculating the error.  
Step 4: Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.  
Step 5: Use this to update the weights, and get a better model.  
Step 6: Continue this until we have a model that is good.  

![title](Backpropagation1.png)
![title](Backpropagation2.png)

In [25]:
# gradient descent
# f(x) = x**2 - 2*x + 1
import numpy as np
testF = lambda x: x**2 - 2*x + 1
# f: function
# init: 初始值
# step: 学习率，步长
# eps: 停止条件 精度
def gd(f, init = 0, step = 0.1, eps = 0.0001):
    deltaX = 0.01
    devF = lambda x: (f(x) - f(x-deltaX))/deltaX
    newpre = f(init)
    error = np.inf
    while error > eps:
        init = init - devF(init)*step
        error = abs(newpre - f(init))
        newpre = f(init)
    return init

In [26]:
gd(testF)

0.9905164235983765

# I. Outline key points for neural network

### perceptron
#### Input
1. Onehot encode (for input)  
pd.get_dummies(data)  
sklearn.preprocessing.OneHotEncoder(sparse = False).fit_transform(data)
2. Normallization (for input) min-max method: (x - min x)/(max x - min x)
3. No correlation between features (optional)

#### Output I: linear combination
1. linear combination (for output)

#### Output II: Activate function
1. step function (discrete output)
3. sigmoid function (continuous output)
$$S(t)=\frac{1}{1+e^{-t}}$$
$$S'(t)=S(t)(1-S(t))$$
6. softmax function (multi-classification problem)
$$Softmax(x_i) = \frac{e^{x_i}}{\sum_i e^{x_i}}$$
**reference: **http://blog.nex3z.com/2017/05/02/sigmoid-%E5%87%BD%E6%95%B0%E5%92%8C-softmax-%E5%87%BD%E6%95%B0%E7%9A%84%E5%8C%BA%E5%88%AB%E5%92%8C%E5%85%B3%E7%B3%BB/


1. percrptron algo (adjustment)
2. Maximum Likelihood (log transformation)
3. cross entropy ~ $\frac{1}{correct\space probability}$  
For 0-1 problem: 
$$Cross-Entropy(y, p) = -\sum_{i=1}^m y'_iln(y_i) + (1-y'_i)ln(1-y_i)$$
where $y_i$ is the predicted probability value for class i and $y'_i$ is the true probability for that class. m is data size.  
For multi-class problem:
$$Cross-Entropy = \sum_{i=1}^m \sum_{j=1}^n -y'^{(j)}_i ln(y^{(j)}_i)$$
where n is classes
4. Error Function
$$E(w, b) = -\frac{1}{m} \sum_{i=1}^m y'_i ln(\sigma(wx_i+b)) + (1-y'_i)ln(1-\sigma(wx_i+b))$$
5. Gradient Descent

### Neural Network
1. Feedforward Neural Network
2. Backpropagation

***notes for onehot encoder***

In [46]:
# 1. Onehot
# example 1:
print('Example 1:')
import pandas as pd
from sklearn import preprocessing
data = pd.read_excel('data.xlsx')
print(data.head(2))
method_pd = pd.get_dummies(data, columns = ['y'])
print(method_pd.head(2))
sklearn_onehot = preprocessing.OneHotEncoder(sparse = False)
method_sklearn = sklearn_onehot.fit_transform(data['y'].values.reshape(-1, 1))

Example 1:
        x1        x2  y
0  0.78051 -0.063669  1
1  0.28774  0.291390  1
        x1        x2  y_0  y_1
0  0.78051 -0.063669    0    1
1  0.28774  0.291390    0    1


In [44]:
# example 2:
print('Example 2:')
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([[0, 0, 3],
         [1, 1, 0],
         [0, 2, 1],
         [1, 0, 2]])
# toarray() return array, if not toarray, return sparse matrix
# same effect with parameter: sparse = False
ans = enc.transform([[0, 1, 2]]).toarray()
# first feature: 0: 10, 1: 01
# second feature: 0: 100, 1: 010, 2: 001
# third feature: 0: 1000, 1: 0100, 2: 0010, 3: 0001
print(ans)

Example 2:
[[1. 0. 0. 1. 0. 0. 0. 1. 0.]]


***notes for normalization***  
***reference:*** https://scikit-learn.org/stable/modules/preprocessing.html

In [54]:
# fit scaler
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit(data[['x1', 'x2']])
X_train_minmax.transform(data[['x1', 'x2']])[:2]

# new data input
X_test = np.array([[-3., -1.]])
X_test_minmax = X_train_minmax.transform(X_test)
X_test_minmax

array([[-3.03408479, -0.88028419]])

In [56]:
data.corr()

Unnamed: 0,x1,x2,y
x1,1.0,0.625802,-0.767375
x2,0.625802,1.0,-0.769329
y,-0.767375,-0.769329,1.0


Batch, epoch

### Problems when training neural network
1. overfitting & underfitting  
(1) Model complexity graph (determine epoch, early stopping)  
(2) L1(feature selection), L2(use most) Regularization (Penalize for large weights)  
(3) Dropout: randomly close some nodes in each epoch  
prob each node will be dropout = 0.2 20% nodes will be turned off
2. bias & variance
3. gradient descent(local minimum, gradient disappear)  
(1) change activate function(tanh, relu) for gradient disappear  
(2) stochastic gradient descent (mini-batch, different dataset for each epoch)  
(3) learning rate(learning rate decay strategy)  
(4) random start point and run gradient descent to all points   
(5) momentum $\beta$  
$$STEP(n)=STEP(n)+\beta STEP(n-1)+\beta^2STEP(n-2)+...$$  


![title](model_complex.png)
![title](reg.png)
![title](biasvar.png)
![title](randstart.png)
![title](mom.png)

# II. Algorithms

### 1. perceptron algorithm
* step 1: start with random weights: w1, ..., wn, b  
* step 2: for every misclassified point  
```
if prediction = 0:
    for i in range(n)
        change wi to wi + axi
        change b to b + a
where a is learning rate
if prediction = 1:
    for i in range(n)
        change wi to wi - axi
        change b to b - a```

In [1]:
import numpy as np
# Setting the random seed, feel free to change it and see different solutions.
np.random.seed(42)

def stepFunction(t):
    if t >= 0:
        return 1
    return 0

def prediction(X, W, b):
    return stepFunction((np.matmul(X,W)+b)[0])

# TODO: Fill in the code below to implement the perceptron trick.
# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate = 0.01):
    for i in range(len(X)):
        y_hat = prediction(X[i],W,b)
        if y[i]-y_hat == 1:
            W[0] += X[i][0]*learn_rate
            W[1] += X[i][1]*learn_rate
            b += learn_rate
        elif y[i]-y_hat == -1:
            W[0] -= X[i][0]*learn_rate
            W[1] -= X[i][1]*learn_rate
            b -= learn_rate
    return W, b
    
# This function runs the perceptron algorithm repeatedly on the dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs = 25):
    x_min, x_max = min(X.T[0]), max(X.T[0])
    y_min, y_max = min(X.T[1]), max(X.T[1])
    W = np.array(np.random.rand(2,1))
    b = np.random.rand(1)[0] + x_max
    # These are the solution lines that get plotted below.
    boundary_lines = []
    for i in range(num_epochs):
        # In each epoch, we apply the perceptron step.
        W, b = perceptronStep(X, y, W, b, learn_rate)
        boundary_lines.append((-W[0]/W[1], -b/W[1]))
    return boundary_lines

### 2. Gradient Descent
* step 1: start with random weights: w1, ..., wn, b  
* step 2: for every point $(x_1, x_2, ..., x_n)$  
```
For i in range(n):
    Update wi to wi - a*partial(E)/partial(wi)
    Update b to b - a*partial(E)/partial(b)
```
* Workshop: Implementing the Gradient Descent Algorithm

### 3. Backpropagation
* Workshop: Student Admissions
* Reference:  
(1) https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/  
(2) http://neuralnetworksanddeeplearning.com/chap2.html