# Single Neuron

## Example- The Linear Unit as a Model

Below is a multiple inputs single neuron:    
<img src="https://i.imgur.com/vyXSnlZ.png" width="50%">

The formula for this neuron would be: $y=w_0x_0+w_1x_1+w_2x_2+b$
* $x_0, x_1, x_1$: inputs
* $w_0, w_1, w_1$：weights. Whenever a value flows through a connection, you multiply the value by the connection's **weight**. For the input **x**, what reaches the neuron is **w * x**. A neural network "learns" by modifying its **weights**.
* $b$: a special kind of weight we call the **bias**.

## Linear Units in Keras

We could define a linear model accepting three input features($x_0, x_1, x_1$)and producing a single output($y$) like below:

In [15]:
from tensorflow import keras
from tensorflow.keras import layers

In [16]:
# Create a network with 1 linear unit
model = keras.Sequential([
    layers.Dense(units=1, input_shape=[3])
])

* with the first arguement **units**, we difine how many outputs we want. **It is equal to the number of neurons in a layer**. Here the output is only $y$.
* with the second arguement **input_shape**, we tell Keras the dimensions of the inputs. Here the inputs are $x_0, x_1, x_1$.

# Deep Neural Networks

## Layers

Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer.  

<img src="https://i.imgur.com/2MA4iMV.png">

## Activation Function

An activation function is simply some function we apply to each of a layer's outputs (its activations). The most common is the rectifier function  max(0,x).  

<img src="https://i.imgur.com/aeIyAlF.png">

**Without activation functions, neural networks can only learn linear relationships**. In order to fit curves, we'll need to use activation functions. The rectifier function has a graph that's a line with the negative part "rectified" to zero. Applying the function to the outputs of a neuron will put a bend in the data, moving us away from simple lines.  

When we attach the rectifier to a linear unit, we get a **rectified linear unit or ReLU**. (For this reason, it's common to call the rectifier function the "ReLU function".) 

## Stacking Dense Layers

<a id = "chapter_2.3"></a> 
Now that we have some nonlinearity, let's see how we can stack layers to get complex data transformations.  

<img src="https://i.imgur.com/Y5iwFQZ.png">

The layers before the output layer are sometimes called **hidden**.  

Notice that the final (output) layer is a linear unit (meaning, no activation function). That makes this network appropriate to a regression task, where we are trying to predict some arbitrary numeric value. Other tasks (like classification) might require an activation function on the output.

## Building Sequential Models

The model below is built according to the picture in [Chapter 2.3](#chapter_2.3).

In [17]:
from tensorflow import keras
from tensorflow.keras import layers

In [18]:
model = keras.Sequential([
    # the hidden ReLU layers
    # the first hidden layer
    layers.Dense(units=4, activation='relu', input_shape=[2]),
    #the second hidden layer
    layers.Dense(units=3, activation='relu'),
    # the linear output layer 
    layers.Dense(units=1),
])

**Be sure to pass all the layers together in a list, like [layer, layer, layer, ...].**

# Stochastic Gradient Descent

## Terminology

- Loss Function: it measures how good the network's predictions are, (e.g. MSE, MAE);  


- Optimizer: it can tell the network how to change its weights, (e.g stochastic gradient descent);
    - SGD Steps:
      1. Sample some training data (minibatch) and run it through the network to make predictions;
      2. Measure the loss between the predictions and the true values;
      3. Finally, adjust the weights in a direction that makes the loss smaller.<br> 
      <br>
    - Minibatch (batch): each iteration's sample of training data;
    - Epoch: a complete round of the training data. The number of epochs you train for is how many times the network will see each training example.  
    
- Learning Rate: A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values.
      


<img src='https://i.imgur.com/rFI1tIk.gif' width='80%'>
The animation shows the linear model  being trained with SGD. 
The pale red dots depict the entire training set, while the solid red dots are the minibatches.  

Every time SGD sees a new minibatch, it will shift the weights (w the slope and b the y-intercept) toward their correct values on that batch. Batch after batch, the line eventually converges to its best fit. You can see that the loss gets smaller as the weights get closer to their true values.

## Example - Adding the Loss and Optimizer

>model.compile(
>    optimizer="adam",
    loss="mae",
)

Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning (it is "self tuning", in a sense). Adam is a great general-purpose optimizer.

### Import data

In [19]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

In [20]:
data = pd.read_csv('red-wine.csv')
data

FileNotFoundError: ignored

In [None]:
data.describe()

### Data Processing
<a id = "chapter_3.2.2"></a> 

In [None]:
x = data.drop(columns='quality')
y = data['quality']
x_col_name = x.columns

##### Spliting data

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.3, random_state=0)

##### Scaling data
neural networks tend to perform best when their inputs are on a common scale.

In [None]:
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train, columns=x_col_name)

x_valid = scaler.transform(x_valid)
x_valid = pd.DataFrame(x_valid, columns=x_col_name)

### Building Model
<a id = "chapter_3.2.3"></a> 

In [None]:
model = keras.Sequential([
    # the hidden ReLU layers
    # the first hidden layer
    layers.Dense(units=512, activation='relu', input_shape=[11]),
    #the second hidden layer
    layers.Dense(units=512, activation='relu'),
    #the third hidden layer
    layers.Dense(units=512, activation='relu'),
    # the linear output layer 
    layers.Dense(units=1),
])

In [None]:
model.compile(optimizer='adam', loss='mae')
history = model.fit(
    x_train, y_train,
    validation_data=(x_valid, y_valid),
    batch_size=256,
    epochs=10,
)

In [None]:
model_history = pd.DataFrame(history.history)
model_history

In [None]:
model_history.plot(
    figsize=(8,5), 
    title='Learning Curve',
    xlabel='Epoch',
    ylabel='MAE',
    fontsize=10
)

# Overfitting and Underfitting

## Terminology

- Signal: it can help our model make predictions from new data.  


- Noise: it is only true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can't actually help the model make predictions. The noise is the part might look useful but really isn't.  


- Learning Curves: plot traing loss and validation loss against epochs.the training loss will go down either when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signal.
<img src="https://i.imgur.com/tHiVFnM.png" width="50%">

- Underfitting： the model dosen't learn enough signal.   


- Overfitting the model learns too much noise.

## Ways to Decrease Overfitting and Underfitting

### Capacity

A model's capacity refers to the size and complexity of the patterns it is able to learn. For neural networks, it is determined by the how many neurons it has and how they are connected together. If your model is underfitting, you should increase the capacity.  


Methods to increase capacity:
- Winder networks: have easier time learning more linear relationship.
- Deeper networks: prefer more nonlinear relationship.

    Examples of wider and deeper networks:
```Python
model = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])
wider = keras.Sequential([
    layers.Dense(32, activation='relu'),
    layers.Dense(1),
])
deeper = keras.Sequential([
    layers.Dense(16, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(1),
])
```

### Early Stopping

#### Introduction

Simply stop the training whenever it seems the validation loss isn't decreasing anymore. 
<img src="https://i.imgur.com/eP0gppr.png" width="50%">

Code:
```Python
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(
    min_delta=0.01, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)
```
These parameters say: "If there hasn't been at least an improvement of 0.01 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found." It can sometimes be hard to tell if the validation loss is rising due to overfitting or just due to random batch variation. The parameters allow us to set some allowances around when to stop.<p>
**<font color=red size=6>how to understand EarlyStopping?</font>**


#### Example

In [None]:
from tensorflow import keras
from tensorflow.keras import layers, callbacks

Here still use the model built in [Chapter 3.2.3](#chapter_3.2.3)

##### Set EarlyStopping

In [None]:
early_stopping = callbacks.EarlyStopping(
    min_delta=0.01, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

In [None]:
history = model.fit(
    x_train, y_train,
    validation_data=(x_valid, y_valid),
    epochs=100,
    batch_size=256,
    callbacks=[early_stopping],   # put your callbacks in a list
)

In [None]:
model_history = pd.DataFrame(history.history)

val_loss_min = model_history.val_loss.min()
epoch_min = model_history[model_history.val_loss == val_loss_min].index
val_loss_min = round(val_loss_min, 4)
epoch_min = np.array(epoch_min)[0]
print('Minimun validation loss: {}'.format(val_loss_min))
print('Epoch number: {}'.format(epoch_min))

_,ax1 = plt.subplots()
model_history.plot(
    figsize=(10,6), 
    xlabel='Epoch',
    ylabel='MAE',
    fontsize=10,
    ax=ax1,
    xticks=np.arange(0, 26, 5),
    yticks=np.arange(0, 0.6, 0.05)
)
ax1.set_title('Learning Curves', size=30)
ax1.scatter(epoch_min, val_loss_min, color='green',s=80)
ax1.grid(linestyle='--')
ax1.text(epoch_min, val_loss_min, '({},{})'.format(epoch_min, val_loss_min),size=15)


It shows that the Keras stopped training before the full 100 epochs.

# Dropout and Batch Normalization

## Dropout

Randomly drop out some fraction of a layer's input units **every step of training**, making it much harder for the network to learn those spurious patterns in the **training data**. <p>
You could also think about dropout as creating a kind of ensemble of networks. The predictions will no longer be made by one big network, but instead by **a committee of smaller networks**. Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual. (If you're familiar with **random forests** as an ensemble of decision trees, it's the same idea.)<p>
Below picture shows 50% dropout has been added between the two hidden layers.
<img src="https://i.imgur.com/a86utxY.gif" width='80%'>

    
Code:<p>
```Python
keras.Sequential([
    # ...
    # Put the Dropout layer just before the layer you want the dropout applied to
    layers.Dropout(rate=0.3), # apply 30% dropout to the next layer 
    layers.Dense(16),
    # ...
])
```

## Batch Normalization (Batchnorm)

Batchnorm can help correct training that is slow or unstable.The reason is that SGD will shift the network weights in proportion to how large an activation the data produces. Features that tend to produce activations of very different sizes can make for unstable training behavior.<p>
A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.<p>
it's good to normalize the data before it goes into the network:
```Python
layers.BatchNormalization(),
layers.Dense(16, activation='relu'),
keras.Sequential([
    # first layer
    layers.BatchNormalization(), # act as a kind of adaptive preprocessor 
    layers.Dense(16),
    # ...
])
```

Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also help prediction performance). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix various problems that can cause the training to get "stuck". Consider adding batch normalization to your models, especially if you're having trouble during training.<p>
It seems that batch normalization can be used at almost any point in a network. You can put it after a layer...
```Python
layers.Dense(16, activation='relu'),
layers.BatchNormalization(),
 ```
... or between a layer and its activation function:
```Python
layers.Dense(16),
layers.BatchNormalization(),
layers.Activation('relu'),
```
**<font color=red size=6>What is the influence by adding batchnorm between different layers?</font>**


## Example - Using Dropout and Batch Normalization

Here still use the dataset processed in [Chapter 3.2.3](#chapter_3.2.2)

In [None]:
model = keras.Sequential([
    layers.Dense(1024, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1),
])

In [None]:
model.compile(
    optimizer='adam',
    loss='mae',
)

history = model.fit(
    x_train, y_train,
    validation_data=(x_valid, y_valid),
    batch_size=256,
    epochs=100,
    verbose=0,
)

In [None]:
model_history = pd.DataFrame(history.history)

val_loss_min = model_history.val_loss.min()
epoch_min = model_history[model_history.val_loss == val_loss_min].index
val_loss_min = round(val_loss_min, 4)
epoch_min = np.array(epoch_min)[0]
print('Minimun validation loss: {}'.format(val_loss_min))
print('Epoch number: {}'.format(epoch_min))

_,ax2 = plt.subplots()
model_history.plot(
    figsize=(10,6), 
    xlabel='Epoch',
    ylabel='MAE',
    fontsize=10,
    ax=ax2,
    xticks=np.arange(0, 105, 5),
)

ax2.set_title('Learning Curves', size=30)
ax2.scatter(epoch_min, val_loss_min, color='green',s=80)
ax2.grid(linestyle='--')
ax2.text(epoch_min, val_loss_min, '({},{})'.format(epoch_min, val_loss_min),size=15)


# Binary Classification

## Terminology

- Accuracy: $accuracy = number_{correct} ~/~ total$
- Binary Cross-Entropy: $H_p(q) = - \frac{1}{N}\Sigma y_i \cdot log(p(y_i)) + (1-y_i)\cdot log(1-p(y_i)) $
  - $y_i$: 1 for positive, 0 for negative;
  -$p(y_i)$: the predicted probability.
<img src="https://i.imgur.com/DwVV9bR.png" width="50%">
- Sigmoid Activation: $\sigma (x) = \frac{1}{1+e^{-x}}$ , it maps real numbers into the interval [0,1].<img src='https://i.imgur.com/FYbRvJo.png' width='48%'>

## Example - Binary Classification

### Import data

In [None]:
data = pd.read_csv('ion.csv', index_col=0)
data.head()

### Data precessing

##### change the class column into numbers

In [None]:
data.Class.replace(to_replace=['good', 'bad'], value=[1,0], inplace=True)

##### Split   data

In [None]:
x = data.drop(columns='Class')
y = data.Class
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.3, random_state=0)

#####  Scale data

In [None]:
scaler = MinMaxScaler()
x_train = pd.DataFrame(scaler.fit_transform(x_train))
x_valid = pd.DataFrame(scaler.transform(x_valid))

### Build Model

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
model  = keras.Sequential([
    layers.Dense(4, 'relu', [33]),
    layers.Dense(4, 'relu'),
    # the final layer we use sigmoid activation function
    layers.Dense(1, 'sigmoid')
])

In [None]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy'],
)

In [None]:
early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True
)

In [None]:
history = model.fit(
    x_train, y_train,
    validation_data=(x_valid, y_valid),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0,
)

In [None]:
model_history = pd.DataFrame(history.history)
model_history.head()

In [None]:
_, ax = plt.subplots(1,2, figsize=(13, 5))
model_history[['loss', 'val_loss']].plot(ax=ax[0])
model_history[['accuracy', 'val_accuracy']].plot(ax=ax[1])

# NLP

## Tokenization


### Bag of Words (BOW)

In [21]:
import torch
from torch import nn

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

In [23]:
vectorizer1 = CountVectorizer()

corpus = [
          "This is the first document.",
          "And this is the second one."
]
X1 = vectorizer1.fit_transform(corpus)
X1.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 1],
       [1, 0, 0, 1, 1, 1, 1, 1]])

In [24]:
vectorizer1.vocabulary_

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'this': 7}

### TF-IDF(Term Frequency-Inverse Document Frequency)

$ TF-IDF(t, d) = TF(t, d) * IDF(t) $

$ IDF(t) = log \frac{1+N}{1+DF(t)} + 1 $


*   TF(t, d) : the frequency of word t in document d;
*   DF(t) : the number of documents containing word t.



In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
vectorizer2 = TfidfVectorizer()
X2 = vectorizer2.fit_transform(corpus)
X2.toarray()

array([[0.        , 0.53309782, 0.53309782, 0.37930349, 0.        ,
        0.        , 0.37930349, 0.37930349],
       [0.47042643, 0.        , 0.        , 0.33471228, 0.47042643,
        0.47042643, 0.33471228, 0.33471228]])

## Word Embedding

<img src="https://github.com/CT608/Deep_Learning/blob/main/Pictures/93adea57dd44868dd97ce7378be4dd0.jpg?raw=true
" width=70%>

In [27]:
embedding = nn.Embedding(11, 4, padding_idx=10)
embedding.weight #this is similar to the right matrix of the above picture

Parameter containing:
tensor([[-0.7666, -1.1007, -1.0370, -0.8972],
        [ 1.6282, -0.1722, -0.1903, -0.0537],
        [-1.4381, -0.3847, -0.7910, -0.0032],
        [ 2.2032, -0.1362, -0.4988, -0.8137],
        [-0.8977, -1.3183,  0.2952, -1.8086],
        [-0.3296, -0.8729,  0.6099,  0.1308],
        [-1.4520, -0.8888,  0.8420, -0.4980],
        [ 0.2580,  0.2150,  0.6035,  0.6584],
        [-1.4233,  1.8499,  1.0712, -0.5191],
        [-0.6894,  0.2017, -0.1428,  0.9833],
        [ 0.0000,  0.0000,  0.0000,  0.0000]], requires_grad=True)

In [28]:
input = torch.LongTensor([7, 5, 2, 4, 10, 10, 10])
embedding(input) #this is similar to the middle matrix of the above picture

tensor([[ 0.2580,  0.2150,  0.6035,  0.6584],
        [-0.3296, -0.8729,  0.6099,  0.1308],
        [-1.4381, -0.3847, -0.7910, -0.0032],
        [-0.8977, -1.3183,  0.2952, -1.8086],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000]], grad_fn=<EmbeddingBackward0>)

## RNN (Recurrent Neural Network)

### Vanilla Recurrent Neural Network

<img src="https://github.com/CT608/Deep_Learning/blob/main/Pictures/4f931c9e94382b6ce906f6694fc054b.png?raw=true
" width=50%>

$ y_t = h_t = tanh(W_{ih}x_t + b_{ih} + W_{hh}h_{t-1} + b_{hh}) $



*   Each hidden unit receives 2 inputs ($x_t,h_{t-1}$)
*   $ tanh x = \frac{e^x - e^{-x}}{e^x + e^{-x}}$, it may cause vanishing or exploding gradient problem.

<img src="https://github.com/CT608/Deep_Learning/blob/main/Pictures/tanhx.jpg?raw=true" width=30%>




```
#code
torch.nn.RNN()
```









### Long-Short Term Memory (LSTM)

<img src="https://github.com/CT608/Deep_Learning/blob/main/Pictures/LSTM.jpg?raw=true" width=70%>




*   $c_{t-1}$: cell state at previous time step; 
*   $c_{t}$: cell state at current time step; 
*   $h_{t-1}$: activation from previous time step; 
*   $h_{t}$: activation for the next time step; 
*   $ f_t $: forget gate, controls which information is remembered and which is forgotten;
*   $ i_t $: input gate, update the current cell state.


```
#code
torch.nn.LSTM()
```




### Gated Recurrent UNit (GRU)

<img src="https://github.com/CT608/Deep_Learning/blob/main/Pictures/GRU.jpg?raw=true" width=50%>

A simple version of LSTM, performance between LSTM and GRU is close.


```
#code
torch.nn.GRU()
```