# Deep Learning Crash Course for Beginners
<hr>

## What is deep learning
### Classification
- **AI**
- **Machine Learning:**
Teaching computer to Recognize Patterns and Data
- **Deep Learning:**<br>
Technique that learns features and tasks directly from data.
    - neural networks: hidden layers

### why now
- Data is prevalent
- Improved hardware architectures
- Tensorflow/Pytorch

### Neural Networks
- Input layer -> hidden layer -> output layer
- neurons

### Learning Process
- Forward Propagation: *creation function*
  - *Weight*: how important is the neuron
  - *Bias*: allows for the shifting of the $\sigma$ to the right or left
$$
\hat{y} = \sigma \left( \sum_{i=1}^{n} x_i w_i + b_i \right)
$$
- Back Propagation -> Gradient Descent implemented on a network<br>
*Loss function* help qualify he deviation from the expected output.
    - use loss function
    - go backwards & adjust initial weights and bias
    - values adjust for better model

### Learning Algorithm
- Initialize Parameter with random values
- Feed input data to network
- Compare the predict value with expected value & calculate loss
- Perform Back Propagation to propagate this loss back through the network
- Update Parameters based on the loss
- Iterate previous steps till loss is  *minimized*
<hr>

## Terms used in neural network

### Activation Functions
- Non-linearity in the network
- Decide whether a neuron can contribute to the next layer
- Functions:
    1. Step Function:<br>
    if value>0(threshold) -> activate, else -> do not activate 
    2. Linear Function:<br>
    Derivative is constant
    3. Sigmoid Function **Binary Classification**
        - Non-linear
        - Analog Outputs: $\text{sig}(t) \in (0,1)$
        - Vanishing Gradient Problem
    4. Tanh Function
        - Non-linear
        - Derivative stepper than Sigmoid:  $\text{tanh}(x) \in (-1,1)$
        - Vanishing Gradient Problem
    5. ReLU Function (Rectified Linear Unit) **If unsure**
        - Non-linear
        - Sparse Activations
        - $\text{R}(z) \in (0, \infty)$
        - Gradient = 0 -> Dying ReLU Problem
    6. Leaky ReLU Function

### Loss Functions
*Regression:* Squared Error, Huber Loss

*Binary Classification:* Binary Cross-Entropy, Hinge Loss

*Multi-Class Classification:* Multi-Class Cross-Entropy, Kullback Divergence

#### Optimizers
Tie together the *loss function* and *model parameters* by **updating** network based on output of the loss function.

*Loss Function* **guide** the optimizer.

### Gradient Descent
Iterative algorithm starts off at random point on the loss function and travels down its **slop** in steps until it reaches lowest point of the function.
- Algorithm
  1. calculate what small change in each individual weight would do to the loss function
  2. adjust each parameter based on its gradient
  3. repeat steps 1 and 2, until loss function is as low as possible
- Learning Rate
  - small number as 0.001
  - ensure that any changes made to the weights are quite small
- Stochastic Gradient Descent(SGD)
  - use a **subset** of training examples rather than the *entire* lot
  - implementation uses *batches* on each pass
  - use *momentum* to accumulate gradients
  - less intensive computationally

### Parameters & Hyperparameter
- Model parameters: can be estimated<br>
Examples: weight, bias
- Model hyperparameter: not estimated<br>
configurations external to the neural network. Value can not be estimated right from the data<br>
Example: Learning rate.

### Epochs, Batches, Batch Sizes & Iterations
Only if the dataset i large. **chunks** feed one-by-one
- Epoches<br>
Entire dataset is passed forward and backward through the neural network only **ONCE** <br>
Multiple epochs to generalize better
- Batches & Batch Size<br>
Divide large dataset into smaller batches. Total number of training examples in a batch is the size.
- Iteration<br>
Number of batches needed to complete one epoch
<hr>


## Types of Learning

### Supervised Learning
Predict the correct label for unseen data.
- Algorithm designed learn by example
- Trained on *well-labeled* data
- Examples consist of:
  - Input object(vector)
  - Desired output(supervisory signal)

During training, SL algorithm searches for patterns that correlate with the desire output. After training, takes unseen inputs and determine which label to classify it to.
1. Classification
  - data into class/category
  - models finds features in the data that correlate to a class and creates a *mapping function* -> classify unseen data
  - Popular Algorithm
    1. Linear Classifiers
    2. Support Vector Machines
    3. K-Nearest Neighbor
    4. Random Forest
2. Regression
  - Find relationship between dependent & independent variables
  - predict continues value
  - Popular Algorithm
    1. Linear Regression
    2. Lasso Regression
    3. Multivariate Regression

- Applications
  - Bioinformatics
  - Object Recognition
  - Spam Detection
  - Speech Recognition

### Unsupervised Learning
- Uses to manifest **underlying patterns** in data
- Used in exploratory data analysis
- Does not use labelled data, rather relies on the **data features**
- **Goal:** Analyze data and find important underlying patterns

1. Clustering<br>
    grouping data into different clusters or groups
    - Partition Clustering -> single cluster
    - Hierarchical Clustering -> clusters within clusters
- Popular Algorithm
  1. K-Means
  2. Expectation Maximization
  3. Hierachical Cluster Analysis(HCA)  
2. Association<br>
   find relationship between different entities

- Application
  - AirBnB
  - Amazon
  - Credit card fraud detection

### Reinforcement Learning
  - Enables an *agent* to learn in an interactive *environment* by trail & error based on feedback from its own actions&experience
  - Use *rewards* & *punishments* as signals for positive & negative behavior
  - Goal:model that gets maximum reward

Model as *Markov Decision Process*
- Application
  - Robotics
  - Business Strategy Planning
  - Traffic Light Control
  - Web System Configuration

### Regularization
Core Problem: both training data & NEW data, most common problem is *Overfitting*

- Tackling Overfitting
1. Dropout: 
randomly removes some nodes & their connections
2. Dataset Augmentation:
More data -> better model<br>
apply transformation on existing dataset to get synthesize more data
3. Early Stopping : training error decreases and validation error increases
<hr>

## NN Architectures

### Fully-Connected Feed Forward NN
Each neuron is connected to every subsequent layer with no backward connections
- Inputs
- Output
- Hidden Layers
- Neurons per hidden layer
- Activation functions

### Recurrent NN
- Feed-Forward Neural Networks<br>
take fixed-sized input and return fixed-sized outputs
- Vanilla NN can not process *sequence data*
- *feedback loop* in the hidden layer
- Train RNN
  - back propagation algorithm
  - applied for every *sequence data point*
  - back propagation through time(BTT)
  - short-term memory of a RNN is due to Vanishing Gradient Problem(VGP)

### LSTMs & GRNNs
- Long Short Term Memory
  - Update Gate
  - Reset Gate
  - Forget Gate
- Gated RNN
  - Updated Gate
  - Reset Gate
- Application
  - NLP
  - Sentiment Analysis
  - DNA sequence classification
  - Speech recognition
  - Language translation

### Convolutional NNS (CNN)
- inspired by the organization of neurons in the *visual cortex* of human brain
- Good for processing like image, audio and video 
- Hidden Layers
  - convolutional layers<br>
    Input -> 2D, Output -> 1D, extract features in *chunks* by *kernel*
  - pooling layers<br>
    reduce the number of neurons: 1. Max & 2. Min pooling
  - fully-connectd layers
  - normalization layers
- Application
  - Computer vision
  - image recognition
  - image processing
  - image segmentation
  - video analysis
<hr>

## Creating a DL Model

1. Gathering Data<br>
 - picking data is the key
 - Size
   - amount of data = 10x number of model parameters
   - regression: 10 examples per predictor variable
   - image: 1000 images per class
 - Quality
   - label errors
   - noisy
 - resource:
   - [UCI](archive.ics.uci.edu)
   - [kaggle](kaggle.com/datasets)
   - [Google dataset](datasetsearch.research.google.com)
2. Preprocessing the data
- splitting dataset into *subsets*
   - Train on *training data*
   - Evaluate on *validation data*
   - Test it on *testing data*
   - hyperparameters -> decide how big the validation set
   - Cross-Validation
- Formatting -> CSV file
- Missing data -> 'NaN' or 'Null'
   - Eliminating
   - imputing the missing value
- Sampling -> small sample of the dataset<br>
    Example weight = Original Weight x Downsampling Factor
    - Faster convergence
    - Reduce disk space
    - dataset is in *similar ratio*
- Feature Scaling
  1. Normalization
  2. Standardization  
3. Training the model<br>
Feed data -> Forward propagation -> Loss function -> Back propagation
4. Evaluation<br>
Test the model on *validation*
5. Optimization
    - Hyperparameter Tuning
      - increase number of epochs
      - adjust learning rate
      - initial condition play large role
    - Addressing Overfitting
      - Getting more data/Reducing model size
      - Regularization: L1 / L2
    - Data Augmentation
      - artificially increasing dataset 
    - Dropout
      - randomly drops out some neurons
      - reduce co-dependency if neurons