In [None]:
'''

Best Practices

1. Never initialize weights from 0 or with same values as then there won’t be update.
   this is known as symmetry breaking problem.
   
2. Always normalize the inputs.

3. Use ReLU in hidden layer activation, but be careful with the learning rate and monitor the fraction of dead units

4. If ReLU is giving problems. Try Leaky ReLU, PReLU, Maxout. Do not use sigmoid

5. Never initialize the weights to large value otherwise model will blow up.

6. Weights initialized must be inversely proportion to input.

7. The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to 
   the vanishing gradient problem
   
8. Use He initialization for ReLu and Leaky ReLu.

9. Rectified Linear Unit (ReLU) activation function can help you in preventing vanishing gradients.

10. The problem of vanishing gradient requires to use of small learning rates with gradient descent.
    so the model will get some more time to converge

'''

In [None]:
'''
Best practices:

1. Use Batch nromalization can help you in preventing exploding gradients.
    - standard deviation of one and mean output activation of zero.

2. Use Gradient Clipping:
    - Gradient clipping involves forcing the gradient values (element-wise) to a specific minimum or maximum value 
      if the gradient exceeded an expected range.

3. Use callback help to build better models.

4. The learning rate should not be too low because the training progresses very slowly.
   and it will take many weights updates before it comes to the minimum point
   
5. The learning rate should not be too high because the training progresses very rapidly
   and it will often fail to converge or even diverge.
   
6. If you want to train the neural network in less time and more efficiently, then Adam is the best optimizer.

7. Use Checkpoints
'''

In [None]:
'''
Recommendation on Activation Functions

    - Softmax is used only for the output layer
    - Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient problem
    - Tanh is avoided most of the time due to dead neuron problem
    - ReLU function should only be used in the hidden layers
    - An output layer can be linear activation function in case of regression problems.
    - Multilayer Perceptron (MLP): ReLU activation function.
    - Convolutional Neural Network (CNN): ReLU activation function.
    - Recurrent Neural Network: Tanh and/or Sigmoid activation function.
    - As a rule of thumb, you can begin with using ReLU function and then move over to other activation functions 
      in case ReLU doesn’t provide with optimum results
'''

In [None]:
'''
Cecklist to improve performance:

1. Analyze errors (bad predictions) in the validation dataset.

2. Monitor the activations. Consider batch or layer normalization if it is not zero centered or Normal distributed.

3. Monitor the percentage of dead nodes.

4. Apply gradient clipping (in particular NLP) to control exploding gradients.

5. Shuffle dataset (manually or programmatically).

6. Balance the dataset (Each class has the similar amount of samples).

'''

In [None]:
'''
Tuning
    - Learning rate tuning
    - Mini-batch size
    - Regularization factors
    - Layer-specific hyperparameters (like dropout)
'''

'''
Advanced Tuning
    - Learning rate decay schedule
    - Momentum
    - Early stopping

'''

In [None]:
'''

Hyerparameters in Deep Learning :
  - Number of Hidden units
  - Learning rate
  - Epochs
  - Network weight Initialization
  - Activation Function
  - Batch Size
  - Momentum

'''

In [None]:
'''
Challenges:

    1. Presence of Data Available for Training our Model
    2. Model Overfitting
    3. Model Underfitting
    4. Training Time is too High

'''

In [None]:
'''
Challenges with all types Gradient Descent:
    - Choosing an optimum value of the learning rate. 
    - If the learning rate is too small than gradient descent may take ages to converge.
    - Having a constant learning rate for all the parameters. There may be some parameters which we may not want to 
      change at the same rate.May get trapped at local minima.
'''

In [None]:
'''
Improving The Keras Model 
    - Add additional layers to our network(Hidden layers with neurons)
    - Improving the simple net in Keras with dropout
    - Testing different optimizers in Keras
    - Tune the number of neurons in hidden layers, etc.
    - Add or reduce the number of dense layers
    - Increasing the number of epochs
    - Controlling the optimizer learning rate
    - Increasing the number of internal hidden neurons
    - Increasing the size of batch computation(Batch Size)
    - Adopting regularization for avoiding overfitting
    - Hyperparameters tuning
        - hidden neurons, BATCH_SIZE,epochs,
'''

In [None]:
'''
Improving Simple Keras Network :
    - By adding additional layers (Hidden)
    - Randomly drop with the dropout probability some of the values propagated
    - Try different Optimizers  (RMSProp,Adam,SGD)
    - Increasing number of epochs(20 to 200) : Increasing  the value by 10
    - Controlling Optimizer learning rate[0.1,0.01,0.001,0.001,0.0001,0.0001] = [1E-0, 1E-1, 1E-2, 1E-3, 1E-4, 1E-5, 1E-6, 1E-7]
    - Increasing the number of internal hidden neurons [32,64,128,255,512,1024]
    - Increasing the size of batch computation[64,128,256,512]
    - try Different Types of regularization(L1 regularization;L2 regularization; Elastic net regularization:)
'''

In [None]:
'''
Data Optimization :

    - Balance your data set
        - Subsample Majority Class: You can balance the class distributions by subsampling the majority class.
        - Oversample Minority Class: Sampling with replacement can be used to increase your minority class proportion.

'''

In [None]:
'''
Note :

    - Poor choice of learning rate that results in large weight updates.
    - Poor choice of data preparation, allowing large differences in the target variable.
    - Poor choice of loss function, allowing the calculation of large error values.

'''

In [None]:
'''
When to use : Optimizers

    - Mini Batch SGD           --> Use when network is small or shallow
    - Momentum with GD/ SGD    --> Works well most of the cases but slighly slower in converges
    - AdaGrad/AdaDelte/RMSProp --> Use when there is sparse data
    - Adam and Its Variation   --> Always Recommended

'''

In [None]:
'''
Consider while Weight Initialization :
    - Initializing all weights to 0:
            - Initializing all the weights with zeros leads the neurons to learn the same features during training.
            - It acts as linear model
    - Initializing weights randomly :
            - Using standard normal distribution
                - leads to two issue
                    - Vanishing gradients
                    - Exploding gradients
    - Initilization weights too large 
        - A  too-large initialization leads to exploding gradients

    - Initilization weights too small:
        - A too-small initialization leads to vanishing gradients

'''

In [None]:
'''

How to find appropriate initialization values:
    - The mean of the activations should be zero.
    - The variance of the activations should stay the same across every layer.


'''

In [None]:
# Metrics

'''
When to Use it?


Accuracy :

- When to use :  classification problems which are well balanced and not skewed or No class imbalance
- When not to use : target class of dataset is imbalanced (target variable classes in the data are a majority of one class)


Sensitivity / Specificity
- There are some cases where Sensitivity is important and need to be near to 1.
- There are business cases where Specificity is important and need to be near to 1.
- We need to understand the business problem and decide the importance of Sensitivity and Specificity.


- Precision is a useful metric in cases where False Positive is a higher concern than False Negatives.
  
- Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but 
  the actual positive cases should not go undetected.

'''

In [None]:
'''

When to use Metrics (MAE,MSE and RMSE)


When Outliers in our data are while creating model. Do we include the outliers in our model creation or do we ignore them?
    - Accepting the outliers as part of model devlopment in this case. I would want to use the MSE to ensure that
      my model takes these outliers into account more.
    - If I wanted to downplay their significance, I would use the MAE since the outlier residuals won’t contribute 
      as much to the total error as MSE.
      
MAE:  
    - If there are many outliers then you may consider using Mean Absolute Error. it doesn't punish huge errors.
    - If we want to treat all errors equally,  MAE is a better measure.


MSE : 
    - It used when mostly when the dataset contains outliers, or unexpected values (too high or too low values) and we can 
      address these issue by reducing the MSE.
    -  If we want to give more weight-age to large errors, MSE/RMSE is better
      
      
RMSE : 
       - This basically implies that RMSE assigns a higher weight to larger errors. 
       
       - This indicates that RMSE is much more useful when large errors are present and they drastically affect the 
         model's performance
       - when your observations(Target Value) conditional distribution is asymmetric and you want an unbiased fit.
       
- If you have outlier in the data and you want to ignore them, MAE is a better option but if you want to account for them 
  in your loss function, go for MSE/RMSE.

'''