# Regularization in Deep Learning

Regularization is used to reduce overfitting, improve generalization, and enhance the model performance on unseen data. Below are the most common techniques:

In [3]:
from tensorflow import keras
from tensorflow.keras import layers, models, losses

## L1 Regularization (Lasso)
- It penalizes the model by adding a term to the loss function that discourages large weights (penalty)
- Adds the absolute value of the weights to the loss function
- Tends to drive the weights to zero
- Applied an individual layer (recommendation: hidden layer with highest complexity)

![lasso](https://www.researchgate.net/publication/352564627/figure/fig1/AS:1037128163135494@1624282031207/Neural-network-representation-of-the-lasso-problem-Equation-2.png)

In [2]:
from tensorflow.keras.regularizers import l1

In [None]:
nn_model = models.Sequential([
                          layers.Input(shape=(28,28)),
                          layers.Flatten(),
                          layers.Dense(140, activation='relu', kernel_regularizer=l1(0.01)),
                          layers.Dense(10)
])

## L2 Regularization (Ridge)
- It penalizes the model by adding a term to the loss function that discourages large weights (penalty)
- Adds the square of the weights to the loss function
- Tends to shrink the weights, but doesn't drive them to absolute zero
- Applied an individual layer (recommendation: hidden layer with highest complexity)

In [None]:
from tensorflow.keras.regularizers import l2

nn_model = models.Sequential([
                        layers.Input(shape=(28,28)),
                        layers.Flatten(),
                        layers.Dense(140, activation='relu', kernel_regularizer=l2(0.01)),
                        layers.Dense(10)
])

## L1 and L2 Combined (ElasticNet)
- Same attributes from both L1 and L2 Regularization

In [None]:
from tensorflow.keras.regularizers import l1_l2

nn_model = keras.models.Sequential([
                        layers.Flatten(input_shape=(28, 28)),
                        layers.Dense(140, activation='relu', kernel_regularizer=l1_l2(l1=1e-5, l2=1e-4)), # notice the very small values
                        layers.Dense(10)
                      ])

## Dropout
- Randomly sets a fraction of the neurons to zero during training 
- Introduces noise and prevents co-adaptation of neurons (reduces interdependence)
- It's added as layer object, but technically applied to the previous layer
- Applied to layers individually (recommendation: hidden layer with highest complexity)
 
![dr](https://dataheroes.ai/wp-content/uploads/2023/04/post-2378-03.png)
 


In [None]:

nn_model = keras.models.Sequential([
                        layers.Flatten(input_shape=(28, 28)),
                        layers.Dense(140, activation='relu'),
                        layers.Dropout(0.1), #Fraction of the input units to randomly drop (10% dropped)
                        layers.Dense(10)
                      ])

## Batch Normalization

- **Normalization** It normalizes the output of the activations of each layer to have zero mean and 1 std (unit variance)
- **Scaling and Shifting** using 2 parameters (gamma and beta) it shifts the output
- Applied to layers individually (recommendation: hidden layer with highest complexity)

![bn](https://miro.medium.com/v2/resize:fit:709/1*Y0EtAQpR2iBsv97YrwRywg.png)

In [None]:

nn_model = keras.models.Sequential([
                        layers.Flatten(input_shape=(28, 28)),
                        layers.Dense(140, activation='relu'),
                        layers.BatchNormalization(),
                        layers.Dense(140, activation='relu'),
                        layers.Dense(10)
                      ])

## Early Stopping

- It monitors the validation loss and stops training when it [validation loss] stops improving 
- Basically, the learning stops when the performance of the model on a validation subset starts degrading
- Unlike other techniques, it is applied in `fit()` and not in the model architecture 
- You set 2 key parameters:
    - `monitor`:The measure to be monitored  (val_loss)
    - `patience`: Number of epochs with no improvement after which training will be stopped. 

![ES](https://miro.medium.com/v2/resize:fit:708/1*LSjaVNMa-ku85I35of-mAw.png)

In [5]:
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
# neural net structure
nn_model = keras.models.Sequential([
                        layers.Flatten(input_shape=(28, 28)),
                        layers.Dense(140, activation='relu'), #activation=tf.nn.relu
                        layers.BatchNormalization(),
                        layers.Dense(10)
                      ])
#compiling 
nn_model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])

# apply early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=2) # applied after 2 Epochs

model_hist = nn_model.fit(X_train, y_train, epochs=10, validation_split=0.2, callbacks=[early_stopping])

**Bonus** you can add additional features in the callbacks attributes such as:
- EarlyStopping
- Model CheckPoint to record the learning snapshot in every Epoch
- Connect the model performance to TensorBoard for monitoring and visuals

In [None]:
my_callbacks = [
    keras.callbacks.EarlyStopping(patience=2),
    keras.callbacks.ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.2f}.h5'),
    keras.callbacks.TensorBoard(log_dir='./logs'),
]
model.fit(dataset, epochs=10, callbacks=my_callbacks)

## Choosing The Right Regularization Technique
- It depends on the specific problem and dataset.
- Therefore, it's a trial and error process to find the optimal method or combination of methods (using hyperparameter tuning)
- Recommendations:
    - **Fine Tuning** Start with a low level of regularization and then start increasing until you get the desired outcome
    - **Model Complexity** complex models may need stronger regularization e.g. multiple layers and large number of neurons
    - **Dataset Size** larger datasets don't need high level of regularization