# When to Use L1/L2/Elastic Net Regularization

Choosing between L1, L2, and Elastic Net regularization involves considering the specific characteristics of your dataset, your model's requirements, and the computational resources at your disposal. Here's a detailed summary of when to use each, along with their pros and cons:

### L1 Regularization (Lasso)

**When to Use:**
- When you have a high-dimensional dataset with many features, but suspect only a few are actually important (sparse feature set).
- When model interpretability is important, as L1 can zero out the coefficients of less important features, effectively performing feature selection.

**Pros:**
- Encourages sparsity, thus performing implicit feature selection.
- Can help improve model interpretability by eliminating irrelevant features.

**Cons:**
- Can be unstable in the presence of highly correlated features, arbitrarily selecting one feature over another.
- Not well-suited for situations where many small or correlated features contribute to the outcome.

### L2 Regularization (Ridge)

**When to Use:**
- When dealing with multicollinearity (i.e., when independent variables are highly correlated).
- When you expect that many small or moderate effects contribute to the outcome, rather than a few variables with large effects.

**Pros:**
- Tends to give better prediction accuracy than L1 when features are correlated.
- Stabilizes the estimation of regression coefficients by penalizing the size of coefficients without setting them to zero.

**Cons:**
- Does not perform feature selection — all features are included in the model, which can make the model complex and harder to interpret.
- Might not be as effective in high-dimensional spaces where sparsity is preferable.

### Elastic Net

**When to Use:**
- When you want to combine the benefits of L1 and L2 regularization.
- In the presence of correlated features, when you also need to perform feature selection.
- When dealing with high-dimensional data where L1 regularization might select too few features, or L2 regularization does not provide enough regularization.

**Pros:**
- Balances between feature selection (L1) and regularization (L2), making it versatile for various scenarios.
- Can outperform L1 and L2 regularization alone in many cases, especially with datasets that have multiple correlated features.

**Cons:**
- Adds an extra layer of complexity in choosing the regularization parameters (requires tuning two hyperparameters).
- Computationally more intensive due to the need for parameter tuning.

### Summary

- **L1 Regularization** is best suited for scenarios where sparsity and feature selection are crucial.
- **L2 Regularization** is ideal for problems with multicollinearity or when all features are expected to contribute towards the prediction.
- **Elastic Net** offers a middle ground, leveraging the strengths of both L1 and L2 regularization, making it a powerful choice for complex datasets, albeit at the cost of increased computational complexity and the need for more sophisticated hyperparameter tuning.

The choice among L1, L2, and Elastic Net should be guided by cross-validation to determine which method offers the best performance for your specific dataset and problem.

# Why Dropout Layers Are Effective

Dropout is a regularization technique used in neural networks that helps prevent overfitting. The intuition behind how dropout acts as a regularizer can be understood through its operational mechanism and the effects it has on the training process and the model's generalization capability. Here’s a breakdown to build that intuition:

### Mechanism of Dropout

- **Random Deactivation**: During training, dropout randomly deactivates a proportion of neurons (i.e., units) in the network at each training step or epoch. This means that a randomly selected subset of neurons does not contribute to the forward pass (calculation of the output) and does not get updated during the backpropagation step.
- **Dynamic Network Thinning**: By deactivating different subsets of neurons at each training step, dropout effectively trains a "thinned" version of the network. Since the deactivated neurons change every time, it’s like training a large ensemble of smaller, different networks.

### Effects and Intuition

1. **Prevention of Co-Adaptations**: Dropout prevents neurons from co-adapting too closely. When neurons rely on the presence of specific other neurons to correct their mistakes, the network can become overly complex and specialized to the training data, leading to overfitting. Dropout ensures that neurons cannot rely on the presence of others, forcing them to become more robust and capable of making correct decisions independently.

2. **Ensemble Interpretation**: The process of randomly dropping out neurons during training can be seen as training a large number of different "thinned" networks with shared weights. At test time, using the full network with scaled-down weights is akin to averaging the predictions of these thinned networks, similar to ensemble methods like bagging. Ensemble methods are known for their ability to improve model robustness and generalization.

3. **Effective Capacity Reduction**: Dropout reduces the effective capacity of the network by limiting the number of active neurons during training. This reduction in capacity helps prevent the network from overfitting by ensuring it cannot memorize the training data too closely.

4. **Noise Introduction**: Dropout introduces noise into the training process, which can be beneficial for preventing overfitting. This noise forces the network to learn more robust features that are invariant to the presence or absence of particular neurons, thereby improving the model's generalization to unseen data.

5. **Regularization without Explicit Constraints**: Unlike traditional regularization methods that add explicit constraints to the loss function (e.g., L1 or L2 penalties), dropout regularizes the model implicitly. It encourages the learning of more generalized representations without directly manipulating the loss function's form.

### Summary

The intuitive understanding of dropout as a regularizer comes from its ability to prevent complex co-adaptations among neurons, simulate an ensemble-like effect by averaging over many thinned networks, reduce the model's effective capacity, and introduce beneficial noise into the training process. These factors contribute to the development of a more robust and generalizable model that performs better on unseen data, effectively serving the purpose of regularization.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


# !gdown 1mMRZKe5Qm99fJBE9y0mLdvVZYDfHvkxY

df = pd.read_csv('Amazon.csv', encoding='latin-1')
df.dropna(inplace = True)

In [4]:
!pip install category-encoders


Collecting category-encoders
  Obtaining dependency information for category-encoders from https://files.pythonhosted.org/packages/7f/e5/79a62e5c9c9ddbfa9ff5222240d408c1eeea4e38741a0dc8343edc7ef1ec/category_encoders-2.6.3-py2.py3-none-any.whl.metadata
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
   ---------------------------------------- 0.0/81.9 kB ? eta -:--:--
   ----- ---------------------------------- 10.2/81.9 kB ? eta -:--:--
   ------------------------------ --------- 61.4/81.9 kB 812.7 kB/s eta 0:00:01
   ---------------------------------------- 81.9/81.9 kB 918.2 kB/s eta 0:00:00
Installing collected packages: category-encoders
Successfully installed category-encoders-2.6.3


In [3]:
X = df.drop(columns=['ID','Returned'])
y = df['Returned']


from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.2, random_state=42)

print('Train : ', X_train.shape, y_train.shape)
print('Validation:', X_val.shape, y_val.shape)
print('Test  : ', X_test.shape, y_test.shape)



Train :  (7039, 10) (7039,)
Validation: (1760, 10) (1760,)
Test  :  (2200, 10) (2200,)


ModuleNotFoundError: No module named 'category_encoders'

In [5]:


from category_encoders import TargetEncoder



enc = TargetEncoder(cols=['Warehouse_block','Mode_of_Shipment','Product_importance','Gender'])
X_train = enc.fit_transform(X_train, y_train)

X_val = enc.transform(X_val, y_val)
X_test = enc.transform(X_test, y_test)

X_train.head()


Unnamed: 0,Warehouse_block,Mode_of_Shipment,Customer_care_calls,Customer_rating,Cost_of_the_Product,Prior_purchases,Product_importance,Gender,Discount_offered,Weight_in_gms
10286,0.578005,0.608167,6,2,196,2,0.600878,0.596148,10,5180
7746,0.600336,0.600251,4,3,228,5,0.600878,0.599487,9,1044
1789,0.601109,0.608167,5,2,231,4,0.586928,0.596148,41,2992
2521,0.601109,0.600251,6,4,221,10,0.586928,0.596148,42,2972
10404,0.600336,0.576471,5,3,243,6,0.600878,0.599487,1,1856


In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [7]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense


# For Reproducibility
np.random.seed(42)
tf.random.set_seed(42)




In [8]:
def create_baseline():

  model = Sequential([
                    Dense(256, activation="relu"),
                    Dense(128, activation="relu"),
                    Dense(64, activation="relu"),
                    Dense(1 , activation = 'sigmoid')])
  return model


model = create_baseline()


model.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = tf.keras.losses.BinaryCrossentropy(),
                metrics=["accuracy"])



history = model.fit(X_train, y_train, validation_data = (X_val, y_val),  epochs=10, batch_size=128, verbose=1)



Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [9]:
model.evaluate(X_train, y_train)




[0.48271700739860535, 0.717289388179779]

In [10]:
model.evaluate(X_val, y_val)




[0.5455969572067261, 0.6380681991577148]

# l1/l2 regularization

In [11]:
def create_baseline():
    # lambda = 0.01
    L2Reg = tf.keras.regularizers.L2(l2=1e-6)
    model = Sequential([
                    Dense(256, activation="relu", kernel_regularizer = L2Reg ),
                    Dense(128, activation="relu", kernel_regularizer = L2Reg),
                    Dense(64, activation="relu", kernel_regularizer = L2Reg),
                    Dense(1 , activation = 'sigmoid')])
    return model


model = create_baseline()

model.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = tf.keras.losses.BinaryCrossentropy(),
                metrics=["accuracy"])


history = model.fit(X_train, y_train, validation_data = (X_val, y_val),  epochs=10, batch_size=128, verbose=1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [12]:
model.evaluate(X_train, y_train)




[0.4843883812427521, 0.714448094367981]

In [13]:
model.evaluate(X_val, y_val)




[0.5457697510719299, 0.6380681991577148]

# Dropout

In [14]:
from tensorflow.keras.layers import Dropout
def create_Dropout():
    # lambda = 0.01
    L2Reg = tf.keras.regularizers.L2(l2=1e-6)
    model = Sequential([
                    Dense(256, activation="relu", kernel_regularizer = L2Reg ),
                    Dropout(0.3),
                    Dense(128, activation="relu", kernel_regularizer = L2Reg),
                    Dropout(0.3),
                    Dense(64, activation="relu", kernel_regularizer = L2Reg),
                    Dense(1 , activation = 'sigmoid')])
    return model


model = create_Dropout()

model.compile(optimizer = tf.keras.optimizers.Adam(),
                  loss = tf.keras.losses.BinaryCrossentropy(),
                  metrics=["accuracy"])


history = model.fit(X_train, y_train, validation_data = (X_val, y_val),  epochs=10, batch_size=128, verbose=1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [15]:
model.evaluate(X_train, y_train)




[0.5027924180030823, 0.6931382417678833]

In [16]:
model.evaluate(X_val, y_val)




[0.5389804244041443, 0.6403409242630005]

# Batch Normalization

In [20]:
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Activation
from tensorflow.keras.activations import relu

def create_BatchNormalization_model():
    L2Reg = tf.keras.regularizers.L2(l2=1e-6)
    model = Sequential([
                    Dense(256, kernel_regularizer = L2Reg),
                    BatchNormalization(),
                    Activation(relu),
                    Dropout(0.2),
                    Dense(128, kernel_regularizer = L2Reg),
                    BatchNormalization(),
                    Activation(relu),
                    Dense(64,kernel_regularizer = L2Reg ),
                    BatchNormalization(),
                    Activation(relu),
                    Dense(1)])
    return model


model = create_BatchNormalization_model()

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.001),
                loss = tf.keras.losses.BinaryCrossentropy(),
                metrics=["accuracy"])


history = model.fit(X_train, y_train, validation_data = (X_val, y_val),  epochs=15, batch_size=128, verbose=1)


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [23]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_16 (Dense)            (None, 256)               2816      
                                                                 
 batch_normalization_3 (Bat  (None, 256)               1024      
 chNormalization)                                                
                                                                 
 activation_3 (Activation)   (None, 256)               0         
                                                                 
 dropout_3 (Dropout)         (None, 256)               0         
                                                                 
 dense_17 (Dense)            (None, 128)               32896     
                                                                 
 batch_normalization_4 (Bat  (None, 128)               512       
 chNormalization)                                     

In [24]:
256/2 + 512/2 + 1024/2 

896.0

In [21]:
model.evaluate(X_train, y_train)




[2.714524984359741, 0.6161386370658875]

In [22]:
model.evaluate(X_val, y_val)




[2.860673189163208, 0.5931817889213562]