# Total Optimizers ==> 
In deep learning, optimizers are algorithms used to update the parameters of a neural network model during the training process. They play a crucial role in minimizing the loss function and improving the accuracy of the model. Here are several popular optimizers along with their explanations:

(1). Stochastic Gradient Descent (SGD):

SGD is one of the simplest optimizers. It updates the parameters in the direction of the negative gradient of the loss function with respect to the current mini-batch of training data.
However, SGD has a limitation of converging slowly and can get stuck in local minima.


(2). Momentum:

The Momentum optimizer builds upon SGD by adding a momentum term. It accumulates a fraction of the previous gradients to determine the direction of the update.
This helps to accelerate convergence and navigate through flat regions and local minima.


(3). Nesterov Accelerated Gradient (NAG):

NAG is an improvement over the Momentum optimizer. It calculates the gradient not only based on the current parameters but also using an estimate of the future parameters.
By looking ahead, NAG allows the optimizer to better anticipate the momentum's effect and adjust its update accordingly.


(4). AdaGrad:

AdaGrad adapts the learning rate for each parameter based on the historical gradients. It increases the learning rate for infrequent features and decreases it for frequent ones.
This makes AdaGrad well-suited for sparse data but can cause the learning rate to become too small over time.


(5). RMSprop:

RMSprop addresses the diminishing learning rate issue of AdaGrad by introducing an exponentially decaying average of squared gradients.
By keeping a moving average of the squared gradients, RMSprop normalizes the learning rates and improves convergence.


(6). Adam (Adaptive Moment Estimation):

Adam combines the concepts of momentum and RMSprop. It maintains an exponentially decaying average of past gradients and squared gradients.
Adam uses bias correction to account for the initialization bias in the first few iterations, making it perform well in practice across different deep learning tasks.


(7).  AdaDelta:

AdaDelta is similar to RMSprop but improves upon it by addressing the issue of the learning rate decay.
Instead of using a global learning rate, AdaDelta uses an adaptive learning rate that is adjusted based on the moving average of gradients.


(8). Adamax:

Adamax is a variant of Adam that uses the infinity norm (maximum absolute value) of the gradient instead of the L2 norm.
It is generally less sensitive to the scale of the gradients and is useful when dealing with sparse gradients.


(9).  Nadam:

Nadam is an extension of Adam that incorporates the Nesterov momentum into its update rule.
By including the Nesterov momentum, Nadam aims to accelerate convergence and improve generalization.


(10). AMSGrad:

AMSGrad is a modification of Adam that prevents the learning rate from decaying too quickly.
It maintains the maximum of the past squared gradients and uses it in the update rule, unlike Adam, which uses the exponential average.

# Advantages And Disadvantages of SGD ==> 

(1).  Advantages of Stochastic Gradient Descent (SGD):

(a).  Simplicity: SGD is a straightforward and easy-to-implement optimization algorithm compared to more complex optimizers.
    
(b). Memory Efficiency: Since SGD updates the parameters using a single sample or a small batch at a time, it requires less memory compared to other optimizers that use the entire dataset.
    
(c). Computational Efficiency: SGD can be computationally efficient, especially for large datasets, as it processes data in small batches rather than the entire dataset at once.
    
(d). Generalization: SGD can help prevent overfitting by introducing randomness through the use of mini-batches, which can improve the generalization performance of the model.
    
(2). Disadvantages of Stochastic Gradient Descent (SGD):

(a).  Noisy Updates: SGD introduces noise into the parameter updates due to the randomness of the mini-batch sampling. This noise can lead to slower convergence and instability.


(b).  High Variance: The estimates of the gradient obtained from each mini-batch can have a high variance, which can lead to oscillations and suboptimal convergence.


(c).  Learning Rate Selection: SGD is sensitive to the learning rate choice. Setting a learning rate that is too high can cause the optimization process to diverge, while setting it too low can result in slow convergence.


(d).  Local Minima: SGD is susceptible to getting trapped in local minima, especially in high-dimensional and non-convex optimization problems. It may struggle to escape shallow minima and converge to the global minimum.

# 2. Momentum ==> 


Momentum is an optimization algorithm that builds upon the basic Stochastic Gradient Descent (SGD) method. It helps accelerate convergence and navigate through flat regions and local minima by introducing a momentum term that keeps track of the past gradients. Here is a complete explanation of the Momentum optimizer:

Algorithm Overview:

In Momentum optimization, the parameter update is influenced not only by the current gradient but also by the accumulated gradient from previous iterations.
At each iteration, the momentum optimizer computes an exponentially decaying average of the past gradients and uses this accumulated gradient to update the parameters.
Math Formulation:

The update rule for the Momentum optimizer can be expressed as follows:

v(t) = m * v(t-1) + learning_rate * gradient(t)

parameter(t) = parameter(t-1) - v(t)

Where:

v(t) is the velocity or accumulated gradient at time step t.
m is the momentum coefficient, usually a value between 0 and 1.
learning_rate is the learning rate or step size.
gradient(t) is the gradient of the loss function with respect to the parameters at time step t.
parameter(t) represents the parameters of the model at time step t.



(a).  Momentum Effect:

The momentum term, m * v(t-1), introduces inertia to the update process. It allows the optimizer to continue moving in the same direction as previous iterations if the gradients consistently point in that direction.
The learning rate * gradient(t) term adjusts the update direction based on the current gradient.


Advantages of Momentum Optimizer:

(a).  Accelerated Convergence: The momentum term helps the optimizer to move faster towards the minimum by accumulating the gradients from previous iterations. This can lead to faster convergence and reduce the time needed for training.
Smoother Optimization Trajectory: Momentum helps the optimizer smooth out the oscillations that often occur when using plain SGD, resulting in a more stable and consistent optimization trajectory.
Improved Exploration: Momentum allows the optimizer to escape shallow local minima and navigate through flat regions, making it less likely to get stuck.
    
    
(b). Hyperparameter Tuning:

The momentum coefficient (m) is a hyperparameter that needs to be tuned. A value close to 1 (e.g., 0.9) is commonly used, but different problems may benefit from different values. It determines how much of the past gradients should be taken into account.
The learning rate also needs to be properly set, considering the interaction between the momentum and learning rate hyperparameters. A high learning rate may lead to overshooting, while a low learning rate may result in slow convergence.


(c). Variants of Momentum:

Nesterov Accelerated Gradient (NAG): NAG is a modification of the momentum optimizer that calculates the gradient based on the estimated future parameters. It helps the optimizer anticipate the momentum's effect and adjust the update direction accordingly, leading to improved performance.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

data = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# data.head() 


# Preprocess the dataset
X = data.iloc[:, 3:-1].values
y = data.iloc[:, -1].values

# Encode categorical features
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])  # Geography
X[:, 2] = le.fit_transform(X[:, 2])  # Gender

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the neural network architecture
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_dim=X_train.shape[1]),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Define the loss function and metrics
loss_fn = tf.keras.losses.BinaryCrossentropy()
metrics = ['accuracy']

# Define the optimizer with Momentum
learning_rate = 0.001
momentum = 0.9
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum)

# Compile the model
model.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)

# Train the model
epochs = 10
batch_size = 32
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_acc)

# Disadvantages of Momentum ==> 

(1).   Overshooting:

One potential issue with the Momentum optimizer is overshooting or oscillation. If the learning rate or the momentum coefficient is set too high, the optimizer may overshoot the optimal solution and oscillate around it. This can lead to slower convergence or instability in the optimization process.


(2).  Difficulty in Converging in Some Cases:

In certain scenarios, such as optimization problems with noisy or sparse gradients, the Momentum optimizer may face difficulties in converging to the optimal solution. The accumulated gradients from previous iterations can cause the optimizer to overshoot or miss important regions of the loss landscape, making convergence slower or even preventing it altogether.

(3).  Sensitivity to Hyperparameters:

The performance of the Momentum optimizer is highly dependent on the choice of hyperparameters, specifically the momentum coefficient and the learning rate. Finding the optimal values for these hyperparameters can be challenging and require careful tuning. Suboptimal choices can result in slow convergence or unstable behavior.

(4).  Increased Memory Requirements:

The Momentum optimizer requires additional memory to store and update the velocity or accumulated gradients at each iteration. While this memory overhead may not be significant for small models and datasets, it can become a concern when dealing with large-scale deep learning models or limited computational resources.

(5).  Lack of Adaptivity:

Momentum does not adaptively adjust its behavior based on the characteristics of the optimization problem or the current stage of training. It applies a fixed momentum coefficient throughout the training process, which may not be optimal for different stages or regions of the loss landscape. This lack of adaptivity can lead to suboptimal convergence in some cases.

In [2]:
import numpy as np 
import pandas as pd 

data = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# data.head() 


import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler



# Preprocess the dataset
X = data.iloc[:, 3:-1].values
y = data.iloc[:, -1].values

# Encode categorical features
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])  # Geography
X[:, 2] = le.fit_transform(X[:, 2])  # Gender

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the neural network architecture
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_dim=X_train.shape[1]),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Define the loss function and metrics
loss_fn = tf.keras.losses.BinaryCrossentropy()
metrics = ['accuracy']

# Define the optimizer with Nesterov Accelerated Gradient (NAG)
learning_rate = 0.001
momentum = 0.9
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum, nesterov=True)

# Compile the model
model.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)

# Train the model
epochs = 10
batch_size = 32
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_acc)

# Disadvantages of NAG ==> 

(1). Sensitivity to Learning Rate: NAG can be sensitive to the choice of learning rate. If the learning rate is set too high, it may result in unstable updates and difficulty in convergence. On the other hand, setting the learning rate too low can lead to slow convergence.


(2). Hyperparameter Tuning: Nesterov Accelerated Gradient requires tuning of the momentum coefficient and learning rate. Finding the optimal values for these hyperparameters can be a challenging task and may require extensive experimentation.


(3). Computational Complexity: NAG involves additional computations compared to basic Momentum optimization. It requires calculating the gradient at the estimated future position of the parameters, which adds computational complexity to the optimization process. This increased computational cost can be a concern when dealing with large-scale datasets or complex models.


(4). Limited Applicability: While NAG can be effective in many scenarios, it may not always outperform other optimization algorithms. Its performance can vary depending on the specific problem and dataset. In certain cases, alternative optimizers like Adam or RMSprop may yield better results.


(5). Lack of Robustness to Noise: Nesterov Accelerated Gradient may exhibit reduced performance in the presence of noisy or sparse gradients. The estimation of future parameters based on the current gradient direction can be influenced by noisy updates, leading to suboptimal convergence.


(6). Local Minima: Like other optimization algorithms, NAG is not immune to the problem of getting stuck in local minima. While it offers faster convergence in many cases, there is still a possibility of converging to suboptimal solutions depending on the nature of the loss landscape.

# (4). AdaGrad ==> 

Adagrad (Adaptive Gradient) is an optimization algorithm used in deep learning that adapts the learning rate for each parameter based on its historical gradients. It addresses the challenge of choosing a suitable learning rate by automatically adjusting it during training. 

(1).  Algorithm Overview:


Adagrad adjusts the learning rate for each parameter individually based on the sum of squared gradients for that parameter.
It assigns a larger learning rate to parameters with smaller gradients and a smaller learning rate to parameters with larger gradients.


This adaptation of learning rates allows for more rapid progress in directions with smaller gradients and slower progress in directions with larger gradients.


(2).  Math Formulation:

The update rule for Adagrad can be expressed as follows:

cache(t) = cache(t-1) + gradient(t)^2

parameter(t) = parameter(t-1) - (learning_rate / sqrt(cache(t) + epsilon)) * gradient(t)

Where:

cache(t) is the sum of squared gradients at time step t.

gradient(t) is the gradient of the loss function with respect to the parameters at time step t.

parameter(t) represents the parameters of the model at time step t.

learning_rate is the initial learning rate.

epsilon is a small constant (e.g., 1e-8) added to the denominator for numerical stability.

(3).  Accumulation of Gradients:

Adagrad accumulates the squared gradients over time by summing up the square of each gradient for a specific parameter.
This accumulation gives more weight to infrequent and large gradients, allowing the learning rate to adapt accordingly.


(4).  Learning Rate Decay:

Adagrad inherently performs learning rate decay as the accumulated squared gradients keep increasing over time.
The learning rate becomes smaller as the cache term in the denominator becomes larger, which ensures that the learning rate decreases monotonically.

# Advantages of Adagrad Gradient Descent:

(1).  Adaptive Learning Rates: Adagrad adapts the learning rates based on the historical gradients, enabling efficient learning for different parameters.

(2).  Sparse Feature Handling: Adagrad handles sparse features well by giving them larger learning rates, making it suitable for natural language processing and recommendation systems.

(3).  No Manual Learning Rate Tuning: Adagrad automatically adjusts the learning rate, reducing the need for manual tuning.


# Disadvantages of Adagrad Gradient Descent:

(1).  Cumulative Gradient Squares: As Adagrad accumulates squared gradients over time, the sum of squares keeps increasing. This can result in diminishing learning rates, making the algorithm converge too slowly.


(2).  Lack of Robustness: Adagrad might not perform well in cases where there are sudden changes in gradients or when the problem has a non-convex loss landscape.


(3).  Difficulty with Non-Differentiable Features: Adagrad struggles with features that are non-differentiable or have a zero gradient. The accumulated gradients for such features can become too large, distorting the optimization process.

In [3]:
import numpy as np 
import pandas as pd 

data = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# data.head() 


import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler



# Preprocess the dataset
X = data.iloc[:, 3:-1].values
y = data.iloc[:, -1].values

# Encode categorical features
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])  # Geography
X[:, 2] = le.fit_transform(X[:, 2])  # Gender

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the neural network architecture
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_dim=X_train.shape[1]),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Define the loss function and metrics
loss_fn = tf.keras.losses.BinaryCrossentropy()
metrics = ['accuracy']

# Define the optimizer with Adagrad
learning_rate = 0.01
optimizer = tf.keras.optimizers.Adagrad(learning_rate=learning_rate)

# Compile the model
model.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)

# Train the model
epochs = 10
batch_size = 32
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_acc)


# (5). RMSprop ==> 

RMSprop (Root Mean Square Propagation) is an optimization algorithm commonly used in deep learning. It addresses the limitations of the Adagrad optimizer by introducing a moving average of squared gradients. Here's a complete explanation of RMSprop:

(1).  Algorithm Overview:

RMSprop adapts the learning rate for each parameter based on the moving average of squared gradients.
It maintains a weighted average of the squared gradients over time, which helps control the learning rate.
Math Formulation:

The update rule for RMSprop can be expressed as follows:

cache(t) = decay_rate * cache(t-1) + (1 - decay_rate) * gradient(t)^2

parameter(t) = parameter(t-1) - (learning_rate / sqrt(cache(t) + epsilon)) * gradient(t)

Where:

cache(t) is the moving average of squared gradients at time step t.
gradient(t) is the gradient of the loss function with respect to the parameters at time step t.
parameter(t) represents the parameters of the model at time step t.
learning_rate is the initial learning rate.
decay_rate is a hyperparameter that controls the weightage of previous squared gradients.
epsilon is a small constant (e.g., 1e-8) added to the denominator for numerical stability.


(2).  Accumulation of Squared Gradients:

RMSprop accumulates the squared gradients by taking a weighted average of the squared gradients from previous time steps.
The decay_rate hyperparameter controls the rate at which previous gradients contribute to the current cache.


(3).  Learning Rate Adaptation:

RMSprop adapts the learning rate for each parameter based on the square root of the average squared gradients.
It divides the learning rate by the square root of the cache value, which effectively scales the learning rate based on the magnitude of the gradients.

# Advantages of RMSprop:

(1). Adaptive Learning Rates: RMSprop adapts the learning rates based on the average squared gradients, allowing for efficient learning across different parameters.
    
    
(2). Stable Convergence: By maintaining a moving average of squared gradients, RMSprop can handle noisy or sparse gradients more robustly, leading to more stable convergence.
    
    
(3). Reduced Sensitivity to Learning Rate: RMSprop's adaptation of the learning rate reduces the need for manual tuning, making it less sensitive to the choice of the initial learning rate.
    

# Disadvantages of RMSprop:

(1).  Difficulty with Non-Stationary Problems: RMSprop may not perform well on non-stationary problems where the optimal learning rate changes over time. The accumulated squared gradients can cause the learning rate to become too small.


(2).  Limited Memory: RMSprop relies on a fixed-size cache to store the average squared gradients. In some cases, the limited memory may hinder the algorithm's ability to adapt to the dynamics of the gradients effectively.



In [4]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler


data = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# data.head() 

# Preprocess the dataset
X = data.iloc[:, 3:-1].values
y = data.iloc[:, -1].values

# Encode categorical features
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])  # Geography
X[:, 2] = le.fit_transform(X[:, 2])  # Gender

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the neural network architecture
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_dim=X_train.shape[1]),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Define the loss function and metrics
loss_fn = tf.keras.losses.BinaryCrossentropy()
metrics = ['accuracy']

# Define the optimizer with RMSprop
learning_rate = 0.001
optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)

# Compile the model
model.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)

# Train the model
epochs = 10
batch_size = 32
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_acc)

# Adam ==> 

The Adam optimizer is a popular optimization algorithm commonly used in deep learning and machine learning. It stands for "Adaptive Moment Estimation" and combines the concepts of momentum and adaptive learning rates.

Here's a complete overview of the Adam optimizer:

Background:
The Adam optimizer was introduced in 2014 by Diederik P. Kingma and Jimmy Ba. It was designed to address some limitations of other optimization algorithms like stochastic gradient descent (SGD) and its variants.

Algorithm:
The Adam optimizer maintains a set of adaptive learning rates for individual parameters. It computes individual adaptive learning rates based on the estimate of both the first-order (mean of gradients) and second-order (uncentered variance of gradients) moments of the gradients.

The algorithm involves the following steps:

a. Initialization:

Initialize the first-moment variable (mean of gradients) to zero.

Initialize the second-moment variable (variance of gradients) to zero.

Set the hyperparameters: learning rate (α), decay rates for the moment estimates (β1 and β2), and a small constant for numerical stability (ε).

b. Update at each iteration:

Compute the gradient of the objective function with respect to the parameters.


Update the first-moment variable:

Multiply the first-moment variable by β1.

Add (1 - β1) times the gradient to the first-moment variable.

Update the second-moment variable:

Multiply the second-moment variable by β2.

Add (1 - β2) times the squared gradient to the second-moment variable.

Correct for bias:

Compute the bias-corrected first-moment estimate.

Compute the bias-corrected second-moment estimate.

Update the parameters:

Update the parameters by subtracting the learning rate multiplied by the first-moment estimate divided by the square root of the second-moment estimate (plus ε for numerical stability).

The algorithm adapts the learning rates for each parameter based on the gradient's statistics, allowing it to converge faster and handle sparse gradients efficiently.


Hyperparameters:
The Adam optimizer requires setting several hyperparameters:

(1). Learning rate (α): Determines the step size taken during parameter updates.

(2). Decay rate for the first-moment estimate (β1): Controls the exponential decay of the moving average of gradients.

(3). Decay rate for the second-moment estimate (β2): Controls the exponential decay of the moving average of squared gradients.

(4). Small constant (ε): Added for numerical stability to prevent division by zero.

The values of these hyperparameters can significantly affect the performance of the optimizer and need to be tuned based on the specific problem and dataset.

# Advantages of Adam ==> 

(1).  Efficient: Adam maintains per-parameter learning rates, adapting them individually based on the estimated moments of the gradients.
    
(2).  Robust to sparse gradients: It performs well even when dealing with sparse gradients, making it suitable for tasks like natural language processing (NLP).
    
(3).  Converges faster: Adam's adaptive learning rates allow it to converge faster compared to traditional optimization algorithms.
    

# Disadvantages of Adam ==> 

(1). Increased memory usage: Adam maintains additional variables for the adaptive learning rates, leading to higher memory requirements compared to simpler optimizers like SGD.
    
    
(2). Sensitivity to hyperparameters: The performance of Adam can be sensitive to the choice of hyperparameters, and improper tuning may result in suboptimal convergence.
    
    

In [None]:
import numpy as np 
import pandas as pd 


data = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# data.head() 

import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler



# Preprocess the categorical variables
label_encoder = LabelEncoder()
data['Geography'] = label_encoder.fit_transform(data['Geography'])
data['Gender'] = label_encoder.fit_transform(data['Gender'])

# Split the data into features and labels
X = data.drop(['Exited', 'RowNumber', 'CustomerId', 'Surname'], axis=1)
y = data['Exited']

# Normalize the numerical features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")


# AdaDelta ==> 

AdaDelta is an optimization algorithm used in deep learning and machine learning to update the parameters of a neural network. It is a variant of the stochastic gradient descent (SGD) optimization method that aims to address some of the limitations of traditional SGD, such as the sensitivity to learning rate tuning.

Complete overview of the AdaDelta optimizer:

(1). Background:
AdaDelta was proposed by Matthew D. Zeiler in 2012 as an extension of the Adagrad optimizer. Adagrad uses a per-parameter learning rate that adapts over time, but it suffers from a problem known as "learning rate decay," which causes the learning rates to become very small, making further learning difficult. AdaDelta was designed to overcome this issue and provide a more stable and effective optimization method.

(2). Adaptive Learning Rate:
Like Adagrad, AdaDelta adapts the learning rate for each parameter based on the magnitudes of its gradients. However, instead of accumulating the squared gradients over time, AdaDelta uses a more sophisticated scheme that restricts the accumulated sum of gradients to a fixed-size sliding window.

(3). Accumulated Gradient:
AdaDelta keeps track of the accumulated sum of squared gradients, denoted by E[g^2], for each parameter. It is computed using an exponential moving average formula, where the accumulated gradient is decayed and updated with each iteration.

(4). Root Mean Square Update (RMS):
In addition to the accumulated gradient, AdaDelta maintains an additional parameter called the root mean square update, denoted by E[dx^2]. Similar to the accumulated gradient, this value is computed using an exponential moving average. It represents the magnitude of the updates to the parameters.

(5). Parameter Update:
The update for each parameter is calculated using the following steps:
    
    
(a).   Compute the root mean square of the accumulated gradient:
    
    
RMS[g] = sqrt(E[g^2] + epsilon)
Here, epsilon is a small constant (e.g., 1e-8) added for numerical stability.


(b).    Compute the root mean square of the previous parameter updates:
RMS[dx] = sqrt(E[dx^2] + epsilon)
(c). Compute the update step for each parameter:
    
    
dx = -(RMS[dx] / RMS[g]) * g
d. Update the accumulated gradient:
E[g^2] = rho * E[g^2] + (1 - rho) * g^2
Here, rho is a decay rate hyperparameter (e.g., 0.9) that controls the weighting of the current gradient compared to the accumulated gradient.


(e).    Update the root mean square of the parameter updates:
E[dx^2] = rho * E[dx^2] + (1 - rho) * dx^2


(f). Update the parameters:
    
    
param += dx

(6). Learning Rate Adaptation:
    
    
AdaDelta eliminates the need for manually setting a learning rate. It adapts the learning rate based on the historical gradients and updates of the parameters. The magnitudes of the gradients and updates influence the effective learning rate for each parameter. AdaDelta allows for more robust learning and reduces the need for fine-tuning the learning rate hyperparameter.

# Benefits of AdaDelta:

(1).  AdaDelta can handle sparse gradients well, making it suitable for problems with large-scale datasets and high-dimensional spaces.

(2).  It eliminates the need for setting a global learning rate manually, reducing the sensitivity to learning rate tuning.


(3).  AdaDelta's adaptive learning rate allows for smoother convergence during training and helps prevent overshooting or getting stuck in plateaus.


# Limitations of AdaDelta:

(1).  AdaDelta has some additional hyperparameters to tune, such as the decay rate (rho) and a small constant (epsilon), although they are generally less sensitive than the learning rate in traditional optimization methods.

(2).  It may require more memory to store the additional parameters (E[g^2] and E[dx^2]) for each parameter compared to standard optimization methods.


In [None]:
import numpy as np 
import pandas as pd 

df = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# df.head() 

import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load and preprocess the dataset
# Assuming you have already loaded your dataset into a pandas DataFrame called 'df'

# Perform label encoding for categorical variables
label_encoder = LabelEncoder()
df['Geography'] = label_encoder.fit_transform(df['Geography'])
df['Gender'] = label_encoder.fit_transform(df['Gender'])

# Split the dataset into features and target variable
X = df.drop(['Exited', 'Surname', 'RowNumber'], axis=1).values
y = df['Exited'].values

# Perform feature scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the neural network architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=64, activation='relu', input_dim=X_train.shape[1]),
    tf.keras.layers.Dense(units=64, activation='relu'),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adadelta(),  # Use AdaDelta optimizer
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Test Loss:', loss)
print('Test Accuracy:', accuracy)

# Adamax Optimizer ===> 


Adamax is an optimization algorithm that is commonly used in deep learning models. It is a variant of the Adam (Adaptive Moment Estimation) optimizer, which combines the benefits of both the adaptive learning rate method and the moving average of gradients. Adamax extends Adam by using the infinity norm (max norm) for the update step instead of the L2 norm used in Adam.

The key components and equations used in the Adamax optimizer:

(1).  Initialization:

(a).Initialize the parameters:
    
Learning rate (α): Typically a small value, e.g., 0.001.
    
    
β1: Exponential decay rate for the first moment estimates. Commonly set to 0.9.
    
    
β2: Exponential decay rate for the second moment estimates. Commonly set to 0.999.
    
    
ε: A small value added to the denominator to avoid division by zero. Typically around 1e-8.
    
    
Initialize the first moment vector (m) and the exponentially weighted infinity norm vector (u) for each parameter to zero.


(2).  Iterative update:
    
(a). For each iteration/timestep t:

Compute the gradients of the loss function with respect to the parameters.

(b). Update the first moment estimates (m) using the exponential decay rate β1:
    
m = β1 * m + (1 - β1) * gradients

(c). Update the second moment estimates (v) using the exponential decay rate β2:
    
u = max(β2 * u, abs(gradients))

(d). Compute the bias-corrected first moment estimate:
    
m_hat = m / (1 - β1^t)


(e). Update the parameters using the Adamax update rule:
    
parameter = parameter - (α / (1 - β1^t)) * (m_hat / (u + ε))

# Summary ==> 
the Adamax optimizer uses the first and second moment estimates (m and u, respectively) to adaptively update the parameters. The first moment estimate represents the average of past gradients, and the second moment estimate represents the exponentially weighted infinity norm of past gradients. By using the infinity norm, Adamax provides a more stable update step when the gradients have varying magnitudes.

Adamax is known for its robustness and ability to handle sparse gradients effectively. It has been widely used in various deep learning architectures and has shown good performance in many applications.

# Advantages of Adamax optimizer ==> 

(1). Adaptive learning rate: Adamax adjusts the learning rate for each parameter individually based on the magnitudes of the past gradients. This adaptive behavior allows it to handle different learning rates for different parameters, leading to efficient optimization.
    

(2). Momentum-based updates: Adamax incorporates momentum through the first moment estimates (m), which helps accelerate convergence and smooth out noisy gradients. The momentum term allows the optimizer to continue moving in the right direction even when the gradients fluctuate.
    

(3). Robustness to sparse gradients: Adamax performs well in scenarios where the gradients are sparse, meaning only a few parameters have significant updates. It adapts to different magnitudes of gradients by using the infinity norm, which can handle large gradients effectively.
    

(4). Low memory requirements: Adamax requires minimal memory to store the first moment estimates (m) and the exponentially weighted infinity norm vector (u). These vectors have the same dimensionality as the model parameters, making Adamax memory-efficient compared to some other optimization algorithms.
    

# Disadvantages of Adamax optimizer ==> 

(1). Sensitivity to learning rate: Adamax can be sensitive to the choice of the initial learning rate. Setting a learning rate that is too high may result in unstable updates or failure to converge, while setting it too low can lead to slow convergence.
    

(2). Potential for overshooting: The momentum-based updates in Adamax may cause the optimizer to overshoot the minimum, especially when the gradients are noisy or the learning rate is high. This behavior can result in slower convergence or oscillations around the optimal solution.
    

(3). Lack of theoretical guarantees: Unlike some other optimization algorithms, such as stochastic gradient descent with momentum, Adamax does not have strong theoretical convergence guarantees. While it often performs well in practice, its convergence properties are not yet well-understood.
    

(4). Additional hyperparameters: Adamax introduces additional hyperparameters, such as the exponential decay rates (β1 and β2) and the small constant (ε) added to the denominator. Choosing appropriate values for these hyperparameters may require some trial and error or hyperparameter tuning.
    

In [None]:
import numpy as np 
import pandas as pd 

data = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# data.head() 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from tensorflow import keras
from tensorflow.keras import layers, optimizers


# Separate features and target variable
X = data.drop(columns = ['Surname' , 'Exited'], axis=1)
y = data['Exited']

# Perform label encoding for categorical variables
label_encoder = LabelEncoder()
X['Geography'] = label_encoder.fit_transform(X['Geography'])
X['Gender'] = label_encoder.fit_transform(X['Gender'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the model architecture
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model with Adamax optimizer
optimizer = optimizers.Adamax(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)


# Nadam Optimizer ===> 

Nadam (Nesterov-accelerated Adaptive Moment Estimation) is an optimization algorithm that combines the advantages of Nesterov accelerated gradient (NAG) and Adam (Adaptive Moment Estimation). It is a variant of Adam that incorporates Nesterov momentum into its update rule. Nadam aims to provide faster convergence and better generalization performance compared to traditional momentum-based optimizers.

The key components and equations used in the Nadam optimizer:

(1). Initialization:

Initialize the parameters:
Learning rate (α): Typically a small value, e.g., 0.001.
β1: Exponential decay rate for the first moment estimates. Commonly set to 0.9.
β2: Exponential decay rate for the second moment estimates. Commonly set to 0.999.
ε: A small value added to the denominator to avoid division by zero. Typically around 1e-8.
Initialize the first moment vector (m) and the second moment vector (v) for each parameter to zero.


(2).  Iterative update:

For each iteration/timestep t:

Compute the gradients of the loss function with respect to the parameters.

Update the first moment estimates (m) using the exponential decay rate β1:

m = β1 * m + (1 - β1) * gradients

Update the second moment estimates (v) using the exponential decay rate β2:

v = β2 * v + (1 - β2) * gradients^2


Compute the bias-corrected first and second moment estimates:

m_hat = m / (1 - β1^t)
v_hat = v / (1 - β2^t)

Compute the Nesterov momentum update:
momentum = β1 * m_hat + ((1 - β1) * gradients) / (1 - β1^t)
Update the parameters using the Nadam update rule:
parameter = parameter - (α / (sqrt(v_hat) + ε)) * (momentum + (β1 * momentum) - ((1 - β1) * gradients) / (1 - β1^t))

# Short summary ==> 
Nadam combines the benefits of Nesterov accelerated gradient, which improves convergence near the minimum, with the adaptive learning rate and momentum features of Adam. By incorporating Nesterov momentum, Nadam has shown improved performance on deep learning models, particularly for tasks involving large-scale and complex datasets.

# Advantages of Nadam Optimizers ==> 

(1).  Fast convergence: Nadam combines the benefits of Nesterov accelerated gradient (NAG) and Adam optimizer, resulting in faster convergence compared to traditional momentum-based optimizers. The inclusion of Nesterov momentum allows the optimizer to make more accurate updates and handle complex optimization landscapes.

(2).  Good generalization performance: Nadam has been observed to have good generalization performance, meaning it can effectively generalize to unseen data beyond the training set. This can lead to improved accuracy and robustness of deep learning models.

(3).  Adaptive learning rate: Similar to Adam, Nadam adjusts the learning rate for each parameter individually based on the magnitudes of the first and second moment estimates. This adaptiveness enables efficient optimization and helps overcome the challenges of using a fixed learning rate.

(4). Robustness to noisy gradients: The combination of Nesterov momentum and adaptive learning rate in Nadam helps handle noisy gradients effectively. The optimizer can navigate through noisy or sparse gradients, resulting in stable and reliable updates.

# Disadvantages of Nadam Optimizers ==> 

(1). Hyperparameter sensitivity: Nadam, like other optimization algorithms, requires careful tuning of hyperparameters to achieve optimal performance. The choice of learning rate, decay rates, and other hyperparameters can impact the convergence and generalization ability of the model. It may require some experimentation and tuning to find the best set of hyperparameters.
    

(2). Computational overhead: Nadam involves additional computations compared to simpler optimization algorithms. It requires calculating and maintaining the first and second moment estimates for each parameter, which can increase the computational overhead. While this additional complexity is generally acceptable in most scenarios, it might be a consideration for computationally constrained environments.
    

(3). Lack of theoretical guarantees: Although Nadam has shown promising performance in practice, it lacks strong theoretical convergence guarantees like some other optimization algorithms. Its convergence properties are still an active area of research, and the behavior of Nadam in different optimization scenarios is not yet fully understood.
    

In [None]:
import numpy as np 
import pandas as pd
data = pd.read_csv("C:\\Users\\rajen\\OneDrive\\Desktop\\data\\Churn_Modelling.csv")
# data.head() 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from tensorflow import keras
from tensorflow.keras import layers, optimizers


# Separate features and target variable
X = data.drop(columns = ['Surname','Exited'], axis=1)
y = data['Exited']

# Perform label encoding for categorical variables
label_encoder = LabelEncoder()
X['Geography'] = label_encoder.fit_transform(X['Geography'])
X['Gender'] = label_encoder.fit_transform(X['Gender'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the model architecture
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model with Nadam optimizer
optimizer = optimizers.Nadam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)


# AMSGrad   Optimizer ==> 

AMSGrad (Adaptive Moment Estimation for Stochastic Gradient Descent) is an optimization algorithm that extends the Adam (Adaptive Moment Estimation) optimizer. It aims to address a limitation of Adam where the accumulated second moment estimates (v) can grow indefinitely, potentially leading to suboptimal convergence or oscillations in certain scenarios. AMSGrad modifies the update rule of Adam to ensure the boundedness of the second moment estimates.

The key components and equations used in the AMSGrad optimizer:

(1).  Initialization:

(a). Initialize the parameters:
    
Learning rate (α): Typically a small value, e.g., 0.001.
    
(b). β1: Exponential decay rate for the first moment estimates. Commonly set to 0.9.
    
β2: Exponential decay rate for the second moment estimates. Commonly set to 0.999.
    
ε: A small value added to the denominator to avoid division by zero. Typically around 1e-8.
    
Initialize the first moment vector (m) and the maximum second moment vector (v_max) for each parameter to zero.



(2). Iterative update:
For each iteration/timestep t:

Compute the gradients of the loss function with respect to the parameters.

Update the first moment estimates (m) using the exponential decay rate β1:
    
m = β1 * m + (1 - β1) * gradients

Update the second moment estimates (v) using the exponential decay rate β2:
    
v = β2 * v + (1 - β2) * gradients^2

Update the maximum second moment estimates (v_max) as the maximum element-wise value between the current v_max and v:
    
v_max = max(v_max, v)

Compute the bias-corrected first and second moment estimates:
    
m_hat = m / (1 - β1^t)

v_hat = v_max / (1 - β2^t)

Update the parameters using the AMSGrad update rule:
    
parameter = parameter - (α / (sqrt(v_hat) + ε)) * m_hat


# In summary, AMSGrad introduces the v_max term to ensure that the second moment estimates do not grow indefinitely. By using the maximum second moment estimates, AMSGrad guarantees that the update direction remains stable, preventing the oscillatory behavior that can occur in Adam when v grows rapidly.

# Advantages of AMSGrad optimizer ==> 

(1). Stability and avoidance of oscillations: The boundedness of the second moment estimates in AMSGrad helps prevent the occurrence of oscillations and improves stability during optimization. This stability is particularly beneficial in scenarios with non-convex and ill-conditioned optimization landscapes.
    

(2). Improved convergence in certain cases: AMSGrad has demonstrated improved convergence properties compared to Adam in scenarios where the Adam optimizer fails to converge due to unbounded second moment estimates. AMSGrad can provide more reliable and efficient optimization in such cases.
    

# Disadvantages of AMSGrad optimizer ==> 

(1). Potentially slower convergence: In some cases, the modifications made in AMSGrad to ensure boundedness of second moment estimates may lead to slower convergence compared to Adam. While AMSGrad addresses the issue of unboundedness, it may sacrifice some of the adaptive learning rate properties of Adam.
    

(2). Additional computational overhead: The additional computations involved in AMSGrad, such as calculating the maximum second moment estimates (v_max), introduce some computational overhead compared to simpler optimization algorithms. However, this additional complexity is generally acceptable in most scenarios.
    