<a href="https://colab.research.google.com/github/Sweta-Das/TensorFlow-Python-Projects/blob/Fundamentals/optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimization
- It is the process of adjusting the parameters (weights and biases) of a model to minimize (or maximize) a certain objective function (like loss function).
- Goal of optimization is to minimize the loss so that the model becomes more accurate.

## Importance of Optimization in Training Linear Regression Model
- Optimization is crucial for training the linear regression model as the model's accuracy directly depends on finding the best values of $w$ and $b$.
- Poor optimization can lead to underfitting or overfitting.

### Steps to Optimize Linear Regression Model
1. **Initializing parameters** -> Assigning random values for $w$ and $b$
2. **Forward Pass** -> Computing the predictions $ŷ$
3. **Calculate loss** -> Using a loss function to measure the error between $ŷ$ and $y$
4. **Backpropagation** -> Computing the gradients of the loss function w.r.t $w$ and $b$
5. **Update Parameters** -> Adjusting $w$ and $b$ using an optimization algo. like *Gradient Descent*
6. The process is repeated until the loss converges to a minimum or reaches a stopping criterion.

In [None]:
# Importing libraries
import numpy as np
import tensorflow as tf

In [None]:
# Generating synthetic data for training & testing
np.random.seed(42)
X_train = np.random.rand(100, 1) # 100 samples, 1 feature
y_train = 4 * X_train + np.random.randn(100, 1) # Linear relation with some noise

X_test = np.random.rand(20, 1)
y_test = 4 * X_test + np.random.randn(20, 1)

In [None]:
# Creating linear regression model in tensorflow
class LinearRegressionModel(tf.Module):
  def __init__(self):
    # Initializing weight and bias randomly
    self.w = tf.Variable(tf.random.normal([1]), name="weight")
    self.b = tf.Variable(tf.random.normal([1]), name="bias")

  def __call__(self, X):
    return self.w * X + self.b

model = LinearRegressionModel()

In [None]:
# Loss Function (Mean Squared Error)
def loss_fn(y_true, y_pred):
  return tf.reduce_mean(tf.square(y_true - y_pred))

In [None]:
# Training
def train(model, X_train, y_train, epochs):
  # Initializing optimizer after model creation
  optimizer = tf.optimizers.SGD(learning_rate=0.1)

  for epoch in range(epochs):
    with tf.GradientTape() as tape:
      predictions = model(X_train)
      loss = loss_fn(y_train, predictions)

    # Computing gradients and updating parameters
    gradients = tape.gradient(loss, [model.w, model.b])
    optimizer.apply_gradients(zip(gradients, [model.w, model.b]))

    if epoch % 10 == 0:
      print(f"Epoch {epoch}: Loss = {loss.numpy():.4f}, w = {model.w.numpy()}, b = {model.b.numpy()}")

train(model, X_train, y_train, epochs=100)

Epoch 0: Loss = 4.5440, w = [0.25911975], b = [0.5524367]
Epoch 10: Loss = 1.3363, w = [1.1261972], b = [1.3596457]
Epoch 20: Loss = 1.1972, w = [1.4720178], b = [1.2571218]
Epoch 30: Loss = 1.1003, w = [1.7482758], b = [1.121551]
Epoch 40: Loss = 1.0274, w = [1.9864544], b = [1.0012794]
Epoch 50: Loss = 0.9726, w = [2.1929066], b = [0.89683014]
Epoch 60: Loss = 0.9314, w = [2.3719232], b = [0.80624974]
Epoch 70: Loss = 0.9005, w = [2.527154], b = [0.72770417]
Epoch 80: Loss = 0.8772, w = [2.6617596], b = [0.65959466]
Epoch 90: Loss = 0.8597, w = [2.77848], b = [0.6005348]


In [None]:
# Testing

# Defining accuracy function (MAE and RMSE)
def mean_abs_error(y_true, y_pred):
  return tf.reduce_mean(tf.abs(y_true-y_pred))

def root_mean_sqr_error(y_true, y_pred):
  return tf.sqrt(tf.reduce_mean(tf.square(y_true - y_pred)))

def test(model, X_test, y_test):
  predictions = model(X_test)
  loss = loss_fn(y_test, predictions)
  mae = mean_abs_error(y_test, predictions)
  rmse = root_mean_sqr_error(y_test, predictions)
  print(f"Test MAE: {mae.numpy(): .4f}, Test RMSE: {rmse.numpy():.4f}, Test Loss: {loss.numpy():.4f}")
  return mae, rmse

In [None]:
test(model=model, X_test=X_test, y_test=y_test)

Test MAE:  0.6896, Test RMSE: 0.8188, Test Loss: 0.6704


(<tf.Tensor: shape=(), dtype=float32, numpy=0.6896304>,
 <tf.Tensor: shape=(), dtype=float32, numpy=0.8187523>)

 - MAE = 0.6896 means that, on average, the model is relatively close to the actual test values but still has room for improvement.
 - RMSE = 0.8188 means that there are some larger errors in predictions, though the errors are relatively moderate.
 - Test loss = 0.6704 is consistent with the RMSE, suggesting a reasonable fit, though improvements could be made.

## Gradient Descent

- an optimization algorithm i.e. used to minimize the cost function (or loss function) in ML models, particularly in deep learning and neural networks.
- Its main goal is to adjust the model's parameters (weights & biases) iteratively by moving in the direction that reduces the error or cost most rapidly.
- Process of gradient descent involves;
    - Computing the gradient (partial derivatives) of the cost function concerning each parameter.
    - Updating the parameters in the opposite direction of the gradient to minimize the cost function.
    - Repeating the process until the cost function reaches a minimum.
- Formula to update each parameter: <br>

    $\theta = \theta - \eta . \frac{\partial J(\theta)}{\partial \theta}$
    Here,
    - $\eta$ = learning rate; a small constant determining the step size for each iteration
    - $\frac{\partial J(\theta)}{\partial \theta}$ = gradient of the cost function $J(\theta)$

In [None]:
import numpy as np
import tensorflow as tf

In [None]:
# Creating synthetic data
np.random.seed(0)
X = np.random.rand(100, 1).astype(np.float32) # 100 data points as inputs
y_true = 3.5 * X + 1.2 + np.random.randn(100, 1) * 0.1 # y=3.5*X + 1.2 with some noise

In [None]:
# Defining the linear model: y = w * X + b
class LinearModel(tf.Module):
  def __init__(self):
    self.w = tf.Variable(np.random.randn(), dtype=tf.float32)
    self.b = tf.Variable(np.random.randn(), dtype=tf.float32)

  def __call__(self, X):
    return self.w * X + self.b

In [None]:
# Defining the loss function (Mean Squared Error)
def loss_fn(y_pred, y_true):
  return tf.reduce_mean(tf.square(y_pred - y_true))

In [None]:
# Training the model using Gradient Descent
def train_step(model, X, y_true, learning_rate):
  with tf.GradientTape() as tape:
    y_pred = model(X)
    loss = loss_fn(y_pred, y_true)
  gradients = tape.gradient(loss, [model.w, model.b])
  model.w.assign_sub(learning_rate * gradients[0])
  model.b.assign_sub(learning_rate * gradients[1])
  return loss

In [None]:
# Initializing model
model = LinearModel()

# Training parameters
learning_rate = 0.1
epochs = 500

# Training the model
for epoch in range(epochs):
  loss = train_step(model, X, y_true, learning_rate)
  if epoch % 50 == 0: # Printing the loss every 50 epochs
    print(f"Epoch {epoch}: Loss = {loss.numpy(): .4f}, w: {model.w.numpy():.4f}, b: {model.b.numpy():.4f}")

Epoch 0: Loss =  30.6112, w: -0.0641, b: -1.1431
Epoch 50: Loss =  0.0890, w: 2.5380, b: 1.7065
Epoch 100: Loss =  0.0304, w: 3.0071, b: 1.4688
Epoch 150: Loss =  0.0152, w: 3.2460, b: 1.3478
Epoch 200: Loss =  0.0113, w: 3.3676, b: 1.2861
Epoch 250: Loss =  0.0103, w: 3.4295, b: 1.2548
Epoch 300: Loss =  0.0100, w: 3.4610, b: 1.2388
Epoch 350: Loss =  0.0099, w: 3.4771, b: 1.2306
Epoch 400: Loss =  0.0099, w: 3.4852, b: 1.2265
Epoch 450: Loss =  0.0099, w: 3.4894, b: 1.2244


In [None]:
# Final learned parameters
print(f"\nFinal parameters: w = {model.w.numpy():.4f}, b = {model.b.numpy():.4f}")


Final parameters: w = 3.4915, b = 1.2233


### Types of Gradient Descent

**1. Batch Gradient Descent (BGD)**
- Here, the gradient of the cost function is computed using the entire dataset for each iteration.
- It is a deterministic method, as it computes the average gradient of all training examples.

**Advantages**
- Smooth convergence to the global or local minimum.
- Computationally efficient when dealing with smaller datasets.

**Disadvantages**
- For large datasets, computing the gradient over the entire dataset at each step can be slow and memory-intensive.
- May be infeasible for very large datasets.

**Update Rule**

$\theta = \theta - \eta . \frac{1}{m} \sum_{i=1}^m \nabla_\theta J (\theta, x^{(i)}, y^{(i)})$ <br>

where, m = number of training examples

In [7]:
import numpy as np
import tensorflow as tf

# Generating sample data
np.random.seed(42)
X = np.random.rand(100, 1) # 100 samples, 1 feature
y = 3*X+2 + np.random.randn(100, 1)*0.1 # y = 3x + 2 + noise

# Converting data to TensorFlow tensors
X_train = tf.constant(X, dtype=tf.float32)
y_train = tf.constant(y, dtype=tf.float32)

# Defining model parameters
weights = tf.Variable(tf.random.normal([1, 1]), dtype=tf.float32)
bias = tf.Variable(tf.random.normal([1]), dtype=tf.float32)

# Defining hyperparameters
learning_rate = 0.1
epochs = 1000

# Training loop for batch gradient descent
for epoch in range(epochs):
  with tf.GradientTape() as tape:
    # Linear model: y_pred = X*weights + bias
    y_pred = tf.matmul(X_train, weights) + bias

    # Mean Squared Error Loss
    loss = tf.reduce_mean(tf.square(y_train - y_pred))

  # Computing gradients
  gradients = tape.gradient(loss, [weights, bias])

  # Updating parameters
  weights.assign_sub(learning_rate * gradients[0])
  bias.assign_sub(learning_rate * gradients[1])

  # Logging progress every 100 epochs
  if epoch % 100 == 0:
    print(f"Epoch {epoch}: Loss = {loss.numpy()}")

# Output final parameters
print(f"Final parameters: w = {weights.numpy()}, b = {bias.numpy()}")

Epoch 0: Loss = 9.431061744689941
Epoch 100: Loss = 0.02597496658563614
Epoch 200: Loss = 0.009100452065467834
Epoch 300: Loss = 0.008125617168843746
Epoch 400: Loss = 0.008069302886724472
Epoch 500: Loss = 0.00806605163961649
Epoch 600: Loss = 0.008065860718488693
Epoch 700: Loss = 0.008065851405262947
Epoch 800: Loss = 0.008065850473940372
Epoch 900: Loss = 0.008065846748650074
Final parameters: w = [[2.954013]], b = [2.0215147]


Here, a synthetic data is generated for a simple linear regression problem, `y=3x + 2 + noise`. The linear model is defined as `y_pred = X*weights + bias` and MSE is used to measure the error. `tf.GradientTape` computes the gradients of the loss w.r.t the parameters. Gradients are applied to update `weights` & `bias` using the learning rate.<br>

We can see that the loss is decreasing over epochs and the trained parameters (`weights` & `bias`) approaches the true values 3 & 2 respectively.

**2. Stochastic Gradient Descent (SGD)**
- Here, the gradient of the cost function is computed for each training example individually, making faster updates and learning before the entire dataset has been processed.

**Advantages**
- Faster iterations as parameters are updated for each training example.
- Can handle large datasets as it processes one example at a time.
- Often escapes local minima due to inherent randomness.

**Disadvantages**
- Convergence path is noisier that Batch Gradient Descent, making it harder to achieve convergence to the minimum.
- May overshoot the minimum, causing fluctuation around it.

**Update Rule**

$\theta = \theta - \eta . \nabla_\theta J(\theta, x^{(i)}, y^{(i)})$ <br>

where, $x^{(i)}, y^{(i)}$ represents a single training example.

**3. Mini-Batch Gradient Descent (MBGD)**
- Provides a balance between Batch & Stochastic Gradient Descent.
- Instead of using the entire dataset (BGD) or one sample (SGD), Mini-Batch Gradient Descent updates the parameters by computing the gradient based on a small batch of training examples.

**Advantages**
- Provides benefits of both BGD and SGD: faster than BGD and more stable that SGD.
- Efficiently utilizes vectorized operations and hardware acceleration (e.g., GPUs)
- Noise in gradient updates helps avoid local minima while the batch size stabilizes the updates.

**Disadvantages**
- Requires tuning the batch size, which can affect convergence speed and performance.

**Update Rule**

$\theta = \theta - \eta . \frac{1}{n} \sum_{i=1}^n \nabla_{\theta}J(\theta, x^{(i)}, y^{(i)})$

where, $n$ = mini-batch size (e.g., 32, 64, 128)