# Stochastic Gradient Descent
### **ELI5 

Imagine you're at the top of a hilly park and you have a ball. You want to roll the ball down to the lowest point of the park. Instead of looking at the whole park to find the quickest path down, you just look at the ground right in front of you. You give the ball a little push in the direction that seems to go downwards. You keep doing this, giving the ball little pushes, and watching which way it goes down. After several little pushes, the ball eventually reaches the lowest point.

In this analogy:
- The hilly park is like a function we want to minimize.
- The ball is our current guess or solution.
- The little pushes are small steps we take to adjust our solution to make it better.

### **Or in another words:

Stochastic Gradient Descent (SGD) is an optimization algorithm commonly used in machine learning, especially for training deep neural networks.

Let's break it down:

1. **Gradient**: This refers to the slope or direction of the steepest increase of a function. For a multi-variable function, it's a vector of partial derivatives. In simpler terms, the gradient points in the direction where the function is increasing the fastest.
2. **Descent**: This means we want to go in the opposite direction of the gradient. Why? Because we usually want to minimize a function, like the error in a machine learning model. By moving opposite to the gradient, we're heading towards the minimum.
3. **Stochastic**: In the context of SGD, this means that instead of using the entire dataset to compute the gradient (which can be computationally expensive), we use just one or a small batch of data points. This introduces randomness (hence "stochastic"), as the direction of the gradient can vary based on which data points are chosen.

The process is iterative:
- Start with an initial guess for the solution (like initial weights in a neural network).
- Randomly pick a data point (or a small batch) from the dataset.
- Compute the gradient using that data point.
- Update the solution by moving a small step in the opposite direction of the gradient.
- Repeat until convergence or for a set number of iterations.

The main advantage of SGD over traditional gradient descent is speed. Because it uses only a subset of the data at each step, it can make progress and adjust the solution without having to process the entire dataset. However, because of the stochastic nature, it can be noisier and might not always take the most direct path to the minimum, but with the right parameters, it can converge to a good solution.


### Basic example in Python
We'll try to find the minimum of the quadratic function f(x)=x**2 using Stochastic Gradient Descent (SGD).

The true minimum of this function is at x=0

Here's how the SGD process would look:

* Initialization: Start with a random guess for x.
* Gradient Calculation: f'(x) = 2x
* Update Rule: Update x by subtracting a fraction (learning rate) of the gradient.
* Iteration: Repeat the gradient calculation and update steps until convergence.

In [1]:
import numpy as np

# Function f(x) = x^2
def f(x):
    return x**2

# Derivative of f, f'(x) = 2x
def df(x):
    return 2*x

# Stochastic Gradient Descent
def sgd(initial_x, learning_rate, num_iterations):
    x = initial_x
    history = [x]
    
    for i in range(num_iterations):
        gradient = df(x)
        x = x - learning_rate * gradient
        history.append(x)
        
    return x, history

# Parameters
initial_x = np.random.uniform(-10, 10)  # Starting with a random guess between -10 and 10
learning_rate = 0.1
num_iterations = 50

final_x, history = sgd(initial_x, learning_rate, num_iterations)

final_x, history

(-3.7806166769441725e-05,
 [-2.64888617180133,
  -2.119108937441064,
  -1.6952871499528512,
  -1.356229719962281,
  -1.0849837759698249,
  -0.8679870207758599,
  -0.6943896166206879,
  -0.5555116932965503,
  -0.4444093546372402,
  -0.35552748370979215,
  -0.2844219869678337,
  -0.22753758957426698,
  -0.1820300716594136,
  -0.14562405732753086,
  -0.11649924586202469,
  -0.09319939668961975,
  -0.0745595173516958,
  -0.05964761388135664,
  -0.04771809110508531,
  -0.03817447288406825,
  -0.0305395783072546,
  -0.02443166264580368,
  -0.019545330116642945,
  -0.015636264093314357,
  -0.012509011274651486,
  -0.01000720901972119,
  -0.008005767215776952,
  -0.006404613772621562,
  -0.00512369101809725,
  -0.0040989528144778,
  -0.00327916225158224,
  -0.0026233298012657918,
  -0.0020986638410126334,
  -0.0016789310728101067,
  -0.0013431448582480853,
  -0.0010745158865984683,
  -0.0008596127092787747,
  -0.0006876901674230197,
  -0.0005501521339384157,
  -0.0004401217071507326,
  -0.0003