# Implement Gradient Descent Variants with MSE Loss
In this problem, you need to implement a single function that can perform three variants of gradient descent—Stochastic Gradient Descent (SGD), Batch Gradient Descent, and Mini-Batch Gradient Descent—using Mean Squared Error (MSE) as the loss function. The function will take an additional parameter to specify which variant to use.

Example
```python
import numpy as np

# Sample data
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])

# Parameters
learning_rate = 0.01
n_iterations = 1000
batch_size = 2

# Initialize weights
weights = np.zeros(X.shape[1])

# Test Batch Gradient Descent
final_weights = gradient_descent(X, y, weights, learning_rate, n_iterations, method='batch')
output: [float,float]
# Test Stochastic Gradient Descent
final_weights = gradient_descent(X, y, weights, learning_rate, n_iterations, method='stochastic')
output: [float, float]
# Test Mini-Batch Gradient Descent
final_weights = gradient_descent(X, y, weights, learning_rate, n_iterations, batch_size, method='mini_batch')
output: [float, float]
```

## Understanding Gradient Descent Variants with MSE Loss

Gradient Descent is an optimization algorithm used to minimize the cost function in machine learning models, particularly in linear regression and neural networks. The Mean Squared Error (MSE) loss function is commonly used in regression tasks. There are three main types of gradient descent based on how much data is used to compute the gradient at each iteration:

## 1. Batch Gradient Descent

Batch Gradient Descent computes the gradient of the MSE loss function with respect to the parameters for the entire training dataset. It updates the parameters after processing the entire dataset:

$$\theta = \theta - \alpha \cdot \frac{2}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}$$

Where $\alpha$ is the learning rate, $m$ is the number of samples, and $\nabla_\theta J(\theta)$ is the gradient of the MSE loss function.

## 2. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent updates the parameters for each training example individually, making it faster but more noisy:

$$\theta = \theta - \alpha \cdot 2(h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}$$

Where $x^{(i)}$ and $y^{(i)}$ are individual training examples.

## 3. Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a compromise between Batch and Stochastic Gradient Descent. It updates the parameters after processing a small batch of training examples, without shuffling the data:

$$\theta = \theta - \alpha \cdot \frac{2}{b} \sum_{i = 1}^b (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}$$ 
 
Where $b$ is the batch size, a subset of the training dataset.

Each method has its advantages: Batch Gradient Descent is more stable but slower, Stochastic Gradient Descent is faster but noisy, and Mini-Batch Gradient Descent strikes a balance between the two.

In [1]:
import numpy as np

def gradient_descent(X, y, weights, learning_rate, n_iterations, batch_size=1, method='batch'):
    m = len(y)
    for _ in range(n_iterations):
        if method=='batch':
            y_ = X.dot(weights)
            errs = y_ - y
            gradient = 2 * X.T.dot(errs) / m
            weights = weights - learning_rate * gradient
        elif method=='stochastic':
            for i in range(m):
                yi_ = X[i].dot(weights)
                err = yi_ - y[i]
                gradient = 2 * X[i].T.dot(err)
                weights = weights - learning_rate * gradient
        elif method=='mini_batch':
            for i in range(0, m, batch_size):
                xbi = X[i:i+batch_size]
                ybi = y[i:i+batch_size]
                ybi_ = xbi.dot(weights)
                errs = ybi_ - ybi
                gradient = 2 * xbi.T.dot(errs) / batch_size
                weights = weights - learning_rate * gradient
    return weights

In [3]:
import numpy as np
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])
weights = np.zeros(X.shape[1])
learning_rate = 0.01
n_iterations = 100
# Test Batch Gradient Descent
output = gradient_descent(X, y, weights, learning_rate, n_iterations, method='batch')
print('Test Case 1: Accepted') if np.allclose(output, [1.14905239, 0.56176776]) else print('Test Case 1: Failed')
print('Input:')
print('import numpy as np\nX = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])\ny = np.array([2, 3, 4, 5])\nweights = np.zeros(X.shape[1])\nlearning_rate = 0.01\nn_iterations = 100\n# Test Batch Gradient Descent\noutput = gradient_descent(X, y, weights, learning_rate, n_iterations, method=\'batch\')\nprint(output)')
print()
print('Output:')
print(output)
print()
print('Expected:')
print('[1.14905239 0.56176776]')
print()
print()


import numpy as np
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])
weights = np.zeros(X.shape[1])
learning_rate = 0.01
n_iterations = 100
# Test Stochastic Gradient Descent
output = gradient_descent(X, y, weights, learning_rate, n_iterations, method='stochastic')
print('Test Case 2: Accepted') if np.allclose(output, [1.0507814, 0.83659454]) else print('Test Case 2: Failed')
print('Input:')
print('import numpy as np\nX = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])\ny = np.array([2, 3, 4, 5])\nweights = np.zeros(X.shape[1])\nlearning_rate = 0.01\nn_iterations = 100\n# Test Stochastic Gradient Descent\noutput = gradient_descent(X, y, weights, learning_rate, n_iterations, method=\'stochastic\')\nprint(output)')
print()
print('Output:')
print(output)
print()
print('Expected:')
print('[1.0507814  0.83659454]')
print()
print()


import numpy as np
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])
weights = np.zeros(X.shape[1])
learning_rate = 0.01
n_iterations = 100
batch_size = 2
# Test Mini-Batch Gradient Descent
output = gradient_descent(X, y, weights, learning_rate, n_iterations, batch_size, method='mini_batch')
print('Test Case 3: Accepted') if np.allclose(output, [1.10334065, 0.68329431]) else print('Test Case 3: Failed')
print('Input:')
print('import numpy as np\nX = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])\ny = np.array([2, 3, 4, 5])\nweights = np.zeros(X.shape[1])\nlearning_rate = 0.01\nn_iterations = 100\nbatch_size = 2\n# Test Mini-Batch Gradient Descent\noutput = gradient_descent(X, y, weights, learning_rate, n_iterations, batch_size, method=\'mini_batch\')\nprint(output)')
print()
print('Output:')
print(output)
print()
print('Expected:')
print('[1.10334065 0.68329431]')

Test Case 1: Accepted
Input:
import numpy as np
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])
weights = np.zeros(X.shape[1])
learning_rate = 0.01
n_iterations = 100
# Test Batch Gradient Descent
output = gradient_descent(X, y, weights, learning_rate, n_iterations, method='batch')
print(output)

Output:
[1.14905239 0.56176776]

Expected:
[1.14905239 0.56176776]


Test Case 2: Accepted
Input:
import numpy as np
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])
weights = np.zeros(X.shape[1])
learning_rate = 0.01
n_iterations = 100
# Test Stochastic Gradient Descent
output = gradient_descent(X, y, weights, learning_rate, n_iterations, method='stochastic')
print(output)

Output:
[1.0507814  0.83659454]

Expected:
[1.0507814  0.83659454]


Test Case 3: Accepted
Input:
import numpy as np
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])
y = np.array([2, 3, 4, 5])
weights = np.zeros(X.shape[1])
learning_rate = 0.01
n_iterations = 100
batch_size 