# Gradient Descent, Regression, MLE and MSE
Sterling Hayden

_Note: I had ChatGPT fromat all my equations into LaTeX format._

## Parameter update rule for Simple Linear Regression using Gradient Descent

**Linear Regression Equation:**

The linear regression equation is defined as:

$$y = b_0 + b_1x_1 + b_2x_2 + \ldots + b_Jx_J$$

Where:
- \(y\) is the predicted output (dependent variable).
- \(x_1, x_2, \..., x_J\) are the Jth input features (independent variables).
- \(b_0\) is the y-intercept.
- \(b_1, b_2, \ldots, b_J\) are the weights associated with each feature.


**Step 1:** Define Mean Squared Error (MSE):

$$MSE = \frac{1}{2N} \sum_{i=1}^{N} \left(y_i - \left(b_0 + \sum_{j=1}^{J} b_j x_{ij}\right)^2\right)$$

Where:
- \(N\) is the number of data points.
- \(i\) indexes the data points.
- \(j\) indexes the features.


**Step 2:** Partial derivatives of MSE  
   
For b<sub>0</sub>:  
$$
\frac{\partial MSE}{\partial b_0} = -\frac{1}{N} \sum \left(y_i - \left(b_0 + \sum b_j x_{ij}\right)\right)
$$
  
For b<sub>j</sub>:  
$$
\frac{\partial MSE}{\partial b_j} = -\frac{1}{N} \sum x_{ij} \left(y_i - \left(b_0 + \sum b_j x_{ij}\right)\right)
$$

**Step 3:** Update weights  
$$b_0 \leftarrow b_0 - \frac{\partial MSE}{\partial b_0}$$

$$b_j \leftarrow b_j - \frac{\partial MSE}{\partial b_j}$$


**Step 4:** Repeat steps 2 and 3 until convergence

There are several libraries in Python that can perform linear regression, including gradient descent, and make it more convenient for us. One of the most commonly used libraries is scikit-learn. Here's how we can use scikit-learn to perform linear regression easily:

In [10]:
# imports for LR
from sklearn.linear_model import LinearRegression
import numpy as np

#  gen data
np.random.seed(0)
X = 2 * np.random.rand(100, 3)
y = 4 + 3*X[:, 0] + 2*X[:, 1] - 1.5*X[:, 2] + np.random.randn(100)

# Create LR model
model = LinearRegression()

# Fit model
model.fit(X, y)

# Get weights
intercept = model.intercept_
coefficients = model.coef_

print("Intercept (b0):", intercept)
print("Slope (b1):", coefficients[0])
print("Slope (b2):", coefficients[1])
print("Slope (b3):", coefficients[2])


Intercept (b0): 3.9502139024683567
Slope (b1): 2.7956623869503465
Slope (b2): 1.9899783163480775
Slope (b3): -1.3733845071909832


## Parameter update rule for Logistic Regression using Gradient Descent

In Logistic Regression, the sigmoid function (also known as the logistic function) is used to model the probability of a binary outcome. The sigmoid function is defined as:  

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$


where z is the linear combination of the feature values and the model parameters:  

$$z = \mathbf{w} \cdot \mathbf{x}$$


Here, W represents the parameter vector and x is the feature vector for a given data point.  
The probability of the positive class (\(y = 1\)) can be expressed as:  

$$p(y=1|\mathbf{x}) = \sigma(\mathbf{w} \cdot \mathbf{x})$$


And the probability of the negative class (\(y = 0\)) is the complement:  

$$p(y=0|\mathbf{x}) = 1 - p(y=1|\mathbf{x}) = 1 - \sigma(\mathbf{w} \cdot \mathbf{x})$$


To derive the parameter update rule for Logistic Regression using Gradient Descent in terms of the sigmoid function, we need to calculate the gradient of the log-likelihood. Here's the gradient with respect to the parameter w_j:  

$$\frac{\partial \log L(\mathbf{w})}{\partial w_j} = \sum_{i=1}^{N} \left( y_i - \sigma(\mathbf{w} \cdot \mathbf{x}_i) \right) \cdot x_{ij}$$


The parameter update rule for Logistic Regression using Gradient Descent in terms of the sigmoid function is then:  

$$w_j \leftarrow w_j + \alpha \cdot \sum_{i=1}^{N} \left( y_i - \sigma(\mathbf{w} \cdot \mathbf{x}_i) \right) \cdot x_{ij}$$


This update rule is applied to each parameter w_j iteratively to learn the optimal values that maximize the likelihood of the observed data under the logistic regression model. The learning rate alpha controls the step size in the gradient descent process.


The following is the code interpretation of logistic regression:

In [11]:
# imports for LogisticRegression
from sklearn.linear_model import LogisticRegression
import numpy as np

# data gen
np.random.seed(0)
X = 2 * np.random.rand(100, 3)
y = (np.random.rand(100) < 0.5).astype(int)

# init model
model = LogisticRegression()

model.fit(X, y)
intercept = model.intercept_
coefficients = model.coef_

print("Intercept (b0):", intercept)
print("Coefficient (b1):", coefficients[0][0])
print("Coefficient (b2):", coefficients[0][1])
print("Coefficient (b3):", coefficients[0][2])
# Positive coefficients increase the likelihood of the positive class, while negative coefficients decrease it.

Intercept (b0): [0.40279443]
Coefficient (b1): -0.0011113147498026527
Coefficient (b2): -0.2721459386001874
Coefficient (b3): 0.06241684222277979


## Proof that MSE is a special case of MLE

In Linear Regression we assume our model is gaussian. In other words we assume y follows a normal distribution.  
  
Thus the Likelihood of y:  
$$
L(u,\sigma|X) = \frac{1}{\sqrt{2\pi\sigma^2}} \prod_i^n \exp\left(-\frac{(y - u)^2}{2\sigma^2}\right)
$$

$$
\rightarrow    \alpha(B, \sigma^2) = (\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_i^n \exp\left(-\frac{(y_i - B^T x_i)^2}{2\sigma^2}\right)
$$

Therefore, log-likelihood is:  
$$
\log L(B, \sigma^2) = \log\left[\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_i^n \exp\left(-\frac{(y_i - B^T x_i)^2}{2\sigma^2}\right)\right]
$$

$$
\rightarrow \log L(B, \sigma^2) = n\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \sum_{i}^n \frac{-(y_i - B^T x_i)^2}{2\sigma^2}
$$

If we say the negitive of likelihood is our loss:  
$$
loss = -\log L(B,\sigma^2)
$$

Recall that we are intrested in the optimal solution rather than the optimal value.  
In other words we want to minimize the negitive likelihood, which we can find by setting the partial deriv to 0.    
$$
\hat{B} = argmin_B L(B, \sigma)
$$

$$
\rightarrow\hat{B} = argmin_B \sum_{i}^n (y_i - B^T x_i)^2
$$

$$
\rightarrow\hat{B} = argmin_B \sum_{i}^n (y_i - \hat{y}_o)^2
$$ 
Which is the sum of sqaure error.  

Hence in Linear regression, MSE is a special case of MLE when we assume the model is Gaussian.
