Given below are the theory, mathematical concepts and code examples of Simple Linear Regression model having a single feature using the Gradient Descent Algorithm!!

A majority of the theory and notation is used from Professor Andrew Ng's lectures!!

**Simple Linear Regression with One Feature** **(Univariate Linear Regression)**

**Overview:** Simple linear regression is a statistical method used to model the relationship between a single independent variable (feature) and a dependent variable (target/label). The goal is to find a linear equation that best predicts the target variable based on the feature. The 'hypothesis' of a simple linear regression model can be expressed as:
\begin{equation}
h(x) = \theta_0 + \theta_1 x
\end{equation}

where:

h(x) is the hypothesis/model

x is the independent variable

$\theta_0$ is the y-intercept or the bias term

$\theta_1$ is the slope of the line (coefficient of the feature)

$\theta_0$ & $\theta_1$ together are also called the parameters or coefficients or the weights

Note the above hypothesis is an affine function


**Cost/Loss Function: Mean Squared Error (MSE) cost function**

The most common cost function used in linear regression is the Mean Squared Error (MSE), which measures the average squared difference between the predicted values and the actual values:

\begin{equation}
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
\end{equation}

where:
m is the number of training examples

$h_\theta(x^{(i)})$ is the hypothesis function, which predicts output for *i*-th training example

$y^{(i)}$ is the i-th training example

$\theta_0$ & $\theta_1$ are the parameters of the linear regression model

The mean squared cost function measures the average squared error between the predicted output and the actual output for all training examples.

The goal of the linear regression is to find the values of $\theta_0$ & $\theta_1$ that minimizes the cost function J($\theta_0$, $\theta_1$)

The graph of the cost function vs $\theta_1$ vs $\theta_0$ is a 3-Dimensional curve with a bowl shape that may have more than one global minimum. But since we're using squared error, we can only have a single global minimum.

You can also use a countor plot to visualize it better

**Gradient Descent Algorithm**


The outline of the algorithm is to start with a random value of $\theta_0$ & $\theta_1$ and keep changing both to reduce J($\theta_0$, $\theta_1$) until we settle at or near a minima

The program looks like this:

repeat until convergence{
  \begin{equation}
    \theta_0 = \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)
  \end{equation}

  \begin{equation}
    \theta_1 = \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J (\theta_0, \theta_1)
  \end{equation}
}

where $\alpha$ is the "learning rate" that determines how big of a step to take for each iteration

Here you have to simultaneously update $\theta_0$ & $\theta_1$ and take the pre-updated value in the partial derivative

The derivative shows you the slope(+/-) & it helps to get closer to the minimum, that's the reason there's the partial derivative term in the equation

The equation then becomes:

repeat until convergence{
  \begin{equation}
    \theta_0 = \theta_0 - \alpha (\frac{1}{m} \sum_{i=1}^{m} (h(\theta_0 ^ {(i)}) - y^{(i)}))
  \end{equation}

  \begin{equation}
    \theta_1 = \theta_1 - \alpha (\frac{1}{m} \sum_{i=1}^{m} (h(\theta_1 ^ {(i)}) - y^{(i)})x^{(i)})
  \end{equation}
}

That's probably all there is to for an overview of SLR & Gradient Descent algorithm. Now to apply it "manually" using only the Numpy library

In [129]:
#Import all the necessary libraries
import numpy as np
import matplotlib.pyplot as plt

In [130]:
#Generate some synthetic data
X = np.random.randn(100, 1) #generate 100 random integers
Y = 2 * X + np.random.randn(100, 1) #linear relation with some noise

In [None]:
#Implementing Gradient Descent

#setting up the hyperparameters
learning_rate = 0.01
iterations = 1000
m = X.shape[0] #number of training examples

#initializing the parameters
theta_0 = 0
theta_1 = 0

#save the cost for plotting later
Cost = []

# gradient descent
for iteration in range(iterations):
    h = theta_0 + theta_1 * X
    gradientTheta_0 = (1/m) * np.sum(h - Y)
    gradientTheta_1 = (1/m) * np.sum((h - Y) * X)
    theta_0 -= learning_rate * gradientTheta_0
    theta_1 -= learning_rate * gradientTheta_1
    cost = (1/(2*m)) * np.sum((h - Y) ** 2)
    Cost.append(cost)

# Plotting the cost over iterations
plt.plot(range(iterations), Cost)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('Cost vs Iterations')
plt.show()

# Print final parameters
print(f'Final parameters: theta_0 = {theta_0}, theta_1 = {theta_1}')

# Predicting the regression line
Y_pred = theta_0 + theta_1 * X

# Plotting the results
plt.scatter(X, Y, label='Actual Data')
plt.plot(X, Y_pred, color='red', label='Regression Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()

The above code converges right around at 200 iterations for learning rate equal to 0.01, however you can choose different values of $\alpha$, $\theta_0$ & $\theta_1$ and see when the algorithm converges and what's the linear equation!!