
# Lecture 2 - Regression

## Representing Supervised Learning
We can represent supervised learning like so:
<!-- ![Supervised Learning](img/supervised-learning.png) -->

## Regression Hypothesis
The regression hypothesis function can be represented like so, where there are j input features:
$$
h(x) = \theta_0 + \theta_1x_1 + ... + \theta_jx_j
$$

We can generalize this like so:
$$
h(x) = \sum \limits _{j=0} ^{n} \theta_jx_j
$$

Where:
- $x_0$ = 1

We can represent our input variables like so:
$$
\vec{x} = \begin{bmatrix}
x_1 \\
... \\
x_j
\end{bmatrix}
$$
We can represent our parameters like so:
$$
\vec{\theta} = \begin{bmatrix}
\theta_1 \\
... \\
\theta_j
\end{bmatrix}
$$


### Terminology
- $x_j$ is the input (Feature j) variable.
- $y$ is the output variable.
- $m$ is the number of training examples.
- $\theta$ are the parameters.
- $(x, y)$ is a training example.
- $(x^{(i)}, y^{(i)})$ is the i-th training example.

__Note__: The superscript (i) is used to denote the i-th training example.

## Cost Function
The cost function is average of the squared differences between the predicted value and the actual value. It is represented like so:

In linear regression, we want to minimize the cost function which is represented like so:
$$
J(\theta) = \frac{1}{2m} \sum \limits _{i=1} ^m ((y)^{(i)} - h(x^{(i)}))^2
$$

### Explanation
The cost function is used to measure how accurate our hypothesis function is. The closer the cost function is to 0, the more accurate our hypothesis function is.

1. $((y)^{(i)} - h(x^{(i)}))^2$ is the squared error for the i-th training example.
2. $\sum \limits _{i=1} ^m$, we then iterate through all the training sets and sum it all up
3. We multiply that sum by $\frac{1}{m}$ to get the average of the total squared errors. We divide by 2 to make the derivative calculations easier. 


## Gradient Descent
Gradient descent is an algorithm that minimizes the cost function, $J(\theta)$, by iteratively updating the parameters, $\theta$, by checking the gradient.

We can compute the gradient descent like so:
$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)
$$







In [None]:
%pip install numpy
%pip install matplotlib

import numpy as np
import matplotlib.pyplot as plt

# Define the function to minimize
def f(x):
    return x**2 + 10*np.sin(x)

# Define the derivative of the function
def grad_f(x):
    return 2*x + 10*np.cos(x)

# Initialize the starting point
x_0 = 3

# Perform gradient descent
learning_rate = 0.1
x_list = [x_0]
for i in range(50):
    x_list.append(x_list[-1] - learning_rate*grad_f(x_list[-1]))

# Plot the function and the progress of gradient descent
x = np.linspace(-5, 5, 100)
y = f(x)
plt.plot(x, y, label='f(x)')
plt.scatter(np.array(x_list), f(np.array(x_list)), c='r', label='Gradient Descent')
# plt.scatter(x_list, f(x_list), c='r', label='Gradient Descent')
plt.legend()
plt.show()