Regression - using labels to predict a **continuous** output

### Types of regression
- Linear regression: **zz** linear relation ($y = mx + b$)
- Polynomial regression: curved relation ($y = mx_2 + b$)
- Multiple Linear Regression: more than one feature/independent variable ($y = w_1x_1 + w_2x_2 + b$)

### Feature Vector

Given you have $n$ features, here is one sample in the data set, where $i$ is the index of the sample:

$x^{(i)} = [x_{(1)}^{(i)}, x_{(2)}^{(i)}, x_{(3)}^{(i)}, ..., x_{(n)}^{(i)}] $

Make a design feature vector by adding a 1 to the begining of the vector to account for the bias b:

$x^{(i)} = [1^{(i)}, x_{(1)}^{(i)}, x_{(2)}^{(i)}, x_{(3)}^{(i)}, ..., x_{(n)}] $ 

Now create a weight vector that is size $n + 1$:

$w = [w_{(1)}, w_{(2)}]$

### Hypothesis 

Your prediction or hypothesis is $h(x;w) = w_T \cdot x_{i}$

$w_T$ is the transposed weight matrix and $x_{i}$ is the feature vector at $i$:

$h(x;w) = [w_{(1)}, w_{(2)}, ...,  w_{(n+1)}] \cdot \begin{pmatrix} \\1^{(i)} \\x_{(1)}^{(i)} \\... \\x_{(n)}^{(i)}\end{pmatrix}$

which would yield:

$h(x;w) = w_{(1)}(1)^{(i)} + w_{(2)}x_{(1)}^{(i)} + ... + w_{(n+1)}x_{(n)}^{(i)} $


Since $w_{(1)}(1)^{(i)}$ is the same as $w_{(1)}$ we can just call that $b$ or the bias.


### Measuring our Regression
We can use Mean Squared Error to see how good our prediction is compared to the actual data.

We create a **cost/loss** function:

$J(w) = \frac{1}{2} \sum_{i=1}^{m} (h(x^i;w) - y^i)^2$


where $i$ is the index of the training sample, $m$ is the number of samples, $y$ is the label, and $x$ is the feature vector, and $h(x^i;w)$ is the prediction.

# Gradient Descent

We are incrementally changing the vector $w_{j+1}$'s values to minimize the cost function ($J(w)$). We do this by taking the derrivitive of the cost fucntion $\nabla J(w)$, scaling it by a learning rate ($\alpha$) and subtracting the obtained value from the current $w_{j}$.

## Batch Gradient Descent Algorithm
- Goes over entire dataset each iteration and calculates gradient.
- Upside: Slow
- Downside: Converges

**Until Convergence:**
- $w_{j} := w_{j} - \alpha (\frac{1}{m} \sum_{i=1}^{m} (h(x^i;w) - y^i) x^{(i)}) $

## Stochsatic Gradient Descent
- Chooses random sample each iteration.
- Upside: Faster
- Downside: Never fully converges

**Until Convergence:**
- for i = 1 to m
    - Choose random x-y sample
    - $w_{j} := w_{j} - \alpha (h(x^i;w) - y^i) x^{(i)} $

# Normal Equations

- Directly computes the parameters of a model that minimizes the Sum
of the squared difference between the actual term and the predicted
term
- Can only be used for linear regression
- Assumes the $X$ is a feature matrix that includes the bias term

$W = (X^{T}X)^-1 \cdot (X^{T}y)$ 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import argparse

data =  pd.read_csv('lreg.csv')

x = np.insert(np.array(data['studytime']).reshape(len(data['studytime']), 1), 0, 1, axis=1)
y = np.array(data['score']).reshape(len(data['score']), 1)
w = np.ones(2).reshape(2,1)
m =len(data['studytime'])
epochs = 1000
alpha = .001

loss = []

for epoch in range(epochs):
    for i in range(m):
        index = np.random.randint(m)
        for j in range(len(w)):
            w[j] = (w[j] - alpha * (np.dot(w.T, x[index]) - y[index]) * x[index][j])

    loss.append(((np.dot(w.T, x[index]) - y[index])**2) / 2 )


plt.plot(data["studytime"], data['score'], 'o', label='Actual')  # Plot the data points
plt.title('Stochastic Linear Regression for Test Scores')
plt.xlabel('Study Time')
plt.ylabel('Score')
plt.plot(data["studytime"], np.dot(x, w), '-', label='Prediction')  # Plot the linear regression line
plt.legend(loc="upper left")
plt.show()

plt.title('Loss over Epochs (Stohastic)')
plt.plot(*range(epochs), loss)  # Plot the data points
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import argparse

data =  pd.read_csv('lreg.csv')

x = np.insert(np.array(data['studytime']).reshape(len(data['studytime']), 1), 0, 1, axis=1)
y = np.array(data['score']).reshape(len(data['score']), 1)
w = np.ones(2).reshape(2,1)
m =len(data['studytime'])
epochs = 100
alpha = .001

loss = []

def gradient(w,x,y,j):
    grad = 0
    for i in range(m):
        grad += (np.dot(w.T, x[i]) - y[i]) * x[i][j]
    return grad


for epoch in range(epochs):
    for i in range(m):
        index = np.random.randint(m)
        for j in range(len(w)):
            w[j] = (w[j] - alpha * (1/m) * gradient(w,x,y,j))

    loss.append( (1/m) * ((np.dot(w.T, x[index]) - y[index])**2) / 2 )


plt.plot(data["studytime"], data['score'], 'o', label='Actual')  # Plot the data points
plt.title('Stochastic Linear Regression for Test Scores')
plt.xlabel('Study Time')
plt.ylabel('Score')
plt.plot(data["studytime"], np.dot(x, w), '-', label='Prediction')  # Plot the linear regression line
plt.legend(loc="upper left")
plt.show()

plt.title('Loss over Epochs (Stohastic)')
plt.plot(*range(epochs), loss)  # Plot the data points
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()