# Linear Regression
Goal is to predict the value of a dependent (target) variable based on independent (predictor) variables. Linear regression model looks like this:

$$ \hat y = \hat w_{0} + \hat w_{1}x_{1} + \hat w_{2}x_{2} + ... + \hat w_{m}x_{m} $$

where the coefficients $\hat w_{0}$ is the bias and $\hat w_{1}, \hat w_{2} ..., \hat w_{m}$ are the estimated weights.

### Simple linear regression model

Given data $x$ and $y$, we can plot a scatter chart such as for the following small dataset where $x$ is the year and $y$ is the revenue of a company:

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

data = [
    {'X': 2018, 'Y': 50},
    {'X': 2019, 'Y': 54},
    {'X': 2020, 'Y': 58},
    {'X': 2021, 'Y': 55},
    {'X': 2022, 'Y': 60},
]

df = pd.DataFrame(data)
print(df)

      X   Y
0  2018  50
1  2019  54
2  2020  58
3  2021  55
4  2022  60


![alt text](images/5.3%20Linear%20Regression.png 'Linear Regression')

The image above shows that the true value for $y$ under the assumptions of a linear relationship and the estimated value $\hat y$ are often different due to noise in the data, i.e., there will be a random error term $\epsilon$ such that:

$$y = w_{0} + w_{1}x + \epsilon$$

where the bias $w_{0}$ is the intercept and $w_{1}$ is the model slope. In a linear regression model, the best model is one that minimizes the error term values for each data point. So, we wish to find the coefficients which minimize this error. We can use the least squares method to do this.

Given $y_{i} = w_{0} + w_{1}x_{i} +\epsilon , i = 1 ... n$
$$\sum_{i=1}^{n}\epsilon^{2} = \sum_{i=1}^{n}(y_{i} - w_{0} - w_{1}x_{i})$$

and we can calculate the following:

$$\frac{\partial}{\partial w_{0}} \sum_{i=1}^{n}\epsilon^{2} = 0$$
and
$$\frac{\partial}{\partial w_{1}} \sum_{i=1}^{n}\epsilon^{2} = 0$$

Doing a whole bunch of tedious maths on these two terms leads to the following results:

$$w_{0} = \frac{1}{n}\sum_{i=1}^{n}y_{i} - w_{1}\frac{1}{n}\sum_{1=2}^{n}x_{i}$$
and
$$w_{1} = \frac{\sum_{i=1}^{n}y_{i}x_{i} - \frac{(\sum_{i=1}^{n}y_{i})(\sum_{i=1}^{n}x_{i})}{n} }{\sum_{i=1}^{n}x_{i}^{2} - \frac{(\sum_{i=1}^{n}x_i)^{2}}{n}}$$

Calculating each term individually is simple using the dataframe from earlier.

In [5]:
df['YX'] = df['Y']*df['X']
df['X2'] = df['X']*df['X']
print(df)

      X   Y      YX       X2
0  2018  50  100900  4072324
1  2019  54  109026  4076361
2  2020  58  117160  4080400
3  2021  55  111155  4084441
4  2022  60  121320  4088484


In [13]:
sum_yx = sum(df['YX'])
sumy_sumx = sum(df['X'])*sum(df['Y'])
sum_x2 = sum(df['X2'])
sum_x_2 = (sum(df['X']))*(sum(df['X']))

w1 = (sum_yx - (sumy_sumx / len(df.index))) / (sum_x2 - (sum_x_2 / len(df.index)))
print(w1)

2.1


In [14]:
w0 = sum(df['Y']) / (len(df.index)) - (w1/len(df.index))*sum(df['X'])
print(w0)

-4186.6


This means that we now have a linear equation as follows:

$\hat y = -4186.6 + 2.1\cdot x$

So, in 2024, the predicted revenue is 78.5 million dollars.
Of course, we also need to assess the performance and reliability of the linear regression model on unseen data using model validation techniques. It is possible to gain confidence in the ability to make accurate predictions.