# Linear Regression

## Simple Linera regression

$$\hat{y} = w_0 + w_1\times x$$

where $w_0$ (y-intercept or bias) and $w_1$ (slope, or weight) are coefficient parameters, and $\hat{y}$ is the predicted response variable based on the predictor variable $x$.

### Estimate

**sum of square error**:

$$E = \sum_{i=1}^{n} ( y_i - (w_0 + w_1\times x_i))^2$$

In the scikit-learn's LinearRegression class, it is using the  Ordinary Least Squares method, which to minimize the sum of the squared residuals between the observed and predicted values. 

It uses a closed-form solution to directly calculate the optimal values for the coefficients without an iterative process.The closed-form solution for estimating the coefficients in the OLS method is:

$$β = (X^T * X)^(-1) * X^T * y$$


**Mean Absolute Error (MAE)**:
$$\begin{equation} MAE = \frac{\sum_{i=1}^{n} \lvert( y_i - \hat{y}_i)\rvert }{n} \end{equation}$$


**Root Mean Square Error (RMSE)**
$$\begin{equation} RMSE = \sqrt{\frac{\sum_{i=1}^{n} ( y_i - \hat{y}_i)^2 }{n}} \end{equation}$$

**R-squared**
$$ \begin{equation} R^2 = 1 - \frac{\sum_{i=1}^{n} ( y_i - \hat{y}_i)^2 }{\sum_{i=1}^{n} (y_i- \bar y)^2} \end{equation}$$


## Multiple Linear Regression 

$$\hat{y} = w_0 + w_1\times x_1 + w_2 \times x_2 ... + w_n \times x_n$$

In [14]:
# import and prepare
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

data=pd.read_csv('../Poll_result.csv')

# create linear regression model
lr_model = LinearRegression()

# set training set
x = data['Q1'].values.reshape(-1, 1)
y = data['Q5Average'].values.reshape(-1, 1)

# train 
lr_model.fit(x,y)

# y_predict
y_predi = lr_model.predict(x)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))

In [15]:
y_predi

array([[2.26686404],
       [2.03930874],
       [2.26686404],
       ...,
       [2.11516051],
       [2.11516051],
       [2.11516051]])

In [16]:
rmse

0.6125387763554059

In [17]:
x1 = data['Q7Average'].values.reshape(-1, 1)
x2 = data['Q5Average'].values.reshape(-1, 1)
y = data['Q1'].values.reshape(-1, 1)

# Create linear regression model
lr_model = LinearRegression()

# Concatenate the input variables
X = np.concatenate((x1, x2), axis=1)

# Train the model
lr_model.fit(X, y)

# Predict y values
y_pred = lr_model.predict(X)

In [12]:
y_pred

array([[3.1085424 ],
       [2.66962693],
       [2.66962693],
       ...,
       [2.66962693],
       [2.62634354],
       [2.62634354]])