# Linear Regression in Python
## Statsmodels


### About Linear Regression
It is used as a predictive model that assumes a *linear relationship* between the dependent variable (the variable we are trying to predict/estimate) and the independent variable/s (input variable/s used in the prediction). 
<br>
<br> Under simple linear regression, only *one* independent/input variable is used to predict the dependent variable. It has the following structure:

$$ Y = MX + B$$

* Y = Dependent Variable (output/outcome/prediction/estimation)
* M = Slope of the regression line (The effect that X has on Y)
* X = Independant Variable (input variable used in the prediction of Y)
* B = Constant (Y-intercept)
<br>
<br>
For models where relationships may exist between a dependent variable and *multiple* independent variables, we use the **Multiple Linear Regression** structure:

$$ Y = C + M_{1}X_{1} + M_{2}X_{2} + ... $$

### Example with a Dataset
Let's suppose you have a fictitious economy with the following parameters, where the index_price is dependent variable, and the 2 independent/input variables are:
* interest_rate
* unemployment_rate


In [1]:
import pandas as pd

data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
        'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
        'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
        'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
        'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]        
        }

df = pd.DataFrame(data)
df.head()

Unnamed: 0,year,month,interest_rate,unemployment_rate,index_price
0,2017,12,2.75,5.3,1464
1,2017,11,2.5,5.3,1394
2,2017,10,2.5,5.3,1357
3,2017,9,2.5,5.3,1293
4,2017,8,2.5,5.4,1256


#### Coding using Statsmodels
let's apply the following syntax to perform linear regression in Python using statsmodels:

In [2]:
import pandas as pd
import statsmodels.api as sm

data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
        'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
        'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
        'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
        'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]        
        }

df = pd.DataFrame(data) 

x = df[['interest_rate','unemployment_rate']]
y = df['index_price']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            index_price   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     92.07
Date:                Wed, 12 Oct 2022   Prob (F-statistic):           4.04e-11
Time:                        17:09:47   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              1798.4040    899.24

### Interpreting the Regression Results from statsmodels
Several important concepts can be seen within the results above:
* 1. **Adjusted. R-squared**: Reflects the fit of the model. R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.
* 2. **const coefficient**: The Y-intercept. It means that if both the interest_rate and unemployment_rate coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.
* 3. **interest_rate coefficient**: Represents the change in the output Y due to a change of one unit in the interest rate (everything else held constant)
* 4. **unemployment_rate coefficient**: Represents the change in the output Y due to a change of one unit in the unemployment rate (everything else held constant)
* 5. **std err**: Reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy
* 6. **P >|t|** The p-value. A p-value of less than 0.05 is considered to be statistically significant
* 7. **Confidence Interval**: Represents the range in which our coefficients are likely to fall (with a likelihood of 95%)
<br>
<br> The following tutorial includes an 
[example of multiple linear regression using both sklearn and statsmodels](https://datatofish.com/multiple-linear-regression-python/).