### Comparing sklearn and StatsModel

In sklearn, the intercept is included by default. <i>fit_intercept=True</i>. But intercept is not included automatically in Statsmodel. Thus a column of 1 is added. 

Statsmodel takes y first, then x when constructing the model. <i>sm.OLS(y1, x1)</i>

why adding a column of 1? 

reference: https://stats.stackexchange.com/questions/249892/wildly-different-r2-between-statsmodels-linear-regression-and-sklearn-linear

In [6]:
# Import
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

import statsmodels.api as sm

In [1]:
# Compare sklearn and statsmodel
x1 = [26.0, 31.0, 47.0, 51.0, 50.0, 49.0, 37.0, 33.0, 49.0, 54.0, 31.0, 49.0, 48.0, 49.0, 49.0, 47.0, 44.0, 48.0, 35.0, 43.0]
y1 = [116.0, 94.0, 100.0, 102.0, 116.0, 116.0, 68.0, 118.0, 91.0, 104.0, 78.0, 116.0, 90.0, 109.0, 116.0, 118.0, 108.0, 119.0, 110.0, 102.0]

### StatsModel

In [4]:
# Add intercept, as sklearn includes an intercept by default
x1 = sm.add_constant(x1)
print(x1[0:5])

# Fit and summarize statsmodel OLS model
model_sm = sm.OLS(y1, x1)
result_sm = model_sm.fit()
print(result_sm.summary())

[[ 1. 26.]
 [ 1. 31.]
 [ 1. 47.]
 [ 1. 51.]
 [ 1. 50.]]
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.058
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     1.117
Date:                Thu, 07 Oct 2021   Prob (F-statistic):              0.305
Time:                        18:31:33   Log-Likelihood:                -80.427
No. Observations:                  20   AIC:                             164.9
Df Residuals:                      18   BIC:                             166.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------

### sklearn

In [7]:
# Create sklearn linear regression object. fit_intercept is set True by default
ols_sk = LinearRegression(fit_intercept=True)

# fit model
model_sk = ols_sk.fit(pd.DataFrame(x1), pd.DataFrame(y1))

# sklearn coefficient of determination
coef_of_det = model_sk.score(pd.DataFrame(x1), pd.DataFrame(y1))

print('sklearn R^2: ' + str(coef_of_det))

sklearn R^2: 0.05840690736642229
