## Pair Programming - Regresión lineal 7

### ANOVA

---

In [31]:
# Tratamiento de datos
import pandas as pd
import numpy as np

# Estadísticos
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Configuración warnings
import warnings
warnings.filterwarnings('ignore')

Sabemos que no cumplimos ninguna de las asunciones, y por tanto no deberíamos sacar el ANOVA, pero por seguir prácticando hemos decidido seguir adelante con el.

### 1. Abrimos nuestro df.

In [32]:
df_estandarizado= pd.read_csv('../archivos/metro_C.csv', index_col=0)

In [33]:
df_estandarizado.head(2)

Unnamed: 0,index,season,cat_time,weekday,holiday_cat,traffic_volume,traffic_box,temp_c,snow_1h
0,0,Autumn,morning,martes,no,5546,544.504895,0.55231,-0.027233
1,1,Autumn,morning,martes,no,4517,472.646491,0.637259,-0.027233


### 2. Inicializamos la ANOVA

In [34]:
lm = ols("traffic_box ~ temp_c + weekday + holiday_cat + snow_1h + cat_time + season", data=df_estandarizado).fit()

In [35]:
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
weekday,6.0,61073910.0,10178990.0,1546.147287,0.0
holiday_cat,1.0,3015261.0,3015261.0,458.006203,3.871014e-101
cat_time,3.0,1034958000.0,344986100.0,52402.019915,0.0
season,3.0,2354139.0,784713.0,119.194784,6.556425e-77
temp_c,1.0,1791850.0,1791850.0,272.174814,5.604636e-61
snow_1h,1.0,22258.74,22258.74,3.381014,0.06595761
Residual,48154.0,317019500.0,6583.451,,


### 3. Sacamos el summary.

In [36]:
lm.summary()

0,1,2,3
Dep. Variable:,traffic_box,R-squared:,0.777
Model:,OLS,Adj. R-squared:,0.777
Method:,Least Squares,F-statistic:,11170.0
Date:,"Mon, 30 Jan 2023",Prob (F-statistic):,0.0
Time:,21:03:49,Log-Likelihood:,-280110.0
No. Observations:,48170,AIC:,560200.0
Df Residuals:,48154,BIC:,560400.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,287.4740,1.409,203.974,0.000,284.712,290.236
weekday[T.jueves],99.7813,1.389,71.860,0.000,97.060,102.503
weekday[T.lunes],72.6934,1.377,52.807,0.000,69.995,75.392
weekday[T.martes],89.3055,1.387,64.397,0.000,86.587,92.024
weekday[T.miercoles],95.0596,1.383,68.725,0.000,92.348,97.771
weekday[T.sabado],35.7325,1.387,25.754,0.000,33.013,38.452
weekday[T.viernes],103.4204,1.388,74.530,0.000,100.701,106.140
holiday_cat[T.si],-7.9241,10.424,-0.760,0.447,-28.356,12.507
cat_time[T.midday],167.6423,1.229,136.416,0.000,165.234,170.051

0,1,2,3
Omnibus:,2714.576,Durbin-Watson:,0.733
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5376.947
Skew:,-0.407,Prob(JB):,0.0
Kurtosis:,4.42,Cond. No.,35.4


Viendo los resultados concluimos que todas nuestras variables predictoras tienen un valor menor a 0.05, excepto "holiday_cat" y "season[T.Winter]", y nuestro R-squared tiene un 0.77 de puntuación 😀


### 4. Abrimos nuestro df sin estandarizar.

In [37]:
df_sin_estandarizar= pd.read_csv('../archivos/metro_A.csv', index_col=0)

In [38]:
df_sin_estandarizar.head(2)

Unnamed: 0,season,weekday,cat_time,holiday_cat,temp_c,snow_1h,traffic_volume
0,Autumn,martes,morning,no,15.13,0.0,5545
1,Autumn,martes,morning,no,16.21,0.0,4516


In [39]:
lm = ols("traffic_volume ~ temp_c + weekday + holiday_cat + snow_1h + cat_time + season", data=df_sin_estandarizar).fit()

In [40]:
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
weekday,6.0,10205400000.0,1700900000.0,1893.057262,0.0
holiday_cat,1.0,404173700.0,404173700.0,449.834681,2.235496e-99
cat_time,3.0,135572100000.0,45190690000.0,50296.046948,0.0
season,3.0,389826200.0,129942100.0,144.622109,2.682546e-93
temp_c,1.0,381232000.0,381232000.0,424.301149,7.187423e-94
snow_1h,1.0,3697502.0,3697502.0,4.115223,0.04250442
Residual,48171.0,43281350000.0,898493.9,,


In [41]:
lm.summary()

0,1,2,3
Dep. Variable:,traffic_volume,R-squared:,0.772
Model:,OLS,Adj. R-squared:,0.772
Method:,Least Squares,F-statistic:,10900.0
Date:,"Mon, 30 Jan 2023",Prob (F-statistic):,0.0
Time:,21:03:50,Log-Likelihood:,-398650.0
No. Observations:,48187,AIC:,797300.0
Df Residuals:,48171,BIC:,797500.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2066.6950,16.641,124.191,0.000,2034.078,2099.312
weekday[T.jueves],1276.7151,16.210,78.762,0.000,1244.944,1308.487
weekday[T.lunes],930.7509,16.072,57.911,0.000,899.250,962.252
weekday[T.martes],1148.0866,16.190,70.915,0.000,1116.355,1179.818
weekday[T.miercoles],1221.5815,16.148,75.651,0.000,1189.932,1253.231
weekday[T.sabado],415.0813,16.198,25.625,0.000,383.333,446.830
weekday[T.viernes],1305.7471,16.201,80.599,0.000,1273.994,1337.500
holiday_cat[T.si],-194.8252,121.782,-1.600,0.110,-433.519,43.869
cat_time[T.midday],2224.0694,14.379,154.675,0.000,2195.886,2252.252

0,1,2,3
Omnibus:,1912.656,Durbin-Watson:,0.673
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2857.801
Skew:,-0.38,Prob(JB):,0.0
Kurtosis:,3.92,Cond. No.,1850.0


Concluimos que no hay una gran diferencia en los resultados respecto a nuestro df estandarizado, aunque mínimamente nuestro R2 ha bajado a 0.77 y las variable predictora de "holiday_cat" y seasonT[Winter] han intercambiado , más o menos, sus valores, aunque sigue sin ser menor que 0.05.