# Pair Programming ANOVA

En el pair programming de hoy usaremos el set de datos que guardastéis en el pair programming de normalización y estandarización.

En el ejercicio de hoy tendréis que hacer un ANOVA con vuestro datos y hacer una interpretación de los resultados.

📌 NOTA Puede que vuestros datos no se ajusten o no cumplan todas las asunciones, no pasa nada, haced el ANOVA e interpretad los resultados. En próximas lecciones aprenderemos que podemos hacer cuando nos encontramos en esta situación.

In [1]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import numpy as np
import pandas as pd
import random 

# Gráficos
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Estadísticos
# ------------------------------------------------------------------------------
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.multivariate.manova import MANOVA
from sklearn.preprocessing import StandardScaler

plt.rcParams["figure.figsize"] = (10,8) 

In [2]:
df = pd.read_csv('../archivos/coste_vida_estandar.csv', index_col = 0)
df.head()

Unnamed: 0,city,country,basic,basic_boxcox,mcdonalds,cappuccino,milk,rice,eggs,chicken,...,gasoline,internet,gym_monthly,cinema,preschool,primary_school,apt_3beds_outcentre,square_meter_incentre,monthly_salary,mortgage
0,Seoul,South Korea,182.13,16.271842,-0.210227,0.601852,2.395833,1.052023,1.311594,0.749091,...,0.241935,-0.300807,1.062337,0.207547,0.109541,1.275564,0.89581,6.846204,0.515595,-0.346667
1,Shanghai,China,66.0,10.244243,-0.340909,0.625,3.520833,-0.283237,-0.007246,-0.290909,...,-0.129032,-0.437216,1.436441,0.053701,1.630723,2.267313,0.74298,5.333712,-0.001095,-0.049524
2,Guangzhou,China,59.65,9.760717,-0.542614,0.421296,1.791667,-0.393064,-0.376812,-0.489091,...,-0.145161,-0.447554,-0.004914,0.053701,0.34434,2.054004,-0.044698,3.635133,-0.085812,-0.019048
3,Mumbai,India,43.57,8.371859,-0.911932,-0.069444,-0.625,-0.508671,-0.927536,-0.503636,...,0.112903,-0.632375,-0.437517,-0.561684,-0.352316,-0.388079,-0.034103,1.255105,-0.318112,0.508571
4,Delhi,India,58.07,9.635477,-0.735795,-0.398148,-0.666667,-0.49711,-0.876812,-0.481818,...,-0.064516,-0.667171,-0.642511,-0.473149,-0.403728,-0.589125,-0.36832,0.000156,-0.340228,0.527619


In [3]:
len(df['city'].unique())

4043

In [4]:
df.shape

(4081, 28)

In [9]:
df.columns

Index(['city', 'country', 'basic_boxcox', 'mcdonalds', 'cappuccino', 'milk',
       'rice', 'eggs', 'chicken', 'beef', 'banana', 'water', 'wine', 'beer',
       'cigarettes_marlboro', 'public_transport_ticket', 'taxi', 'gasoline',
       'internet', 'gym_monthly', 'cinema', 'preschool', 'primary_school',
       'apt_3beds_outcentre', 'square_meter_incentre', 'monthly_salary',
       'mortgage'],
      dtype='object')

In [8]:
df.drop('basic', axis = 1, inplace=True)

In [11]:
lm = ols('basic_boxcox ~ country + mcdonalds + cappuccino + milk + rice + eggs + chicken + beef + banana + water + wine +beer + cigarettes_marlboro + public_transport_ticket + taxi +gasoline + internet + gym_monthly + cinema + preschool +primary_school + apt_3beds_outcentre + square_meter_incentre +monthly_salary + mortgage', data=df).fit()
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
country,207.0,7065.405489,34.132394,4.632669,1.720601e-83
mcdonalds,1.0,16363.044761,16363.044761,2220.898179,0.0
cappuccino,1.0,3380.333982,3380.333982,458.800773,2.9631e-96
milk,1.0,212.995283,212.995283,28.909096,8.036171e-08
rice,1.0,421.930519,421.930519,57.267137,4.732527e-14
eggs,1.0,1797.714856,1797.714856,243.997478,2.228626e-53
chicken,1.0,115.821324,115.821324,15.720019,7.477691e-05
beef,1.0,207.622414,207.622414,28.179856,1.167909e-07
banana,1.0,325.433148,325.433148,44.16989,3.433584e-11
water,1.0,50.415884,50.415884,6.84277,0.008934835


In [14]:
lm.summary()

0,1,2,3
Dep. Variable:,basic_boxcox,R-squared:,0.583
Model:,OLS,Adj. R-squared:,0.558
Method:,Least Squares,F-statistic:,23.3
Date:,"Fri, 12 May 2023",Prob (F-statistic):,0.0
Time:,10:17:59,Log-Likelihood:,-9746.4
No. Observations:,4081,AIC:,19960.0
Df Residuals:,3849,BIC:,21420.0
Df Model:,231,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,13.2700,1.927,6.887,0.000,9.492,17.048
country[T.Albania],0.3773,2.080,0.181,0.856,-3.702,4.456
country[T.Algeria],-0.7575,1.990,-0.381,0.703,-4.658,3.143
country[T.Andorra],-3.2932,2.487,-1.324,0.186,-8.169,1.583
country[T.Angola],-1.2899,3.345,-0.386,0.700,-7.848,5.269
country[T.Anguilla],0.3261,3.335,0.098,0.922,-6.212,6.864
country[T.Antigua And Barbuda],-0.5987,3.330,-0.180,0.857,-7.127,5.930
country[T.Argentina],0.1939,2.027,0.096,0.924,-3.779,4.167
country[T.Armenia],-0.9623,2.225,-0.433,0.665,-5.324,3.399

0,1,2,3
Omnibus:,108.64,Durbin-Watson:,1.969
Prob(Omnibus):,0.0,Jarque-Bera (JB):,167.856
Skew:,0.262,Prob(JB):,3.5500000000000003e-37
Kurtosis:,3.844,Cond. No.,1010000.0


Todas nuestras variables explican en cierta medida nuestra variable dependiente. No obstante, debemos recordar que estos datos han sido extraidos sin cumplir con una de las asunciones báscias que es el de la normalidad de la variable dependiente por lo que se trata de un ejercicio puramente práctico.

En realidad, nuestra conclusión continua siendo que no podemos realizar una regresión linal como método adecuado para predecir nuestra variable dependiente.