# Pair Programming ANOVA

En el pair programming de hoy usaremos el set de datos que guardastéis en el pair programming de normalización y estandarización.

En el ejercicio de hoy tendréis que hacer un ANOVA con vuestro datos y hacer una interpretación de los resultados.

📌 NOTA Puede que vuestros datos no se ajusten o no cumplan todas las asunciones, no pasa nada, haced el ANOVA e interpretad los resultados. En próximas lecciones aprenderemos que podemos hacer cuando nos encontramos en esta situación.

In [1]:
import numpy as np
import pandas as pd
import random 

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.multivariate.manova import MANOVA
from sklearn.preprocessing import StandardScaler

plt.rcParams["figure.figsize"] = (10,8) 

import sys
sys.path.append("../../")
from src import funciones as fun
from src import variables as var

In [2]:
df = pd.read_pickle('../archivos/coste_vida_estandar.pkl')
df.head()

Unnamed: 0,country,mcdonalds,cappuccino,milk,rice,eggs,chicken,beef,banana,water,...,gasoline,basic,internet,gym_monthly,cinema,preschool,primary_school,apt_3beds_outcentre,monthly_salary,basic_boxcox
0,South Korea,-0.219373,0.601852,2.395833,1.046243,1.304348,0.747273,4.294342,3.245552,0.307692,...,0.245902,0.465023,-0.302919,1.074934,0.208531,0.101671,1.262442,0.904358,0.517125,0.397744
1,China,-0.350427,0.625,3.520833,-0.289017,-0.014493,-0.292727,0.338771,0.241993,-0.142857,...,-0.131148,-0.534202,-0.439942,1.455932,0.053956,1.617468,2.248014,0.750069,-0.00458,-0.617292
2,China,-0.552707,0.421296,1.791667,-0.398844,-0.384058,-0.490909,0.14856,0.014235,-0.285714,...,-0.147541,-0.58884,-0.450326,-0.011987,0.053956,0.335639,2.036034,-0.045124,-0.090119,-0.698365
3,India,-0.923077,-0.069444,-0.625,-0.514451,-0.934783,-0.505455,-0.656716,-1.081851,-0.461538,...,0.114754,-0.727198,-0.635978,-0.452562,-0.564346,-0.358551,-0.390842,-0.034428,-0.324674,-0.930884
4,India,-0.746439,-0.398148,-0.666667,-0.50289,-0.884058,-0.483636,-0.690038,-0.967972,-0.450549,...,-0.065574,-0.602435,-0.67093,-0.661335,-0.475392,-0.409781,-0.590637,-0.371835,-0.347005,-0.719354


Aunque sabemos que con la variable dependiente sin normalizar adecuadamente y sin cumplir las demas asunciones con nuestras variables predictoras sabemos que no podemos realizar un modelo de regresión lineal, vamos a continuar el proceso y sacar una ANOVA como actividad didáctica.

In [3]:
# eliminamos la variable respuesta original, ya que para realizar la ANOVA utilizaremos la tratada con el método de normalización.
df.drop('basic', axis = 1, inplace=True)

In [4]:
df.columns

Index(['country', 'mcdonalds', 'cappuccino', 'milk', 'rice', 'eggs', 'chicken',
       'beef', 'banana', 'water', 'wine', 'beer', 'cigarettes_marlboro',
       'public_transport_ticket', 'taxi', 'gasoline', 'internet',
       'gym_monthly', 'cinema', 'preschool', 'primary_school',
       'apt_3beds_outcentre', 'monthly_salary', 'basic_boxcox'],
      dtype='object')

In [5]:
lm = ols('basic_boxcox ~ country + mcdonalds + cappuccino + milk + rice + eggs + chicken + beef + banana + water + wine + beer + cigarettes_marlboro + public_transport_ticket + taxi + gasoline + internet + gym_monthly + cinema + preschool + primary_school + apt_3beds_outcentre + monthly_salary', data=df).fit()
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
country,207.0,1562.075273,7.546257,65.633744,0.0
mcdonalds,1.0,3.759891,3.759891,32.701734,1.148357e-08
cappuccino,1.0,4.863252,4.863252,42.298245,8.745378e-11
milk,1.0,1.584234,1.584234,13.778915,0.0002082722
rice,1.0,1.158475,1.158475,10.075859,0.001513003
eggs,1.0,3.420634,3.420634,29.751041,5.192365e-08
chicken,1.0,1.699627,1.699627,14.782541,0.0001224173
beef,1.0,0.740445,0.740445,6.440038,0.01119316
banana,1.0,2.667936,2.667936,23.204438,1.507488e-06
water,1.0,0.15729,0.15729,1.368033,0.2422159


In [6]:
lm.summary()

0,1,2,3
Dep. Variable:,basic_boxcox,R-squared:,0.767
Model:,OLS,Adj. R-squared:,0.755
Method:,Least Squares,F-statistic:,60.95
Date:,"Sat, 20 May 2023",Prob (F-statistic):,0.0
Time:,12:50:27,Log-Likelihood:,-1389.5
No. Observations:,4468,AIC:,3239.0
Df Residuals:,4238,BIC:,4712.0
Df Model:,229,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.6730,0.249,-2.704,0.007,-1.161,-0.185
country[T.Albania],0.1430,0.265,0.540,0.589,-0.376,0.662
country[T.Algeria],0.0650,0.256,0.254,0.799,-0.436,0.566
country[T.Andorra],0.0128,0.318,0.040,0.968,-0.611,0.636
country[T.Angola],-0.1554,0.422,-0.368,0.713,-0.982,0.672
country[T.Anguilla],0.8896,0.424,2.099,0.036,0.059,1.720
country[T.Antigua And Barbuda],0.4973,0.425,1.170,0.242,-0.336,1.330
country[T.Argentina],0.0927,0.260,0.357,0.721,-0.417,0.603
country[T.Armenia],0.4861,0.285,1.704,0.088,-0.073,1.045

0,1,2,3
Omnibus:,241.167,Durbin-Watson:,1.949
Prob(Omnibus):,0.0,Jarque-Bera (JB):,538.294
Skew:,0.347,Prob(JB):,1.2900000000000001e-117
Kurtosis:,4.552,Cond. No.,2700000.0


Nuestras variables explican cierta medida nuestra variable dependiente. No obstante, debemos recordar que estos datos han sido extraidos sin cumplir con las asunciones básicas como la de la normalidad de la variable dependiente  o la homocedasticidad e independencia de las variables independientes, por lo que se trata de un ejercicio puramente práctico.

En realidad, nuestra conclusión continua siendo que no podemos realizar una regresión lineal como método adecuado para predecir nuestra variable dependiente.