# Pair Programming ANOVA

En el pair programming de hoy usaremos el set de datos que guardastéis en el pair programming de normalización y estandarización.

Hasta ahora habéis estado evaluando las características de vuestro set de datos y habéis hecho una gran exploración, es el momento de hacer vuestro primer ANOVA! En el ejercicio de hoy tendréis que hacer un ANOVA con vuestro datos y hacer una interpretación de los resultados.

📌 NOTA Puede que vuestros datos no se ajusten o no cumplan todas las asunciones, no pasa nada, haced el ANOVA e interpretad los resultados. En próximas lecciones aprenderemos que podemos hacer cuando nos encontramos en esta situación.

In [7]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import numpy as np
import pandas as pd
import random 

# Gráficos
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Estadísticos
# ------------------------------------------------------------------------------
import statsmodels.api as sm
from statsmodels.formula.api import ols

from statsmodels.multivariate.manova import MANOVA
from sklearn.preprocessing import StandardScaler

plt.rcParams["figure.figsize"] = (10,8) 

In [10]:
df = pd.read_csv("../files/endesarrollo_estandarizado.csv", index_col = 0)
df.sample(5)

Unnamed: 0,index,country,status,continente,year,life_expectancy,life_box,adult_mortality,infant_deaths,alcohol,measles,bmi,total_expenditure,diphtheria,hiv/aids,gdp,thinness__1-19_years,income_composition_of_resources,schooling
1543,1879,Niger,Developing,Africa,2013.0,69.0,100866.195655,0.4,1.62963,-0.519069,2.550449,-0.468728,0.168986,-1.045455,0.357143,-0.266892,-0.660714,-1.074599,-1.684211
478,542,Chad,Developing,Africa,2001.0,48.0,34216.818723,-0.99375,1.37037,-0.461615,52.59588,-0.595255,0.5666,-2.909091,3.357143,-0.328546,1.178571,-1.229462,-1.789474
1643,1995,Paraguay,Developing,South America,2011.0,73.4,121261.015214,-0.925,-0.074074,0.626053,-0.035922,0.353702,1.568588,-0.045455,0.0,0.739472,-0.464286,0.186969,0.157895
1732,2116,Moldova,Developing,Europe,2002.0,67.5,94473.647457,0.3875,-0.148148,0.808321,10.379292,0.356578,1.127237,0.318182,0.0,-0.327618,-0.232143,-0.208648,0.015376
1661,2013,Peru,Developing,South America,2009.0,73.8,123240.201537,-0.93125,0.185185,0.408123,-0.035922,0.48023,-0.045726,0.136364,0.142857,0.789851,-0.625,0.30406,0.421053


In [15]:
df.drop(columns=["index","life_box"], axis=1, inplace=True)

In [16]:
df.isnull().sum()

country                            0
status                             0
continente                         0
year                               0
life_expectancy                    0
adult_mortality                    0
infant_deaths                      0
alcohol                            0
measles                            0
bmi                                0
total_expenditure                  0
diphtheria                         0
hiv/aids                           0
gdp                                0
thinness__1-19_years               0
income_composition_of_resources    0
schooling                          0
dtype: int64

In [14]:
df.columns

Index(['index', 'country', 'status', 'continente', 'year', 'life_expectancy',
       'life_box', 'adult_mortality', 'infant_deaths', 'alcohol', 'measles',
       'bmi', 'total_expenditure', 'diphtheria', 'hiv/aids', 'gdp',
       'thinness__1-19_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

Renombramos las columnas porque el método no nos toma si estas tienen / o -

In [23]:
df.rename(columns={'hiv/aids':"hiv_aids", "thinness__1-19_years": "thinness_1_19_years"}, inplace=True)

In [24]:
df.columns

Index(['country', 'status', 'continente', 'year', 'life_expectancy',
       'adult_mortality', 'infant_deaths', 'alcohol', 'measles', 'bmi',
       'total_expenditure', 'diphtheria', 'hiv_aids', 'gdp',
       'thinness_1_19_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

### Método ols

In [25]:
lm = ols('life_expectancy ~  adult_mortality + infant_deaths +alcohol + measles + bmi + total_expenditure + diphtheria + hiv_aids + gdp + thinness_1_19_years + income_composition_of_resources + schooling' , data= df).fit ()
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
adult_mortality,1.0,86170.403012,86170.403012,5166.565896,0.0
infant_deaths,1.0,9041.352535,9041.352535,542.09731,2.857812e-108
alcohol,1.0,6742.319245,6742.319245,404.252916,3.248814e-83
measles,1.0,176.021336,176.021336,10.553807,0.00117551
bmi,1.0,13646.642858,13646.642858,818.219216,4.456056e-155
total_expenditure,1.0,38.334735,38.334735,2.298457,0.1296345
diphtheria,1.0,9971.036787,9971.036787,597.838896,4.5684990000000005e-118
hiv_aids,1.0,12958.14977,12958.14977,776.93886,2.417567e-148
gdp,1.0,2604.988685,2604.988685,156.188729,9.105485e-35
thinness_1_19_years,1.0,57.225559,57.225559,3.431104,0.06410218


### Método summary

In [27]:
df.columns

Index(['country', 'status', 'continente', 'year', 'life_expectancy',
       'adult_mortality', 'infant_deaths', 'alcohol', 'measles', 'bmi',
       'total_expenditure', 'diphtheria', 'hiv_aids', 'gdp',
       'thinness_1_19_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

In [28]:
lm2 = ols('life_expectancy ~  continente + adult_mortality + infant_deaths +alcohol + measles + bmi + total_expenditure + diphtheria + hiv_aids + gdp + thinness_1_19_years + income_composition_of_resources + schooling' , data= df).fit ()
sm.stats.anova_lm(lm2)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
continente,5.0,103013.354184,20602.670837,1471.626704,0.0
adult_mortality,1.0,25051.065369,25051.065369,1789.37076,1.9852170000000001e-292
infant_deaths,1.0,3704.487132,3704.487132,264.607547,1.638871e-56
alcohol,1.0,264.072781,264.072781,18.862436,1.463495e-05
measles,1.0,24.495639,24.495639,1.749697,0.1860412
bmi,1.0,5192.324333,5192.324333,370.882164,5.9757829999999996e-77
total_expenditure,1.0,12.488168,12.488168,0.892016,0.3450268
diphtheria,1.0,5702.822089,5702.822089,407.346472,8.884014e-84
hiv_aids,1.0,6554.012119,6554.012119,468.146064,5.789739e-95
gdp,1.0,2191.329212,2191.329212,156.524298,7.809527e-35


In [33]:
# df[df["continente"] == "Oceania"]

In [29]:
lm2.summary()

0,1,2,3
Dep. Variable:,life_expectancy,R-squared:,0.829
Model:,OLS,Adj. R-squared:,0.828
Method:,Least Squares,F-statistic:,684.1
Date:,"Mon, 23 Jan 2023",Prob (F-statistic):,0.0
Time:,10:30:03,Log-Likelihood:,-6590.7
No. Observations:,2410,AIC:,13220.0
Df Residuals:,2392,BIC:,13320.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,66.5190,0.188,354.522,0.000,66.151,66.887
continente[T.Asia],3.8071,0.234,16.256,0.000,3.348,4.266
continente[T.Europe],5.1188,0.386,13.273,0.000,4.363,5.875
continente[T.North America],5.9957,0.311,19.252,0.000,5.385,6.606
continente[T.Oceania],2.6140,0.407,6.425,0.000,1.816,3.412
continente[T.South America],4.5921,0.365,12.586,0.000,3.877,5.308
adult_mortality,-2.2199,0.124,-17.839,0.000,-2.464,-1.976
infant_deaths,-0.1819,0.040,-4.540,0.000,-0.260,-0.103
alcohol,-0.7856,0.162,-4.862,0.000,-1.102,-0.469

0,1,2,3
Omnibus:,121.89,Durbin-Watson:,0.617
Prob(Omnibus):,0.0,Jarque-Bera (JB):,433.432
Skew:,-0.078,Prob(JB):,7.609999999999999e-95
Kurtosis:,5.072,Cond. No.,199.0


## Interpretación

Al analizar el Adj. R-squared, vemos que el 79.5% de la variación de life_expetacy, variable respuesta, se explica a través de las variables predictoras.

`Coeficiente`
- Refiere a la tasa de cambios medios en la variable respuesta por unidad de cambio de la variable preditora, lo que interesa es el signo. 
En las variables adult mortality, muerte infantil, alcohol, hvi_aid y delgadez se observa un signo negativo, es decir, que si estas variables aumentan, la expectativa de vida disminuye.-

- Para el caso de variables con signo positivo como son sarampión, bmi, gasto total, diphteria, gdp, income y escolaridad, esto indica que si aumenta el coeficiente también aumentará la expectativa de vida.

`P>|t|`
- Al analizar las variables predictoras individualmente, observamos que measles, total_expenditure y thinness_1_19_years son NO significativas a la hora e predecir nuestra variable respuesta.
- Respecto a la variable categórica continente, el pvalue indica que todas tienen un valor menor a 0.05, por lo tanto, todas son significativas para la predicción de nuestra variable respuesta.