# Pair Programming ANOVA

En el pair programming de hoy usaremos el set de datos que guardastéis en el pair programming de normalización y estandarización.

Hasta ahora habéis estado evaluando las características de vuestro set de datos y habéis hecho una gran exploración, es el momento de hacer vuestro primer ANOVA! En el ejercicio de hoy tendréis que hacer un ANOVA con vuestro datos y hacer una interpretación de los resultados.

📌 NOTA Puede que vuestros datos no se ajusten o no cumplan todas las asunciones, no pasa nada, haced el ANOVA e interpretad los resultados. En próximas lecciones aprenderemos que podemos hacer cuando nos encontramos en esta situación.

In [90]:
# Tratamiento de datos
import numpy as np
import pandas as pd
import random 

# Gráficos
import matplotlib.pyplot as plt
import seaborn as sns

# Estadísticos
import statsmodels.api as sm
from statsmodels.formula.api import ols

from statsmodels.multivariate.manova import MANOVA
from sklearn.preprocessing import StandardScaler

plt.rcParams["figure.figsize"] = (10,8) 

In [91]:
df = pd.read_csv("../files/endesarrollo_estandarizado.csv", index_col = 0)
df.sample(5)

Unnamed: 0,index,country,status,continente,year,life_expectancy,life_box,adult_mortality,infant_deaths,alcohol,measles,bmi,total_expenditure,diphtheria,hiv/aids,gdp,thinness__1-19_years,income_composition_of_resources,schooling
2338,2866,Venezuela,Developing,South America,2007.0,73.4,121261.015214,0.0125,0.148148,1.202576,0.031696,0.655643,-0.089463,-1.272727,0.0,0.455733,-0.553571,0.387158,0.315789
2284,2812,Uruguay,Developing,South America,2013.0,76.8,138772.221672,-0.2875,-0.185185,0.631996,-0.035922,0.808052,1.274354,0.181818,0.0,4.371975,-0.571429,0.613787,1.0
421,485,Cameroon,Developing,Africa,2010.0,55.3,52167.194389,-0.7875,1.777778,0.697375,0.47121,-0.253055,-0.077535,-0.272727,3.857143,-0.050877,0.25,-0.549575,-0.526316
129,161,Bahamas,Developing,North America,2014.0,75.4,131371.507695,-0.91875,-0.185185,1.351164,-0.035922,0.833932,0.900596,0.272727,0.0,0.758513,-0.392857,0.617564,0.236842
502,566,China,Developing,Asia,2009.0,74.9,128793.353618,-0.4375,9.0,0.445765,110.816693,-0.215672,0.129225,0.409091,0.0,0.6973,-0.053571,0.213409,0.131579


In [92]:
df.drop(columns=["index","life_box"], axis=1, inplace=True)

In [93]:
df.isnull().sum()

country                            0
status                             0
continente                         0
year                               0
life_expectancy                    0
adult_mortality                    0
infant_deaths                      0
alcohol                            0
measles                            0
bmi                                0
total_expenditure                  0
diphtheria                         0
hiv/aids                           0
gdp                                0
thinness__1-19_years               0
income_composition_of_resources    0
schooling                          0
dtype: int64

In [94]:
df.columns

Index(['country', 'status', 'continente', 'year', 'life_expectancy',
       'adult_mortality', 'infant_deaths', 'alcohol', 'measles', 'bmi',
       'total_expenditure', 'diphtheria', 'hiv/aids', 'gdp',
       'thinness__1-19_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

Renombramos las columnas porque el método no nos toma si estas tienen / o -

In [95]:
df.rename(columns={'hiv/aids':"hiv_aids", "thinness__1-19_years": "thinness_1_19_years"}, inplace=True)

In [96]:
df.columns

Index(['country', 'status', 'continente', 'year', 'life_expectancy',
       'adult_mortality', 'infant_deaths', 'alcohol', 'measles', 'bmi',
       'total_expenditure', 'diphtheria', 'hiv_aids', 'gdp',
       'thinness_1_19_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

### Método ols

In [101]:
lm = ols('life_expectancy ~  continente + adult_mortality + infant_deaths +alcohol + measles + bmi + total_expenditure + diphtheria + hiv_aids + gdp + thinness_1_19_years + income_composition_of_resources + schooling' , data= df).fit ()
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
continente,5.0,103013.354184,20602.670837,1471.626704,0.0
adult_mortality,1.0,25051.065369,25051.065369,1789.37076,1.9852170000000001e-292
infant_deaths,1.0,3704.487132,3704.487132,264.607547,1.638871e-56
alcohol,1.0,264.072781,264.072781,18.862436,1.463495e-05
measles,1.0,24.495639,24.495639,1.749697,0.1860412
bmi,1.0,5192.324333,5192.324333,370.882164,5.9757829999999996e-77
total_expenditure,1.0,12.488168,12.488168,0.892016,0.3450268
diphtheria,1.0,5702.822089,5702.822089,407.346472,8.884014e-84
hiv_aids,1.0,6554.012119,6554.012119,468.146064,5.789739e-95
gdp,1.0,2191.329212,2191.329212,156.524298,7.809527e-35


### Método summary

In [102]:
df["continente"].unique()

array(['Asia', 'Europe', 'Africa', 'North America', 'South America',
       'Oceania'], dtype=object)

In [80]:
# df[df["continente"] == "Oceania"]

In [103]:
lm.summary()

0,1,2,3
Dep. Variable:,life_expectancy,R-squared:,0.829
Model:,OLS,Adj. R-squared:,0.828
Method:,Least Squares,F-statistic:,684.1
Date:,"Mon, 23 Jan 2023",Prob (F-statistic):,0.0
Time:,12:27:05,Log-Likelihood:,-6590.7
No. Observations:,2410,AIC:,13220.0
Df Residuals:,2392,BIC:,13320.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,66.5190,0.188,354.522,0.000,66.151,66.887
continente[T.Asia],3.8071,0.234,16.256,0.000,3.348,4.266
continente[T.Europe],5.1188,0.386,13.273,0.000,4.363,5.875
continente[T.North America],5.9957,0.311,19.252,0.000,5.385,6.606
continente[T.Oceania],2.6140,0.407,6.425,0.000,1.816,3.412
continente[T.South America],4.5921,0.365,12.586,0.000,3.877,5.308
adult_mortality,-2.2199,0.124,-17.839,0.000,-2.464,-1.976
infant_deaths,-0.1819,0.040,-4.540,0.000,-0.260,-0.103
alcohol,-0.7856,0.162,-4.862,0.000,-1.102,-0.469

0,1,2,3
Omnibus:,121.89,Durbin-Watson:,0.617
Prob(Omnibus):,0.0,Jarque-Bera (JB):,433.432
Skew:,-0.078,Prob(JB):,7.609999999999999e-95
Kurtosis:,5.072,Cond. No.,199.0


## Interpretación

`Al analizar el Adj. R-squared, vemos que el 79.5% de la variación de life_expetacy, variable respuesta, se explica a través de las variables predictoras.`

`Coeficiente`
- Refiere a la tasa de cambios medios en la variable respuesta por unidad de cambio de la variable preditora, lo que interesa es el signo. 
En las variables adult mortality, muerte infantil, alcohol, hvi_aid y delgadez se observa un signo negativo, es decir, que si estas variables aumentan, la expectativa de vida disminuye.-

- Para el caso de variables con signo positivo como son sarampión, bmi, gasto total, diphteria, gdp, income y escolaridad, esto indica que si aumenta el coeficiente también aumentará la expectativa de vida.

`P>|t|`
- Al analizar las variables predictoras individualmente, observamos que measles, total_expenditure y thinness_1_19_years son NO significativas a la hora e predecir nuestra variable respuesta.
- Respecto a la variable categórica continente, el pvalue indica que todas tienen un valor menor a 0.05, por lo tanto, todas son significativas para la predicción de nuestra variable respuesta.

In [104]:
df.columns

Index(['country', 'status', 'continente', 'year', 'life_expectancy',
       'adult_mortality', 'infant_deaths', 'alcohol', 'measles', 'bmi',
       'total_expenditure', 'diphtheria', 'hiv_aids', 'gdp',
       'thinness_1_19_years', 'income_composition_of_resources', 'schooling'],
      dtype='object')

### Eliminación de columnas no significativas y fuera de la investigación para la creación del modelo.

`Decidimos eliminar además de las columnas no significativas arrojadas por el summary, país, estatus y año, ya que no están alineadas con el modelo que buscamos diseñar.`

In [105]:
len(df["country"].unique())

159

In [106]:
df.drop(columns=["year",'country','status','measles', 'total_expenditure', 'thinness_1_19_years'], axis=1, inplace=True)

In [107]:
df.head(2)

Unnamed: 0,continente,life_expectancy,adult_mortality,infant_deaths,alcohol,bmi,diphtheria,hiv_aids,gdp,income_composition_of_resources,schooling
0,Asia,65.0,0.625,2.111111,-0.519069,-0.451474,-1.136364,0.0,-0.219529,-0.553352,-0.421053
1,Asia,59.9,0.675,2.185185,-0.519069,-0.465852,-1.272727,0.0,-0.211517,-0.564684,-0.447368


In [108]:
df.to_csv('../files/endesarrollo_anova.csv')