# Desafio - Clasificación desde la econometría 

### Desafio 1: Preparar el ambiente de trabajo 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('southafricanheart.csv').drop('Unnamed: 0', axis=1)

In [3]:
df

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,12.00,5.73,23.11,Present,49,25.30,97.20,52,1
1,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,170,7.50,6.41,38.03,Present,51,31.99,24.26,58,1
4,134,13.60,3.50,27.78,Present,60,25.99,57.34,49,1
...,...,...,...,...,...,...,...,...,...,...
457,214,0.40,5.98,31.72,Absent,64,28.45,0.00,58,0
458,182,4.20,4.41,32.10,Absent,52,28.61,18.72,52,1
459,108,3.00,1.59,15.23,Absent,40,20.09,26.64,55,0
460,118,5.40,11.61,30.79,Absent,64,27.35,23.97,40,0


Descripción de variables:

- sbp: Presión Sanguínea Sistólica. 
- tobacco: Promedio tabaco consumido por día. 
- ldl: Lipoproteína de baja densidad. 
- adiposity: Adiposidad. 
- famhist: Antecedentes familiares de enfermedades cardiácas. (Binaria) 
- types: Personalidad tipo A 
- obesity​: Obesidad. 
- alcohol​: Consumo actual de alcohol. 
- age​: edad. 
- chd​: Enfermedad coronaria. (dummy) 

#### Analisis descriptivo

In [4]:
df.loc[:,'sbp':'age'].describe()

Unnamed: 0,sbp,tobacco,ldl,adiposity,typea,obesity,alcohol,age
count,462.0,462.0,462.0,462.0,462.0,462.0,462.0,462.0
mean,138.32684,3.635649,4.740325,25.406732,53.103896,26.044113,17.044394,42.816017
std,20.496317,4.593024,2.070909,7.780699,9.817534,4.21368,24.481059,14.608956
min,101.0,0.0,0.98,6.74,13.0,14.7,0.0,15.0
25%,124.0,0.0525,3.2825,19.775,47.0,22.985,0.51,31.0
50%,134.0,2.0,4.34,26.115,53.0,25.805,7.51,45.0
75%,148.0,5.5,5.79,31.2275,60.0,28.4975,23.8925,55.0
max,218.0,31.2,15.33,42.49,78.0,46.58,147.19,64.0


In [5]:
df['famhist'].value_counts()

Absent     270
Present    192
Name: famhist, dtype: int64

In [6]:
df['chd'].value_counts()

0    302
1    160
Name: chd, dtype: int64

### Desafio 2

In [7]:
df['famhist'] = np.where(df['famhist']== 'Present', 1, 0)

In [8]:
df

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,12.00,5.73,23.11,1,49,25.30,97.20,52,1
1,144,0.01,4.41,28.61,0,55,28.87,2.06,63,1
2,118,0.08,3.48,32.28,1,52,29.14,3.81,46,0
3,170,7.50,6.41,38.03,1,51,31.99,24.26,58,1
4,134,13.60,3.50,27.78,1,60,25.99,57.34,49,1
...,...,...,...,...,...,...,...,...,...,...
457,214,0.40,5.98,31.72,0,64,28.45,0.00,58,0
458,182,4.20,4.41,32.10,0,52,28.61,18.72,52,1
459,108,3.00,1.59,15.23,0,40,20.09,26.64,55,0
460,118,5.40,11.61,30.79,0,64,27.35,23.97,40,0


In [9]:
modelo_coronaria = smf.logit('chd ~ famhist', df).fit()

Optimization terminated successfully.
         Current function value: 0.608111
         Iterations 5


In [10]:
modelo_coronaria.summary2().tables[0][2]

0    Pseudo R-squared:
1                 AIC:
2                 BIC:
3      Log-Likelihood:
4             LL-Null:
5         LLR p-value:
6               Scale:
7                     
Name: 2, dtype: object

In [11]:
modelo_coronaria.summary2()

0,1,2,3
Model:,Logit,Pseudo R-squared:,0.057
Dependent Variable:,chd,AIC:,565.8944
Date:,2022-01-17 02:38,BIC:,574.1655
No. Observations:,462,Log-Likelihood:,-280.95
Df Model:,1,LL-Null:,-298.05
Df Residuals:,460,LLR p-value:,4.9371e-09
Converged:,1.0000,Scale:,1.0
No. Iterations:,5.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-1.1690,0.1431,-8.1687,0.0000,-1.4495,-0.8885
famhist,1.1690,0.2033,5.7514,0.0000,0.7706,1.5674


In [12]:
def inverse_logit(x):
    return 1/(1+np.exp(-x))

In [13]:
round(inverse_logit(modelo_coronaria.params['Intercept']),3)

0.237

In [14]:
round(inverse_logit(modelo_coronaria.params['famhist']),3)

0.763

**¿Cuál es la probabilidad de un individuo con antecedentes familiares de tener una enfermedad coronaria?**

La probabiliadad de padecer una enfermedad coronaria teniendo antecendentes familiares relacionados con la enfermedad es de un 76%.

**¿Cuál es la probabilidad de un individuo sin antecedentes familiares de tener una enfermedad coronaria?** 

La probabiliadad de padecer una enfermedad coronaria **no** teniendo antecendentes familiar relacionados con la enfermedad es de un 24%.

**¿Cuál es la diferencia en la probabilidad entre un individuo con antecedentes y otro sin antecedentes?**

La diferencia en la probabilidad de tener una enfermedad coronaria entre un individuo con antecedentes y otro sin antecedentes familiares es de un 52%.

**Replique el modelo con smf.ols y comente las similitudes entre los coeficientes estimados.**

In [15]:
modelo_coronaria_lineal = smf.ols('chd ~ famhist', df).fit()

In [16]:
modelo_coronaria_lineal.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.072
Dependent Variable:,chd,AIC:,593.1725
Date:,2022-01-17 02:38,BIC:,601.4437
No. Observations:,462,Log-Likelihood:,-294.59
Df Model:,1,F-statistic:,36.86
Df Residuals:,460,Prob (F-statistic):,2.66e-09
R-squared:,0.074,Scale:,0.2105

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,0.2370,0.0279,8.4893,0.0000,0.1822,0.2919
famhist,0.2630,0.0433,6.0713,0.0000,0.1778,0.3481

0,1,2,3
Omnibus:,768.898,Durbin-Watson:,1.961
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.778
Skew:,0.579,Prob(JB):,0.0
Kurtosis:,1.692,Condition No.:,2.0


In [17]:
modelo_coronaria.params['famhist']/4

0.29224827135747744

Se observa que al dividir por 4 el log-odds obtenido al utilizar la funcion logistica, da un valor cercano al obtenido en la funcion lineal.

### Desafío 3: Estimación completa 

In [18]:
modelo_completo = smf.logit('chd ~ sbp+tobacco+ldl+adiposity+famhist+typea+obesity+alcohol+age', df).fit()

Optimization terminated successfully.
         Current function value: 0.510974
         Iterations 6


In [19]:
modelo_completo.summary2()

0,1,2,3
Model:,Logit,Pseudo R-squared:,0.208
Dependent Variable:,chd,AIC:,492.14
Date:,2022-01-17 02:38,BIC:,533.4957
No. Observations:,462,Log-Likelihood:,-236.07
Df Model:,9,LL-Null:,-298.05
Df Residuals:,452,LLR p-value:,2.0548e-22
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-6.1507,1.3083,-4.7015,0.0000,-8.7149,-3.5866
sbp,0.0065,0.0057,1.1350,0.2564,-0.0047,0.0177
tobacco,0.0794,0.0266,2.9838,0.0028,0.0272,0.1315
ldl,0.1739,0.0597,2.9152,0.0036,0.0570,0.2909
adiposity,0.0186,0.0293,0.6346,0.5257,-0.0388,0.0760
famhist,0.9254,0.2279,4.0605,0.0000,0.4787,1.3720
typea,0.0396,0.0123,3.2138,0.0013,0.0154,0.0637
obesity,-0.0629,0.0442,-1.4218,0.1551,-0.1496,0.0238
alcohol,0.0001,0.0045,0.0271,0.9784,-0.0087,0.0089


In [20]:
inverse_logit(0.0001)

0.5000249999999792

**A continuacion depuraré el modelo**

In [21]:
modelo_completo_r = smf.logit('chd ~ tobacco+ldl+famhist+typea+age', df).fit()

Optimization terminated successfully.
         Current function value: 0.514811
         Iterations 6


In [22]:
modelo_completo_r.summary2()

0,1,2,3
Model:,Logit,Pseudo R-squared:,0.202
Dependent Variable:,chd,AIC:,487.6856
Date:,2022-01-17 02:38,BIC:,512.499
No. Observations:,462,Log-Likelihood:,-237.84
Df Model:,5,LL-Null:,-298.05
Df Residuals:,456,LLR p-value:,2.5537000000000002e-24
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-6.4464,0.9209,-7.0004,0.0000,-8.2513,-4.6416
tobacco,0.0804,0.0259,3.1057,0.0019,0.0297,0.1311
ldl,0.1620,0.0550,2.9470,0.0032,0.0543,0.2697
famhist,0.9082,0.2258,4.0228,0.0001,0.4657,1.3507
typea,0.0371,0.0122,3.0505,0.0023,0.0133,0.0610
age,0.0505,0.0102,4.9442,0.0000,0.0305,0.0705


Estadisticos de bondad de ajuste:
- BIC disminuye en el modelo con menos variables.
- Log-Likekihood y LL-Null se mantienen practicamente igual.

Las variables utilizadas son significativas a un 99% para explicar la variabilidad de presentar una enfermedad coronaria, ya que sus puntajes z son mayores a 2.59. Lo que nos permite rechazar la hipotesis nula(Las variables presentes en la tabla no explican la variabilidad de presetar una enfermedad coronaria).

### Desafío 4: Estimación de perfiles

**A continuacion la probabilidad de tener una enfermedad coronaria desglosado por variable:**

In [23]:
inverse_logit(modelo_completo_r.params['tobacco'])

0.5200830212571891

In [24]:
inverse_logit(modelo_completo_r.params['ldl']*df.ldl.max())

0.9229650231293174

In [25]:
inverse_logit(modelo_completo_r.params['ldl']*df.ldl.min())

0.5396048085179768

In [26]:
inverse_logit(modelo_completo_r.params['famhist'])

0.7126266203973513

In [27]:
inverse_logit(modelo_completo_r.params['typea'])

0.5092777382073078

In [28]:
inverse_logit(modelo_completo_r.params['age'])

0.5126124196804074

In [29]:
suma = modelo_completo_r.params['tobacco']+modelo_completo_r.params['ldl']
+modelo_completo_r.params['famhist']+modelo_completo_r.params['typea']+modelo_completo_r.params['age']
+modelo_completo_r.params['Intercept']

-6.446444511709026

In [30]:
inverse_logit(suma)

0.5602968673370117

**La probabilidad de tener una enfermedad coronaria para un individuo con características similares a la muestra.**

La probabilidad de tener una enfermedad coronaria para este individuo seria de un 56%.

**La probabilidad de tener una enfermedad coronaria para un individuo con altos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.**

La probabilidad de tener una enfermedad coronaria para este individuo seria de un 92%.

**La probabilidad de tener una enfermedad coronaria para un individuo con bajos niveles de lipoproteína de baja densidad, manteniendo todas las demás características constantes.** 

La probabilidad de tener una enfermedad coronaria para este individuo seria de un 54%.