## Aplicación de Python con el Dataset
### Primero, se importan las librerías a utilizar

In [19]:
import pandas as pd

### Guardar el Dataset dentro de una variable:

In [20]:
df = pd.read_csv("Change criteria hypertension peru.csv")
df

Unnamed: 0,id,city,masl,sex,age_years,systolic_bp,diastolic_bp,weight_kg,height_cm,body_mass_index,...,physical_activity,msnm,region,sist_old,diast_old,sist_new,diast_new,treatment,HTA_new,BMI_cat
0,3574,Huancayo,3250,Female,23,119,74,58.0,163.0,22.0,...,Yes,2-4mil msnm,Mountain,Optma,Optma,Normal,Norm/elev,No,No,Normal
1,1092,Loreto,100,Male,60,110,70,54.0,160.0,21.0,...,No,<2000msm,Jungle,Optma,Optma,Normal,Norm/elev,No,No,Normal
2,861,Lima,500,Female,38,120,80,65.0,163.0,25.0,...,Yes,<2000msm,Coast,Optma,Normal,Normal,HTA 1,No,Yes,Normal
3,835,Lima,500,Female,43,110,80,60.0,157.0,24.0,...,No,<2000msm,Coast,Optma,Normal,Normal,HTA 1,No,Yes,Normal
4,4654,Huanuco,1900,Female,30,95,60,50.0,152.0,22.0,...,Yes,<2000msm,Mountain,Optma,Optma,Normal,Norm/elev,No,No,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5610,5490,Puno,3800,Male,25,80,60,,,,...,Yes,2-4mil msnm,Mountain,Optma,Optma,Normal,Norm/elev,No,No,
5611,4731,Puno,3800,Female,55,105,63,,,,...,Yes,2-4mil msnm,Mountain,Optma,Optma,Normal,Norm/elev,No,No,
5612,4715,Puno,3800,Male,46,123,85,,,,...,No,2-4mil msnm,Mountain,Normal,Normal/alta,Elevado,HTA 1,No,Yes,
5613,4750,Puno,3800,Male,24,115,65,,,,...,No,2-4mil msnm,Mountain,Optma,Optma,Normal,Norm/elev,No,No,


### Análisis y limpieza de datos
#### Se encontrarán las nans.

In [21]:
df.isna().sum()

id                           0
city                         0
masl                         0
sex                          0
age_years                    0
systolic_bp                  0
diastolic_bp                 0
weight_kg                  197
height_cm                  201
body_mass_index            211
diabetes_mellitus           18
dm_treatment                44
cv_diseases                 28
cd_treatment                25
smoking                    132
smoking_years             5129
hypertension_dx              0
hypertension_years        4814
hypertension_treatment    4783
physical_activity            0
msnm                         0
region                       0
sist_old                     0
diast_old                    0
sist_new                     0
diast_new                    0
treatment                    0
HTA_new                      0
BMI_cat                    211
dtype: int64

Aquí se van a cambiar algunos datos, debido a que se muestran muchos `NaN`, pero no es poque se hayan dejado estos espacios en blanco como dato incompleto, sino que, por ejemplo, en la variable de smoking_years hay personas que no han fumado en ningún momento de la vida, por lo que se dejaron espacios vacíos en lugar de poner el valor de cero. Se cambiarán los datos de algunas columnas.

In [22]:
df_1 = df
df_1['hypertension_years'] = df['hypertension_years'].fillna(0)
df_1['smoking_years'] = df['smoking_years'].fillna(0)
df_1['hypertension_treatment'] = df['hypertension_treatment'].fillna("N/A")

In [23]:
df_1.isna().sum()

id                          0
city                        0
masl                        0
sex                         0
age_years                   0
systolic_bp                 0
diastolic_bp                0
weight_kg                 197
height_cm                 201
body_mass_index           211
diabetes_mellitus          18
dm_treatment               44
cv_diseases                28
cd_treatment               25
smoking                   132
smoking_years               0
hypertension_dx             0
hypertension_years          0
hypertension_treatment      0
physical_activity           0
msnm                        0
region                      0
sist_old                    0
diast_old                   0
sist_new                    0
diast_new                   0
treatment                   0
HTA_new                     0
BMI_cat                   211
dtype: int64

Dentro de las filas que contienen `NaN's` se eliminarán las columnas donde alguna de sus variables tienen al menos un `NaN'

In [24]:
df_2 = df_1.dropna(subset=['weight_kg', 'height_cm', 'diabetes_mellitus', 'dm_treatment', 'cv_diseases', 'cd_treatment', 'smoking', 'BMI_cat'])

In [25]:
df_2.isna().sum()

id                        0
city                      0
masl                      0
sex                       0
age_years                 0
systolic_bp               0
diastolic_bp              0
weight_kg                 0
height_cm                 0
body_mass_index           0
diabetes_mellitus         0
dm_treatment              0
cv_diseases               0
cd_treatment              0
smoking                   0
smoking_years             0
hypertension_dx           0
hypertension_years        0
hypertension_treatment    0
physical_activity         0
msnm                      0
region                    0
sist_old                  0
diast_old                 0
sist_new                  0
diast_new                 0
treatment                 0
HTA_new                   0
BMI_cat                   0
dtype: int64

In [26]:
print('Número de columnas eliminadas: {}'.format(len(df) - len(df_2)))

Número de columnas eliminadas: 368


Se transformarán las filas de `smoking_years` y `hypertension_years` de tipo Int

In [27]:
df_2['smoking_years'].astype(int)
df_2['hypertension_years'].astype(int)

0       0
1       0
2       0
3       0
4       0
       ..
5399    0
5400    1
5401    0
5402    0
5403    0
Name: hypertension_years, Length: 5247, dtype: int32

La cantidad de columnas eliminadas es de 368, por lo que nos quedamos con un total de 5247 datos.
Dentro de la columna msnm (metros sobre el nivel del mar) se cambiará el formato de los datos, pues se puede observar que en algunos se tiene `2-4 mil msnm` y en otros está como `<2000msm` y `>4000 msnm`. Para empezar ya se sabe que es la columna de msnm entonces está demás ponerlo dentro de la casilla, esto se va a eliminar; y otro punto es el elegir si se quedan los datos de los números como `2 mil` o `2000`, por el momento se quedará como 2 mil.

In [28]:
#df_3 = df_2['msnm'].replace('2-4mil msnm','2-4 mil')
#df_3 = df_3['msnm'].replace('<2000msm','< 2 mil')
#df_3 = df_3['msnm'].replace('>4000 msnm','> 4 mil')
df_3 = df_2.replace({'msnm': {'2-4mil msnm': '2-4 mil', '<2000msm': '< 2 mil', '>4000 msnm': '> 4 mil'}})
df_3['msnm']

0       2-4 mil
1       < 2 mil
2       < 2 mil
3       < 2 mil
4       < 2 mil
         ...   
5399    2-4 mil
5400    < 2 mil
5401    < 2 mil
5402    2-4 mil
5403    > 4 mil
Name: msnm, Length: 5247, dtype: object

También en las columnas de las presiones arteriales sistólicas y diastólicas nuevas y antiguas varían mucho los datos:
<ul>
    <li>En la columna sist_old y diast_old se encuentra una palabra que está como `Optma`, se reemplazará por `Optima`.</li>
    <li>En la columna diast_new se encuentra un dato en muchas filas, este es `Norm/elev` mientras que en sist_old y diast_old está como `Normal/alta` entonces se cambiará de `Norm/elev` a `Normal/alta` para tener un mismo formato.</li>
</ul>

In [29]:
df_4 = df_3.replace({'sist_old': {'Optma': 'Optima'}, 'diast_old': {'Optma': 'Optima'}, 'diast_new': {'Norm/elev': 'Normal/alta'}})
df_4[['sist_old','diast_old','diast_new']]

Unnamed: 0,sist_old,diast_old,diast_new
0,Optima,Optima,Normal/alta
1,Optima,Optima,Normal/alta
2,Optima,Normal,HTA 1
3,Optima,Normal,HTA 1
4,Optima,Optima,Normal/alta
...,...,...,...
5399,Normal/alta,HTA 1,HTA 2
5400,HTA 2,HTA 2,HTA 2
5401,Optima,Optima,Normal/alta
5402,HTA 1,HTA 1,HTA 2


La última columna, `BMI_cat`, se muestra catalogado el índice de masa corporal, pero analizando esta columna con los valores de la columna `body_mass_index` no están bien catalogadas, aparte que los datos dentro de la columna `body_mass_index` están redondeados, entonces se van a calcular de nuevo cada uno de estos valores y después se realizará la correción en `BMI_cat`.

In [30]:
def bmi_change(df):
    if df['body_mass_index'] < 18.5:
        return 'Underweight'
    elif 18.5 <= df['body_mass_index'] < 25.0:
        return 'Normal'
    elif 25.0 <= df['body_mass_index'] < 30.0:
        return 'Overweight'
    elif 30.0 <= df['body_mass_index'] < 35.0:
        return 'Obese class 1'
    elif 35.0 <= df['body_mass_index'] < 40.0:       
        return 'Obese class 2'
    elif 40.0 <= df['body_mass_index']:
        return 'Obese class 3'

In [31]:
df_4['body_mass_index'] = round(df_4['weight_kg'] / ((df_4['height_cm'] / 100) ** 2), 2)
df_4['BMI_cat'] = df_4.apply(bmi_change, axis = 1)
df_4

Unnamed: 0,id,city,masl,sex,age_years,systolic_bp,diastolic_bp,weight_kg,height_cm,body_mass_index,...,physical_activity,msnm,region,sist_old,diast_old,sist_new,diast_new,treatment,HTA_new,BMI_cat
0,3574,Huancayo,3250,Female,23,119,74,58.0,163.0,21.83,...,Yes,2-4 mil,Mountain,Optima,Optima,Normal,Normal/alta,No,No,Normal
1,1092,Loreto,100,Male,60,110,70,54.0,160.0,21.09,...,No,< 2 mil,Jungle,Optima,Optima,Normal,Normal/alta,No,No,Normal
2,861,Lima,500,Female,38,120,80,65.0,163.0,24.46,...,Yes,< 2 mil,Coast,Optima,Normal,Normal,HTA 1,No,Yes,Normal
3,835,Lima,500,Female,43,110,80,60.0,157.0,24.34,...,No,< 2 mil,Coast,Optima,Normal,Normal,HTA 1,No,Yes,Normal
4,4654,Huanuco,1900,Female,30,95,60,50.0,152.0,21.64,...,Yes,< 2 mil,Mountain,Optima,Optima,Normal,Normal/alta,No,No,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5399,1528,Cajamarca,2750,Female,49,135,90,77.0,147.0,35.63,...,Yes,2-4 mil,Mountain,Normal/alta,HTA 1,HTA 1,HTA 2,No,No,Obese class 2
5400,1038,Loreto,100,Male,49,164,105,101.0,165.0,37.10,...,Yes,< 2 mil,Jungle,HTA 2,HTA 2,HTA 2,HTA 2,No,No,Obese class 2
5401,298,Piura-Chiclayo,30,Male,47,104,58,67.0,141.0,33.70,...,No,< 2 mil,Coast,Optima,Optima,Normal,Normal/alta,No,No,Obese class 1
5402,2752,Cusco,3350,Female,49,140,95,91.0,165.0,33.43,...,No,2-4 mil,Mountain,HTA 1,HTA 1,HTA 2,HTA 2,No,No,Obese class 1


Por último se cambiará el index del DataFrame con los id de cada paciente.

In [32]:
df_4 = df_4.set_index('id')

In [38]:
df_4.groupby('hypertension_dx').size()

hypertension_dx
.       119
No     4374
Yes     754
dtype: int64

Ahora que ya se tienen limpios los datos, se guardará el DataFrame en un archivo nuevo, se nombrará como `hypertension_peru_clean.csv`

In [16]:
# fumador = lambda smoking: True if smoking == "Yes" elif smoking == "No" False
# diabetico = lambda diabe]tes_mellitus:  True if diabetes_mellitus == "Yes" elif diabetes_mellitus == "No" False
df_4 = df_4.sort_values(by=['id'])
df_4.to_csv("hypertension_peru_clean.csv")

In [183]:
df_4.dtypes

city                       object
masl                        int64
sex                        object
age_years                   int64
systolic_bp                 int64
diastolic_bp                int64
weight_kg                 float64
height_cm                 float64
body_mass_index           float64
diabetes_mellitus          object
dm_treatment               object
cv_diseases                object
cd_treatment               object
smoking                    object
smoking_years             float64
hypertension_dx            object
hypertension_years        float64
hypertension_treatment     object
physical_activity          object
msnm                       object
region                     object
sist_old                   object
diast_old                  object
sist_new                   object
diast_new                  object
treatment                  object
HTA_new                    object
BMI_cat                    object
dtype: object

In [184]:
mayor_igual_a_30 = df_4['age_years'] >= 30
df_4[mayor_igual_a_30]

Unnamed: 0_level_0,city,masl,sex,age_years,systolic_bp,diastolic_bp,weight_kg,height_cm,body_mass_index,diabetes_mellitus,...,physical_activity,msnm,region,sist_old,diast_old,sist_new,diast_new,treatment,HTA_new,BMI_cat
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Piura-Chiclayo,4,Female,84,135,58,50.0,140.0,25.51,No,...,No,< 2 mil,Coast,Normal/alta,Optima,HTA 1,Normal/alta,Yes,Yes,Overweight
2,Piura-Chiclayo,4,Female,64,120,80,60.0,160.0,23.44,No,...,No,< 2 mil,Coast,Optima,Normal,Normal,HTA 1,No,Yes,Normal
3,Piura-Chiclayo,4,Male,47,108,63,72.0,159.0,28.48,No,...,No,< 2 mil,Coast,Optima,Optima,Normal,Normal/alta,Yes,No,Overweight
4,Piura-Chiclayo,4,Male,63,123,75,65.0,157.0,26.37,No,...,Yes,< 2 mil,Coast,Normal,Optima,Elevado,Normal/alta,No,No,Overweight
5,Piura-Chiclayo,4,Female,65,113,63,63.0,155.0,26.22,No,...,Yes,< 2 mil,Coast,Optima,Optima,Normal,Normal/alta,No,No,Overweight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5613,Rinconada,5100,Female,43,88,65,70.0,150.0,31.11,No,...,No,> 4 mil,Mountain,Optima,Optima,Normal,Normal/alta,No,No,Obese class 1
5614,Rinconada,5100,Male,49,110,80,68.0,157.0,27.59,No,...,No,> 4 mil,Mountain,Optima,Normal,Normal,HTA 1,No,Yes,Overweight
5615,Rinconada,5100,Male,30,120,80,81.0,167.0,29.04,No,...,No,> 4 mil,Mountain,Optima,Normal,Normal,HTA 1,No,Yes,Overweight
5616,Rinconada,5100,Male,57,124,68,91.0,147.0,42.11,No,...,No,> 4 mil,Mountain,Normal,Optima,Elevado,Normal/alta,No,No,Obese class 3


In [185]:
# Cantidad de pacientes que se les diagnosticó recientemente hipertensión arterial
antiguos_no_HTA = df_4['hypertension_dx'] == 'No'
recien_HTA = df_4['HTA_new'] == 'Yes'
len(df_4[recien_HTA & antiguos_no_HTA][{'diast_new', 'sist_new'}])

1047

In [186]:
# Cantidad de pacientes que fuman, son diabéticos, tienen sobrepeso o más y tienen hipertensión arterial
antiguos_yes_HTA = df_4['hypertension_dx'] == 'Yes'
mayor_normal_BMI = df_4['body_mass_index'] >= 25.0
diabetico = df_4['diabetes_mellitus'] == 'Yes'
fumador = df_4['smoking'] == 'Yes'

len(df_4[recien_HTA & mayor_normal_BMI & diabetico & fumador])

12

In [189]:
# Pacientes que anteriormente tuvieron hipertensión pero ahora su diagnóstico sale negativo
recien_no_HTA = df_4['HTA_new'] == 'No'
df_4[recien_no_HTA & antiguos_yes_HTA]

Unnamed: 0_level_0,city,masl,sex,age_years,systolic_bp,diastolic_bp,weight_kg,height_cm,body_mass_index,diabetes_mellitus,...,physical_activity,msnm,region,sist_old,diast_old,sist_new,diast_new,treatment,HTA_new,BMI_cat
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,Piura-Chiclayo,4,Male,83,155,85,59.0,149.0,26.58,No,...,No,< 2 mil,Coast,HTA 1,Normal/alta,HTA 2,HTA 1,Yes,No,Overweight
11,Piura-Chiclayo,4,Female,48,125,50,53.0,145.0,25.21,No,...,No,< 2 mil,Coast,Normal,Optima,Elevado,Normal/alta,Yes,No,Overweight
12,Piura-Chiclayo,4,Male,77,140,85,62.0,158.0,24.84,No,...,No,< 2 mil,Coast,HTA 1,Normal/alta,HTA 2,HTA 1,Yes,No,Normal
16,Piura-Chiclayo,4,Female,36,128,78,74.0,160.0,28.91,No,...,Yes,< 2 mil,Coast,Normal,Optima,Elevado,Normal/alta,Yes,No,Overweight
21,Piura-Chiclayo,4,Female,65,138,93,57.0,150.0,25.33,No,...,No,< 2 mil,Coast,Normal/alta,HTA 1,HTA 1,HTA 2,Yes,No,Overweight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5609,Rinconada,5100,Male,59,135,90,89.0,140.0,45.41,No,...,No,> 4 mil,Mountain,Normal/alta,HTA 1,HTA 1,HTA 2,No,No,Obese class 3
5610,Rinconada,5100,Male,61,135,95,74.0,164.0,27.51,No,...,No,> 4 mil,Mountain,Normal/alta,HTA 1,HTA 1,HTA 2,No,No,Overweight
5613,Rinconada,5100,Female,43,88,65,70.0,150.0,31.11,No,...,No,> 4 mil,Mountain,Optima,Optima,Normal,Normal/alta,No,No,Obese class 1
5618,Rinconada,5100,Female,35,100,75,78.0,142.0,38.68,No,...,Yes,> 4 mil,Mountain,Optima,Optima,Normal,Normal/alta,No,No,Obese class 2


In [192]:
# Pacientes recien diagnosticados con hipertensión que ya tienen tratamiento y realizan ejercicio
treatment_new_yes = df_4['treatment'] == 'Yes'
physical_activity_yes = df_4['physical_activity'] == 'Yes'
df_4[recien_HTA & treatment_new_yes & physical_activity_yes]

Unnamed: 0_level_0,city,masl,sex,age_years,systolic_bp,diastolic_bp,weight_kg,height_cm,body_mass_index,diabetes_mellitus,...,physical_activity,msnm,region,sist_old,diast_old,sist_new,diast_new,treatment,HTA_new,BMI_cat
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
408,Lima,500,Male,44,138,70,80.0,160.0,31.25,No,...,Yes,< 2 mil,Coast,Normal/alta,Optima,HTA 1,Normal/alta,Yes,Yes,Obese class 1
444,Lima,500,Female,25,130,75,74.0,163.0,27.85,No,...,Yes,< 2 mil,Coast,Normal/alta,Optima,HTA 1,Normal/alta,Yes,Yes,Overweight
844,Lima,500,Female,64,115,80,54.0,154.0,22.77,No,...,Yes,< 2 mil,Coast,Optima,Normal,Normal,HTA 1,Yes,Yes,Normal
936,Lima,500,Female,61,130,70,64.0,161.0,24.69,No,...,Yes,< 2 mil,Coast,Normal/alta,Optima,HTA 1,Normal/alta,Yes,Yes,Normal
1032,Loreto,100,Male,67,128,80,80.0,155.0,33.3,Yes,...,Yes,< 2 mil,Jungle,Normal,Normal,Elevado,HTA 1,Yes,Yes,Obese class 1
1301,Cajamarca,2750,Female,68,135,75,64.0,155.0,26.64,No,...,Yes,2-4 mil,Mountain,Normal/alta,Optima,HTA 1,Normal/alta,Yes,Yes,Overweight
2406,Cerro de Pasco,4350,Male,66,137,85,59.0,160.0,23.05,No,...,Yes,> 4 mil,Mountain,Normal/alta,Normal/alta,HTA 1,HTA 1,Yes,Yes,Normal
2661,Cusco,3350,Female,54,120,80,110.0,155.0,45.79,No,...,Yes,2-4 mil,Mountain,Optima,Normal,Normal,HTA 1,Yes,Yes,Obese class 3
2691,Cusco,3350,Female,70,120,80,67.0,164.0,24.91,No,...,Yes,2-4 mil,Mountain,Optima,Normal,Normal,HTA 1,Yes,Yes,Normal
3471,Huancayo,3250,Male,72,135,80,80.0,167.0,28.69,No,...,Yes,2-4 mil,Mountain,Normal/alta,Normal,HTA 1,HTA 1,Yes,Yes,Overweight


https://zenodo.org/record/4567767#.Yj9dpzWjlEY