# Limpiar dataset unificado

In [43]:
import pandas as pd
import numpy as np

In [44]:
df = pd.read_csv('./DataFiles/enfermedad_cardiaca.csv', index_col=0)
df

Unnamed: 0,age,sex,Chest pain type,resting blood pressure,serum cholestoral,fasting blood sugar,Resting EEG,maximum heart rate,exercise induced angina,ST depression,slope ST,number of major vessels,thal,diagnosis of heart disease
0,28.0,1.0,2.0,130,132,0,2,185,0,0.0,?,?,?,0
1,29.0,1.0,2.0,120,243,0,0,160,0,0.0,?,?,?,0
2,29.0,1.0,2.0,140,?,0,0,170,0,0.0,?,?,?,0
3,30.0,0.0,1.0,170,237,0,1,170,0,0.0,?,?,6,0
4,31.0,0.0,2.0,100,219,0,1,150,0,0.0,?,?,?,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
792,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
793,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
794,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
795,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


Analizaremos columna por columna para unificar y limpiar los datos.

La columna *age* ya está codificada correctamente desde la fase anterior.

### Atributo *sex*

In [45]:
df['sex'].value_counts()

sex
1.0    613
0.0    184
Name: count, dtype: int64

Solo convertiremos a int

In [46]:
df['sex'] = df['sex'].astype('int')
df['sex'].value_counts()

sex
1    613
0    184
Name: count, dtype: int64

In [47]:
# Cambiar 1 -> 'hombre' y 0 -> 'mujer' en la columna 'sex'
df['sex'] = df['sex'].map({1: 'hombre', 0: 'mujer'}).astype('category')
df['sex'].value_counts()

sex
hombre    613
mujer     184
Name: count, dtype: int64

### Atributo *Chest pain type*

In [48]:
df['Chest pain type'].value_counts()

Chest pain type
4.0    398
3.0    187
2.0    170
1.0     42
Name: count, dtype: int64

In [49]:
remplazar_cp = {
    1: 'Angina típica',
    2: 'Angina atípica',
    3: 'Dolor no anginosa',
    4: 'Asintomático'
}

df['Chest pain type'] = df['Chest pain type'].map(remplazar_cp).astype('category')
df['Chest pain type'].value_counts()

Chest pain type
Asintomático         398
Dolor no anginosa    187
Angina atípica       170
Angina típica         42
Name: count, dtype: int64

### Atributo *resting blood pressure*

En este atributo tenemos registron con *"?"*, como comprende una minoría (57 registros), serán remplazados por la mediana del resto de valores.

In [50]:
df['resting blood pressure'].value_counts()

resting blood pressure
120      81
130      69
140      60
?        57
120.0    37
         ..
106.0     1
156.0     1
154.0     1
114.0     1
164.0     1
Name: count, Length: 99, dtype: int64

In [51]:
# remplazamos '?' por valors NaN
df['resting blood pressure'] = df['resting blood pressure'].replace('?', np.nan,).astype('float')

# ademas tenemos un valor de 0, tambien lo convertiemoso a NaN
df['resting blood pressure'] = df['resting blood pressure'].replace(0.0, np.nan,).astype('float')



In [52]:
df['resting blood pressure'].info()

<class 'pandas.core.series.Series'>
Index: 797 entries, 0 to 796
Series name: resting blood pressure
Non-Null Count  Dtype  
--------------  -----  
739 non-null    float64
dtypes: float64(1)
memory usage: 12.5 KB


In [53]:
print(f'La mediana corresponde a: {df['resting blood pressure'].median()}')
df['resting blood pressure'] = df['resting blood pressure'].replace(np.nan, df['resting blood pressure'].median())
df['resting blood pressure'].info()


La mediana corresponde a: 130.0
<class 'pandas.core.series.Series'>
Index: 797 entries, 0 to 796
Series name: resting blood pressure
Non-Null Count  Dtype  
--------------  -----  
797 non-null    float64
dtypes: float64(1)
memory usage: 12.5 KB


### Atributo *serum cholestoral*

In [54]:
df['serum cholestoral'].value_counts()

serum cholestoral
0        49
?        30
216       7
220       7
204.0     6
         ..
187.0     1
157.0     1
176.0     1
241.0     1
131.0     1
Name: count, Length: 335, dtype: int64

Paraece que en esta columna tenemos tanto datos en 0 y datos perdidos (?), en total son 79 registros faltantes. Seran remplazados por la mediana del resto de registros.

In [55]:
df['serum cholestoral'] = df['serum cholestoral'].replace('?', np.nan).astype('float')
df['serum cholestoral'] = df['serum cholestoral'].replace(0, np.nan).astype('float')

In [56]:
df['serum cholestoral'] = df['serum cholestoral'].replace(np.nan, df['serum cholestoral'].median()).astype('float')

In [57]:
df['serum cholestoral'].info()

<class 'pandas.core.series.Series'>
Index: 797 entries, 0 to 796
Series name: serum cholestoral
Non-Null Count  Dtype  
--------------  -----  
797 non-null    float64
dtypes: float64(1)
memory usage: 12.5 KB


### Atributo *fasting blood sugar*

In [58]:
df['fasting blood sugar'].value_counts()

fasting blood sugar
0      391
0.0    258
1       88
1.0     45
?       15
Name: count, dtype: int64

Al tratarse de una variable binaria, y solamente 15 registros están perdidos, se  propone remover estos resgistros vacíos del dataset, pues no podemos asumir ningun dato ni ponderar alguna métrica estándar.

In [59]:
df = df[df['fasting blood sugar'] != '?']

In [60]:
df['fasting blood sugar'] = df['fasting blood sugar'].map({
                                        '0':0,
                                        '0.0': 0,
                                        '1':1,
                                        '1.0':1
                                    })

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['fasting blood sugar'] = df['fasting blood sugar'].map({


In [61]:
df['fasting blood sugar'].value_counts()

fasting blood sugar
0    649
1    133
Name: count, dtype: int64

### Atributo *Resting EEG*

In [62]:
df['Resting EEG'].value_counts()

Resting EEG
0      306
0.0    151
2.0    148
1      139
2       33
1.0      4
?        1
Name: count, dtype: int64

En la documentación del dataset podemos encontrar que la variable es una categorización de tres clases:

     19 restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria

Se propone una trasformación a variable categórica con la siguiente forma:

``` 
{ 
   0: 'normal',
   1: 'anomalia',
   2: 'hipertrofia'
}
```
El unico valor perdido ('?') se descartará del dataset.

In [63]:
df = df[df['Resting EEG'] != '?']

In [64]:
df['Resting EEG'].value_counts()

Resting EEG
0      306
0.0    151
2.0    148
1      139
2       33
1.0      4
Name: count, dtype: int64

In [65]:
resting_EEG_ordinal_dtype = pd.CategoricalDtype(categories=['normal', 'anomalia', 'hipertrofia'], ordered=True)

df['Resting EEG'] = df['Resting EEG'].map({
    '0': 'normal',
    '0.0': 'normal',
    '1': 'anomalia',
    '1.0': 'anomalia',
    '2': 'hipertrofia',
    '2.0': 'hipertrofia',
}).astype(resting_EEG_ordinal_dtype)

In [66]:
df['Resting EEG'].value_counts()

Resting EEG
normal         457
hipertrofia    181
anomalia       143
Name: count, dtype: int64

### Atributo *maximum heart rate*

In [67]:
df['maximum heart rate'].value_counts()

maximum heart rate
?        54
150      33
140      31
120      23
130      23
         ..
117.0     1
71.0      1
118.0     1
134.0     1
90.0      1
Name: count, Length: 183, dtype: int64

In [68]:
df['maximum heart rate'] = df['maximum heart rate'].replace('?', np.nan,).astype('float')

In [69]:
df['maximum heart rate'].describe()

count    727.000000
mean     140.286107
std       25.122688
min       69.000000
25%      121.500000
50%      142.000000
75%      160.000000
max      202.000000
Name: maximum heart rate, dtype: float64

Parece que los datos siguen una distribución normal y no hay valores anómalos.

Se sustituirán los valores perdido por la mediana del resto de datos.

In [70]:
df['maximum heart rate'] = df['maximum heart rate'].replace(np.nan, df['maximum heart rate'].median())
df['maximum heart rate'].describe()

count    781.000000
mean     140.404609
std       24.241365
min       69.000000
25%      123.000000
50%      142.000000
75%      159.000000
max      202.000000
Name: maximum heart rate, dtype: float64

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 781 entries, 0 to 796
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   age                         781 non-null    float64 
 1   sex                         781 non-null    category
 2   Chest pain type             781 non-null    category
 3   resting blood pressure      781 non-null    float64 
 4   serum cholestoral           781 non-null    float64 
 5   fasting blood sugar         781 non-null    int64   
 6   Resting EEG                 781 non-null    category
 7   maximum heart rate          781 non-null    float64 
 8   exercise induced angina     781 non-null    object  
 9   ST depression               781 non-null    object  
 10  slope ST                    781 non-null    object  
 11  number of major vessels     781 non-null    object  
 12  thal                        781 non-null    object  
 13  diagnosis of heart diseas

### Atributo *exercise induced angina*

In [72]:
df['exercise induced angina'].value_counts()

exercise induced angina
0      247
0.0    204
1      177
1.0     99
?       54
Name: count, dtype: int64

In [73]:
df['exercise induced angina'] = df['exercise induced angina'].replace('?', np.nan).astype('float')
df['exercise induced angina'].value_counts()

exercise induced angina
0.0    451
1.0    276
Name: count, dtype: int64

Esta variable ya se podría quedar así, tampoco podemos asumir un dato para los registros peridos NaN.


In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 781 entries, 0 to 796
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   age                         781 non-null    float64 
 1   sex                         781 non-null    category
 2   Chest pain type             781 non-null    category
 3   resting blood pressure      781 non-null    float64 
 4   serum cholestoral           781 non-null    float64 
 5   fasting blood sugar         781 non-null    int64   
 6   Resting EEG                 781 non-null    category
 7   maximum heart rate          781 non-null    float64 
 8   exercise induced angina     727 non-null    float64 
 9   ST depression               781 non-null    object  
 10  slope ST                    781 non-null    object  
 11  number of major vessels     781 non-null    object  
 12  thal                        781 non-null    object  
 13  diagnosis of heart diseas

### Atributo *ST depression* (oldpeak)

In [75]:
df['ST depression'].value_counts()

ST depression
0.0     282
?        56
1.0      52
1.5      41
2.0      40
0        39
2        22
1        17
1.2      17
0.8      15
3.0      14
0.6      14
0.5      14
3        13
1.4      13
2.5      13
1.6      12
0.2      12
1.8      10
0.4       9
0.1       7
2.8       6
2.6       6
1.9       5
4.0       4
2.2       4
3.6       4
4         4
1.3       3
3.4       3
0.3       3
2.4       3
0.9       3
3.5       2
4.2       2
1.1       2
3.2       2
2.3       2
5.0       1
-0.5      1
1.7       1
3.1       1
2.9       1
5.6       1
6.2       1
2.1       1
3.8       1
0.7       1
4.4       1
Name: count, dtype: int64

In [76]:
df[df['ST depression'] == '-0.5']

Unnamed: 0,age,sex,Chest pain type,resting blood pressure,serum cholestoral,fasting blood sugar,Resting EEG,maximum heart rate,exercise induced angina,ST depression,slope ST,number of major vessels,thal,diagnosis of heart disease
299,66.0,hombre,Dolor no anginosa,120.0,239.5,0,anomalia,120.0,0.0,-0.5,1,?,?,0


Parece que esta variable tiene multiples valores flotantes, 56 registros perdidos un registro con valor de -0.5.

La documentación nos describe la variable de la siguiente forma:

``` 
oldpeak = ST depression induced by exercise relative to rest
```
Investigando un poco más parece que el valor de -0.5 no es necesariamente un error, por lo que este se conservará para el analisis.

El resto de vaciós se sustituirán por la mediana.

**Recomendación**

Si bien el valor -0.5, con índice 299, se conservará para el análisis, se recomienda retirarlo en caso de utilizar los datos para el entrenamiento de un modelo de aprendizaje automático, pues se considerá como un dato irrelevante al tratarse solo de un registro.

In [77]:
df['ST depression'] = df['ST depression'].replace('?', np.nan).astype('float')
df['ST depression'].describe()

count    725.000000
mean       0.915172
std        1.097456
min       -0.500000
25%        0.000000
50%        0.500000
75%        1.500000
max        6.200000
Name: ST depression, dtype: float64

In [78]:
df['ST depression'] = df['ST depression'].replace(np.nan, df['ST depression'].median()).astype('float')
df['ST depression'].describe()

count    781.000000
mean       0.885403
std        1.062745
min       -0.500000
25%        0.000000
50%        0.500000
75%        1.500000
max        6.200000
Name: ST depression, dtype: float64

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 781 entries, 0 to 796
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   age                         781 non-null    float64 
 1   sex                         781 non-null    category
 2   Chest pain type             781 non-null    category
 3   resting blood pressure      781 non-null    float64 
 4   serum cholestoral           781 non-null    float64 
 5   fasting blood sugar         781 non-null    int64   
 6   Resting EEG                 781 non-null    category
 7   maximum heart rate          781 non-null    float64 
 8   exercise induced angina     727 non-null    float64 
 9   ST depression               781 non-null    float64 
 10  slope ST                    781 non-null    object  
 11  number of major vessels     781 non-null    object  
 12  thal                        781 non-null    object  
 13  diagnosis of heart diseas

### Atributo *slope ST*