# Limpiar dataset unificado

In [42]:
import pandas as pd
import numpy as np

In [43]:
df = pd.read_csv('./DataFiles/enfermedad_cardiaca.csv', index_col=0)
df

Unnamed: 0,age,sex,Chest pain type,resting blood pressure,serum cholestoral,fasting blood sugar,Resting EEG,maximum heart rate,exercise induced angina,ST depression,slope ST,number of major vessels,thal,diagnosis of heart disease
0,28.0,1.0,2.0,130,132,0,2,185,0,0.0,?,?,?,0
1,29.0,1.0,2.0,120,243,0,0,160,0,0.0,?,?,?,0
2,29.0,1.0,2.0,140,?,0,0,170,0,0.0,?,?,?,0
3,30.0,0.0,1.0,170,237,0,1,170,0,0.0,?,?,6,0
4,31.0,0.0,2.0,100,219,0,1,150,0,0.0,?,?,?,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
792,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
793,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
794,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
795,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


Analizaremos columna por columna para unificar y limpiar los datos.

La columna *age* ya está codificada correctamente desde la fase anterior.

### Atributo *sex*

In [44]:
df['sex'].value_counts()

sex
1.0    613
0.0    184
Name: count, dtype: int64

Solo convertiremos a int

In [45]:
df['sex'] = df['sex'].astype('int')
df['sex'].value_counts()

sex
1    613
0    184
Name: count, dtype: int64

In [46]:
# Cambiar 1 -> 'hombre' y 0 -> 'mujer' en la columna 'sex'
df['sex'] = df['sex'].map({1: 'hombre', 0: 'mujer'}).astype('category')
df['sex'].value_counts()

sex
hombre    613
mujer     184
Name: count, dtype: int64

### Atributo *Chest pain type*

In [47]:
df['Chest pain type'].value_counts()

Chest pain type
4.0    398
3.0    187
2.0    170
1.0     42
Name: count, dtype: int64

In [48]:
remplazar_cp = {
    1: 'Angina típica',
    2: 'Angina atípica',
    3: 'Dolor no anginosa',
    4: 'Asintomático'
}

df['Chest pain type'] = df['Chest pain type'].map(remplazar_cp).astype('category')
df['Chest pain type'].value_counts()

Chest pain type
Asintomático         398
Dolor no anginosa    187
Angina atípica       170
Angina típica         42
Name: count, dtype: int64

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 797 entries, 0 to 796
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   age                         797 non-null    float64 
 1   sex                         797 non-null    category
 2   Chest pain type             797 non-null    category
 3   resting blood pressure      797 non-null    object  
 4   serum cholestoral           797 non-null    object  
 5   fasting blood sugar         797 non-null    object  
 6   Resting EEG                 797 non-null    object  
 7   maximum heart rate          797 non-null    object  
 8   exercise induced angina     797 non-null    object  
 9   ST depression               797 non-null    object  
 10  slope ST                    797 non-null    object  
 11  number of major vessels     797 non-null    object  
 12  thal                        797 non-null    object  
 13  diagnosis of heart diseas

### Atributo *resting blood pressure*

En este atributo tenemos registron con *"?"*, como comprende una minoría (57 registros), serán remplazados por la mediana del resto de valores.

In [50]:
df['resting blood pressure'].value_counts()

resting blood pressure
120      81
130      69
140      60
?        57
120.0    37
         ..
106.0     1
156.0     1
154.0     1
114.0     1
164.0     1
Name: count, Length: 99, dtype: int64

In [51]:
# remplazamos '?' por valors NaN
df['resting blood pressure'] = df['resting blood pressure'].replace('?', np.nan,).astype('float')

# ademas tenemos un valor de 0, tambien lo convertiemoso a NaN
df['resting blood pressure'] = df['resting blood pressure'].replace(0.0, np.nan,).astype('float')



In [52]:
df['resting blood pressure'].info()

<class 'pandas.core.series.Series'>
Index: 797 entries, 0 to 796
Series name: resting blood pressure
Non-Null Count  Dtype  
--------------  -----  
739 non-null    float64
dtypes: float64(1)
memory usage: 12.5 KB


In [53]:
print(f'La mediana corresponde a: {df['resting blood pressure'].median()}')
df['resting blood pressure'] = df['resting blood pressure'].replace(np.nan, df['resting blood pressure'].median())
df['resting blood pressure'].info()


La mediana corresponde a: 130.0
<class 'pandas.core.series.Series'>
Index: 797 entries, 0 to 796
Series name: resting blood pressure
Non-Null Count  Dtype  
--------------  -----  
797 non-null    float64
dtypes: float64(1)
memory usage: 12.5 KB


### Atributo *serum cholestoral*

In [54]:
df['serum cholestoral'].value_counts()

serum cholestoral
0        49
?        30
216       7
220       7
204.0     6
         ..
187.0     1
157.0     1
176.0     1
241.0     1
131.0     1
Name: count, Length: 335, dtype: int64

Paraece que en esta columna tenemos tanto datos en 0 y datos perdidos (?), en total son 79 registros faltantes. Seran remplazados por la mediana del resto de registros.

In [55]:
df['serum cholestoral'] = df['serum cholestoral'].replace('?', np.nan).astype('float')
df['serum cholestoral'] = df['serum cholestoral'].replace(0, np.nan).astype('float')

In [59]:
df['serum cholestoral'] = df['serum cholestoral'].replace(np.nan, df['serum cholestoral'].median()).astype('float')

In [61]:
df['serum cholestoral'].info()

<class 'pandas.core.series.Series'>
Index: 797 entries, 0 to 796
Series name: serum cholestoral
Non-Null Count  Dtype  
--------------  -----  
797 non-null    float64
dtypes: float64(1)
memory usage: 12.5 KB


### Atributo *fasting blood sugar*

In [63]:
df['fasting blood sugar'].value_counts()

fasting blood sugar
0      391
0.0    258
1       88
1.0     45
?       15
Name: count, dtype: int64

In [64]:
df['thal'].value_counts()

thal
?      434
3.0    166
7.0    117
7       33
6       18
6.0     18
3       11
Name: count, dtype: int64