In [176]:
import pandas as pd
df = pd.read_csv("heart.csv")

# Description of the dataset  
## background info
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to
this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.


## columns  
1- age: age in years  
  
2- sex: whether the patient is male  
  
3- chest pain type (4 values):  
Chest pain measured between 0 to 3 on a subjective scale  
  
4- resting blood pressure:  
prior assumption that blood pressure and the prevalence of heart disease are correlated  
  
5- serum cholestoral in mg/dl:  
blood cholestoral is made to be the enemy with regards to heart disease so again expect a large corr  
  
6- fasting blood sugar > 120 mg/dl:  
Perhaps can indicate possible diabetes. Diabetes has high comorbidity rates  
  
7- resting electrocardiographic results (values 0,1,2):   
0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria  
  
8- maximum heart rate achieved:  
max heartrate can tell us how much the heart is straining to provide the body with sufficient oxygen. The higher the heartrate the more strain.  
  
9- exercise induced angina:  
chest pain caused by reduced blood flow to the heart. During exercise this can increase as there is a higher required bloodflow.
  
10- oldpeak = ST depression induced by exercise relative to rest: 
  
  
11- the slope of the peak exercise ST segment: 
  
12- number of major vessels (0-3) colored by flourosopy: 
  
13- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect:  
thalessemia is a hereditary blood disorder that may increase risk of heart disease.  

  
# Acknowledgements  
## Creators:  

1- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.  
2- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.  
3- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.  
4- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.  
## Donor:
1- David W. Aha (aha '@' ics.uci.edu) (714) 856-8779  
## Inspiration  
Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

In [177]:
df.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [178]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

here we check for null values. we can see that there are no null values in this data set of type null

In [179]:
for col in df.columns:
    print(f"set of {col}: ",set(df[col]))

set of age:  {29, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 74, 76, 77}
set of sex:  {0, 1}
set of cp:  {0, 1, 2, 3}
set of trestbps:  {128, 129, 130, 132, 134, 135, 136, 138, 140, 142, 144, 145, 146, 148, 150, 152, 154, 155, 156, 160, 164, 165, 170, 172, 174, 178, 180, 192, 200, 94, 100, 101, 102, 104, 105, 106, 108, 110, 112, 114, 115, 117, 118, 120, 122, 123, 124, 125, 126}
set of chol:  {564, 126, 131, 141, 149, 157, 160, 164, 166, 167, 168, 169, 172, 174, 175, 176, 177, 178, 180, 182, 183, 184, 185, 186, 187, 188, 192, 193, 195, 196, 197, 198, 199, 200, 201, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,

We then check for distinct values to check for nulls of a different kind such as special characters to which we also find no issues.

In [180]:
df.drop(index = df.loc[df["thal"]==0].index.values, axis=0,inplace=True)

fixing error 0 thal values are supposed to be Nans

In [181]:
# check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 301 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       301 non-null    int64  
 1   sex       301 non-null    int64  
 2   cp        301 non-null    int64  
 3   trestbps  301 non-null    int64  
 4   chol      301 non-null    int64  
 5   fbs       301 non-null    int64  
 6   restecg   301 non-null    int64  
 7   thalach   301 non-null    int64  
 8   exang     301 non-null    int64  
 9   oldpeak   301 non-null    float64
 10  slope     301 non-null    int64  
 11  ca        301 non-null    int64  
 12  thal      301 non-null    int64  
 13  target    301 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 35.3 KB


We then check the data types for the different columns. Here we see our first possible sets of issues, which is integer classes for categorical data.  
We also see the bad naming conventions so lets go about fixing those for ease of interpretation.

In [182]:
df.columns = ["age", "sex", "chest_pain", "resting_bp", "chol", "fstng_bld_sgr", "elect_result", "max_hrt_rt", "exercise_induced_angina", "st_dep", "st_slope", "major_vessels", "thalassemia", "target"]

In [183]:
df["sex"].loc[df["sex"]==1] ="male"
df["sex"].loc[df["sex"]==0] ="female"
df["fstng_bld_sgr"].loc[df["fstng_bld_sgr"]==1] = "high"
df["fstng_bld_sgr"].loc[df["fstng_bld_sgr"]==0] = "not_high"
df["chest_pain"].replace({0: "asymptomatic", 1: "atypical_angina", 2: "non-anginal_pain", 3: "typical_angina"}, inplace = True)
df["elect_result"].replace({0:"normal", 1:"having_ST-T_wave_abnormality", 2:"left_ventricular_hypertrophy"},inplace=True)
df["exercise_induced_angina"].loc[df["exercise_induced_angina"]==1] = "present"
df["exercise_induced_angina"].loc[df["exercise_induced_angina"]==0] = "not_present"
df["st_slope"].replace({0: "upsloping", 1: "flat", 2: "downsloping"},inplace = True)
df["thalassemia"].replace({1:"Normal", 2: "fixed_defect", 3:"reversable_defect"}, inplace=True)

giving more categorical values more meaningful names

In [185]:
df['sex'] = df['sex'].astype('object')
df['chest_pain'] = df['chest_pain'].astype('object')
df['fstng_bld_sgr'] = df['fstng_bld_sgr'].astype('object')
df['elect_result'] = df['elect_result'].astype('object')
df['exercise_induced_angina'] = df['exercise_induced_angina'].astype('object')
df['st_slope'] = df['st_slope'].astype('object')
df['thalassemia'] = df['thalassemia'].astype('object')

ensuring that categorical values be treated as such

In [186]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 301 entries, 0 to 302
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      301 non-null    int64  
 1   sex                      301 non-null    object 
 2   chest_pain               301 non-null    object 
 3   resting_bp               301 non-null    int64  
 4   chol                     301 non-null    int64  
 5   fstng_bld_sgr            301 non-null    object 
 6   elect_result             301 non-null    object 
 7   max_hrt_rt               301 non-null    int64  
 8   exercise_induced_angina  301 non-null    object 
 9   st_dep                   301 non-null    float64
 10  st_slope                 301 non-null    object 
 11  major_vessels            301 non-null    int64  
 12  thalassemia              301 non-null    object 
 13  target                   301 non-null    int64  
dtypes: float64(1), int64(6), o

In [190]:
cleaned = pd.get_dummies(df,drop_first= True)

get dummy values for categorical columns for use in machine learning models.

In [191]:
cleaned.to_csv("clean_data.csv")