Oct 11th 2025

## Objective:
Load the UCI ML heart disease dataset, EDA, feature description and develop an understanding of the dataset, its corelations, limitations and create processed dataset to be used by the scikit pipeline.

## Setup

In [3]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

import kagglehub

## Loading data

In [4]:
os.environ['KAGGLEHUB_CACHE'] = r'C:\\Users\\patil\\Documents\\GitHub\\ml_projects\\heart_disease_prediction\\data\\raw'

# Download latest version
path = kagglehub.dataset_download("redwankarimsony/heart-disease-data")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/redwankarimsony/heart-disease-data?dataset_version_number=6...


100%|█████████████████████████████████████████████████████████████████████████████| 12.4k/12.4k [00:00<00:00, 3.16MB/s]

Extracting model files...
Path to dataset files: C:\\Users\\patil\\Documents\\GitHub\\ml_projects\\heart_disease_prediction\\data\\raw\datasets\redwankarimsony\heart-disease-data\versions\6





In [5]:
path_to_dataset = r'C:\\Users\\patil\\Documents\\GitHub\\ml_projects\\heart_disease_prediction\\data\\raw\datasets\redwankarimsony\heart-disease-data\versions\6'
filename = 'heart_disease_uci.csv'
full_file_path = os.path.join(path_to_dataset, filename)
df = pd.read_csv(full_file_path)

In [6]:
df

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
915,916,54,Female,VA Long Beach,asymptomatic,127.0,333.0,True,st-t abnormality,154.0,False,0.0,,,,1
916,917,62,Male,VA Long Beach,typical angina,,139.0,False,st-t abnormality,,,,,,,0
917,918,55,Male,VA Long Beach,asymptomatic,122.0,223.0,True,st-t abnormality,100.0,False,0.0,,,fixed defect,2
918,919,58,Male,VA Long Beach,asymptomatic,,385.0,True,lv hypertrophy,,,,,,,0


## Exploratory Data Analysis

In [18]:
df.dtypes

id            int64
age           int64
sex          object
dataset      object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
slope        object
ca          float64
thal         object
num           int64
dtype: object

In [24]:
numeric_cols = []
categorical_cols = []
for col in df.columns:
    if df[col].dtype == np.number or df[col].dtype == 'int64':
        numeric_cols.append(col)
    else:
        categorical_cols.append(col)
print('numeric:',numeric_cols)
print('categorical:', categorical_cols)

numeric: ['id', 'age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'num']
categorical: ['sex', 'dataset', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']


  if df[col].dtype == np.number or df[col].dtype == 'int64':


In [22]:
for col in categorical_cols:
    print(col, df[col].unique())

sex ['Male' 'Female']
dataset ['Cleveland' 'Hungary' 'Switzerland' 'VA Long Beach']
cp ['typical angina' 'asymptomatic' 'non-anginal' 'atypical angina']
fbs [True False nan]
restecg ['lv hypertrophy' 'normal' 'st-t abnormality' nan]
exang [False True nan]
slope ['downsloping' 'flat' 'upsloping' nan]
thal ['fixed defect' 'normal' 'reversable defect' nan]


### Meaning of some of the technical terms
Dataset: sample collected from different places

CP: chest pain - typical angina (some chest pain), atypical angina (chest pain not clearly related to heart), asymptomatic (no chest pain related to heart), non-anginal (no chest pain)

Trestbps: resting blood pressure (on admission to the hospital)

Chol: cholestrol

FBS - Fasting Blood sugar >=120 mg/dL

Rest ECG: lv hypertrophy (showing probable or definite left ventricular hypertrophy), normal, st-t abnormality (having ST-T 
wave abnormality)

Thalch: maximum heart rate achieved

Exang: Did exercise induce chest pain?

Oldpeak: Change in flow to heart during exercise relative to normal

Slope: Slope of the peak exercise ST segment - downsloping (bad sign), flat (medium sign), upsloping (good sign)

CA: Number of major arteries with with diameter narrowed by >50%

Thal: Thalium stress test - fixed defect (defect present even during rest), normal (no defect), reversable defect (defect only appears during exercise)

Num: Diagnosis of heart disease 0 (no HD), 1 (Mild HD), 2 (Moderate HD), 3 (Severe HD), 4 (Very severe HD)

### First impressions from data features:
This dataset can be classified in 3 types of patients: 
1. no HD
2. some HD but can be reveresed 
3. severe HD

Features that may lead to no HD: nonanginal, fbs False, normal rest ecg, exang False, upsloping, ca 0, thal normal
Features that may lead to some HD: asymptomatic, flat slope, reversable defect
Features that may lead to severe HD: typical angina, fbs True, abnormal rest ecg, exang True, oldpeak high, downsloping, ca 2 or 3,  fixed defect

### Looking at distributions of the features

In [None]:
def plot_feautres(df, filename = 'feature_dist.png'):
    n_plots = len(numeric_cols) + len(categorical_cols)

### Plotting output by each feature