## Day 4: Data Understanding

The objective of this notebook is to understand the structure, meaning, and quality of the UCI Heart Disease dataset.

This step focuses on:
- Understanding what each feature represents medically
- Identifying data types and target variable meaning
- Detecting missing or invalid values
- Making high-level observations about the data

No preprocessing, feature engineering, or modeling is performed at this stage.

This step is essential before building a Patient Digital Twin, as each row represents a clinical snapshot of a patient.


In [2]:
import pandas as pd
import numpy as np


In [3]:
df = pd.read_csv("../data/heart_disease.csv")
df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [4]:
df.shape


(1025, 14)

### Dataset Shape

- Each row represents an individual patient
- Each column represents a clinical or demographic attribute
- The dataset is tabular and structured


In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


### Dataset Structure

- The dataset contains a mix of numerical and categorical features
- Some categorical features are encoded as integers
- No explicit datetime column is present


### Feature Description (Medical Context)

- **age**: Age of the patient in years (numerical, static)
- **sex**: Gender (1 = male, 0 = female) (categorical, static)
- **cp**: Chest pain type (0–3) (categorical, time-varying)
- **trestbps**: Resting blood pressure in mm Hg (numerical, time-varying)
- **chol**: Serum cholesterol in mg/dl (numerical, time-varying)
- **fbs**: Fasting blood sugar > 120 mg/dl (binary, time-varying)
- **restecg**: Resting electrocardiographic results (categorical, time-varying)
- **thalach**: Maximum heart rate achieved (numerical, time-varying)
- **exang**: Exercise-induced angina (binary, time-varying)
- **oldpeak**: ST depression induced by exercise (numerical, time-varying)
- **slope**: Slope of the peak exercise ST segment (categorical, time-varying)
- **ca**: Number of major vessels colored by fluoroscopy (numerical, time-varying)
- **thal**: Thalassemia status (categorical, time-varying)
- **target**: Presence of heart disease (1 = yes, 0 = no)

Each feature represents a physiological, symptomatic, or diagnostic aspect of a patient, which aligns well with the concept of a Patient Digital Twin.


In [6]:
df["target"].value_counts()


target
1    526
0    499
Name: count, dtype: int64

### Target Variable Explanation

- The target variable is binary
- **target = 1** indicates presence of heart disease
- **target = 0** indicates absence of heart disease

This variable represents the health state of the patient, which the digital twin aims to model and track over time.


In [7]:
df.isnull().sum()


age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Missing and Invalid Values

- Some features may contain missing or invalid values
- Certain columns in this dataset are known to use placeholder values
- These data quality issues will be addressed in the preprocessing phase (Day 5)


In [8]:
df.describe()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


### Statistical Summary Observations

- Age, cholesterol, and blood pressure values vary widely across patients
- Heart rate and ST depression values show significant spread
- No conclusions are drawn at this stage, only range inspection


### Conclusion of Day 4

- The structure and meaning of the dataset have been clearly understood
- Data types and target variable interpretation are identified
- Potential data quality issues are noted
- The dataset is ready for preprocessing and feature preparation

Next Step:
**Day 5 – Data Preprocessing and Quality Handling**
