# Week 3 Assignment: Heart Disease Dataset Analysis

## I 320D: Data Science for Biomedical Informatics | Spring 2026

### üìã Assignment Version D

---

## üéØ This Week's Mantra

> **"Every Column Tells a Story"**

In this assignment, you'll apply the 10-Point Data Inspection to a real-world healthcare dataset focused on heart disease diagnosis. By the end, you'll understand not just *what* the data contains, but *why* each variable matters for clinical decision-making.

---

## Learning Objectives

By completing this assignment, you will be able to:

1. ‚úÖ Apply the systematic 10-Point Inspection to a new healthcare dataset
2. ‚úÖ Identify and classify feature types (continuous, discrete, categorical, ordinal)
3. ‚úÖ Detect and document data quality issues (missing values, unexpected values)
4. ‚úÖ Research and document clinical meaning for healthcare variables
5. ‚úÖ Create meaningful data groupings based on clinical standards
6. ‚úÖ Formulate answerable research questions about heart disease risk factors

---

## About the Dataset

**Dataset:** Heart Disease UCI (Combined)  
**Source:** UCI Machine Learning Repository / Kaggle  
**File:** `heart_disease_uci.csv`  
**Target Variable:** `num` (diagnosis of heart disease: 0 = no disease, 1-4 = presence of disease)

### Clinical Context

Heart disease remains the leading cause of death globally, accounting for approximately 17.9 million deaths annually according to the World Health Organization. This dataset combines patient records from four medical institutions: Cleveland Clinic, Hungarian Institute of Cardiology, University Hospital Zurich (Switzerland), and VA Long Beach Medical Center. Understanding these diagnostic variables is crucial for:

- Early identification of high-risk patients
- Understanding regional variations in heart disease presentation
- Clinical decision support systems
- Risk stratification and preventive care planning

---

## Getting Started

First, load the dataset and import your libraries:

In [2]:
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv("heart_disease_uci.csv")

# Check first few rows
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


## Part 1: The 10-Point Data Inspection (40 points)

Complete each inspection step and document your findings.

### Step 1: Shape (4 points)

**Your Code:**

In [3]:
df.shape

(920, 16)

**Your Findings:**
- How many rows (observations)? ___920____________
- How many columns (features)? ______16_________
- What does each row represent in clinical terms? ___Each row represents one individual patient and their clinical heart disease test results____________

### Step 2: Column Names (4 points)

**Your Code:**

In [4]:
df.columns.tolist()

['id',
 'age',
 'sex',
 'dataset',
 'cp',
 'trestbps',
 'chol',
 'fbs',
 'restecg',
 'thalch',
 'exang',
 'oldpeak',
 'slope',
 'ca',
 'thal',
 'num']

**Your Findings:**
- List all column names:
id

age

sex

dataset

cp

trestbps

chol

fbs

restecg

thalch

exang

oldpeak

slope

ca

thal

num
- Which columns might need further research to understand? (Hint: Many use medical abbreviations!)

  Most of them need clarification since they are medical terms.
  
cp (chest pain type)
trestbps (resting blood pressure)

chol (cholesterol level)

fbs (fasting blood sugar)

restecg (resting ECG results)

thalch (maximum heart rate achieved)

exang (exercise induced angina)

oldpeak (ST depression induced by exercise)

slope (slope of peak exercise ST segment)

ca (number of major vessels colored by fluoroscopy)

thal (thalassemia test result)

num (diagnosis of heart disease ‚Äî target variable)

### Step 3: Data Types (4 points)


In [5]:
df.dtypes

id            int64
age           int64
sex          object
dataset      object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
slope        object
ca          float64
thal         object
num           int64
dtype: object

### **Your Findings:**
- Which columns are numeric (int64 or float64)?

  
id (int64)

age (int64)

trestbps (float64)

chol (float64)

thalch (float64)

oldpeak (float64)

ca (float64)

num (int64)
- Which columns are categorical (object/string)?
sex

dataset

cp

fbs

restecg

exang

slope

thal

- Are there any data types that seem incorrect based on what you know about the data?

fbs- stored as object but should be true/false (boolean)
exang- stored as object but should be true/false (boolean)
ca- stored as float but should be whole # (integer)
num- stored as float but represents categories (categorical)

### Step 4: First Look (4 points)


In [6]:
df.head(10)

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0
5,6,56,Male,Cleveland,atypical angina,120.0,236.0,False,normal,178.0,False,0.8,upsloping,0.0,normal,0
6,7,62,Female,Cleveland,asymptomatic,140.0,268.0,False,lv hypertrophy,160.0,False,3.6,downsloping,2.0,normal,3
7,8,57,Female,Cleveland,asymptomatic,120.0,354.0,False,normal,163.0,True,0.6,upsloping,0.0,normal,0
8,9,63,Male,Cleveland,asymptomatic,130.0,254.0,False,lv hypertrophy,147.0,False,1.4,flat,1.0,reversable defect,2
9,10,53,Male,Cleveland,asymptomatic,140.0,203.0,True,lv hypertrophy,155.0,True,3.1,downsloping,0.0,reversable defect,1


**Your Findings:**
- What do the actual values look like?

*Demographic data, clinical measurements, test results, and heart disease indicators.*

- Do you notice any categorical variables that are already human-readable vs. encoded?
Human-readable:

*sex (Male, Female)*

*cp (typical angina, asymptomatic, etc.)*

*restecg (normal, lv hypertrophy, st-t abnormality)*

*slope (upsloping, flat, downsloping)*

*thal (normal, fixed defect, reversible defect)*

*dataset (Cleveland, VA Long Beach, etc.)*

- Are there any values that look like they might be placeholders for missing data?

*Some numeric columns have 0 values (e.g., trestbps or chol), which may indicate missing or incorrectly recorded data.*

*The latter rows show NaN values, indicating missing data.*

### Step 5: Last Look (4 points)


In [7]:
df.tail(10)

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
910,911,51,Female,VA Long Beach,asymptomatic,114.0,258.0,True,lv hypertrophy,96.0,False,1.0,upsloping,,,0
911,912,62,Male,VA Long Beach,asymptomatic,160.0,254.0,True,st-t abnormality,108.0,True,3.0,flat,,,4
912,913,53,Male,VA Long Beach,asymptomatic,144.0,300.0,True,st-t abnormality,128.0,True,1.5,flat,,,3
913,914,62,Male,VA Long Beach,asymptomatic,158.0,170.0,False,st-t abnormality,138.0,True,0.0,,,,1
914,915,46,Male,VA Long Beach,asymptomatic,134.0,310.0,False,normal,126.0,False,0.0,,,normal,2
915,916,54,Female,VA Long Beach,asymptomatic,127.0,333.0,True,st-t abnormality,154.0,False,0.0,,,,1
916,917,62,Male,VA Long Beach,typical angina,,139.0,False,st-t abnormality,,,,,,,0
917,918,55,Male,VA Long Beach,asymptomatic,122.0,223.0,True,st-t abnormality,100.0,False,0.0,,,fixed defect,2
918,919,58,Male,VA Long Beach,asymptomatic,,385.0,True,lv hypertrophy,,,,,,,0
919,920,62,Male,VA Long Beach,atypical angina,120.0,254.0,False,lv hypertrophy,93.0,True,0.0,,,,1


**Your Findings:**
- Does the data end cleanly?

**Structurally, it does end clearly (at row 919) but the many NaN values appear inthe last rows**

- Are the last rows consistent with the first rows?

**They have the same column structure and general format but the last rows contain significantly more missing values compared to the first rows**

- Do you notice more or fewer missing values in later rows?

MORE

### Step 6: Memory Usage (4 points)


In [8]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 413.6 KB


**Your Findings:**
- How much memory does the dataset use? ____~.40___________ MB
- Is this a "small" or "large" dataset by data science standards?

  **This is a small dataset by data science standards as they are often hundreds of MBs or several GBs**

### Step 7: Missing Values (4 points)


In [9]:
missing_counts = df.isnull().sum()
missing_percent = (missing_counts / len(df)) * 100

pd.DataFrame({
    "Missing Count": missing_counts,
    "Missing %": missing_percent.round(2)
})


Unnamed: 0,Missing Count,Missing %
id,0,0.0
age,0,0.0
sex,0,0.0
dataset,0,0.0
cp,0,0.0
trestbps,59,6.41
chol,30,3.26
fbs,90,9.78
restecg,2,0.22
thalch,55,5.98


**Your Findings:**
- Which columns have missing values?

trestbps

chol

fbs

restecg

thalch

exang

oldpeak

slope

ca

thal
- What percentage of each column is missing?

trestbps ‚Üí 6.41%

chol ‚Üí 3.26%

fbs ‚Üí 9.78%

restecg ‚Üí 0.22%

thalch ‚Üí 5.98%

exang ‚Üí 5.98%

oldpeak ‚Üí 6.74%

slope ‚Üí 33.59%

ca ‚Üí 66.41%

thal ‚Üí 52.83%

- Which columns have the MOST missing data? What might explain this?

Top 3:
ca, thal, slope

These might be because:
Some hospitals did not record these tests or these tests may not have been performed for all patients

### Step 8: Duplicates (4 points)


In [10]:
df.duplicated().sum()

df["id"].nunique(), len(df)



(920, 920)

**Your Findings:**
- Are there any duplicate rows? ___no____________
- Are all patient IDs unique? ______yes_________


### Step 9: Basic Statistics (4 points)


In [11]:
df.describe()


Unnamed: 0,id,age,trestbps,chol,thalch,oldpeak,ca,num
count,920.0,920.0,861.0,890.0,865.0,858.0,309.0,920.0
mean,460.5,53.51087,132.132404,199.130337,137.545665,0.878788,0.676375,0.995652
std,265.725422,9.424685,19.06607,110.78081,25.926276,1.091226,0.935653,1.142693
min,1.0,28.0,0.0,0.0,60.0,-2.6,0.0,0.0
25%,230.75,47.0,120.0,175.0,120.0,0.0,0.0,0.0
50%,460.5,54.0,130.0,223.0,140.0,0.5,0.0,1.0
75%,690.25,60.0,140.0,268.0,157.0,1.5,1.0,2.0
max,920.0,77.0,200.0,603.0,202.0,6.2,3.0,4.0


## **Your Findings:**
- What is the age range in the dataset? ____28___________ to ___77____________
- What is the range of resting blood pressure (trestbps)? ____0___________ to __200_____________
- What is the range of serum cholesterol (chol)? _______0________ to __603_____________
- What is the range of maximum heart rate achieved (thalch)? ________60_______ to __202_____________
- Do any min/max values seem impossible or clinically unlikely?


Yes:

trestbps minimum = 0 is impossible

chol minimum = 0 is impossible

oldpeak minimum = -2.6 is unusual (as negative ST depression is uncommon)

### Step 10: Unique Counts (4 points)


In [12]:
df.nunique().sort_values()



sex           2
fbs           2
exang         2
restecg       3
slope         3
thal          3
cp            4
dataset       4
ca            4
num           5
age          50
oldpeak      53
trestbps     61
thalch      119
chol        217
id          920
dtype: int64

**Your Findings:**
- Which columns have very few unique values (likely categorical)?
sex (2)

fbs (2)

exang (2)

restecg (3)

slope (3)

thal (3)

cp (4)

dataset (4)

ca (4)

num (5

- Which columns have many unique values (likely continuous or IDs)?
  
id (920)

chol (217)

thalch (119)

trestbps (61)

oldpeak (53)

age (50)


- Does the number of unique IDs match the number of rows? _____yes__________

## Part 2: Data Dictionary (20 points)


Complete the following data dictionary. For each column, you must:
1. **Research** the clinical meaning (these are standard cardiac assessment terms)
2. **Identify** the feature type (Continuous, Discrete, Categorical-Nominal, Categorical-Ordinal, Binary, Identifier)
3. **Document** the valid values/range you observe
4. **Note** any issues or questions

| Column | Description | Feature Type | Valid Values/Range | Notes/Issues |
|--------|-------------|--------------|-------------------|--------------|
| `id` | Unique id for each patient | identifier | | |
| `age` | Age of the patient in years | discrete | 28 - 77 | |
| `sex` | Male/Female | categorical | categorical nominal | |
| `dataset` | location of data collection | categorical nominal | | |
| `cp` | chest pain type (typical angina, atypical angina, non-anginal, asymptomatic) | categorical nominal | | |
| `trestbps` | resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital) | continuous | 0 - 200 | |
| `chol` | serum cholesterol in mg/dl | continuous | 0 - 603 | |
| `fbs` | if fasting blood sugar > 120 mg/dl | binary | | |
| `restecg` | resting electrocardiographic results. Values: (normal, stt abnormality, lv hypertrophy) | categorical nominal | | |
| `thalch` | maximum heart rate achieved | continuous| 60 - 202 | |
| `exang` | exercise-induced angina (True/ False) | binary | | |
| `oldpeak` | ST depression induced by exercise relative to rest | continuous | -2.6 - 6.2 | |
| `slope` | the slope of the peak exercise ST segment | categorical nominal | | high amounts of null vals |
| `ca` | number of major vessels (0-3) colored by fluoroscopy | discrete | 0 - 3 | high amounts of null vals |
| `thal` | [normal; fixed defect; reversible defect] | categorical nominal | | high amounts of null vals |
| `num` | the predicted attribute | discrete | 0 - 4 | |

### Clinical Research Questions for Version D

Answer these questions based on your research (you may need to use Google):

**1. How is maximum heart rate (thalch) measured during a cardiac stress test? What is the formula for calculating age-predicted maximum heart rate, and why is achieving a target percentage important?**

Your answer:
This is measured through having the patient do exercises on a treadmill/ stationary bike while heart rate is continuously monitored using ECG equipment.
The formula is 220-age.
Doctors typically aim for patients to reach 85% of their predicted maximum heart rate to ensure the test is useful, and this is important because if the heart is not stressed adequately, restricted blood flow may not appear. And an inadequate heart rate response may indicate chronotropic incompetence, which itself can signal heart disease.
---

**2. What is ST depression (oldpeak)? Why is the amount of ST depression during exercise an important diagnostic indicator for coronary artery disease?**

Your answer:
ST depression is a downward shift in the ST segment of an ECG tracing during exercise. It indicates a possibility of the heart muscle not receiving enough oxygen. The greater the ST depresssion, the higher the likelihood of coronary artery blockage and potential coronary artery disease.
---

**3. What are the different types of chest pain (cp) in this dataset? Which type is most commonly associated with heart disease, and why might "asymptomatic" patients still be tested?**

Your answer:
The different types are: typical angina, atypical angina, non-anginal pain, and asymptomatic.
Typical angina is most strongly associated with heart disease because it follows a classic pattern and it reflects reduced blood flow.
Asymptomatic patients may still be tested because they have abnormal ECG findings, they may have risk factors, or some patients might just not experience pain
---

**4. What is chronotropic incompetence? How might a patient's inability to achieve their target heart rate during exercise testing indicate underlying heart disease?**

Your answer:
This is the inability of the heart to increase its rate appropiately during exercise.
If a patient cannot reach 85% of the predicted maximum heart rate, it may mean they have a sinus node dysfunction, coronary artery disesae, or underlying cardiac impairment. Chronotropic incompetence is an independent predictor of cardiovascular mortality.
---


## Part 3: Data Validation (15 points)
### 3.1 Blood Pressure Validation (5 points)

Write code to check:
- How many blood pressure readings are 0? (This is clinically impossible for a living patient!)
- How many readings are below 80 mmHg? Above 200 mmHg?
- What might a value of 0 represent in this dataset?



In [13]:
(df["trestbps"] == 0).sum()
(df["trestbps"] < 80).sum(), (df["trestbps"] > 200).sum()
df["trestbps"].value_counts().head()


trestbps
120.0    131
130.0    115
140.0    102
110.0     59
150.0     56
Name: count, dtype: int64

**Your Findings:**

- Are there any impossible blood pressure values?

A value of 0 mmHG is clinically impossible for a living patient

- How should values of 0 be treated - as missing data or as valid values?

As missing data

- Does this match what `.isnull()` reports for this column?
No .isnull does not count 0 as missing, meaning these values were likely used as a placeholder for missing blood pressure readings


### 3.2 Cholesterol Validation (5 points)

Write code to examine cholesterol values closely. Check for any unusual values including zeros.

In [26]:
(df["chol"] == 0).sum()



np.int64(172)

In [27]:
df["chol"].describe()

count    890.000000
mean     199.130337
std      110.780810
min        0.000000
25%      175.000000
50%      223.000000
75%      268.000000
max      603.000000
Name: chol, dtype: float64

**Your Findings:**

- Did you find any cholesterol values of 0?

The minimum value is 0, indicating that some patients (at least 1) have a recorded cholesterol of 0.

- Is a cholesterol level of 0 clinically possible?
No. This is impossible for humans

- How many such impossible values exist?
172


- What would happen if you calculated the mean cholesterol without handling this?

This would obviously lower the mean cholesterol level, leading to an inaccurate estimate of the population's average cholesterol level. These values should be treated as missing before computing summary statistics.
  

### 3.3 Multi-Source Data Validation (5 points)

This dataset combines data from four different medical centers. Write code to:
1. Count patients from each source (dataset column)
2. Check if missing data patterns differ by source

In [28]:
df["dataset"].value_counts()



dataset
Cleveland        304
Hungary          293
VA Long Beach    200
Switzerland      123
Name: count, dtype: int64

In [29]:
df.groupby("dataset").apply(lambda x: x.isnull().mean() * 100)


  df.groupby("dataset").apply(lambda x: x.isnull().mean() * 100)


Unnamed: 0_level_0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num,max_hr_predicted,percent_max_hr,hr_response,disease_binary
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Cleveland,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.328947,1.644737,0.986842,0.0,0.0,0.0,0.0,0.0
Hungary,0.0,0.0,0.0,0.0,0.0,0.341297,7.849829,2.730375,0.341297,0.341297,0.341297,0.0,64.505119,98.976109,90.443686,0.0,0.0,0.341297,0.0,0.0
Switzerland,0.0,0.0,0.0,0.0,0.0,1.626016,0.0,60.97561,0.813008,0.813008,0.813008,4.878049,13.821138,95.934959,42.276423,0.0,0.0,0.813008,0.0,0.0
VA Long Beach,0.0,0.0,0.0,0.0,0.0,28.0,3.5,3.5,0.0,26.5,26.5,28.0,51.0,99.0,83.0,0.0,0.0,26.5,0.0,0.0


**Your Findings:**

- Which medical center contributed the most patients?

Cleveland
- Which medical center contributed the fewest?

123
- Do certain columns have more missing data in certain sources? Which ones?

Yes. For Hungary: ca(~99%), thal(~42%), and slope (~64%)
Switzerland: ca(~96%), thal (~42%), fbs(~61%)
VA Long Beach: ca(~99%), thal (~83%), slope(~51%)
- Why might data completeness vary across medical institutions?

Different diagnostic protocols, equipment variability, differences in documentation standards, research focus, and data collection practices can all contribute to this difference in outcomes.

## Part 4: Create Heart Rate Response Groups (10 points)

Create a new column called `hr_response` that categorizes patients based on their maximum heart rate achieved (thalch) during stress testing, relative to their age-predicted maximum.

### Version D: Heart Rate Response Categories

First, you'll need to calculate each patient's **age-predicted maximum heart rate** using the formula:
```
Age-Predicted Max HR = 220 - age
```

Then calculate what **percentage of their maximum** they achieved:
```
Percent of Max = (thalch / Age-Predicted Max HR) * 100
```

Use these categories based on stress test interpretation guidelines:

| HR Response Category | Percent of Age-Predicted Max | Clinical Significance |
|---------------------|------------------------------|----------------------|
| Missing/Invalid | thalch is NaN or 0 | Data quality issue - cannot classify |
| Poor Response | < 70% | Possible chronotropic incompetence |
| Submaximal | 70-84% | Below diagnostic threshold |
| Adequate | 85-99% | Diagnostically valid test |
| Excellent | ‚â• 100% | Exceeded predicted maximum |

In [16]:
# Step 1: Calculate age-predicted maximum heart rate
df["max_hr_predicted"] = 220 - df["age"]


# Step 2: Calculate percentage of maximum achieved
df["percent_max_hr"] = (df["thalch"] / df["max_hr_predicted"]) * 100


# Step 3: Create the hr_response category column
def classify_hr(row):
    if pd.isna(row["thalch"]) or row["thalch"] == 0:
        return "Missing/Invalid"
    elif row["percent_max_hr"] < 70:
        return "Poor Response"
    elif row["percent_max_hr"] < 85:
        return "Submaximal"
    elif row["percent_max_hr"] < 100:
        return "Adequate"
    else:
        return "Excellent"

df["hr_response"] = df.apply(classify_hr, axis=1)

# IMPORTANT: Handle missing/zero thalch values appropriately

In [17]:
### Verify your groupings worked:
df["hr_response"].value_counts()

# Show counts per heart rate response category

hr_response
Adequate           292
Submaximal         290
Poor Response      179
Excellent          104
Missing/Invalid     55
Name: count, dtype: int64

In [18]:
### Calculate heart disease rate by heart rate response:
df["disease_binary"] = (df["num"] > 0).astype(int)

(
    df[df["hr_response"] != "Missing/Invalid"]
    .groupby("hr_response")["disease_binary"]
    .mean()
    .mul(100)
    .round(2)
)


# Create a binary indicator for heart disease (num > 0 means disease present)
# Then calculate the percentage of patients with heart disease in each HR response category
# Exclude the "Missing/Invalid" category from your interpretation



hr_response
Adequate         43.49
Excellent        31.73
Poor Response    75.98
Submaximal       61.38
Name: disease_binary, dtype: float64

### Analysis Questions:

**1. How many patients are in each heart rate response category? How many have missing/invalid heart rate readings?**

Your answer:
Adequate: 292

Submaximal: 290

Poor Response: 179

Excellent: 104

Missing/Invalid: 55
---

**2. What is the heart disease rate (percentage) for each heart rate response category (excluding invalid)?**

Your answer:
Poor Response: 75.98%

Submaximal: 61.38%

Adequate: 43.49%

Excellent: 31.73%
---

**3. Does heart disease prevalence vary with heart rate response as expected? Patients who can't achieve their target heart rate often have cardiac issues - do you see this pattern?**

Your answer:
Yes. The patients with a Poor response have the highest disease rate of 75.98%
Disease prevalence decreases as heart disease conditions get better; patients with an Excellent response have the lowest disease rates.
This matches expectations. Patients who cannot achieve their predicted heart rate may have some impairment in their cardiac systems.
---


## Part 5: Research Questions (15 points)

### 5.1 Write Three Answerable Questions (9 points)

Write three questions that THIS dataset can answer. Remember: the data can show relationships and patterns, but cannot prove causation.

**Your questions must explore these specific areas:**

1. **A question about thallium stress test results (thal) and heart disease:**

Is there an association between abnormal thallium stress test results (thal) and the presence of heart disease (num >0)?
---

2. **A question comparing patients across the four medical centers (dataset):**
Does the prevalence of heart disease differ across the four medical centers

---

3. **A question about the combination of ST depression (oldpeak) AND ST slope (slope):**
How does the combination of ST depression (oldpeak) and ST slope (slope) relate to heart disease prevalence?

---

### 5.2 Identify One Question the Data CANNOT Answer (3 points)

Write one question about **patient symptoms or quality of life** that this dataset cannot answer, and explain why.

**Question:**
How does the combination of ST depression (oldpeak) and ST slope (slope) relate to heart disease prevalence?

**Why it cannot be answered with this data:**

This dataset only contains clinical and diagnostic variables. It does not include what the patients think, how severe the symptoms are, or their quality of life measures. Therefore we cannot evaluated how heart disease impacts their daily living.
---


### 5.3 Grouping Analysis (3 points)

Answer this question using a groupby analysis:

**"What is the average age for each value of the number of major vessels colored by fluoroscopy (ca)?"**

(Note: You'll need to handle missing values in ca - about 66% of this column is missing!)

In [19]:
df_ca = df.dropna(subset=["ca"])

df_ca.groupby("ca")["age"].mean().round(2)


ca
0.0    51.76
1.0    57.51
2.0    60.15
3.0    59.90
Name: age, dtype: float64

**Your Interpretation:**

Is there a relationship between age and number of diseased vessels? Does this make clinical sense? What are the limitations of this analysis given the high missingness in the ca column?


There appears to be a positive relationship between age and number of diseased vessels. Patients with more vessels colored by fluoroscopy tend to be older, on average.
This makes sense clinically because heart diseases progress with age and older patients are more likely to have advanced coronary artery disease
    

## Part 6: Target Variable Analysis (Bonus - 5 points)

The `num` column is our **target variable** - what we're trying to predict. It has 5 possible values (0-4) representing the severity of heart disease narrowing. Analyze its distribution.

In [20]:
**Your Code:**

# Show the count and percentage for each value of num

# Also create a binary version (0 = no disease, 1 = disease present) and show its distribution


SyntaxError: invalid syntax (2346695670.py, line 1)

In [None]:
### Bonus Questions:

**1. What percentage of patients in this dataset have some form of heart disease (num > 0)?**

Your answer:

---

**2. Is this dataset balanced or imbalanced between disease/no disease?**

Your answer:

---

**3. Looking at the 5 severity levels (0-4), which level is most common? Is this distribution what you'd expect clinically?**

Your answer:

---

**4. How might the multi-severity nature of the target (0-4) versus a simple binary (disease/no disease) affect how you'd approach a machine learning problem?**

Your answer:

---

## Submission Checklist

Before submitting, verify you have completed:

- [ ] **Part 1:** All 10 inspection steps with code AND written findings
- [ ] **Part 2:** Complete data dictionary with all 16 columns filled in
- [ ] **Part 2:** Answered all 4 clinical research questions
- [ ] **Part 3:** All 3 validation checks with code and answers
- [ ] **Part 4:** Created `hr_response` column using **Heart Rate Response Categories**
- [ ] **Part 4:** Calculated heart disease rate by HR response with interpretation
- [ ] **Part 5:** Three research questions (thal, medical centers, oldpeak+slope)
- [ ] **Part 5:** One unanswerable question about symptoms/quality of life
- [ ] **Part 5:** Age by number of vessels (ca) groupby analysis
- [ ] **Bonus (Optional):** Target variable analysis

---

## Grading Rubric

| Component | Points | Requirements for Full Credit |
|-----------|--------|------------------------------|
| Part 1: 10-Point Inspection | 40 | All 10 steps complete with working code AND thoughtful written analysis |
| Part 2: Data Dictionary | 20 | All 16 columns documented with correct feature types and clinical research |
| Part 3: Data Validation | 15 | All validation checks complete with working code and insightful answers |
| Part 4: HR Response Categories | 10 | Working code that correctly calculates % of max HR and creates categories (including invalid handling) AND meaningful interpretation |
| Part 5: Research Questions | 15 | Three good questions in specified areas, one clear limitation, groupby analysis complete |
| **Bonus:** Target Analysis | +5 | Thoughtful analysis of target variable distribution with clinical connection |
| **Total** | 100 (+5 bonus) | |

---

## Hints (Read Before You Get Stuck!)

### ‚ö†Ô∏è Common Pitfalls:

1. **Values of 0 that aren't really zero**
   - Some columns (like `trestbps` and `chol`) have 0 values that represent missing data, not actual measurements
   - The `.isnull()` method won't detect these!
   - Always examine your data with `value_counts()` and check if extreme values make clinical sense

2. **Missing data varies by source**
   - The four medical centers have different data completeness
   - Some columns like `ca`, `thal`, and `slope` have extensive missing data
   - Understanding WHY data is missing is as important as knowing that it's missing

3. **The target variable has multiple levels**
   - `num` ranges from 0-4, not just 0-1
   - For many analyses, you'll want to convert this to binary (disease present/absent)

4. **Sex is coded as text**
   - Unlike some versions of this dataset, sex is "Male"/"Female" not 0/1

5. **Heart rate response calculation**
   - Remember to handle cases where thalch is missing or 0
   - The formula is (thalch / (220 - age)) * 100

### üí° Pro Tips:

- Use `value_counts(dropna=False)` to see if there are any null values
- When something seems weird (like 0 blood pressure), investigate it‚Äîdon't assume it's valid
- Cross-reference columns: do the patterns make clinical sense together?
- Read the original UCI documentation for additional context
- Always state what the data CAN tell you vs. what would require additional information

---

## Useful Resources

- **UCI Repository Page:** https://archive.ics.uci.edu/ml/datasets/heart+disease
- **Stress Test Interpretation:** https://www.mayoclinic.org/tests-procedures/stress-test/about/pac-20385234
- **Chronotropic Incompetence:** https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3188536/
- **Target Heart Rate Calculator:** https://www.heart.org/en/healthy-living/fitness/fitness-basics/target-heart-rates
- **Pandas Documentation:** https://pandas.pydata.org/docs/

---

*Remember: "Every Column Tells a Story" - your job is to figure out what that story is!*

---

**Due Date:** [See Canvas]

**Submission:** Upload your completed Jupyter notebook (.ipynb) to Canvas
