# Week 3 Assignment: Heart Disease Dataset Analysis

## I 320D: Data Science for Biomedical Informatics | Spring 2026

### ðŸ“‹ Assignment Version D

---

## ðŸŽ¯ This Week's Mantra

> **"Every Column Tells a Story"**

In this assignment, you'll apply the 10-Point Data Inspection to a real-world healthcare dataset focused on heart disease diagnosis. By the end, you'll understand not just *what* the data contains, but *why* each variable matters for clinical decision-making.

---

## Learning Objectives

By completing this assignment, you will be able to:

1. âœ… Apply the systematic 10-Point Inspection to a new healthcare dataset
2. âœ… Identify and classify feature types (continuous, discrete, categorical, ordinal)
3. âœ… Detect and document data quality issues (missing values, unexpected values)
4. âœ… Research and document clinical meaning for healthcare variables
5. âœ… Create meaningful data groupings based on clinical standards
6. âœ… Formulate answerable research questions about heart disease risk factors

---

## About the Dataset

**Dataset:** Heart Disease UCI (Combined)  
**Source:** UCI Machine Learning Repository / Kaggle  
**File:** `heart_disease_uci.csv`  
**Target Variable:** `num` (diagnosis of heart disease: 0 = no disease, 1-4 = presence of disease)

### Clinical Context

Heart disease remains the leading cause of death globally, accounting for approximately 17.9 million deaths annually according to the World Health Organization. This dataset combines patient records from four medical institutions: Cleveland Clinic, Hungarian Institute of Cardiology, University Hospital Zurich (Switzerland), and VA Long Beach Medical Center. Understanding these diagnostic variables is crucial for:

- Early identification of high-risk patients
- Understanding regional variations in heart disease presentation
- Clinical decision support systems
- Risk stratification and preventive care planning

---

## Getting Started

First, load the dataset and import your libraries:

In [4]:
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv("heart_disease_uci.csv")

# Check first few rows
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


## Part 1: The 10-Point Data Inspection (40 points)

Complete each inspection step and document your findings.

### Step 1: Shape (4 points)

**Your Code:**

In [5]:
df.shape

(920, 16)

**Your Findings:**
- How many rows (observations)? _______________
- How many columns (features)? _______________
- What does each row represent in clinical terms? _______________

### Step 2: Column Names (4 points)

**Your Code:**

In [6]:
df.columns.tolist()

['id',
 'age',
 'sex',
 'dataset',
 'cp',
 'trestbps',
 'chol',
 'fbs',
 'restecg',
 'thalch',
 'exang',
 'oldpeak',
 'slope',
 'ca',
 'thal',
 'num']

**Your Findings:**
- List all column names:

- Which columns might need further research to understand? (Hint: Many use medical abbreviations!)


### Step 3: Data Types (4 points)


In [7]:
df.dtypes

id            int64
age           int64
sex          object
dataset      object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
slope        object
ca          float64
thal         object
num           int64
dtype: object

**Your Findings:**
- Which columns are numeric (int64 or float64)?

- Which columns are categorical (object/string)?

- Are there any data types that seem incorrect based on what you know about the data?

### Step 4: First Look (4 points)


In [8]:
df.head(10)

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0
5,6,56,Male,Cleveland,atypical angina,120.0,236.0,False,normal,178.0,False,0.8,upsloping,0.0,normal,0
6,7,62,Female,Cleveland,asymptomatic,140.0,268.0,False,lv hypertrophy,160.0,False,3.6,downsloping,2.0,normal,3
7,8,57,Female,Cleveland,asymptomatic,120.0,354.0,False,normal,163.0,True,0.6,upsloping,0.0,normal,0
8,9,63,Male,Cleveland,asymptomatic,130.0,254.0,False,lv hypertrophy,147.0,False,1.4,flat,1.0,reversable defect,2
9,10,53,Male,Cleveland,asymptomatic,140.0,203.0,True,lv hypertrophy,155.0,True,3.1,downsloping,0.0,reversable defect,1


**Your Findings:**
- What do the actual values look like?

- Do you notice any categorical variables that are already human-readable vs. encoded?

- Are there any values that look like they might be placeholders for missing data?

### Step 5: Last Look (4 points)


In [9]:
df.tail(10)

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
910,911,51,Female,VA Long Beach,asymptomatic,114.0,258.0,True,lv hypertrophy,96.0,False,1.0,upsloping,,,0
911,912,62,Male,VA Long Beach,asymptomatic,160.0,254.0,True,st-t abnormality,108.0,True,3.0,flat,,,4
912,913,53,Male,VA Long Beach,asymptomatic,144.0,300.0,True,st-t abnormality,128.0,True,1.5,flat,,,3
913,914,62,Male,VA Long Beach,asymptomatic,158.0,170.0,False,st-t abnormality,138.0,True,0.0,,,,1
914,915,46,Male,VA Long Beach,asymptomatic,134.0,310.0,False,normal,126.0,False,0.0,,,normal,2
915,916,54,Female,VA Long Beach,asymptomatic,127.0,333.0,True,st-t abnormality,154.0,False,0.0,,,,1
916,917,62,Male,VA Long Beach,typical angina,,139.0,False,st-t abnormality,,,,,,,0
917,918,55,Male,VA Long Beach,asymptomatic,122.0,223.0,True,st-t abnormality,100.0,False,0.0,,,fixed defect,2
918,919,58,Male,VA Long Beach,asymptomatic,,385.0,True,lv hypertrophy,,,,,,,0
919,920,62,Male,VA Long Beach,atypical angina,120.0,254.0,False,lv hypertrophy,93.0,True,0.0,,,,1


**Your Findings:**
- Does the data end cleanly?

- Are the last rows consistent with the first rows?

- Do you notice more or fewer missing values in later rows?

### Step 6: Memory Usage (4 points)


In [10]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 413.6 KB


**Your Findings:**
- How much memory does the dataset use? _______________ MB
- Is this a "small" or "large" dataset by data science standards?

### Step 7: Missing Values (4 points)


In [11]:
missing_counts = df.isnull().sum()
missing_percent = (missing_counts / len(df)) * 100

pd.DataFrame({
    "Missing Count": missing_counts,
    "Missing %": missing_percent.round(2)
})


Unnamed: 0,Missing Count,Missing %
id,0,0.0
age,0,0.0
sex,0,0.0
dataset,0,0.0
cp,0,0.0
trestbps,59,6.41
chol,30,3.26
fbs,90,9.78
restecg,2,0.22
thalch,55,5.98


**Your Findings:**
- Which columns have missing values?

- What percentage of each column is missing?

- Which columns have the MOST missing data? What might explain this?


### Step 8: Duplicates (4 points)


In [12]:
df.duplicated().sum()

df["id"].nunique(), len(df)



(920, 920)

**Your Findings:**
- Are there any duplicate rows? _______________
- Are all patient IDs unique? _______________


### Step 9: Basic Statistics (4 points)


In [13]:
df.describe()


Unnamed: 0,id,age,trestbps,chol,thalch,oldpeak,ca,num
count,920.0,920.0,861.0,890.0,865.0,858.0,309.0,920.0
mean,460.5,53.51087,132.132404,199.130337,137.545665,0.878788,0.676375,0.995652
std,265.725422,9.424685,19.06607,110.78081,25.926276,1.091226,0.935653,1.142693
min,1.0,28.0,0.0,0.0,60.0,-2.6,0.0,0.0
25%,230.75,47.0,120.0,175.0,120.0,0.0,0.0,0.0
50%,460.5,54.0,130.0,223.0,140.0,0.5,0.0,1.0
75%,690.25,60.0,140.0,268.0,157.0,1.5,1.0,2.0
max,920.0,77.0,200.0,603.0,202.0,6.2,3.0,4.0


**Your Findings:**
- What is the age range in the dataset? _______________ to _______________
- What is the range of resting blood pressure (trestbps)? _______________ to _______________
- What is the range of serum cholesterol (chol)? _______________ to _______________
- What is the range of maximum heart rate achieved (thalch)? _______________ to _______________
- Do any min/max values seem impossible or clinically unlikely?


### Step 10: Unique Counts (4 points)


In [14]:
df.nunique().sort_values()



sex           2
fbs           2
exang         2
restecg       3
slope         3
thal          3
cp            4
dataset       4
ca            4
num           5
age          50
oldpeak      53
trestbps     61
thalch      119
chol        217
id          920
dtype: int64

**Your Findings:**
- Which columns have very few unique values (likely categorical)?

- Which columns have many unique values (likely continuous or IDs)?

- Does the number of unique IDs match the number of rows? _______________

## Part 2: Data Dictionary (20 points)
