# Step 1 — Load & Basic Cleaning (Cleveland Heart Dataset)

**Goal**
- Load `cleveland.csv` (renamed from `processed.cleveland.data`).
- Add official column names.
- Make missing values explicit (`?` → `NaN`) and ensure numeric dtypes.
- Snapshot: `shape`, `info()`, `describe()`, missing-values table, target distribution.

> The raw label `target` is 0–4 (0 = healthy, 1–4 = disease). We’ll binarize later.


In [11]:
# import Libraries
import numpy as np 
import pandas as pd

In [None]:
# Load the dataset (no headers in file)
df = pd.read_csv("../data/cleveland.csv", header=None)

print("Shape:", df.shape)   
df.head()

Shape: (303, 14)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [15]:
# Assign official UCI column names
df.columns = [
    "age","sex","cp","trestbps","chol","fbs","restecg",
    "thalach","exang","oldpeak","slope","ca","thal","target"
]

In [16]:
# Handle missing values and enforce numeric types
# '?' → NaN
df=df.replace("?",np.nan)

for c in df.columns:
    df[c] = pd.to_numeric(df[c], errors="coerce")

In [18]:
# Structure & dtypes
df.info()

# Summary stats for all columns
print("\n=== describe() ===")
print(df.describe())

# Missing values per column
print("\n=== Missing values per column ===")
print(df.isna().sum().sort_values(ascending=False))

# Target class distribution (raw 0–4)
print("\n=== Target distribution (raw 0–4) ===")
print(df["target"].value_counts().sort_index())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  target    303 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 33.3 KB

=== describe() ===
              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.15