# **1. Perkenalan Dataset (Heart Disease / UCI)**
Dataset: *Heart Disease* (berasal dari UCI Machine Learning Repository).

**Tujuan**: memprediksi ada/tidaknya penyakit jantung berdasarkan fitur klinis.

**Catatan label/target**:
- Beberapa versi dataset memakai kolom `target` (0/1)
- Versi UCI klasik memakai `num` (0–4). Biasanya dikonversi menjadi biner: `num>0` → 1 (disease)

**Lokasi file** (disarankan):
- Simpan CSV mentah di: `../namadataset_raw/heart.csv` atau `../namadataset_raw/heart_disease.csv`.

Jika Anda menjalankan notebook ini di Kaggle/Colab tanpa internet, cukup **upload** file CSV ke folder tersebut.

Tahap pertama, pastikan Anda sudah memiliki dataset sesuai ketentuan (tabular, ada target, ukuran wajar).

# **2. Import Library**
Di tahap ini kita impor library untuk data loading, EDA, dan preprocessing.

Library utama: `pandas`, `numpy`, `matplotlib`, dan `scikit-learn`.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Supaya output rapi
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 120)


# **3. Memuat Dataset**
Kita coba cari file CSV di folder `namadataset_raw`.

Jika nama file berbeda, silakan sesuaikan variabel `DATA_PATH`.

In [None]:
# Ganti sesuai file kamu (salah satu biasanya ada)
CANDIDATE_PATHS = [
    os.path.join("..", "namadataset_raw", "heart.csv"),
    os.path.join("..", "namadataset_raw", "heart_disease.csv"),
    os.path.join("..", "namadataset_raw", "dataset.csv"),
]

DATA_PATH = None
for p in CANDIDATE_PATHS:
    if os.path.exists(p):
        DATA_PATH = p
        break

if DATA_PATH is None:
    raise FileNotFoundError(
        "File dataset tidak ditemukan. Upload/letakkan CSV ke folder ../namadataset_raw/ "
        "dengan nama: heart.csv atau heart_disease.csv"
    )

df = pd.read_csv(DATA_PATH)
print("✅ Loaded:", DATA_PATH)
df.head()


# **4. Exploratory Data Analysis (EDA)**
Tujuan EDA: memahami struktur data, cek missing value, cek distribusi target, dan ringkasan statistik.

In [None]:
# 4.1 Struktur dasar
print("Shape:", df.shape)
display(df.sample(5, random_state=42))

print("\nInfo:")
df.info()

print("\nMissing values per column:")
display(df.isna().sum().sort_values(ascending=False).head(20))

print("\nStatistik numerik:")
display(df.describe(include=[np.number]).T)


In [None]:
# 4.2 Deteksi kolom target (umum: 'target' atau 'num')
possible_targets = [c for c in df.columns if c.lower() in ["target", "num", "diagnosis", "label"]]
print("Kemungkinan kolom target:", possible_targets)

# Pilih prioritas: target > num
target_col = None
if "target" in df.columns:
    target_col = "target"
elif "num" in df.columns:
    target_col = "num"
elif len(possible_targets) > 0:
    target_col = possible_targets[0]

if target_col is None:
    raise ValueError("Kolom target tidak ditemukan. Pastikan dataset punya label seperti 'target' atau 'num'.")

print("✅ target_col =", target_col)

# Lihat distribusi target
display(df[target_col].value_counts(dropna=False).to_frame("count"))


In [None]:
# 4.3 Visual sederhana (opsional tapi membantu reviewer)
# Distribusi target (kalau num 0-4, tetap ditampilkan)
plt.figure(figsize=(6,4))
df[target_col].value_counts().sort_index().plot(kind="bar")
plt.title("Distribusi Target")
plt.xlabel(target_col)
plt.ylabel("Count")
plt.show()


# **5. Data Preprocessing**
Kita siapkan data agar siap dilatih:
- Pisah fitur (X) dan label (y)
- Tangani missing value
- Encoding kategorikal (OneHot)
- Scaling numerik
- Split train/test
- Simpan output ke folder `namadataset_preprocessing`.

In [None]:
# 5.1 Pisah X & y
X = df.drop(columns=[target_col]).copy()
y_raw = df[target_col].copy()

# Jika target berupa 0-4 (UCI 'num'), ubah ke biner (0 = no disease, >0 = disease)
if y_raw.nunique() > 2:
    y = (y_raw.astype(float) > 0).astype(int)
    print("Target dikonversi ke biner dari", target_col, "(num>0 -> 1).")
else:
    y = y_raw.astype(int)

print("X shape:", X.shape, "| y shape:", y.shape)
print("Distribusi y (biner):")
display(pd.Series(y).value_counts().to_frame("count"))


In [None]:
# 5.2 Tentukan kolom numerik & kategorikal
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = [c for c in X.columns if c not in numeric_cols]

print("Numeric cols:", len(numeric_cols), numeric_cols[:10])
print("Categorical cols:", len(categorical_cols), categorical_cols[:10])

# Pipeline preprocessing untuk masing-masing tipe data
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ],
    remainder="drop"
)


In [None]:
# 5.3 Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y if pd.Series(y).nunique() == 2 else None
)

print("Train:", X_train.shape, "Test:", X_test.shape)


In [None]:
# 5.4 Fit-transform data train, transform data test
X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)

print("X_train_proc shape:", X_train_proc.shape)
print("X_test_proc shape:", X_test_proc.shape)


In [None]:
# 5.5 Simpan hasil preprocessing
# Kita simpan sebagai CSV siap latih:
# - Karena setelah OneHot bentuknya bisa sparse, kita ubah ke array dense lalu jadi DataFrame.
# - Nama kolom onehot diambil dari OneHotEncoder.

# Ambil feature names
feature_names = []
if len(numeric_cols) > 0:
    feature_names += numeric_cols

if len(categorical_cols) > 0:
    ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
    ohe_names = ohe.get_feature_names_out(categorical_cols).tolist()
    feature_names += ohe_names

def to_dense_df(X_proc, feature_names):
    # Support sparse matrix
    try:
        X_dense = X_proc.toarray()
    except Exception:
        X_dense = np.asarray(X_proc)
    return pd.DataFrame(X_dense, columns=feature_names)

train_df = to_dense_df(X_train_proc, feature_names)
train_df["label"] = y_train.values

test_df = to_dense_df(X_test_proc, feature_names)
test_df["label"] = y_test.values

out_train = os.path.join("namadataset_preprocessing", "train_preprocessed.csv")
out_test  = os.path.join("namadataset_preprocessing", "test_preprocessed.csv")

train_df.to_csv(out_train, index=False)
test_df.to_csv(out_test, index=False)

print("✅ Saved:", out_train, "shape:", train_df.shape)
print("✅ Saved:", out_test, "shape:", test_df.shape)
train_df.head()
