# **1. Perkenalan Dataset**

## **Heart Disease Dataset (Cleveland)**

Dataset yang digunakan dalam eksperimen ini adalah **Heart Disease Dataset** dari **UCI Machine Learning Repository**. Dataset ini merupakan salah satu dataset medis yang paling populer untuk klasifikasi penyakit jantung.

### **Deskripsi Dataset:**
- **Sumber**: UCI Machine Learning Repository - Cleveland Heart Disease Database
- **URL**: https://archive.ics.uci.edu/ml/datasets/heart+Disease
- **Jumlah Sampel**: 303 pasien
- **Jumlah Fitur**: 13 fitur medis + 1 target variable
- **Tipe Problem**: Binary Classification (Ada penyakit jantung atau tidak)
- **Target Variable**: 
  - `0` = No disease (Tidak ada penyakit jantung)
  - `1` = Disease present (Ada penyakit jantung)

### **Fitur-Fitur dalam Dataset:**

1. **age**: Usia pasien (dalam tahun)
2. **sex**: Jenis kelamin (1 = laki-laki, 0 = perempuan)
3. **cp**: Tipe nyeri dada (chest pain type)
   - 0: Typical angina
   - 1: Atypical angina
   - 2: Non-anginal pain
   - 3: Asymptomatic
4. **trestbps**: Tekanan darah saat istirahat (mm Hg)
5. **chol**: Kolesterol serum (mg/dl)
6. **fbs**: Gula darah puasa > 120 mg/dl (1 = true, 0 = false)
7. **restecg**: Hasil elektrokardiografi saat istirahat
   - 0: Normal
   - 1: Abnormalitas gelombang ST-T
   - 2: Left ventricular hypertrophy
8. **thalach**: Denyut jantung maksimum yang dicapai
9. **exang**: Exercise induced angina (1 = yes, 0 = no)
10. **oldpeak**: ST depression induced by exercise relative to rest
11. **slope**: Kemiringan segmen ST saat exercise
    - 0: Upsloping
    - 1: Flat
    - 2: Downsloping
12. **ca**: Jumlah pembuluh darah utama (0-3) yang diwarnai fluoroscopy
13. **thal**: Thalassemia
    - 1: Normal
    - 2: Fixed defect
    - 3: Reversible defect

### **Tujuan Eksperimen:**
Membangun model machine learning untuk **memprediksi apakah seseorang memiliki penyakit jantung atau tidak** berdasarkan 13 fitur medis di atas.

### **Kenapa Dataset Ini Dipilih?**
- Dataset medis yang memiliki aplikasi real-world yang penting
- Ukuran dataset moderat (303 samples) - cocok untuk eksperimen
- Binary classification - problem klasik yang mudah dipahami
- Memiliki missing values - memberikan kesempatan untuk praktik data cleaning
- Tersedia secara publik di UCI ML Repository


Dataset diperoleh dari **UCI Machine Learning Repository**, salah satu repositori data machine learning terbesar dan paling terpercaya. Dataset Heart Disease (Cleveland) ini telah digunakan dalam berbagai penelitian ilmiah dan merupakan benchmark standar untuk klasifikasi penyakit jantung.


# **2. Import Library**

Library yang digunakan mencakup pandas untuk manipulasi data, numpy untuk operasi numerik, matplotlib dan seaborn untuk visualisasi, serta scikit-learn untuk preprocessing dan modeling.


In [None]:
# Import library untuk data manipulation dan analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import library untuk preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Import library untuk modeling
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Import library untuk evaluation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Import library untuk saving model
import joblib
import pickle
import os

# Set style untuk visualisasi
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ All libraries imported successfully!")

# **3. Memuat Dataset**

Dataset Heart Disease akan dimuat langsung dari UCI ML Repository menggunakan pandas. Data akan diload dengan handling missing values (yang ditandai dengan '?'), kemudian dilakukan konversi target menjadi binary classification untuk memudahkan pemodelan.


In [None]:
# Load dataset Heart Disease dari file raw
import pandas as pd
import os

# Load dataset dari file yang sudah ada
df = pd.read_csv('../heart_disease_raw.csv')

# Convert target menjadi binary classification jika belum
# 0 = no disease, 1-4 = disease present -> 0 = no disease, 1 = disease
df['target'] = (df['target'] > 0).astype(int)

print("="*70)
print("HEART DISEASE DATASET - LOADING")
print("="*70)
print(f"\nüìä Dataset Shape: {df.shape}")
print(f"üìù Number of samples: {df.shape[0]}")
print(f"üìù Number of features: {df.shape[1] - 1} (+ 1 target)")
print("\n‚úÖ Dataset loaded successfully!")

### 3.1 Informasi Dasar Dataset

In [None]:
# Tampilkan informasi dasar dataset
print("="*70)
print("DATASET INFORMATION")
print("="*70)
print("\nColumn Names & Data Types:")
print(df.dtypes)
print(f"\nDataset Shape: {df.shape}")
print(f"Total Rows: {len(df)}")
print(f"Total Columns: {len(df.columns)}")
print(f"\nColumn Names: {list(df.columns)}")

### 3.2 Distribusi Target Variable

In [None]:
# Analisis distribusi target variable
print("="*70)
print("TARGET VARIABLE DISTRIBUTION")
print("="*70)
print("\nTarget Value Counts:")
print(df['target'].value_counts().sort_index())

print(f"\n   - Class 0 (No Disease): {(df['target']==0).sum()} patients ({(df['target']==0).sum()/len(df)*100:.1f}%)")
print(f"   - Class 1 (Disease):    {(df['target']==1).sum()} patients ({(df['target']==1).sum()/len(df)*100:.1f}%)")

# Visualisasi distribusi target
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
df['target'].value_counts().plot(kind='bar', color=['lightblue', 'lightcoral'])
plt.title('Distribution of Target Variable', fontsize=14, fontweight='bold')
plt.xlabel('Target (0=No Disease, 1=Disease)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks([0, 1], ['No Disease', 'Disease'], rotation=0)
plt.tight_layout()
plt.show()

### 3.3 Preview Data (Head & Tail)

In [None]:
# Tampilkan 5 baris pertama
print("="*70)
print("FIRST 5 ROWS OF DATASET")
print("="*70)
df.head()

# **4. Exploratory Data Analysis (EDA)**

Exploratory Data Analysis dilakukan untuk memahami karakteristik dataset secara mendalam. Tahap ini mencakup analisis statistik deskriptif, distribusi data, korelasi antar fitur, deteksi outlier, dan hubungan fitur dengan target variable. EDA sangat penting untuk menentukan langkah preprocessing yang tepat dan memahami pola dalam data.


In [None]:
# ========================================
# 4.1 Informasi Dasar Dataset
# ========================================
print("="*60)
print("INFORMASI DASAR DATASET - HEART DISEASE")
print("="*60)
print(f"Jumlah Baris: {df.shape[0]}")
print(f"Jumlah Kolom: {df.shape[1]}")
print(f"\nNama Kolom:\n{df.columns.tolist()}")
print(f"\nTipe Data:\n{df.dtypes}")

# ========================================
# 4.2 Statistik Deskriptif
# ========================================
print("\n" + "="*60)
print("STATISTIK DESKRIPTIF")
print("="*60)
display(df.describe())

# ========================================
# 4.3 Cek Missing Values
# ========================================
print("\n" + "="*60)
print("MISSING VALUES")
print("="*60)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0])
if missing_df['Missing Count'].sum() == 0:
    print("‚úÖ Tidak ada missing values!")
else:
    print(f"\n‚ö†Ô∏è Total missing values: {missing_df['Missing Count'].sum()}")

# ========================================
# 4.4 Cek Duplikasi Data
# ========================================
print("\n" + "="*60)
print("DUPLIKASI DATA")
print("="*60)
duplicates = df.duplicated().sum()
print(f"Jumlah data duplikat: {duplicates}")
print(f"Persentase duplikasi: {(duplicates/len(df)*100):.2f}%")

# ========================================
# 4.5 Distribusi Target Variable (Heart Disease)
# ========================================
print("\n" + "="*60)
print("DISTRIBUSI TARGET VARIABLE (HEART DISEASE)")
print("="*60)
print(df['target'].value_counts().sort_index())

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
df['target'].value_counts().sort_index().plot(kind='bar', color=['skyblue', 'salmon'], edgecolor='black')
plt.title('Distribusi Heart Disease', fontsize=14, fontweight='bold')
plt.xlabel('Disease Status (0=No Disease, 1=Disease)')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.xticks([0, 1], ['No Disease', 'Disease'])

plt.subplot(1, 2, 2)
df['target'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', startangle=90, 
                                              colors=['skyblue', 'salmon'])
plt.title('Proporsi Heart Disease', fontsize=14, fontweight='bold')
plt.ylabel('')
plt.legend(['No Disease', 'Disease'], loc='best')
plt.tight_layout()
plt.show()

# ========================================
# 4.6 Distribusi Fitur Numerik
# ========================================
print("\n" + "="*60)
print("DISTRIBUSI FITUR NUMERIK")
print("="*60)

# Visualisasi histogram untuk semua fitur
df.hist(bins=20, figsize=(20, 12), edgecolor='black', color='lightblue')
plt.suptitle('Distribusi Semua Fitur - Heart Disease Dataset', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# ========================================
# 4.7 Correlation Matrix
# ========================================
print("\n" + "="*60)
print("CORRELATION MATRIX")
print("="*60)

plt.figure(figsize=(14, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Heart Disease Dataset', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Korelasi dengan target variable
print("\nKorelasi dengan Target (Heart Disease) - diurutkan:")
target_corr = correlation_matrix['target'].sort_values(ascending=False)
print(target_corr)

# ========================================
# 4.8 Outlier Detection dengan Boxplot
# ========================================
print("\n" + "="*60)
print("OUTLIER DETECTION")
print("="*60)

features = df.columns[:-1]  # Semua kolom kecuali target
fig, axes = plt.subplots(4, 4, figsize=(18, 16))
axes = axes.ravel()

for idx, col in enumerate(features):
    if idx < len(axes):
        axes[idx].boxplot(df[col].dropna(), vert=True, patch_artist=True,
                         boxprops=dict(facecolor='lightblue', color='black'),
                         medianprops=dict(color='red', linewidth=2))
        axes[idx].set_title(f'{col}', fontweight='bold')
        axes[idx].set_ylabel('Value')
        axes[idx].grid(True, alpha=0.3)

# Hide extra subplots
for idx in range(len(features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Boxplot untuk Deteksi Outlier - Heart Disease', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# ========================================
# 4.9 Relationship antara Fitur dengan Target
# ========================================
print("\n" + "="*60)
print("RELATIONSHIP FITUR DENGAN TARGET (HEART DISEASE)")
print("="*60)

# Pilih 4 fitur dengan korelasi tertinggi (exclude target itu sendiri)
top_features = target_corr[1:5].index.tolist()
print(f"Top 4 features dengan korelasi tertinggi: {top_features}")

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    axes[idx].scatter(df[feature], df['target'], alpha=0.5, c=df['target'], 
                     cmap='coolwarm', edgecolors='black', linewidth=0.5, s=50)
    axes[idx].set_xlabel(feature, fontweight='bold', fontsize=12)
    axes[idx].set_ylabel('Target (0=No Disease, 1=Disease)', fontweight='bold', fontsize=12)
    axes[idx].set_title(f'{feature} vs Heart Disease\n(Correlation: {target_corr[feature]:.3f})', 
                       fontweight='bold', fontsize=12)
    axes[idx].set_yticks([0, 1])
    axes[idx].set_yticklabels(['No Disease', 'Disease'])
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ Exploratory Data Analysis completed!")


# **5. Data Preprocessing**

Data preprocessing merupakan langkah krusial untuk memastikan kualitas data sebelum digunakan dalam model machine learning. Data mentah sering mengandung missing values, duplikasi, atau rentang nilai yang tidak konsisten. Proses ini bertujuan membersihkan dan mempersiapkan data agar siap untuk pemodelan.

Tahapan preprocessing yang dilakukan meliputi:
1. **Handling Missing Values** - Mengisi nilai yang hilang dengan strategi median imputation
2. **Removing Duplicates** - Menghapus data duplikat untuk menghindari bias
3. **Feature Scaling** - Standardisasi fitur menggunakan StandardScaler
4. **Train-Test Split** - Membagi data dengan stratified sampling untuk menjaga proporsi kelas
5. **Data Saving** - Menyimpan data yang sudah diproses untuk tahap modeling

Setiap langkah disesuaikan dengan karakteristik dataset Heart Disease yang merupakan data terstruktur dengan fitur numerik.


In [None]:
# ========================================
# 5.1 Handling Missing Values
# ========================================
print("="*60)
print("STEP 1: HANDLING MISSING VALUES")
print("="*60)
print(f"Missing values sebelum handling:")
print(df.isnull().sum())

# Impute missing values dengan median untuk kolom numerik
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')

# Get columns with missing values
cols_with_missing = df.columns[df.isnull().any()].tolist()
print(f"\nKolom dengan missing values: {cols_with_missing}")

if cols_with_missing:
    df[cols_with_missing] = imputer.fit_transform(df[cols_with_missing])
    print(f"\n‚úÖ Missing values telah diisi dengan median")
else:
    print("\n‚úÖ Tidak ada missing values")

print(f"\nMissing values setelah handling:")
print(df.isnull().sum())

# ========================================
# 5.2 Handling Duplicate Data
# ========================================
print("\n" + "="*60)
print("STEP 2: HANDLING DUPLICATE DATA")
print("="*60)
print(f"Jumlah data sebelum menghapus duplikat: {len(df)}")
df_clean = df.drop_duplicates()
print(f"Jumlah data setelah menghapus duplikat: {len(df_clean)}")
print(f"Jumlah duplikat yang dihapus: {len(df) - len(df_clean)}")

# ========================================
# 5.3 Visualisasi Distribusi Target
# ========================================
print("\n" + "="*60)
print("STEP 3: DISTRIBUSI TARGET VARIABLE")
print("="*60)
print("Distribusi Heart Disease (Binary Classification):")
print(df_clean['target'].value_counts().sort_index())
print(f"\nClass Balance:")
print(f"  - No Disease (0): {(df_clean['target']==0).sum()} ({(df_clean['target']==0).sum()/len(df_clean)*100:.1f}%)")
print(f"  - Disease (1):    {(df_clean['target']==1).sum()} ({(df_clean['target']==1).sum()/len(df_clean)*100:.1f}%)")

# Visualisasi
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
df_clean['target'].value_counts().sort_index().plot(kind='bar', color=['lightgreen', 'coral'], edgecolor='black')
plt.title('Distribusi Heart Disease (Binary)', fontweight='bold')
plt.xlabel('Target (0=No Disease, 1=Disease)')
plt.ylabel('Frequency')
plt.xticks([0, 1], ['No Disease', 'Disease'], rotation=0)

plt.subplot(1, 2, 2)
df_clean['target'].value_counts().sort_index().plot(kind='pie', autopct='%1.1f%%', 
                                                     colors=['lightgreen', 'coral'], startangle=90)
plt.title('Proporsi Heart Disease', fontweight='bold')
plt.ylabel('')
plt.legend(['No Disease', 'Disease'], loc='best')
plt.tight_layout()
plt.show()

# ========================================
# 5.4 Feature Scaling - Standardization
# ========================================
print("\n" + "="*60)
print("STEP 4: FEATURE SCALING (STANDARDIZATION)")
print("="*60)

# Pisahkan features dan target
X = df_clean.drop('target', axis=1)
y = df_clean['target']

print(f"Shape of Features (X): {X.shape}")
print(f"Shape of Target (y): {y.shape}")
print(f"\nFeatures: {X.columns.tolist()}")
print(f"\nTarget distribution:\n{y.value_counts().sort_index()}")

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

print("\n‚úÖ Features scaled successfully!")
print("\nStatistik setelah scaling (mean ‚âà 0, std ‚âà 1):")
display(X_scaled_df.describe())

# Visualisasi perbandingan sebelum dan sesudah scaling
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Before scaling
axes[0].boxplot([X[col] for col in X.columns[:6]], labels=X.columns[:6], patch_artist=True,
                boxprops=dict(facecolor='lightblue'))
axes[0].set_title('Before Scaling (Sample 6 Features)', fontweight='bold', fontsize=14)
axes[0].set_ylabel('Value')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3)

# After scaling
axes[1].boxplot([X_scaled_df[col] for col in X_scaled_df.columns[:6]], labels=X_scaled_df.columns[:6], 
                patch_artist=True, boxprops=dict(facecolor='lightgreen'))
axes[1].set_title('After Scaling (Sample 6 Features)', fontweight='bold', fontsize=14)
axes[1].set_ylabel('Standardized Value')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# ========================================
# 5.5 Train-Test Split
# ========================================
print("\n" + "="*60)
print("STEP 5: TRAIN-TEST SPLIT (STRATIFIED)")
print("="*60)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled_df, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Testing set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nTarget distribution in training set:")
print(y_train.value_counts().sort_index())
print(f"  - No Disease: {(y_train==0).sum()} ({(y_train==0).sum()/len(y_train)*100:.1f}%)")
print(f"  - Disease:    {(y_train==1).sum()} ({(y_train==1).sum()/len(y_train)*100:.1f}%)")
print(f"\nTarget distribution in testing set:")
print(y_test.value_counts().sort_index())
print(f"  - No Disease: {(y_test==0).sum()} ({(y_test==0).sum()/len(y_test)*100:.1f}%)")
print(f"  - Disease:    {(y_test==1).sum()} ({(y_test==1).sum()/len(y_test)*100:.1f}%)")

# ========================================
# 5.6 Save Preprocessed Data
# ========================================
print("\n" + "="*60)
print("STEP 6: SAVE PREPROCESSED DATA")
print("="*60)

os.makedirs('data/preprocessed', exist_ok=True)

# Gabungkan kembali X dan y untuk disimpan
train_data = pd.concat([X_train.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1)
test_data = pd.concat([X_test.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1)

train_data.to_csv('data/preprocessed/train_data.csv', index=False)
test_data.to_csv('data/preprocessed/test_data.csv', index=False)

# Simpan scaler
joblib.dump(scaler, 'data/preprocessed/scaler.pkl')

print("‚úÖ Preprocessed data saved successfully!")
print(f"  - Train data: data/preprocessed/train_data.csv")
print(f"    Shape: {train_data.shape}, Size: {os.path.getsize('data/preprocessed/train_data.csv')/1024:.2f} KB")
print(f"  - Test data: data/preprocessed/test_data.csv")
print(f"    Shape: {test_data.shape}, Size: {os.path.getsize('data/preprocessed/test_data.csv')/1024:.2f} KB")
print(f"  - Scaler: data/preprocessed/scaler.pkl")

# ========================================
# 5.7 Summary
# ========================================
print("\n" + "="*60)
print("PREPROCESSING SUMMARY - HEART DISEASE DATASET")
print("="*60)
print(f"‚úÖ Original data: {len(df)} samples")
print(f"‚úÖ After handling missing values: {len(df)} samples")
print(f"‚úÖ After removing duplicates: {len(df_clean)} samples")
print(f"‚úÖ Training samples: {len(X_train)} ({len(X_train)/len(df_clean)*100:.1f}%)")
print(f"‚úÖ Testing samples: {len(X_test)} ({len(X_test)/len(df_clean)*100:.1f}%)")
print(f"‚úÖ Number of features: {X_train.shape[1]}")
print(f"‚úÖ Number of classes: {len(y.unique())} (Binary Classification)")
print(f"‚úÖ Feature scaling: StandardScaler (mean‚âà0, std‚âà1)")
print(f"‚úÖ Train-test split: Stratified (maintains class balance)")
print("\nüéâ Data Preprocessing Completed Successfully!")
print("="*60)
