# Laporan Proyek Machine Learning - Eva Meivina Dwiana

## Import Libraries

Pada tahap ini, berbagai library yang dibutuhkan untuk analisis data, pra-pemrosesan, pelatihan model, dan evaluasi akan diimpor.


In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

## Load Dataset
Dataset yang digunakan dalam proyek ini awalnya berasal dari Kaggle, kemudian diunggah ke repositori GitHub agar bisa diakses secara langsung melalui URL.

In [17]:
url = "https://raw.githubusercontent.com/Evameivina/heart_ml/refs/heads/main/heart.csv"
heart_df = pd.read_csv(url)

## Data Understanding
Tahap ini bertujuan untuk memahami struktur data, melihat lima baris pertama, informasi umum dataset, statistik deskriptif, dan mendeteksi duplikasi data.

In [19]:
print("5 data teratas:")
print(heart_df.head())

5 data teratas:
   Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  MaxHR  \
0   40   M           ATA        140          289          0     Normal    172   
1   49   F           NAP        160          180          0     Normal    156   
2   37   M           ATA        130          283          0         ST     98   
3   48   F           ASY        138          214          0     Normal    108   
4   54   M           NAP        150          195          0     Normal    122   

  ExerciseAngina  Oldpeak ST_Slope  HeartDisease  
0              N      0.0       Up             0  
1              N      1.0     Flat             1  
2              N      0.0       Up             0  
3              Y      1.5     Flat             1  
4              N      0.0       Up             0  


In [28]:
print("Info dataset:")
heart_df.info()

Info dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [27]:
print("Statistik deskriptif:")
print(heart_df.describe())

Statistik deskriptif:
              Age   RestingBP  Cholesterol   FastingBS       MaxHR  \
count  918.000000  918.000000   918.000000  918.000000  918.000000   
mean    53.510893  132.396514   198.799564    0.233115  136.809368   
std      9.432617   18.514154   109.384145    0.423046   25.460334   
min     28.000000    0.000000     0.000000    0.000000   60.000000   
25%     47.000000  120.000000   173.250000    0.000000  120.000000   
50%     54.000000  130.000000   223.000000    0.000000  138.000000   
75%     60.000000  140.000000   267.000000    0.000000  156.000000   
max     77.000000  200.000000   603.000000    1.000000  202.000000   

          Oldpeak  HeartDisease  
count  918.000000    918.000000  
mean     0.887364      0.553377  
std      1.066570      0.497414  
min     -2.600000      0.000000  
25%      0.000000      0.000000  
50%      0.600000      1.000000  
75%      1.500000      1.000000  
max      6.200000      1.000000  


In [26]:
print("Cek duplikat:")
print(f"Jumlah data duplikat: {heart_df.duplicated().sum()}")

Cek duplikat:
Jumlah data duplikat: 0


In [30]:
print("Cek missing values:")
print(heart_df.isnull().sum())

Cek missing values:
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


In [31]:
print("Distribusi target HeartDisease:")
print(heart_df['HeartDisease'].value_counts(normalize=True))

Distribusi target HeartDisease:
HeartDisease
1    0.553377
0    0.446623
Name: proportion, dtype: float64


### Pisahkan Fitur dan Target

In [32]:
X = heart_df.drop('HeartDisease', axis=1)
y = heart_df['HeartDisease']

### Tentukan Fitur Numerik dan Kategorikal

In [33]:
num_features = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']
cat_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

### Data Preparation

In [35]:
num_transformer = StandardScaler()

In [36]:
cat_transformer = OneHotEncoder(handle_unknown='ignore')

In [37]:
preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])

### Split data train dan test (80:20) dengan stratifikasi target

In [38]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

### Fit transform pada train, transform pada test

In [41]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("Preprocessing selesai.")
print(f"Bentuk X_train_processed: {X_train_processed.shape}")
print(f"Bentuk X_test_processed: {X_test_processed.shape}")

Preprocessing selesai.
Bentuk X_train_processed: (734, 20)
Bentuk X_test_processed: (184, 20)


### Model 1: Logistic Regression

In [43]:
logreg = LogisticRegression(random_state=42, max_iter=1000)
logreg.fit(X_train_processed, y_train)

### Model 2: Random Forest

In [44]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_processed, y_train)

### Hyperparameter tuning pada Random Forest

In [45]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
}

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train_processed, y_train)

best_rf = grid_search.best_estimator_
print(f"Best parameters RF: {grid_search.best_params_}")

Best parameters RF: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 100}


## Evaluation
Model yang telah dilatih kemudian dievaluasi performanya menggunakan metrik seperti akurasi, confusion matrix, dan classification report.

In [46]:
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(classification_report(y_test, y_pred))
    return acc, prec, rec, f1

In [48]:
print("Logistic Regression Evaluation")
logreg_metrics = evaluate_model(logreg, X_test_processed, y_test)

Logistic Regression Evaluation
              precision    recall  f1-score   support

           0       0.91      0.83      0.87        82
           1       0.87      0.93      0.90       102

    accuracy                           0.89       184
   macro avg       0.89      0.88      0.88       184
weighted avg       0.89      0.89      0.89       184



In [50]:
print("Random Forest Evaluation")
rf_metrics = evaluate_model(rf, X_test_processed, y_test)

Random Forest Evaluation
              precision    recall  f1-score   support

           0       0.89      0.87      0.88        82
           1       0.89      0.91      0.90       102

    accuracy                           0.89       184
   macro avg       0.89      0.89      0.89       184
weighted avg       0.89      0.89      0.89       184



In [52]:
print("Tuned Random Forest Evaluation")
tuned_rf_metrics = evaluate_model(best_rf, X_test_processed, y_test)

Tuned Random Forest Evaluation
              precision    recall  f1-score   support

           0       0.91      0.85      0.88        82
           1       0.89      0.93      0.91       102

    accuracy                           0.90       184
   macro avg       0.90      0.89      0.89       184
weighted avg       0.90      0.90      0.90       184

