<a href="https://colab.research.google.com/github/LatiefDataVisionary/feature-engineering-college-task/blob/main/Tugas_Kelompok_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tugas Feature Engineering**


Nama kelompok:
-
-
-

**Judul Dataset : Heart Attack Analysis & Prediction Dataset**

**Link Dataset**  : https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/discussion/234843

## **1. Mengimpor Library yang diperlukan**

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
import numpy as np

**Penjelasan Library:**

1. `import pandas as pd`: Mengimpor library `pandas` dan memberikan alias `pd`. Library ini digunakan untuk manipulasi dan analisis data, terutama dalam bentuk DataFrame.
2. `from sklearn.preprocessing import StandardScaler, MinMaxScaler`, `PowerTransformer`: Mengimpor tiga kelas dari modul `sklearn.preprocessing`, yaitu:
  * `StandardScaler`: Untuk standarisasi fitur dengan mengurangkan mean dan membaginya dengan standar deviasi.
  * `MinMaxScaler`: Untuk penskalaan fitur ke rentang tertentu, biasanya antara 0 dan 1.
  * `PowerTransformer`: Untuk membuat data lebih terdistribusi normal dengan menerapkan transformasi daya.
3. `from sklearn.decomposition import PCA`: Mengimpor kelas `PCA` (Principal Component Analysis) dari modul `sklearn.decomposition`. PCA digunakan untuk reduksi dimensi data dengan menemukan kombinasi linear fitur yang paling penting.
4. `from sklearn.model_selection import train_test_split`: Mengimpor fungsi `train_test_split` dari modul `sklearn.model_selection`. Fungsi ini digunakan untuk membagi dataset menjadi set pelatihan dan pengujian.
5. `from sklearn.metrics import accuracy_score`: Mengimpor fungsi `accuracy_score` dari modul `sklearn.metrics`. Fungsi ini digunakan untuk menghitung akurasi model klasifikasi.
6. `import xgboost as xgb`: Mengimpor library `xgboost` dan memberikan alias `xgb`. Library ini menyediakan implementasi algoritma Gradient Boosting yang efisien dan fleksibel.
7. `from sklearn.preprocessing import LabelEncoder`: Mengimpor kelas `LabelEncoder` dari modul `sklearn.preprocessing`. Kelas ini digunakan untuk mengubah label kategorikal menjadi numerik.
8. `import numpy as np`: Mengimpor library `numpy` dan memberikan alias `np`. Library ini digunakan untuk komputasi numerik, terutama dalam bentuk array multidimensi.

**Penjelasan singkat:**
* `pandas` untuk manipulasi data.
* `StandardScaler`, `MinMaxScaler`, dan `PowerTransformer` untuk normalisasi dan transformasi data.
* `PCA` ***(Principal Component Analysis)*** untuk pengurangan dimensi data.
* `train_test_split` untuk membagi data menjadi set pelatihan dan pengujian.
* `accuracy_score` untuk mengevaluasi performa model dengan akurasi.
* `xgboost` adalah library populer untuk model boosting berbasis pohon keputusan.
* `LabelEncoder` untuk mengubah label kategorikal menjadi numerik.
* `numpy` untuk operasi numerik.

## **2. Loading/Mount, Membaca dan Menampilkan Dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# df = pd.read_csv('/content/drive/MyDrive/Classroom/Coba/Iris.csv')
df = pd.read_csv('/content/drive/MyDrive/Feature Engineering/heart.csv')
df

In [None]:
df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


**Penjelasan**:

Dataset ini berisi informasi pasien untuk **analisis risiko penyakit jantung**. Dataset ini memiliki **303 baris** dan **14 kolom**. Berikut adalah rincian masing-masing kolom, disesuaikan dengan struktur data:

1. `age`: Usia pasien dalam tahun.
2. `sex`: Jenis kelamin (`1` = pria; `0` = wanita).
3. `cp` (*chest pain*): Jenis nyeri dada:
  * `1`: *Typical angina* (nyeri dada khas).
  * `2`: *Atypical angina* (nyeri dada tidak khas).
  * `3`: *Non-anginal pain* (nyeri bukan dari jantung).
  * `0`: *Asymptomatic* (tanpa gejala).
4. `trtbps`: Tekanan darah istirahat (mm Hg) saat pasien dirawat.
5. `chol`: Kadar kolesterol serum (mg/dl).
6. `fbs`: Gula darah puasa > 120 mg/dl (`1` = benar; `0` = salah).
7. `restecg`: Hasil elektrokardiografi istirahat:
  * `0`: *Hypertrophy*.
  * `1`: *Normal*.
  * `2`: Abnormalitas gelombang ST-T.
8. `thalachh`: Denyut jantung maksimum yang dicapai.
9. `exng`: Angina yang diinduksi oleh latihan (`1` = ya; `0` = tidak).
10. `oldpeak`: Depresi segmen ST akibat latihan dibandingkan dengan saat istirahat.
11. `slp`: Kemiringan segmen ST:
  * `2`: *Upsloping*.
  * `1`: *Flat*.
  * `0`: *Downsloping*.
12. `caa`: Jumlah pembuluh darah utama (`0`-`3`) yang divisualisasikan dengan fluoroskopi.
13. `thall`: Jenis defek:
  * `2`: *Normal*.
  * `1`: *Fixed defect*.
  * `3`: *Reversible defect*.
14. `output`: Diagnosis penyakit jantung:
  * `Yes`: Ada indikasi penyakit jantung (diameter penyempitan > 50%).
  * `No`: Tidak ada indikasi penyakit jantung.


**Catatan**:

* Dataset ini cocok untuk analisis prediktif, misalnya menggunakan model machine learning untuk memprediksi keberadaan penyakit jantung (output).
* **Variabel numerik** seperti `age`, `trtbps`, `chol`, `thalachh`, dan `oldpeak` dapat digunakan untuk analisis statistik atau normalisasi.
* **Variabel kategorikal** seperti `sex`, `cp`, `restecg`, dan `slp` mungkin memerlukan encoding sebelum digunakan dalam model machine learning.

## **3. Mencoba Model Dasar**

Mencoba menguji performa model hanya dengan menggunakan **label encoder** tanpa feature engineering lain.

In [None]:
X = df[[i for i in df.columns if i != 'output']]
y = df['output']
print(f"X:{X}")

X:     age  sex  cp  trtbps  chol  fbs  restecg  thalachh  exng  oldpeak  slp  \
0     63    1   3     145   233    1        0       150     0      2.3    0   
1     37    1   2     130   250    0        1       187     0      3.5    0   
2     41    0   1     130   204    0        0       172     0      1.4    2   
3     56    1   1     120   236    0        1       178     0      0.8    2   
4     57    0   0     120   354    0        1       163     1      0.6    2   
..   ...  ...  ..     ...   ...  ...      ...       ...   ...      ...  ...   
298   57    0   0     140   241    0        1       123     1      0.2    1   
299   45    1   3     110   264    0        1       132     0      1.2    1   
300   68    1   0     144   193    1        1       141     0      3.4    1   
301   57    1   0     130   131    0        1       115     1      1.2    1   
302   57    0   1     130   236    0        0       174     0      0.0    1   

     caa  thall  
0      0      1  
1      0     

In [None]:
# Encode label string menjadi nilai numerik
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
y = y_encoded

# Pisahkan kumpulan data menjadi kumpulan pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Convert data ke DMatrix untuk XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# parameter untuk XGBoost
params = {
    'objective': 'multi:softmax',
    'num_class': len(np.unique(y)),
    'max_depth': 4,
    'eta': 0.3,
    'eval_metric': 'mlogloss',
    'seed': 42
}

# Ngetrain XGBoost model
num_round = 50
bst = xgb.train(params, dtrain, num_round)

# Membuat prediksi
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)

print(f"Akurasi XGBoost tanpa menggunakan feature engineering selain label encoder adalah : {accuracy*100:.2f}%")

Akurasi XGBoost tanpa menggunakan feature engineering selain label encoder adalah : 83.61%


In [None]:
# import KNeighbors ClaSSifier from sklearn
from sklearn.neighbors import KNeighborsClassifier

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# instantiate the model
knn = KNeighborsClassifier(n_neighbors=3)


# fit the model to the training set
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(f"Akurasi KNN tanpa menggunakan feature engineering selain label encoder adalah :{np.sum(y_pred == y_test)/len(y_test) * 100}%")

Akurasi KNN tanpa menggunakan feature engineering selain label encoder adalah :65.57377049180327%


Mendapatkan semua fitur kecuali label, lalu melakukan feature engineering berupa feature extraction ke fitur tersebut, lalu drop kolom lama yang belum dilakukan pca.

## **4. Feature Engineering dengan PCA**

In [None]:
# Menerapkan PCA pada fitur numerikal untuk mereduksi dimensicol = [i for i in df.columns if i != 'output']
col = [i for i in df.columns if i != 'output']
pca = PCA(n_components=2)
numerical_features = df[col]
pca_features = pca.fit_transform(numerical_features)

# Menambahkan fitur hasil PCA ke DataFrame
pca_df = pd.DataFrame(pca_features, columns=["PCA1", "PCA2"])
df_pca = pd.concat([df, pca_df], axis=1)

# Mendrop kolom lama
df_pca.drop(col, axis=1, inplace=True)
df_pca

Unnamed: 0,output,PCA1,PCA2
0,1,-12.267345,-2.873838
1,1,2.690137,39.871374
2,1,-42.950214,23.636820
3,1,-10.944756,28.438036
4,1,106.979053,15.874468
...,...,...,...
298,0,-4.554121,-27.490169
299,0,16.428008,-12.921716
300,0,-51.963811,-13.323798
301,0,-114.755981,-36.435184


Melakukan feature construction berupa label encoder ke label agar label dapat digunakan di ML model.

In [None]:
X = df_pca[[i for i in df_pca.columns if i != 'output']]
y = df_pca['output']
print(f"X:{X}")

# Encode label string menjadi nilai numerik
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
y = y_encoded

# Pisahkan kumpulan data menjadi kumpulan pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X:           PCA1       PCA2
0    -12.267345  -2.873838
1      2.690137  39.871374
2    -42.950214  23.636820
3    -10.944756  28.438036
4    106.979053  15.874468
..          ...        ...
298   -4.554121 -27.490169
299   16.428008 -12.921716
300  -51.963811 -13.323798
301 -114.755981 -36.435184
302  -10.396142  23.302401

[303 rows x 2 columns]


Mencoba ML KNN dan XGBoost terhadap dataset yang sudah dilakukan feature engineering.

In [None]:
# Convert data ke DMatrix untuk XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# parameter untuk XGBoost
params = {
    'objective': 'multi:softmax',
    'num_class': len(np.unique(y)),
    'max_depth': 4,
    'eta': 0.3,
    'eval_metric': 'mlogloss',
    'seed': 42
}

# Ngetrain XGBoost model
num_round = 50
bst = xgb.train(params, dtrain, num_round)

# Membuat prediksi
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)

print(f"Akurasi XGBoost Menggunakan PCA adalah : {accuracy*100:.2f}%")

Akurasi XGBoost Menggunakan PCA adalah : 75.41%


In [None]:
# impor KNeighbors ClaSSifier dari sklearn
from sklearn.neighbors import KNeighborsClassifier

# Pisahkan kumpulan data menjadi kumpulan pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# instantiate the model
knn = KNeighborsClassifier(n_neighbors=3)


# menyesuaikan model dengan set pelatihan
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(f"Akurasi KNN Menggunakan PCA adalah :{np.sum(y_pred == y_test)/len(y_test) * 100}%")

Akurasi KNN Menggunakan PCA adalah :63.934426229508205%


## **5. Feature Engineering dengan Standard Scaler**

Mendapatkan semua fitur kecuali label, lalu melakukan feature engineering berupa feature construction Standard Scalar ke fitur tersebut, lalu drop kolom lama yang belum dilakukan Standard Scalar.

In [None]:
# Melakukan standard scaler
col = [i for i in df.columns if i != 'output']
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[col])

# Menambahkan fitur yang sudah diskalakan kembali ke DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=[col])
df_scaled = pd.concat([df, scaled_df], axis=1)

# Mendrop kolom lama
df_scaled.drop(col, axis=1, inplace=True)
df_scaled

Unnamed: 0,output,"(age,)","(sex,)","(cp,)","(trtbps,)","(chol,)","(fbs,)","(restecg,)","(thalachh,)","(exng,)","(oldpeak,)","(slp,)","(caa,)","(thall,)"
0,1,0.952197,0.681005,1.973123,0.763956,-0.256334,2.394438,-1.005832,0.015443,-0.696631,1.087338,-2.274579,-0.714429,-2.148873
1,1,-1.915313,0.681005,1.002577,-0.092738,0.072199,-0.417635,0.898962,1.633471,-0.696631,2.122573,-2.274579,-0.714429,-0.512922
2,1,-1.474158,-1.468418,0.032031,-0.092738,-0.816773,-0.417635,-1.005832,0.977514,-0.696631,0.310912,0.976352,-0.714429,-0.512922
3,1,0.180175,0.681005,0.032031,-0.663867,-0.198357,-0.417635,0.898962,1.239897,-0.696631,-0.206705,0.976352,-0.714429,-0.512922
4,1,0.290464,-1.468418,-0.938515,-0.663867,2.082050,-0.417635,0.898962,0.583939,1.435481,-0.379244,0.976352,-0.714429,-0.512922
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,0,0.290464,-1.468418,-0.938515,0.478391,-0.101730,-0.417635,0.898962,-1.165281,1.435481,-0.724323,-0.649113,-0.714429,1.123029
299,0,-1.033002,0.681005,1.973123,-1.234996,0.342756,-0.417635,0.898962,-0.771706,-0.696631,0.138373,-0.649113,-0.714429,1.123029
300,0,1.503641,0.681005,-0.938515,0.706843,-1.029353,2.394438,0.898962,-0.378132,-0.696631,2.036303,-0.649113,1.244593,1.123029
301,0,0.290464,0.681005,-0.938515,-0.092738,-2.227533,-0.417635,0.898962,-1.515125,1.435481,0.138373,-0.649113,0.265082,1.123029


Melakukan feature construction yaitu label encoder ke label agar label dapat digunakan di ML model.

In [None]:
X = df_scaled[[i for i in df_scaled.columns if i != 'output']]
y = df_scaled['output']
print(f"X:{X}")

# Melakukan label encoder
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
y = y_encoded

# Pisahkan kumpulan data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X:       (age,)    (sex,)     (cp,)  (trtbps,)   (chol,)    (fbs,)  (restecg,)  \
0    0.952197  0.681005  1.973123   0.763956 -0.256334  2.394438   -1.005832   
1   -1.915313  0.681005  1.002577  -0.092738  0.072199 -0.417635    0.898962   
2   -1.474158 -1.468418  0.032031  -0.092738 -0.816773 -0.417635   -1.005832   
3    0.180175  0.681005  0.032031  -0.663867 -0.198357 -0.417635    0.898962   
4    0.290464 -1.468418 -0.938515  -0.663867  2.082050 -0.417635    0.898962   
..        ...       ...       ...        ...       ...       ...         ...   
298  0.290464 -1.468418 -0.938515   0.478391 -0.101730 -0.417635    0.898962   
299 -1.033002  0.681005  1.973123  -1.234996  0.342756 -0.417635    0.898962   
300  1.503641  0.681005 -0.938515   0.706843 -1.029353  2.394438    0.898962   
301  0.290464  0.681005 -0.938515  -0.092738 -2.227533 -0.417635    0.898962   
302  0.290464 -1.468418  0.032031  -0.092738 -0.198357 -0.417635   -1.005832   

     (thalachh,)   (exng,)  (oldpeak,

Mencoba ML KNN dan XGBoost terhadap dataset yang sudah dilakukan feature engineering.

In [None]:
# Convert data ke DMatrix untuk XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# parameter untuk XGBoost
params = {
    'objective': 'multi:softmax',
    'num_class': len(np.unique(y)),
    'max_depth': 4,
    'eta': 0.3,
    'eval_metric': 'mlogloss',
    'seed': 42
}

# Ngetrain XGBoost model
num_round = 50
bst = xgb.train(params, dtrain, num_round)

# Membuat prediksi
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)

print(f"Akurasi XGBoost Menggunakan Standard Scaler adalah : {accuracy*100:.2f}%")

Akurasi XGBoost Menggunakan Standard Scaler adalah : 83.61%


In [None]:
# import KNeighbors ClaSSifier from sklearn
from sklearn.neighbors import KNeighborsClassifier

# Pisahkan kumpulan data menjadi kumpulan pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# membuat instance model
knn = KNeighborsClassifier(n_neighbors=3)


# menyesuaikan model dengan set pelatihan
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(f"Akurasi KNN Menggunakan Standard Scaler adalah :{np.sum(y_pred == y_test)/len(y_test) * 100}%")

Akurasi KNN Menggunakan Standard Scaler adalah :86.88524590163934%


## **6. Feature Engineering dengan MinMax Scaler**


Mendapatkan kolom fitur kecuali label, lalu kolom fitur teresbut kita lakukan proses feature engineering yaitu feature construction Min Max Scaler ke fitur tersebut.

Mendapatkan semua fitur kecuali label, lalu melakukan feature engineering yaitu feature construction Min Max Scaler ke fitur tersebut, lalu drop kolom lama yang belum dilakukan Min Max Scaler.

In [None]:
# Melakukan Min Max Scaler
col = [i for i in df.columns if i != 'output']
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(df[col])

# Menambahkan fitur yang sudah diskalakan kembali ke DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=[col])
df_scaled = pd.concat([df, scaled_df], axis=1)

# Mendrop kolom lama
df_scaled.drop(col, axis=1, inplace=True)
df_scaled

Unnamed: 0,output,"(age,)","(sex,)","(cp,)","(trtbps,)","(chol,)","(fbs,)","(restecg,)","(thalachh,)","(exng,)","(oldpeak,)","(slp,)","(caa,)","(thall,)"
0,1,0.708333,1.0,1.000000,0.481132,0.244292,1.0,0.0,0.603053,0.0,0.370968,0.0,0.00,0.333333
1,1,0.166667,1.0,0.666667,0.339623,0.283105,0.0,0.5,0.885496,0.0,0.564516,0.0,0.00,0.666667
2,1,0.250000,0.0,0.333333,0.339623,0.178082,0.0,0.0,0.770992,0.0,0.225806,1.0,0.00,0.666667
3,1,0.562500,1.0,0.333333,0.245283,0.251142,0.0,0.5,0.816794,0.0,0.129032,1.0,0.00,0.666667
4,1,0.583333,0.0,0.000000,0.245283,0.520548,0.0,0.5,0.702290,1.0,0.096774,1.0,0.00,0.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,0,0.583333,0.0,0.000000,0.433962,0.262557,0.0,0.5,0.396947,1.0,0.032258,0.5,0.00,1.000000
299,0,0.333333,1.0,1.000000,0.150943,0.315068,0.0,0.5,0.465649,0.0,0.193548,0.5,0.00,1.000000
300,0,0.812500,1.0,0.000000,0.471698,0.152968,1.0,0.5,0.534351,0.0,0.548387,0.5,0.50,1.000000
301,0,0.583333,1.0,0.000000,0.339623,0.011416,0.0,0.5,0.335878,1.0,0.193548,0.5,0.25,1.000000


Melakukan feature construction yaitu label encoder ke label agar label dapat digunakan di ML model.

In [None]:
X = df_scaled[[i for i in df_scaled.columns if i != 'output']]
y = df_scaled['output']
print(f"X:{X}")

# Melakukan label encoder
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
y = y_encoded

# Pisahkan kumpulan data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X:       (age,)  (sex,)     (cp,)  (trtbps,)   (chol,)  (fbs,)  (restecg,)  \
0    0.708333     1.0  1.000000   0.481132  0.244292     1.0         0.0   
1    0.166667     1.0  0.666667   0.339623  0.283105     0.0         0.5   
2    0.250000     0.0  0.333333   0.339623  0.178082     0.0         0.0   
3    0.562500     1.0  0.333333   0.245283  0.251142     0.0         0.5   
4    0.583333     0.0  0.000000   0.245283  0.520548     0.0         0.5   
..        ...     ...       ...        ...       ...     ...         ...   
298  0.583333     0.0  0.000000   0.433962  0.262557     0.0         0.5   
299  0.333333     1.0  1.000000   0.150943  0.315068     0.0         0.5   
300  0.812500     1.0  0.000000   0.471698  0.152968     1.0         0.5   
301  0.583333     1.0  0.000000   0.339623  0.011416     0.0         0.5   
302  0.583333     0.0  0.333333   0.339623  0.251142     0.0         0.0   

     (thalachh,)  (exng,)  (oldpeak,)  (slp,)  (caa,)  (thall,)  
0       0.603053   

Mencoba ML KNN dan XGBoost terhadap dataset yang sudah dilakukan feature engineering.

In [None]:
# Convert data ke DMatrix untuk XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# parameter untuk XGBoost
params = {
    'objective': 'multi:softmax',
    'num_class': len(np.unique(y)),
    'max_depth': 4,
    'eta': 0.3,
    'eval_metric': 'mlogloss',
    'seed': 42
}

# Ngetrain XGBoost model
num_round = 50
bst = xgb.train(params, dtrain, num_round)

# Membuat prediksi
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)

print(f"Akurasi XGBoost Menggunakan MinMax Scaler adalah : {accuracy*100:.2f}%")

Akurasi XGBoost Menggunakan MinMax Scaler adalah : 83.61%


In [None]:
# import KNeighbors ClaSSifier from sklearn
from sklearn.neighbors import KNeighborsClassifier

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# instantiate the model
knn = KNeighborsClassifier(n_neighbors=3)

# menyesuaikan model dengan set pelatihan
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(f"Akurasi KNN Menggunakan Min Max Scaler adalah :{np.sum(y_pred == y_test)/len(y_test) * 100}%")

Akurasi KNN Menggunakan Min Max Scaler adalah :83.60655737704919%


## **7. Feature Engineering dengan Power Transformer**


Mendapatkan kolom fitur kecuali label, lalu kolom fitur teresbut kita lakukan proses feature engineering yaitu Power Transformer ke fitur tersebut.

Mendapatkan semua fitur kecuali label, lalu melakukan feature engineering yaitu feature construction Power Transformer ke fitur tersebut, lalu drop kolom lama yang belum dilakukan Power Transformer.

In [None]:
# Melakukan Power Transformer
col = [i for i in df.columns if i != 'output']
scaler = PowerTransformer()
scaled_features = scaler.fit_transform(df[col])

# Menambahkan fitur yang sudah diskalakan kembali ke DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=[col])
df_scaled = pd.concat([df, scaled_df], axis=1)
df_scaled

# Mendrop kolom lama
df_scaled.drop(col, axis=1, inplace=True)
df_scaled

Unnamed: 0,output,"(age,)","(sex,)","(cp,)","(trtbps,)","(chol,)","(fbs,)","(restecg,)","(thalachh,)","(exng,)","(oldpeak,)","(slp,)","(caa,)","(thall,)"
0,1,0.956171,0.681005,1.487217,0.831717,-0.159046,2.394438,-1.021730,-0.081881,-0.696631,1.180998,-1.958263,-0.839679,-1.946718
1,1,-1.831006,0.681005,1.058192,0.015350,0.187234,-0.417635,0.933925,1.871628,-0.696631,1.592215,-1.958263,-0.839679,-0.583232
2,1,-1.442978,-1.468418,0.379389,0.015350,-0.819710,-0.417635,-1.021730,1.015375,-0.696631,0.686518,1.015783,-0.839679,-0.583232
3,1,0.147656,0.681005,0.379389,-0.623762,-0.095947,-0.417635,0.933925,1.347213,-0.696631,0.166444,1.015783,-0.839679,-0.583232
4,1,0.260722,-1.468418,-1.015652,-0.623762,1.859954,-0.417635,0.933925,0.543973,1.435481,-0.069148,1.015783,-0.839679,-0.583232
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,0,0.260722,-1.468418,-1.015652,0.576126,0.007273,-0.417635,0.933925,-1.179283,1.435481,-0.706744,-0.736554,-0.839679,1.176072
299,0,-1.039024,0.681005,1.487217,-1.360101,0.453389,-0.417635,0.933925,-0.843304,-0.696631,0.537070,-0.736554,-0.839679,1.176072
300,0,1.557230,0.681005,-1.015652,0.781806,-1.097969,2.394438,0.933925,-0.477666,-0.696631,1.564525,-0.736554,1.341145,1.176072
301,0,0.260722,0.681005,-1.015652,0.015350,-3.088792,-0.417635,0.933925,-1.453449,1.435481,0.537070,-0.736554,0.871923,1.176072


Melakukan feature construction yaitu label encoder ke label agar label dapat digunakan di ML model.

In [None]:
X = df_scaled[[i for i in df_scaled.columns if i != 'output']]
y = df_scaled['output']
print(f"X:{X}")

# Melakukan label encoder
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
y = y_encoded

# Pisahkan kumpulan data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X:       (age,)    (sex,)     (cp,)  (trtbps,)   (chol,)    (fbs,)  (restecg,)  \
0    0.956171  0.681005  1.487217   0.831717 -0.159046  2.394438   -1.021730   
1   -1.831006  0.681005  1.058192   0.015350  0.187234 -0.417635    0.933925   
2   -1.442978 -1.468418  0.379389   0.015350 -0.819710 -0.417635   -1.021730   
3    0.147656  0.681005  0.379389  -0.623762 -0.095947 -0.417635    0.933925   
4    0.260722 -1.468418 -1.015652  -0.623762  1.859954 -0.417635    0.933925   
..        ...       ...       ...        ...       ...       ...         ...   
298  0.260722 -1.468418 -1.015652   0.576126  0.007273 -0.417635    0.933925   
299 -1.039024  0.681005  1.487217  -1.360101  0.453389 -0.417635    0.933925   
300  1.557230  0.681005 -1.015652   0.781806 -1.097969  2.394438    0.933925   
301  0.260722  0.681005 -1.015652   0.015350 -3.088792 -0.417635    0.933925   
302  0.260722 -1.468418  0.379389   0.015350 -0.095947 -0.417635   -1.021730   

     (thalachh,)   (exng,)  (oldpeak,

Mencoba ML KNN dan XGBoost terhadap dataset yang sudah dilakukan feature engineering.

In [None]:
# Convert data ke DMatrix untuk XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# parameter untuk XGBoost
params = {
    'objective': 'multi:softmax',
    'num_class': len(np.unique(y)),
    'max_depth': 4,
    'eta': 0.3,
    'eval_metric': 'mlogloss',
    'seed': 42
}

# Ngetrain XGBoost model
num_round = 50
bst = xgb.train(params, dtrain, num_round)

# Membuat prediksi
y_pred = bst.predict(dtest)
accuracy = accuracy_score(y_test, y_pred)

print(f"Akurasi XGBoost Menggunakan Power Transformer adalah : {accuracy*100:.2f}%")

Akurasi XGBoost Menggunakan Power Transformer adalah : 83.61%


In [None]:
# impor KNeighbors ClaSSifier dari sklearn
from sklearn.neighbors import KNeighborsClassifier

# Pisahkan kumpulan data menjadi kumpulan pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# membuat instance model
knn = KNeighborsClassifier(n_neighbors=3)


# menyesuaikan model dengan set pelatihan
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print(f"Akurasi KNN Menggunakan Power Transformer adalah :{np.sum(y_pred == y_test)/len(y_test) * 100}%")

Akurasi KNN Menggunakan Power Transformer adalah :80.32786885245902%
