<a href="https://colab.research.google.com/github/Luseat/feature_enginering/blob/main/Latihan_Studi_Kasus_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import beberapa library**

In [36]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import OneHotEncoder, StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.ensemble import RandomForestClassifier


In [37]:
X, y = make_classification(n_samples=1000, n_features=15, n_informative=10, n_redundant=2,n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=42)

Setelah menjalankan kode di atas, Anda akan memiliki 1000 data dengan 15 fitur independen dan satu fitur dependen yang berbeda-beda. By default, fungsi ini akan membuat dua buah kelas yang berbeda, tetapi Anda juga bisa menentukan jumlah kelas dengan mengatur nilai n_classes, ya.

Karena pada akhir materi ini kita akan belajar mengenai salah satu metode oversampling, Anda perlu mengatur nilai weights (rasio) untuk membagi jumlah data pada masing-masing kelas. Pada kasus ini, kita akan membagi 90% data untuk kelas pertama, dan 10% data untuk kelas kedua.

In [38]:
# Menyusun dataset menjadi DataFrame untuk kemudahan
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(1, 16)])
df['Target'] = y

# Misalkan kita punya beberapa fitur kategorikal (simulasi fitur kategorikal)
df['Fitur_12'] = np.random.choice(['A', 'B', 'C'], size=1000)
df['Fitur_13'] = np.random.choice(['X', 'Y', 'Z'], size=1000)

df

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,Target,Fitur_12,Fitur_13
0,0.093303,-3.472520,-1.314199,3.525743,0.642138,2.247328,3.067502,1.146301,-2.173112,2.765828,-1.821258,1.459826,-1.024592,1.005559,-0.276558,0,B,X
1,-0.189574,-1.770842,-1.578851,-1.372201,-2.025230,0.518655,-0.764750,-3.958705,-0.598147,1.018789,4.194233,2.236310,-0.001984,-0.243630,0.285979,0,B,Y
2,0.916269,-2.051770,3.631998,0.824844,1.674093,-0.436273,-0.460407,0.031633,-1.140149,2.069694,1.935251,0.671318,-3.175360,2.486020,-2.867291,0,A,X
3,-0.914665,-1.608657,-0.735184,-1.742743,-1.753532,0.383412,-1.057937,-2.897416,-0.830328,1.572469,5.334621,0.776033,-0.494986,-0.788215,1.255376,0,A,Z
4,-0.756784,-2.362885,-3.909120,-0.474571,-4.029843,0.947114,0.581146,-3.435229,-2.142380,2.332385,3.816539,3.038337,-0.391516,0.712335,2.810524,0,A,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,-1.927482,-0.017325,4.884411,0.542628,3.033376,-1.904407,0.953035,1.278882,-1.293396,1.772169,-1.191263,3.201742,-0.576096,-0.387151,-2.629004,0,A,X
996,0.347761,-1.690916,5.949207,-2.289729,2.238469,-0.067922,-0.069702,-1.436622,-2.153011,-0.867583,3.962758,0.661162,-2.550410,0.886822,-1.248408,0,B,X
997,1.201967,-1.263417,-1.331925,-2.468434,1.777577,2.270456,-0.431749,-1.846263,1.753033,1.858452,4.264568,0.571311,5.103484,0.067260,0.931995,1,C,Z
998,-2.127846,-0.975838,0.279144,0.151578,-0.443749,0.650616,-1.410265,-1.017319,-0.643070,2.142898,3.399255,-0.890017,-1.193708,-0.128774,0.800834,0,C,Z


Kode di atas akan memetakan nilai acak yang dihasilkan oleh fungsi make_classification() ke DataFrame yang sudah dibangun sehingga akan menghasilkan dataset seperti berikut.

In [39]:
# Memisahkan fitur dan target
X = df.drop('Target', axis=1)
y = df['Target']

In [40]:
print("Distribusi Kelas Sebelum Kelas", Counter(y))

Distribusi Kelas Sebelum Kelas Counter({0: 901, 1: 99})


Karena dataset ini memiliki fitur yang cukup banyak, pilihlah fitur (feature selection) Namun, pada latihan ini mari kita gunakan teknik embedded agar terbiasa dengan teknik yang paling kompleks.

In [41]:
# ------------------- Embedded Methods -------------------
# Menggunakan Random Forest untuk mendapatkan fitur penting
rf_model  = RandomForestClassifier(n_estimators=100, random_state=42)
X_integer = X.drop(['Fitur_12', 'Fitur_13'], axis=1)
rf_model.fit(X_integer, y)

# Mendapatkan fitur penting
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]

# Menentukan ambang batas untuk fitur penting
threshold = 0.05 # Misalnya, ambang batas 5%
important_features_indices = [i for i in range(len(importances)) if importances[i] >= threshold]


# Menampilkan fitur penting beserta nilainya
print("Fitur yang dipilih dengan Emnbedded Methods (diatas amabang batas)")
for i in important_features_indices:
    # Jika X asli berbentuk DataFrame, maka kita ambil nama kolom
    print(f"{X.columns[i]}: {importances[i]}")


# Mendapatkan nama kolom penting berdasarkan importance
important_features = X_integer.columns[important_features_indices]



# Memindahkan fitur penting ke variabel baru
X_important = X_integer[important_features] # Hanya fitur penting dari data pelatihan

# X_important sekarang berisi hanya fitur penting
print("\nDimensi data pelatihan dengan fitur penting", X_important.shape)

Fitur yang dipilih dengan Emnbedded Methods (diatas amabang batas)
feature_2: 0.10914759494718489
feature_5: 0.07368007013958638
feature_9: 0.23505259533043404
feature_10: 0.0735348049797358
feature_13: 0.13300405135982538
feature_15: 0.11253091664500435

Dimensi data pelatihan dengan fitur penting (1000, 6)


 menentukan ambang batas hubungan antara variabel sebesar 5% sehingga mendapatkan delapan fitur dengan tipe data numerik.

 etelah proses pemilihan fitur numerik dilakukan, Anda perlu menggabungkan data numerik dan kategorikal seperti semula.

In [42]:
X_Selected = pd.concat([X_important, X['Fitur_12']], axis=1)
X_Selected = pd.concat([X_Selected, X['Fitur_13']], axis=1)
X_Selected

Unnamed: 0,feature_2,feature_5,feature_9,feature_10,feature_13,feature_15,Fitur_12,Fitur_13
0,-3.472520,0.642138,-2.173112,2.765828,-1.024592,-0.276558,B,X
1,-1.770842,-2.025230,-0.598147,1.018789,-0.001984,0.285979,B,Y
2,-2.051770,1.674093,-1.140149,2.069694,-3.175360,-2.867291,A,X
3,-1.608657,-1.753532,-0.830328,1.572469,-0.494986,1.255376,A,Z
4,-2.362885,-4.029843,-2.142380,2.332385,-0.391516,2.810524,A,Y
...,...,...,...,...,...,...,...,...
995,-0.017325,3.033376,-1.293396,1.772169,-0.576096,-2.629004,A,X
996,-1.690916,2.238469,-2.153011,-0.867583,-2.550410,-1.248408,B,X
997,-1.263417,1.777577,1.753033,1.858452,5.103484,0.931995,C,Z
998,-0.975838,-0.443749,-0.643070,2.142898,-1.193708,0.800834,C,Z


In [43]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
# Melakukan Encoding untuk fitur 12
X_Selected['Fitur_12'] = label_encoder.fit_transform(X_Selected['Fitur_12'])
# print(label_encoder.inverse_transform(X_Selected[['Fitur_12']]))
# Melakukan Encoding untuk fitur 13
X_Selected['Fitur_13'] = label_encoder.fit_transform(X_Selected['Fitur_13'])
# print(label_encoder.inverse_transform(X_Selected[['Fitur_13']]))

print(X_Selected)

     feature_2  feature_5  feature_9  feature_10  feature_13  feature_15  \
0    -3.472520   0.642138  -2.173112    2.765828   -1.024592   -0.276558   
1    -1.770842  -2.025230  -0.598147    1.018789   -0.001984    0.285979   
2    -2.051770   1.674093  -1.140149    2.069694   -3.175360   -2.867291   
3    -1.608657  -1.753532  -0.830328    1.572469   -0.494986    1.255376   
4    -2.362885  -4.029843  -2.142380    2.332385   -0.391516    2.810524   
..         ...        ...        ...         ...         ...         ...   
995  -0.017325   3.033376  -1.293396    1.772169   -0.576096   -2.629004   
996  -1.690916   2.238469  -2.153011   -0.867583   -2.550410   -1.248408   
997  -1.263417   1.777577   1.753033    1.858452    5.103484    0.931995   
998  -0.975838  -0.443749  -0.643070    2.142898   -1.193708    0.800834   
999   1.387667  -2.834755   2.625895    0.246120    1.583909    2.511309   

     Fitur_12  Fitur_13  
0           1         0  
1           1         1  
2        

In [44]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# Memilih kolom numerik
numeric_columns = X_Selected.select_dtypes(include=['int64', 'float64']).columns
numeric_columns = numeric_columns.drop(['Fitur_12', 'Fitur_13'])

# Membuat salinan data untuk menjaga data asli tetap utuh
X_cleaned = X_important.copy()

In [45]:
for col in numeric_columns:
  # Melihat outlier dengan IQR (Interquartile Range)
  Q1 = X_cleaned[col].quantile(0.25)
  Q3 = X_cleaned[col].quantile(0.75)
  IQR = Q3 - Q1
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

  #Identifikasi Outlier
  Outliers = X_cleaned[(X_cleaned[col] < lower_bound) | (X_cleaned[col] > upper_bound)]

  # Menghapus outlier dari DataFrame
  X_cleaned = X_cleaned.drop(Outliers.index)


In [46]:
X_cleaned

Unnamed: 0,feature_2,feature_5,feature_9,feature_10,feature_13,feature_15
0,-3.472520,0.642138,-2.173112,2.765828,-1.024592,-0.276558
1,-1.770842,-2.025230,-0.598147,1.018789,-0.001984,0.285979
2,-2.051770,1.674093,-1.140149,2.069694,-3.175360,-2.867291
3,-1.608657,-1.753532,-0.830328,1.572469,-0.494986,1.255376
4,-2.362885,-4.029843,-2.142380,2.332385,-0.391516,2.810524
...,...,...,...,...,...,...
994,-2.041914,-0.697435,-1.075422,2.308076,-3.342327,0.056426
995,-0.017325,3.033376,-1.293396,1.772169,-0.576096,-2.629004
996,-1.690916,2.238469,-2.153011,-0.867583,-2.550410,-1.248408
998,-0.975838,-0.443749,-0.643070,2.142898,-1.193708,0.800834


sekarang berjumlah 949 data,