# Tutorial: Encoding Categorical Variables dengan Scikit-learn

Model *machine learning* adalah model matematis. Mereka bekerja dengan angka, bukan dengan teks seperti "Pria", "Wanita", atau "Jakarta". Oleh karena itu, sebelum kita bisa melatih model, kita **wajib mengubah** semua data kategorikal (teks) menjadi representasi numerik. Proses ini disebut **encoding**.

Di notebook ini, kita akan fokus pada dua teknik encoding utama menggunakan Scikit-learn:
- **One-Hot Encoding** untuk data nominal
- **Ordinal Encoding** untuk data ordinal

### 1. Mempersiapkan Data Sample

Mari kita buat DataFrame sederhana yang berisi berbagai jenis data kategorikal.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Data Sample
data = {
    'Kota Asal': ['Jakarta', 'Bandung', 'Surabaya', 'Bandung','Jakarta','Medan'],
    'Ukuran Baju': ['M','L','S','L','M','XL'],
    'Jenis Kelamin': ['Pria', 'Wanita', 'Wanita', 'Pria', 'Wanita', 'Pria'],
    'Membeli': [1,0,0,0,1,1]
}

df = pd.DataFrame(data)
print(df)

  Kota Asal Ukuran Baju Jenis Kelamin  Membeli
0   Jakarta           M          Pria        1
1   Bandung           L        Wanita        0
2  Surabaya           S        Wanita        0
3   Bandung           L          Pria        0
4   Jakarta           M        Wanita        1
5     Medan          XL          Pria        1


Dalam data ini:

- **Kota Asal** & **Jenis Kelamin** adalah data **nominal** (tidak ada urutan/peringkat).
- **Ukuran Baju** adalah data **ordinal** (ada urutan yang jelas: S < M < L < XL).

## 2. Jebakan Umum: Mengapa Label Encoding Seringkali Salah

Pendekatan pertama yang mungkin terpikirkan adalah mengganti setiap kategori dengan angka (misal: Jakarta=0, Bandung=1, Surabaya=2). Ini disebut **Label Encoding**.

**Masalahnya:** model bisa mengasumsikan ada hubungan matematis/urutan antar angka-angka ini (misalnya, Surabaya > Bandung > Jakarta). Ini bisa menyesatkan model dan merusak performanya untuk data **nominal**.

**Hindari Label Encoding untuk fitur nominal!**


## 3. Teknik Utama untuk Data Nominal: One-Hot Encoding

**One-Hot Encoding** adalah solusi yang tepat untuk data nominal. Cara kerjanya adalah dengan membuat kolom baru untuk setiap kategori unik. Kemudian, untuk setiap baris, ia akan menempatkan angka **1** di kolom yang sesuai dengan kategori baris tersebut dan **0** di kolom lainnya.

Mari kita terapkan menggunakan `OneHotEncoder` dari Scikit-learn.

**OneHotEncoder** = gampangnya adalah, untuk yang nilainya string tidak berurut, jika value nya exists 1, dan jika tidak exist value 0, value dari OneHotEncodde hanya 0 dan 1 saja

**OrdinalEncoder** = gampangnya adalah, list" value yang sudah diurutkan, misalnya list ukuran baju ['S','M','L','XL'], maka hasil akhir akan menjadi [0,1,2,3]

In [3]:
from sklearn.preprocessing import OneHotEncoder

In [4]:
# pisahkan fiture dan target
X = df.drop('Membeli',axis=1)
Y = df['Membeli']

print(X)
print(Y)

  Kota Asal Ukuran Baju Jenis Kelamin
0   Jakarta           M          Pria
1   Bandung           L        Wanita
2  Surabaya           S        Wanita
3   Bandung           L          Pria
4   Jakarta           M        Wanita
5     Medan          XL          Pria
0    1
1    0
2    0
3    0
4    1
5    1
Name: Membeli, dtype: int64


In [5]:
# lakukan split and tran
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.33,random_state=42)

X_train_nominal = X_train[['Kota Asal', 'Jenis Kelamin']]
X_test_nominal = X_test[['Kota Asal', 'Jenis Kelamin']]
print(X_train_nominal)
print(X_test_nominal)

  Kota Asal Jenis Kelamin
5     Medan          Pria
2  Surabaya        Wanita
4   Jakarta        Wanita
3   Bandung          Pria
  Kota Asal Jenis Kelamin
0   Jakarta          Pria
1   Bandung        Wanita


In [6]:
# membuat object onehotencoder
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
# handle_unknown -> ignore, jika ada kategori baru di data test, semua kolom barunya akan 0
# sparse_output -> False, menghasilkan array numpy, bukan sparse matrix

In [7]:
# 2. Fit hanya pada pelatihan saja, ini mirip seperti distinc atau group by di mysql
# jadi misal dari 100 row, maka dicari value unik dari 100 row itu apa
print('Melatih OHE pada X_train_nominal')
ohe.fit(X_train_nominal)

Melatih OHE pada X_train_nominal


0,1,2
,"categories  categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith  column. The passed categories should not mix strings and numeric  values within a single feature, and should be sorted in case of  numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20",'auto'
,"drop  drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one  category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two  categories. Features with 1 or more than 2 categories are  left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that  should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21  The parameter `drop` was added in 0.21. .. versionchanged:: 0.23  The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1  Support for dropping infrequent categories.",
,"sparse_output  sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in ""Compressed Sparse Row"" (CSR) format. .. versionadded:: 1.2  `sparse` was renamed to `sparse_output`",False
,"dtype  dtype: number type, default=np.float64 Desired dtype of output.",<class 'numpy.float64'>
,"handle_unknown  handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during  transform, the resulting one-hot encoded columns for this feature  will be all zeros. In the inverse transform, an unknown category  will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered  during transform, the resulting one-hot encoded columns for this  feature will map to the infrequent category if it exists. The  infrequent category will be mapped to the last position in the  encoding. During inverse transform, an unknown category will be  mapped to the category denoted `'infrequent'` if it exists. If the  `'infrequent'` category does not exist, then :meth:`transform` and  :meth:`inverse_transform` will handle an unknown category as with  `handle_unknown='ignore'`. Infrequent categories exist based on  `min_frequency` and `max_categories`. Read more in the  :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform  a warning is issued, and the encoding then proceeds as described for  `handle_unknown=""infrequent_if_exist""`. .. versionchanged:: 1.1  `'infrequent_if_exist'` was added to automatically handle unknown  categories and infrequent categories. .. versionadded:: 1.6  The option `""warn""` was added in 1.6.",'ignore'
,"min_frequency  min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered  infrequent. - If `float`, categories with a smaller cardinality than  `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"max_categories  max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"feature_name_combiner  feature_name_combiner: ""concat"" or callable, default=""concat"" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `""concat""` concatenates encoded feature name and category with `feature + ""_"" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3",'concat'


In [8]:
# 3. Transformasi data pelatihan dan pengujian
X_train_encode = ohe.transform(X_train_nominal)
X_test_encode = ohe.transform(X_test_nominal)

# mendapatkan nama kolom baru yang dihasilkan
encoded_cols = ohe.get_feature_names_out(['Kota Asal', 'Jenis Kelamin'])
print(f'nama kolom baru : {encoded_cols}')

nama kolom baru : ['Kota Asal_Bandung' 'Kota Asal_Jakarta' 'Kota Asal_Medan'
 'Kota Asal_Surabaya' 'Jenis Kelamin_Pria' 'Jenis Kelamin_Wanita']


In [9]:
# mengubah hasil array menjadi DataFrame agar mudah dibaca
X_train_encode_df = pd.DataFrame(X_train_encode, columns=encoded_cols, index=X_train_nominal.index)
X_test_encode_df = pd.DataFrame(X_test_encode, columns=encoded_cols, index=X_test_nominal.index)

In [10]:
print('----hasil one hot encoding pada data pelatihan----')
print(X_train_encode_df)

print('----hasil one hot encoding pada data pengujian----')
print(X_test_encode_df)

----hasil one hot encoding pada data pelatihan----
   Kota Asal_Bandung  Kota Asal_Jakarta  Kota Asal_Medan  Kota Asal_Surabaya  \
5                0.0                0.0              1.0                 0.0   
2                0.0                0.0              0.0                 1.0   
4                0.0                1.0              0.0                 0.0   
3                1.0                0.0              0.0                 0.0   

   Jenis Kelamin_Pria  Jenis Kelamin_Wanita  
5                 1.0                   0.0  
2                 0.0                   1.0  
4                 0.0                   1.0  
3                 1.0                   0.0  
----hasil one hot encoding pada data pengujian----
   Kota Asal_Bandung  Kota Asal_Jakarta  Kota Asal_Medan  Kota Asal_Surabaya  \
0                0.0                1.0              0.0                 0.0   
1                1.0                0.0              0.0                 0.0   

   Jenis Kelamin_Pria  Jen

In [11]:
# 4. Teknik untuk data ordinal: ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

# definisikan urutan kategori yang benar
ukuran_baju_order = ['S','M','L','XL']

# Inisialisai OrdinelEncoder dengan urutan yang sudah kita tentukan
ordinal_encoder = OrdinalEncoder(categories=[ukuran_baju_order])

# kita akan fit dan transform
X_train_ordinal = X_train[['Ukuran Baju']]
X_test_ordinal = X_test[['Ukuran Baju']]

In [12]:
# Fit dan transform
X_train_ordinal_encoded = ordinal_encoder.fit_transform(X_train_ordinal)
X_test_ordinal_encoded = ordinal_encoder.transform(X_test_ordinal)

In [13]:
# Mengubah hasilnya menjadi DataFrame
X_train_ordinal_df = pd.DataFrame(X_train_ordinal_encoded,columns=['Ukuran Baju Encoded'],index=X_train_ordinal.index)
X_test_ordinal_df = pd.DataFrame(X_test_ordinal_encoded,columns=['Ukuran Baju Encoded'],index=X_test_ordinal.index)

print('----hasil one hot encoding pada data pelatihan----')
print(pd.concat([X_train_ordinal,X_train_ordinal_df], axis=1))

print('----hasil one hot encoding pada data pengujian----')
print(pd.concat([X_test_ordinal,X_test_ordinal_df], axis=1))

----hasil one hot encoding pada data pelatihan----
  Ukuran Baju  Ukuran Baju Encoded
5          XL                  3.0
2           S                  0.0
4           M                  1.0
3           L                  2.0
----hasil one hot encoding pada data pengujian----
  Ukuran Baju  Ukuran Baju Encoded
0           M                  1.0
1           L                  2.0
