# Mengenal Categorical Encoding: Label Encoding & One Hot Encoding 

## Apa itu Categorical Encoding?
Categorical Encoding adalah proses konversi nilai categorical menjadi nilai numerical.

Terdapat banyak jenis Categorical Encoding, dua di antaranya adalah:
- Label Encoding
- One Hot Encoding
Referensi: https://en.wikipedia.org/wiki/One-hot

## Label Encoding
Pada Label Encoding, setiap kategori pada suatu feature akan diurutkan secara alfabet dan direpresentasikan dengan sebuah nilai integer.

### Dataset

In [1]:
import pandas as pd

df = pd.DataFrame({
    'country': ['India', 'US', 'Japan', 'US', 'Japan'],
    'age': [44, 34, 46, 35, 23],
    'salary': [72000, 65000, 98000, 45000, 34000]
})

df

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,country,age,salary
0,India,44,72000
1,US,34,65000
2,Japan,46,98000
3,US,35,45000
4,Japan,23,34000


### Label Encoding pada Scikit Learn

In [2]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['country'] = label_encoder.fit_transform(df['country'])
df

Unnamed: 0,country,age,salary
0,0,44,72000
1,2,34,65000
2,1,46,98000
3,2,35,45000
4,1,23,34000


In [3]:
label_encoder.classes_

array(['India', 'Japan', 'US'], dtype=object)

## One Hot Encoding
Pada One Hot Encoding, setiap kategori pada suatu feature akan diurutkan secara alfabet dan direpresentasikan sebagai sekumpulan bits.

### Dataset

In [4]:
df = pd.DataFrame({
    'country': ['India', 'US', 'Japan', 'US', 'Japan'],
    'age': [44, 34, 46, 35, 23],
    'salary': [72000, 65000, 98000, 45000, 34000]
})

df

Unnamed: 0,country,age,salary
0,India,44,72000
1,US,34,65000
2,Japan,46,98000
3,US,35,45000
4,Japan,23,34000


### One Hot Encoding pada Scikit Learn

In [5]:
X = df['country'].values.reshape(-1,1)
X

array([['India'],
       ['US'],
       ['Japan'],
       ['US'],
       ['Japan']], dtype=object)

In [6]:
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()
X = onehot_encoder.fit_transform(X).toarray()
X

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [7]:
onehot_encoder.categories_

[array(['India', 'Japan', 'US'], dtype=object)]

In [8]:
df_onehot = pd.DataFrame(X, columns=[str(i) for i in range(X.shape[1])])
df_onehot

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0


In [9]:
df = pd.concat([df_onehot, df], axis=1)
df

Unnamed: 0,0,1,2,country,age,salary
0,1.0,0.0,0.0,India,44,72000
1,0.0,0.0,1.0,US,34,65000
2,0.0,1.0,0.0,Japan,46,98000
3,0.0,0.0,1.0,US,35,45000
4,0.0,1.0,0.0,Japan,23,34000


In [10]:
df = df.drop(['country'], axis=1)
df

Unnamed: 0,0,1,2,age,salary
0,1.0,0.0,0.0,44,72000
1,0.0,0.0,1.0,34,65000
2,0.0,1.0,0.0,46,98000
3,0.0,0.0,1.0,35,45000
4,0.0,1.0,0.0,23,34000


## Label Encoding vs One Hot Encoding

Menerapkan One Hot Encoding bila:
- Nilai categorical adalah nominal
- Jumlah kategori yang ada tidak terlalu banyak

Menerapkan Label Encoding bila:
- Nilai categorical adalah ordinal
- Jumlah kategori yang ada relatif banyak