## Topic: One-hot Encoding

### OUTCOMES

- 1. Introduction of One hot encoding.


- 2. Code Implementation of One hot encoding.

- 3. One hot encoding for real world datasets

### 1. Introduction of One hot encoding.

- Definition:
    - Encoding is use for convert categorical data to numerical data.

- Type of Categorical Data
    - 1. nominal data
        - no order data
        - eg: (color, city, gender)
   
    - 2. ordinal data
        - order data
        - eg: (rank, course feedback, review)


- One Hot Encoding:
    - one hot encoding is apply for nominal data.
    - Convert nominal categorical data to numerical data.
    - here, numerical data either 0 or 1.


- Steps of one hot encoding:
    step_01: find or detect the nominal featues
        - each unique value create a new column.
        - assign each unique column to (unique order values)
    
    - step_02: combine the new column with original column


    - step_03: Drop the old column

 

### 2. Code Implementation of One hot encoding.

In [1]:
import numpy as np

import pandas as pd


In [2]:
df = pd.DataFrame({
    "id":[1,2,3,4],
    "color":["red","blue","green","red"],
    "size":["Small","Medium","Large","Medium"],
    "price":[10,12,15,11]
})

df

Unnamed: 0,id,color,size,price
0,1,red,Small,10
1,2,blue,Medium,12
2,3,green,Large,15
3,4,red,Medium,11


- here, our target column (nominal categorical) => color

In [3]:
# step_01: detect column ['color'] and apply one hot encoding

d_color = pd.get_dummies(df['color'], prefix = 'C', dtype = int)

d_color

Unnamed: 0,C_blue,C_green,C_red
0,0,0,1
1,1,0,0
2,0,1,0
3,0,0,1


In [14]:
# step_02: combine the new columns with the original data

de_enconded = pd.concat([df,d_color], axis = 1)

de_enconded


Unnamed: 0,id,color,size,price,C_blue,C_green,C_red
0,1,red,Small,10,0,0,1
1,2,blue,Medium,12,1,0,0
2,3,green,Large,15,0,1,0
3,4,red,Medium,11,0,0,1


In [None]:
# step_03: Drop the old column (color)

de_enconded = de_enconded.drop('color', axis = 1)

de_enconded

Unnamed: 0,id,size,price,C_blue,C_green,C_red
0,1,Small,10,0,0,1
1,2,Medium,12,1,0,0
2,3,Large,15,0,1,0
3,4,Medium,11,0,0,1


### 3. One hot encoding for real world datasets

In [16]:
df = pd.read_csv('Data.csv')

df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


- here our target featues __ Country
- because it's Nominal data type

In [None]:
# step_01: Apply one-hot encoding for Country column

d_country = pd.get_dummies(df['Country'], prefix = 'C', dtype = int)

d_country

Unnamed: 0,C_France,C_Germany,C_Spain
0,1,0,0
1,0,0,1
2,0,1,0
3,0,0,1
4,0,1,0
5,1,0,0
6,0,0,1
7,1,0,0
8,0,1,0
9,1,0,0


In [22]:
# step_02: concatenate new column with the original column 

df_encode = pd.concat([df, d_country], axis = 1)

df_encode

Unnamed: 0,Country,Age,Salary,Purchased,C_France,C_Germany,C_Spain
0,France,44.0,72000.0,No,1,0,0
1,Spain,27.0,48000.0,Yes,0,0,1
2,Germany,30.0,54000.0,No,0,1,0
3,Spain,38.0,61000.0,No,0,0,1
4,Germany,40.0,,Yes,0,1,0
5,France,35.0,58000.0,Yes,1,0,0
6,Spain,,52000.0,No,0,0,1
7,France,48.0,79000.0,Yes,1,0,0
8,Germany,50.0,83000.0,No,0,1,0
9,France,37.0,67000.0,Yes,1,0,0


In [23]:
# drop the old column (unencoding)

df_encode = df_encode.drop("Country", axis = 1)

df_encode

Unnamed: 0,Age,Salary,Purchased,C_France,C_Germany,C_Spain
0,44.0,72000.0,No,1,0,0
1,27.0,48000.0,Yes,0,0,1
2,30.0,54000.0,No,0,1,0
3,38.0,61000.0,No,0,0,1
4,40.0,,Yes,0,1,0
5,35.0,58000.0,Yes,1,0,0
6,,52000.0,No,0,0,1
7,48.0,79000.0,Yes,1,0,0
8,50.0,83000.0,No,0,1,0
9,37.0,67000.0,Yes,1,0,0


### Note

- Now country column is ready for learn model