# **Data Preprocessing - Encoding Techniques**

### **Introduction to Encoding Techniques**
Encoding techniques are essential for converting categorical variables into numerical format so that they can be used in machine learning algorithms. Two common encoding techniques are:



1.   **One-Hot Encoding**: This technique converts categorical values into a set of binary columns. Each unique category is represented by a binary column where only one column has a value of 1 (indicating the presence of the category) and all other columns have a value of 0. This is useful when there is no ordinal relationship between the categories.

2.   **Ordinal Encoding**: This technique assigns unique integers to categories with an inherent order. This method is suitable for ordinal variables where the categories have a meaningful sequence.

## One-Hot Encoding

One-Hot Encoding creates a new binary column for each unique value in the categorical column. Here's how you can apply One-Hot Encoding:

In [38]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [39]:
d = {'sales': [100000,222000,1000000,522000,111111,222222,1111111,20000,75000,90000,1000000,10000], 'city': ['Tampa','Tampa','Orlando','Jacksonville','Miami','Jacksonville','Miami','Miami','Orlando','Orlando','Orlando','Orlando'], 'size': ['Small', 'Medium','Large','Large','Small','Medium','Large','Small','Medium','Medium','Medium','Small',]}

In [40]:
df = pd.DataFrame(data=d)
df.head()

Unnamed: 0,sales,city,size
0,100000,Tampa,Small
1,222000,Tampa,Medium
2,1000000,Orlando,Large
3,522000,Jacksonville,Large
4,111111,Miami,Small


In [41]:
df['city'].unique()

array(['Tampa', 'Orlando', 'Jacksonville', 'Miami'], dtype=object)

In [42]:
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform="pandas")

In [43]:
ohetransform = ohe.fit_transform(df[['city']])

In [44]:
ohetransform

Unnamed: 0,city_Jacksonville,city_Miami,city_Orlando,city_Tampa
0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0
5,1.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0
7,0.0,1.0,0.0,0.0
8,0.0,0.0,1.0,0.0
9,0.0,0.0,1.0,0.0


In [45]:
ohetransform.head()

Unnamed: 0,city_Jacksonville,city_Miami,city_Orlando,city_Tampa
0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0


In [46]:
df = pd.concat([df,ohetransform], axis=1).drop(columns='city')

In [47]:
df.head()

Unnamed: 0,sales,size,city_Jacksonville,city_Miami,city_Orlando,city_Tampa
0,100000,Small,0.0,0.0,0.0,1.0
1,222000,Medium,0.0,0.0,0.0,1.0
2,1000000,Large,0.0,0.0,1.0,0.0
3,522000,Large,1.0,0.0,0.0,0.0
4,111111,Small,0.0,1.0,0.0,0.0


In [48]:
df.to_csv('OneHotEncoded.csv', index=True)

## Ordinal Encoder

Ordinal Encoding assigns a unique integer to each category based on a predefined order. This is suitable for columns with a meaningful sequence.

In [49]:
ordinal_df = pd.DataFrame(data=d)
ordinal_df.head()

Unnamed: 0,sales,city,size
0,100000,Tampa,Small
1,222000,Tampa,Medium
2,1000000,Orlando,Large
3,522000,Jacksonville,Large
4,111111,Miami,Small


In [50]:
ordinal_df['size'].unique()

array(['Small', 'Medium', 'Large'], dtype=object)

In [51]:
sizes = ['Small', 'Medium', 'Large']

In [52]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories = [sizes])

In [53]:
enc.fit_transform(ordinal_df[['size']])

array([[0.],
       [1.],
       [2.],
       [2.],
       [0.],
       [1.],
       [2.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.]])

In [54]:
ordinal_df['size'] = enc.fit_transform(ordinal_df[['size']])

In [55]:
ordinal_df.head()

Unnamed: 0,sales,city,size
0,100000,Tampa,0.0
1,222000,Tampa,1.0
2,1000000,Orlando,2.0
3,522000,Jacksonville,2.0
4,111111,Miami,0.0


In [56]:
ordinal_df.to_csv('OrdinalEncoded.csv', index=True)