# **One-Hot Encoder**
One-hot encoding is a technique used to convert categorical variables into numerical variables. 
It is a way to represent categorical data in a numerical format that can be processed by machine learning algorithms.

**Example: One-Hot Encoding of Colors**

Suppose we have a dataset with a column called "Color" that contains the following values: 
"red", "green", "blue", "red", "green", "blue", "red", "green", "blue".

We can use one-hot encoding to convert this column into a numerical format. The resulting encoded column would have the following values: 
[1, 0, 0], [0, 1, 0], [0 , 0, 1], [1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0 , 0], [0, 1, 0], [0, 0, 1].

One-Hot Encoding is helpful when we have a categorical variable with a large number of categories. It allows us to represent the variables in a numerical format that can be processed by machine learning algorithms. 
One-Hot Encoding focuses on nominal data, which is data that has no inherent order or ranking. It is a way to convert categorical data into numerical data that can be used in machine learning models.

-Label Encoder : LabelEncoder can be used to normalize labels. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. Fit label encoder.

-Ordinal Encoder : Ordinal encoding consists of converting categorical data into numeric data by assigning a unique integer to each category, and is a common data preprocessing step in most data science projects. Ordinal encoding is particularly useful when an inherent ordering or ranking is present within the categorical variable.

In [14]:
import pandas as pd

In [15]:
d = {'sales': [100000,222000,1000000,522000,111111,222222,1111111,20000,75000,90000,1000000,10000], 'city': ['Tampa','Tampa','Orlando','Jacksonville','Miami','Jacksonville','Miami','Miami','Orlando','Orlando','Orlando','Orlando'], 'size': ['Small', 'Medium','Large','Large','Small','Medium','Large','Small','Medium','Medium','Medium','Small',]}
display(d)

{'sales': [100000,
  222000,
  1000000,
  522000,
  111111,
  222222,
  1111111,
  20000,
  75000,
  90000,
  1000000,
  10000],
 'city': ['Tampa',
  'Tampa',
  'Orlando',
  'Jacksonville',
  'Miami',
  'Jacksonville',
  'Miami',
  'Miami',
  'Orlando',
  'Orlando',
  'Orlando',
  'Orlando'],
 'size': ['Small',
  'Medium',
  'Large',
  'Large',
  'Small',
  'Medium',
  'Large',
  'Small',
  'Medium',
  'Medium',
  'Medium',
  'Small']}

In [16]:
df = pd.DataFrame(data=d)
df.head()

Unnamed: 0,sales,city,size
0,100000,Tampa,Small
1,222000,Tampa,Medium
2,1000000,Orlando,Large
3,522000,Jacksonville,Large
4,111111,Miami,Small


In [17]:
df['city'].unique()

array(['Tampa', 'Orlando', 'Jacksonville', 'Miami'], dtype=object)

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Create an instance of OneHotEncoder
# Set handle_unknown='ignore' to ignore any unknown categories during transformation 
# Set sparse_output=False to return a dense array
# Set output to 'pandas' to return a DataFrame
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output = False).set_output(transform='pandas')
# Fit and transform the 'city' column
ohetransform = ohe.fit_transform(df[['city']])

In [19]:
display(ohetransform)

Unnamed: 0,city_Jacksonville,city_Miami,city_Orlando,city_Tampa
0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0
5,1.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0
7,0.0,1.0,0.0,0.0
8,0.0,0.0,1.0,0.0
9,0.0,0.0,1.0,0.0


In [20]:
ohetransform.head()

Unnamed: 0,city_Jacksonville,city_Miami,city_Orlando,city_Tampa
0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0


In [21]:
df = pd.concat([df, ohetransform], axis = 1).drop(columns=['city'])

In [22]:
df.head(20)

Unnamed: 0,sales,size,city_Jacksonville,city_Miami,city_Orlando,city_Tampa
0,100000,Small,0.0,0.0,0.0,1.0
1,222000,Medium,0.0,0.0,0.0,1.0
2,1000000,Large,0.0,0.0,1.0,0.0
3,522000,Large,1.0,0.0,0.0,0.0
4,111111,Small,0.0,1.0,0.0,0.0
5,222222,Medium,1.0,0.0,0.0,0.0
6,1111111,Large,0.0,1.0,0.0,0.0
7,20000,Small,0.0,1.0,0.0,0.0
8,75000,Medium,0.0,0.0,1.0,0.0
9,90000,Medium,0.0,0.0,1.0,0.0
