## **Data Encoding**
1. **Nominal/OHE Encoding**
2. **Label and Ordinal Encoding**
3. **Target Guided Ordinal Encoding**

#### **1. Nominal/OHE(One Hot Encoding) Encoding** 

- One Hot Encoding (OHE), also known as nominal encoding, is a technique used to **represent categorical data as numerical data**, which is more suitable for machine learning algorithms.

- In this technique, *each category is represented as a binary vector where each bit corresponds to a unique category*.

- For example, if we have a categorical variable "color" with 3 possible values (Red,Green,Blue or RGB), we can represent it in one hot encoding as follows:

1. Red [1, 0, 0]
2. Green [0, 1, 0]
3. Blue [0, 0, 1]

##### **Dis-advantages**
- Also if we have too many features in the dataset, One Hot Encoding is not suitable
- This cna be represented as a sparse matrix - which can lead to make the model 'Overfitting'

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
# create a simple dataframe
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


#### **1. Create an Instance of OHE**

In [4]:
# fit and transform
encoder = OneHotEncoder()

In [5]:
# fit and transform: creates a sparse matrix
encoded = encoder.fit_transform(df[['color']]).toarray()

In [6]:
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

In [7]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [8]:
# for the new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [9]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,red,0.0,0.0,1.0
4,blue,1.0,0.0,0.0


## **Data Encoding HW**
Do a Data Encoding on tips dataset from sklearn

In [10]:
import seaborn as sns
sns.load_dataset('tips') # look for the categorical datas and get started

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


### **2. Label Encoding**

- Label Encoding and ordinal encoding are 2 techniques used to encode categorical data as numerical data.

- Label Encoding involves *assigning a **unique numerical** label to each category in the variable*.

- The labels are usually assigned in alphabetical order or **based on frequency of the categories**.

- For example, if we have a categorical variable "color" with 3 positive values (red, green, blue), we can represent it using label encoding as follows :
    - Red: 1
    - Green: 2
    - Blue: 3

In [11]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


In [12]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [13]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 2, 0])

In [14]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [15]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [16]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

- Also there might be cases that the model will confuse the labels as some ranks and start considering that while predicting or classifying something.

- If thats not desirable, we need to avoid that.

- Incase we want to make use of ranks, then we need to do an **Ordinal Encoding**

##### **Ordinal Encoding**

- It is used to **encode categorical data** that *have an intrinsic order or ranking*.

- In this technique, **each category is assigned a numerical value based on its position in the order.**

- For example, if we have a categorical variable "education level" with 4 possible values (high_school, college, graduate, post_graduate), we can represent it using ordinal encoding as follows:

1. High_School: 1
2. College: 2
3. Graduate: 3
4. Post_Graduate: 4

In [17]:
## Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

In [18]:
df = pd.DataFrame({
    'size': ['small','medium','large','medium','small','large']
})

In [19]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [20]:
# assign rankings
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

In [21]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [23]:
encoder.transform([['small']])



array([[0.]])

### **Target Guided Encoding**

- It is a technique used to **encode categorical variables** *based on their relationship with the target variable*.

- *This encoding technique is useful when* we have a **categorical variable with a large number of unique categories**and **we want to use this variable as a feature** in our machine learning model

- In Target Guided Ordinal Encoding *we replace each category in the categorical variable with a numerical value based on the **mean or median of the target variable** for that category*. 

- This creates a **monotonic relationship** *between the categorical variable and the target variable*, which can improve the predictive power of our model.

In [24]:
import pandas as pd

df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [25]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [26]:
df.groupby('city')['price'].mean()

city
London      150.0
New York    190.0
Paris       310.0
Tokyo       250.0
Name: price, dtype: float64

In [27]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [28]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [29]:
df['city_encoded'] = df['city'].map(mean_price)

In [30]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [32]:
# now we can feed only the cols we want
df[['city', 'city_encoded']]

Unnamed: 0,city,city_encoded
0,New York,190.0
1,London,150.0
2,Paris,310.0
3,Tokyo,250.0
4,New York,190.0
5,Paris,310.0


## Homework

In [33]:
# convert time based on the total bill
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
