## Data Encoding
1. Nominal / One Hot Encoding (preferred when no. of labels to be encoded are less)
2. Label and ordinal Encoding
3. Target Guided Ordinal Encoding

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green', 'blue']})

In [4]:
#Create an instant of the OneHotEncoder class
encoder = OneHotEncoder()

In [7]:
encoder.fit_transform(df[['color']]).toarray() # df [[color]] needs to be specified as a 2D array

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [11]:
encoded = encoder.fit_transform(df[['color']]).toarray()
encoded_data = pd.DataFrame(encoded, columns=encoder.get_feature_names_out()) # sets the columns to the feature names
encoded_data

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0
5,1.0,0.0,0.0


### Label Encoding
encode categorical data as numeric data <br>
for example: Red: 1 | Blue: 2 | Green: 3|

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [13]:
le.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 1, 0, 2, 1, 0])

In [17]:
print(le.transform([['red']])) #convert individual data
print(le.transform([['green']]))

[2]
[1]


  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


Problem with Label encoding is since numerical values are assigned, such as <br> `Red: 2` | `Blue: 0` | `Green: 1` <br> The model might interpret these numbers as rankings, which is undesirable.
This encoding is more preferred in use cases when the categorical values are actually associated with ranks

### Ordinal Encoding
For use cases, which do have a relation between categorical data and rankings, we use ordinal encoding <br>

For example, we can assign the following categories numbers as ranks: <br>

`HighSchool: 1 | College: 2 | Graduate: 3 | Post-Graduate: 4 `

In [18]:
from sklearn.preprocessing import OrdinalEncoder


In [23]:
df = pd.DataFrame({'Size': ['small', 'medium', 'large', 'small', 'medium', 'large']})

In [25]:
oe = OrdinalEncoder(categories=[['small', 'medium', 'large',]])
#For this order, small will get assigned 0, medium will get assigned 1 and large will get assigned 2
oe.fit_transform(df[['Size']])

array([[0.],
       [1.],
       [2.],
       [0.],
       [1.],
       [2.]])

### Target guided Ordinal Encoding
* Encodes categorical variable based on their relation with the target variable
* Useful when we have a large number of unique categories
* Replace the categorical value with a numerical value based on the `mean or median of the target variable` for that category

In [28]:
df = pd.DataFrame({
    'city': ['New York', 'London', 'Tokyo', 'New York', 'London', 'Tokyo'],
    'price': [100, 200, 150, 120, 180, 250]
})

In [30]:
mean_price = df.groupby('city')['price'].mean().to_dict()
mean_price

{'London': 190.0, 'New York': 110.0, 'Tokyo': 200.0}

In [32]:
df['city_encoded'] = df['city'].map(mean_price)
'''
    Mapping with mean price means that the mean price for each city will be used as the encoded value for that city.
    So we can use the "city_encoded" column instead of "city" column
'''
df

Unnamed: 0,city,price,city_encoded
0,New York,100,110.0
1,London,200,190.0
2,Tokyo,150,200.0
3,New York,120,110.0
4,London,180,190.0
5,Tokyo,250,200.0


In [34]:
import seaborn as sns
data = sns.load_dataset('tips')

In [35]:
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [None]:
data['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [44]:
means = data.groupby('time')['tip'].mean().to_dict()
means

  means = data.groupby('time')['tip'].mean().to_dict()


{'Lunch': 2.7280882352941176, 'Dinner': 3.102670454545455}

In [47]:
data['time_encoded'] = data['time'].map(means)
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_encoded
0,16.99,1.01,Female,No,Sun,Dinner,2,3.10267
1,10.34,1.66,Male,No,Sun,Dinner,3,3.10267
2,21.01,3.50,Male,No,Sun,Dinner,3,3.10267
3,23.68,3.31,Male,No,Sun,Dinner,2,3.10267
4,24.59,3.61,Female,No,Sun,Dinner,4,3.10267
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,3.10267
240,27.18,2.00,Female,Yes,Sat,Dinner,2,3.10267
241,22.67,2.00,Male,Yes,Sat,Dinner,2,3.10267
242,17.82,1.75,Male,No,Sat,Dinner,2,3.10267
