# Data Encoding

Converting categorical data â†’ numerical values for evaluating categorical data for modeling

## Encoding Techniques

### 1. Nominal / OHE Encoding (One Hot Encoding)

Creates binary columns for each category. Each category is represented as a separate feature with values of 0 or 1.

**Use Case**: Non-ordinal categorical variables (e.g., colors, cities, brands)

```python
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(data)
```

**Pros**:
- Works well with nominal (non-ordinal) categories
- Compatible with most machine learning algorithms

**Cons**:
- Creates sparse matrices with high dimensionality
- Can lead to overfitting with high cardinality features
- Increases computational complexity

---

### 2. Label and Ordinal Encoding

Assigns integer values to each category. Label encoding assigns arbitrary integers, while ordinal encoding assigns integers based on a specified order.

**Use Case**: Ordinal categorical variables (e.g., education level, ratings, rankings)

```python
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Label Encoding
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(data)

# Ordinal Encoding
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
encoded_ordinal = ordinal_encoder.fit_transform(data)
```

**Pros**:
- Simple and efficient
- Preserves ordinality when appropriate
- Creates fewer features than one-hot encoding

**Cons**:
- Can introduce false ordinal relationships
- Algorithms may interpret encoded numbers as having inherent magnitude
- Not suitable for nominal data

---

### 3. Target Guided Ordinal Encoding

Encodes categories based on their relationship with the target variable. Categories are ranked by their mean target value or other statistical measures.

**Use Case**: Categorical variables with a relationship to the target variable

```python
# Manual implementation example
import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'B', 'C', 'A', 'B'],
    'target': [10, 20, 15, 12, 22]
})

# Calculate mean target value per category
target_mean = df.groupby('category')['target'].mean().sort_values()

# Create mapping
mapping = {cat: idx for idx, cat in enumerate(target_mean.index)}
df['encoded'] = df['category'].map(mapping)
```

**Pros**:
- Incorporates information from the target variable
- Can improve model performance for tree-based algorithms
- Reduces dimensionality compared to one-hot encoding

**Cons**:
- Risk of overfitting if not properly validated
- Requires a target variable (supervised approach)
- Can introduce data leakage if not handled carefully during cross-validation

---

## Comparison Table

| Technique | Data Type | Dimensionality | Interpretability | Use Case |
|-----------|-----------|-----------------|------------------|----------|
| One Hot Encoding | Nominal | High | Easy | Non-ordinal categories |
| Label Encoding | Ordinal | Low | Moderate | Ranked categories |
| Ordinal Encoding | Ordinal | Low | Moderate | Ranked categories with custom order |
| Target Guided | Mixed | Low | Moderate | Categories with target relationship |

---

## Selection Guide

- **Use One Hot Encoding** for non-ordinal categorical features (colors, cities, brands)
- **Use Label/Ordinal Encoding** for ordinal features with natural ordering (education, ratings)
- **Use Target Guided Encoding** for categorical features that correlate with the target variable and when you need to reduce dimensionality


In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

### 1. Nominal / OHE Encoding (One Hot Encoding)

In [2]:
df = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green', 'red', 'blue', 'red']
})

In [3]:
encoder = OneHotEncoder()

In [4]:
encoded = encoder.fit_transform(df[['color']]).toarray()
print(encoded)

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]]


In [5]:
import pandas as pd
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['color']))

print(encoded_df)

    color_blue  color_green  color_red
0          0.0          0.0        1.0
1          0.0          1.0        0.0
2          1.0          0.0        0.0
3          0.0          0.0        1.0
4          0.0          1.0        0.0
5          1.0          0.0        0.0
6          0.0          0.0        1.0
7          0.0          1.0        0.0
8          0.0          0.0        1.0
9          1.0          0.0        0.0
10         0.0          0.0        1.0


In [6]:
concat_encoded_df = pd.concat([df, encoded_df], axis=1)
print(concat_encoded_df)

    color  color_blue  color_green  color_red
0     red         0.0          0.0        1.0
1   green         0.0          1.0        0.0
2    blue         1.0          0.0        0.0
3     red         0.0          0.0        1.0
4   green         0.0          1.0        0.0
5    blue         1.0          0.0        0.0
6     red         0.0          0.0        1.0
7   green         0.0          1.0        0.0
8     red         0.0          0.0        1.0
9    blue         1.0          0.0        0.0
10    red         0.0          0.0        1.0


#### Excersize based on One Hot Encoding

In [7]:
import seaborn as sns

df = sns.load_dataset('tips')

In [18]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [9]:
def dataset_encoder(pendingEncodingColumnList, dataset):
    concat_encoded_column_df = pd.DataFrame()
    for column in pendingEncodingColumnList:
        encoded_column = encoder.fit_transform(dataset[[column]]).toarray()
        encoded_column_df = pd.DataFrame(encoded_column, columns=encoder.get_feature_names_out([column]))
        concat_encoded_column_df = pd.concat([concat_encoded_column_df, encoded_column_df], axis=1)
    
    return concat_encoded_column_df

In [10]:
pending_encoding_columns_list = ['sex', 'smoker', 'day', 'time']

concat_encoded_dataset = dataset_encoder(pending_encoding_columns_list, df)


In [11]:
concat_encoded_dataset.head()

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [12]:
# encode view of each column 

encoded_column_sex_df = pd.concat([df['sex'], concat_encoded_dataset['sex_Female'], concat_encoded_dataset['sex_Male']], axis=1)
encoded_column_sex_df.head()

Unnamed: 0,sex,sex_Female,sex_Male
0,Female,1.0,0.0
1,Male,0.0,1.0
2,Male,0.0,1.0
3,Male,0.0,1.0
4,Female,1.0,0.0


In [None]:
def getDatasetColumnDict(columnList, dataset):
    column_dist_dict = {
        column: dataset[column].unique().tolist() for column in columnList
    }

    encoded_columns_df = dataset_encoder(columnList, dataset)


In [22]:
encode_required_columns_list = ['sex', 'smoker', 'day', 'time']
getDatasetColumnDict(encode_required_columns_list, df)

     sex_Female  sex_Male  smoker_No  smoker_Yes  day_Fri  day_Sat  day_Sun  \
0           1.0       0.0        1.0         0.0      0.0      0.0      1.0   
1           0.0       1.0        1.0         0.0      0.0      0.0      1.0   
2           0.0       1.0        1.0         0.0      0.0      0.0      1.0   
3           0.0       1.0        1.0         0.0      0.0      0.0      1.0   
4           1.0       0.0        1.0         0.0      0.0      0.0      1.0   
..          ...       ...        ...         ...      ...      ...      ...   
239         0.0       1.0        1.0         0.0      0.0      1.0      0.0   
240         1.0       0.0        0.0         1.0      0.0      1.0      0.0   
241         0.0       1.0        0.0         1.0      0.0      1.0      0.0   
242         0.0       1.0        1.0         0.0      0.0      1.0      0.0   
243         1.0       0.0        1.0         0.0      0.0      0.0      0.0   

     day_Thur  time_Dinner  time_Lunch  
0         

### 2. Label and Ordinal Encoding

In [27]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

In [24]:
labelEncoder = LabelEncoder()
ordinalEncoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

In [31]:
df = pd.DataFrame({
    'size' : ['small', 'medium', 'large', 'large', 'medium', 'small', 'large', 'small', 'medium']
})

In [32]:
df.head()

Unnamed: 0,size
0,small
1,medium
2,large
3,large
4,medium


In [20]:
label_encoded = labelEncoder.fit_transform(df[['size']])
label_encoded_df = pd.DataFrame(label_encoded, columns=['size_Labled'])

label_encoded_df.head()

concat_label_encoded_df = pd.concat([df, label_encoded_df], axis=1)

concat_label_encoded_df.head()

  y = column_or_1d(y, warn=True)


Unnamed: 0,size,size_Labled
0,small,2
1,medium,1
2,large,0
3,large,0
4,medium,1


In [34]:
ordinal_encoded = ordinalEncoder.fit_transform(df[['size']])

ordinal_encoded_df = pd.DataFrame(ordinal_encoded, columns=['size_ordinal_encoded'])

ordinal_encoded_df.head()

concat_ordinal_encoded_df = pd.concat([df, ordinal_encoded_df], axis=1)

concat_ordinal_encoded_df.head()


Unnamed: 0,size,size_ordinal_encoded
0,small,0.0
1,medium,1.0
2,large,2.0
3,large,2.0
4,medium,1.0


### 3. Target Guided Ordinal Encoding

In [10]:
import pandas as pd

In [11]:
df = pd.DataFrame({
    'city': ['NewYork', 'London', 'Paris', 'NewYork', 'London', 'Paris'],
    'price': [100, 200, 300, 150, 250, 350]
})

In [14]:
mean_city_prices = df.groupby('city')['price'].mean().to_dict()

mean_city_prices

{'London': 225.0, 'NewYork': 125.0, 'Paris': 325.0}

In [15]:
df['encoded_city'] = df['city'].map(mean_city_prices)
df.head()

Unnamed: 0,city,price,encoded_city
0,NewYork,100,125.0
1,London,200,225.0
2,Paris,300,325.0
3,NewYork,150,125.0
4,London,250,225.0
