# Categorical Data Encoding

## 1. Introduction and Preparation
Encoding categorical data involves converting categorical variables into numerical representations to be used in machine learning or statistical models. This process assigns numerical values to categories, allowing the data to be processed effectively by algorithms that work with numerical inputs. Common encoding techniques include one-hot encoding, label encoding, ordinal encoding, target encoding, binary encoding, frequency encoding, and hash encoding. By encoding categorical data, we enable the incorporation of these variables into models and leverage the information they provide for analysis and predictions.

There are several ways to encode categorical data, depending on the specific requirements and characteristics of the data. Here are some common methods for categorical data encoding:

1. **One-Hot Encoding (Dummy Coding)**: This method creates binary columns for each category in the original variable. Each category is represented by a separate column, where a value of 1 indicates the presence of that category, and 0 indicates its absence. This approach is suitable when the categories are not ordinal.
Example: Using **pd.get_dummies()** function in pandas or **OneHotEncoder** class in scikit-learn.
Label Encoding:

2. **Label encoding** assigns a unique numerical label to each category in the variable. Each category is replaced with an integer value. This method is useful for ordinal categorical variables where the order matters.
Example: Using **LabelEncoder** class in scikit-learn.
Ordinal Encoding:

3. **Ordinal encoding** maps the categories to ordered numerical values based on a predefined order or mapping. It assigns integers to categories based on their relative order or specified mapping. This encoding is suitable for ordinal categorical variables.
Example: Using a mapping dictionary or the OrdinalEncoder class in scikit-learn.
Binary Encoding:

4. **Binary encoding** represents each category with binary digits. It converts the categories into binary representations and uses a combination of 0s and 1s to encode the variables. This approach is suitable for variables with a large number of categories.
Example: Using libraries like category_encoders or feature-engine.
Frequency Encoding:

5. **Frequency encoding** replaces each category with its frequency or proportion in the dataset. It assigns a numerical value based on the occurrence frequency of each category. This approach is useful when the frequency of categories is informative.
Example: Manually calculating frequencies or using libraries like category_encoders.
Hash Encoding:

6. **Hash encoding** applies a hash function to the categories and assigns a fixed number of binary digits (hashes) to each category. It is useful for large categorical variables with high cardinality.
Example: Using libraries like category_encoders or feature-engine.

These are some common methods for encoding categorical data. The choice of encoding technique depends on the specific characteristics of the data, the nature of the categories, and the requirements of the analysis or modeling task.

In [12]:
import pandas as pd
!pip install category_encoders


Collecting category_encoders
  Downloading category_encoders-2.6.1-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 2. One-Hot Encoding (Dummy Coding)

In [2]:
# Create a DataFrame with the "color" column
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})

# Apply one-hot encoding
one_hot_encoded = pd.get_dummies(df['color'])

In [3]:
one_hot_encoded

Unnamed: 0,blue,green,red
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1


The resulting **one_hot_encoded** DataFrame will have three binary columns: "color_red," "color_green," and "color_blue," where 1 indicates the presence of that color and 0 indicates its absence.

## 3. Label encoding

In [18]:
# Create a DataFrame with the "color" column
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})

# Apply label encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['color_encoded'] = label_encoder.fit_transform(df['color'])

In [19]:
df

Unnamed: 0,color,color_encoded
0,red,2
1,green,1
2,blue,0
3,red,2


The resulting DataFrame will have an additional column named "color_encoded" that contains the encoded numerical values for each category: 2 for "red," 1 for "green," and 0 for "blue."

## 4. Ordinal encoding

In [6]:
# Create a DataFrame with the "color" column
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})

# Define the order of categories
category_order = ['blue', 'green', 'red']

# Apply ordinal encoding
df['color_encoded'] = df['color'].apply(lambda x: category_order.index(x))

In [7]:
df

Unnamed: 0,color,color_encoded
0,red,2
1,green,1
2,blue,0
3,red,2


The resulting DataFrame will have an additional column named "color_encoded" that contains the encoded numerical values based on the order of the categories: 2 for "red," 1 for "green," and 0 for "blue."

## 5. Target Encoding (Mean Encoding)

In [8]:
# Create a DataFrame with the "color" and "target" columns
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red'],
                   'target': [1, 0, 1, 1]})

# Calculate the mean target for each category
mean_target = df.groupby('color')['target'].mean()

# Apply target encoding
df['color_encoded'] = df['color'].map(mean_target)

In [9]:
df

Unnamed: 0,color,target,color_encoded
0,red,1,1.0
1,green,0,0.0
2,blue,1,1.0
3,red,1,1.0


The resulting DataFrame will have an additional column named "color_encoded" that contains the mean target values for each category.

## 6. Binary encoding

In [13]:
import category_encoders as ce

# Create a DataFrame with the "color" column
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})

# Apply binary encoding
binary_encoder = ce.BinaryEncoder(cols=['color'])
df_encoded = binary_encoder.fit_transform(df)

In [14]:
df_encoded

Unnamed: 0,color_0,color_1
0,0,1
1,1,0
2,1,1
3,0,1


The resulting **df_encoded** DataFrame will have binary-encoded columns for the "color" variable.

## 7. Frequency encoding

In [15]:
# Create a DataFrame with the "color" column
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})

# Calculate the frequency of each category
frequency = df['color'].value_counts(normalize=True)

# Apply frequency encoding
df['color_encoded'] = df['color'].map(frequency)

In [16]:
df

Unnamed: 0,color,color_encoded
0,red,0.5
1,green,0.25
2,blue,0.25
3,red,0.5


The resulting DataFrame will have an additional column named "color_encoded" that contains the frequency (proportion) of each category.

These examples demonstrate how each encoding method can be applied to a categorical variable. It's important to adapt the code to your specific dataset and encoding requirements.