# Encoders


## Categorical Variables

A **categorical variable**, also known as a qualitative variable, is a type of variable that can assume a set number of distinct categories or groups. Each category represents a qualitative characteristic or attribute. The categories are mutually exclusive, meaning an observation can only belong to one category.

Categorical variables come in two types: **nominal** and **ordinal**. Nominal variables consist of categories that lack any inherent order, such as the color of a car (red, blue, green, etc.). On the other hand, ordinal variables contain categories that have a natural order or ranking, like educational level (high school, bachelor's, master's, Ph.D.).

Here are some examples of categorical variables:
- The outcome of a roll of a six-sided die, which can result in one of six possibilities: 1, 2, 3, 4, 5, or 6.
- Demographic attributes such as disease status (healthy, sick).
- Blood type, which can be A, B, AB, or O.



Numerical labels in categorical variables are identifiers, not values. For example, in a die roll, 6 isn't "greater" than 1; it's just a different outcome. However, in ordinal variables, the labels do have an order. Like in a survey, responses from "strongly disagree" (1) to "strongly agree" (5) show a clear ranking.

## One Hot Encoding

One hot encoding is a technique to convert categorical variables into numerical values for machine learning models. It creates a new column for each category and assigns a binary value of 1 or 0 to indicate the presence or absence of that category. For example, suppose we have a variable called "color" with three possible values: "red", "green", and "blue". One hot encoding would create three new columns: "color_red", "color_green", and "color_blue". Each row would have a 1 in the column that matches its original color value, and 0 in the other columns. The table below illustrates this process:

| color | color_red | color_green | color_blue |
| ----- | --------- | ----------- | ---------- |
| red   | 1         | 0           | 0          |
| green | 0         | 1           | 0          |
| blue  | 0         | 0           | 1          |

<font color='Blue'><b>Example:</b></font>

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a dataframe with a categorical variable
df = pd.DataFrame({"Fruit": ["Apple", "Mango", "Apple", "Orange"]})

# Print the original dataframe
display(df)

# Use OneHotEncoder for one hot encoding
encoder_onehot = OneHotEncoder(sparse_output=False, dtype = 'int')
df_encoded = pd.DataFrame(encoder_onehot.fit_transform(df[["Fruit"]]),
                          columns = encoder_onehot.get_feature_names_out(["Fruit"]))

# Concatenate the original dataframe with the encoded values
df = pd.concat([df, df_encoded], axis=1)

# Print the encoded dataframe
display(df.drop(["Fruit"], axis=1))

In [None]:
# Get output feature names for transformation.
encoder_onehot.get_feature_names_out(['Fruit'])

We could do this using pandas too.

In [None]:
# Import pandas library
import pandas as pd

# Create a dataframe with a categorical variable
df = pd.DataFrame({"Fruit": ["Apple", "Mango", "Apple", "Orange"]})

# Print the original dataframe
display(df)

# Use pandas.get_dummies() to perform one hot encoding
df_encoded = pd.get_dummies(df, prefix="fruit")

# Print the encoded dataframe
display(df_encoded)

## Ordinal encoding

Ordinal encoding is a technique that transforms categorical variables into numerical values by assigning a unique integer to each category. This is useful when the categories have some inherent order or ranking, such as low, medium and high. Ordinal encoding can also reduce the dimensionality of the data and make it easier for some algorithms to handle.

<font color='Blue'><b>Example:</b></font>

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Create a dataframe with a numerical variable and a categorical variable
df = pd.DataFrame({"survey_response": ["strongly disagree", "disagree", "neutral", "agree", "strongly agree", "strongly agree"]})
display(df)

# Define the order of survey responses
survey_order = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]

# Use OrdinalEncoder to encode the categorical variable
encoder = OrdinalEncoder(categories=[survey_order], dtype=int)
df_encoded = df.copy()
df_encoded["survey_response"] = encoder.fit_transform(df[["survey_response"]])
display(df_encoded)

# print the categories of each feature
from pprint import pprint
pprint(encoder.categories_)

To do a similar encoding using pandas, we can use the following:

In [None]:
# Import necessary libraries
import pandas as pd

# Create a dataframe with a numerical variable and a categorical variable
df = pd.DataFrame({"survey_response": ["strongly disagree", "disagree", "neutral", "agree", "strongly agree", "strongly agree"]})

# Define the order of survey responses
survey_order = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]

# Create a mapping dictionary for ordinal encoding
mapping_dict = {category: index for index, category in enumerate(survey_order)}
pprint(mapping_dict)

# Map the categorical variable to its corresponding numerical values
df["survey_response_encoded"] = df["survey_response"].map(mapping_dict)
display(df)

# Print the categories of each feature
categories = {"survey_response": survey_order}
pprint(categories)

## SimpleImputer

SimpleImputer can replace missing values with a constant value, or with a statistic (such as mean, median, or mode) calculated from each column. The choice of strategy depends on the type and distribution of the data, as well as the goal of the analysis.

To use SimpleImputer, one needs to import it from sklearn.impute module, create an instance of the class with the desired parameters, and then fit and transform the data. For example, the following code snippet shows how to use SimpleImputer to replace missing values with the mean of each column:

```python
# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Create an instance of SimpleImputer with mean strategy
imputer = SimpleImputer(strategy="mean")

# Fit the imputer on the data
imputer.fit(data)

# Transform the data with the imputed values
data_imputed = imputer.transform(data)
```


In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample data with missing values
data = {'A': [1, 4, 7, 8, 11],
        'B': [2, np.nan, 6, 6, 8],
        'C': [0, np.nan, 6, 6, 9]}

df = pd.DataFrame(data)
display(df)

# Strategies to be used by SimpleImputer
strategies = ['mean', 'median', 'most_frequent', 'constant']

# Loop through each strategy and print the imputed data
for strategy in strategies:
    imputer = SimpleImputer(strategy=strategy, fill_value = 5)
    imputed_data = pd.DataFrame(data = imputer.fit_transform(df), columns = df.columns)
    print(f"\nImputed data using {strategy} strategy:")
    display(imputed_data)


## Column Transformer

 ColumnTransformer is a class in the scikit-learn library that allows you to apply different data preparation transforms to different columns of a dataset. For example, you can use ColumnTransformer to scale the numerical columns and encode the categorical columns of a pandas DataFrame. ColumnTransformer also supports dropping or passing through columns without any transformation. You can create a ColumnTransformer object by passing a list of tuples, each containing a name, a transformer, and a column selector. You can also use the make_column_transformer function to create a ColumnTransformer without naming the transformers. ColumnTransformer can be used as a single transformer or as part of a pipeline.

In [None]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer

# Sample dataset
data = {
    'Quality': ['Low', 'Medium', 'High', 'Low', 'Medium'],
    'Fruit': ['Apple', 'Orange', 'Grapes', 'Pineapple', 'Banana'],
    'Weight': [10, 20, 30, np.nan, 15],
}

df = pd.DataFrame(data)
display(df)

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer

# Define transformers and preprocessor
ordinal_encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
onehot_encoder = OneHotEncoder(sparse_output = False)
imputer = SimpleImputer(strategy='mean')

# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[('ordinal', ordinal_encoder, ['Quality']),
                  ('onehot', onehot_encoder, ['Fruit']),
                  ('imputer', imputer, ['Weight'])],
    remainder='passthrough'  # Pass through any other columns not specified
)

# Apply the column transformer to the dataset
transformed_data = preprocessor.fit_transform(df)

# Create a dataframe from the encoded data
transformed_df = pd.DataFrame(transformed_data, columns=preprocessor.get_feature_names_out())

# Display the new dataframe
display(transformed_df )