# Feature Encoding

## 1) Objective


Feature vectors usually contain a mixture of **categorical** and **numerical** features/variables. While numerical values can be normalized via scaling functions such as *StandardScaler* or *MinMaxScaler*, categorical features have to be encoded. There are four criteria that determine which encoding strategy is most suitable:<br> 
<br>
- Are the feature values/states nominal (= **no inherent order** such as gene names, cities etc)?<br>
- Are the feature values/states ordinal (= **there is an inherent order**) and the feature values can be ranked (health status, tax bracket etc)?<br>
- cardinality $c$ of the feature (= how many states/values can a feature have)<br>
- memory usage<br>
<br>
There are many encoding stategies. Here, we want to discuss the most common ones:<br>
<br>
- one-hot encoding<br>
- dummy encoding<br>
- ordinal encoding<br>
- binary encoding<br>
- count encoding<br>
<br>
Later, for example in Chem 277B, we will discuss more sophisticated encoding strategies like word embedding.

<br>

## 2) Preparation

First, we call the standard libraries as usual.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Next, we load a dataset that contrains information of Alzheimer patients. The original source can be found [here](https://pubmed.ncbi.nlm.nih.gov/34233656/).

In [None]:
AD = pd.read_excel('../Datasets/AD_data.xlsx', sheet_name = 'Summary Form')

In [None]:
AD.head()

In [None]:
print(list(AD.columns))

Let us visualize the entire dataset in pairplot:

In [None]:
Pair = sns.pairplot(AD, kind = 'kde')
Pair.map_lower(sns.scatterplot, c = 'k')
Pair.map_upper(sns.scatterplot, c = 'k')
Pair.map_diag(sns.histplot, stat = 'density')

We immediately see that most of the features are **categorical**. We therefore check for **cardinality**:

In [None]:
Categoricals = AD.columns[:-2]
for c in Categoricals:
    card = len(set(AD[c]))
    print("Cardinality of " + c + ': '+ str(card))

Fortunately, cardinality is low, but we need to distinguish between ordinal and nominal features.

<br>

## 3) Encoding

**3.1) One-hot**

The simplest encoding strategy is **one-hot encoding** that **assigns a vector** which only contains one binary value (1/True) at a certain position indicading the state, and another binary value (0/False) elsewhere. One-hot encoding is applicable if the feature values are nominal and if cardinality is low. For example, we can one-hot encode a DNA sequence (states are "ACGT" and $c=4$), which is done quite often in bioinformatics.

In [None]:
#1) defining keys and values for encoding via a dictionary
Cat = "ACGT"                # keys
Enc = np.identity(len(Cat)) # values

In [None]:
#2) actual encoder: a dictionary
Encoding = {c: e for c, e in zip(Cat, Enc)}
Encoder  = lambda Sequence: [Encoding[NT] for NT in Sequence] # defining a function via lambda that performs the encoding

In [None]:
#3) running an arbitrary test sequence:
S = "CCTGGTACACTATAGGCT"
print(np.array(Encoder(S)))

The sequence is now encoded. Each nucleotide is encoded via an one-hot vector. The length of the vector equals the cardinality of the feature.

<br>

**3.2) Dummy**

Instead of presenting a feature state by an entire vector, we can present the state by a **binary variable** (0/1 or False/True), which however comes with the downside of artificially adding new features (the "dummies") which represent the possible states of the given feature.<br>
*Pandas* has a very convenient function for that, which is called *get_dummies*. Let's run this function for all categoricals in the Alzheimer dataset and try to understand what it does: 

In [None]:
AD_dummy = pd.get_dummies(AD, columns = Categoricals, dtype = bool)
AD_dummy.head()

In [None]:
AD_dummy = pd.get_dummies(AD, columns = Categoricals, dtype = int)
AD_dummy.head()

The features are dummy encoded now, but the information is redundant. For example, if we know that $state_0 = 1$, we also know that $state_1 = 0$. That is a perfect 100% anti-correlation which will lead some machine learning methods to fail or perform poorly. Therefore, we drop one column (usually the first) of each feature:

In [None]:
AD_dummy = pd.get_dummies(AD, columns = Categoricals, dtype = int, drop_first = True)
AD_dummy.head()

However, some of the features in the data are ordinal and have an inherent order. These features are $economic$, $education$, $age$ (which should not have been binned in the first place!), $health$ and $lifestyle$. For these features, another encoding strategy has to be used.

<br>

**3.3) Ordinal**

Ordinal encoding is essentially enumerating the feature states according to their inherent order. That has been done already in the data set:

In [None]:
AD[['age range(0: <75; 1: >=75)', 'lifestyle', 'economic', 'education', 'health']].head()

Hence, we leave these features as they are, but dummy encode those which are nominal:

In [None]:
AD_dummy_proper = pd.get_dummies(AD, columns = ['sex', 'heridity', 'marriage',], dtype = int, drop_first = True)
AD_dummy_proper.head()

**Important**:<br> 
<br>
1) Now since the features have been encoded, we need to scale/normalize them as discussed in the lecture. Most of the features are not normally distributed and therefore, a max/min scaler is most suitable. 

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
SMinMax = MinMaxScaler(feature_range = (0, 1))                    # initializing the scaler
Scaled  = SMinMax.fit_transform(AD_dummy_proper)                  # fit/transform the data
Data    = pd.DataFrame(Scaled, columns = AD_dummy_proper.columns) # turning output into a dataframe

In [None]:
Data.head()

Now, the dataset is ready for analysis!

2) Encoding always implies bias, especially ordinal encoding. Try to avoid ordinal encoding as much as possible and use actual objective values instead. For example, instead of $health$, features like blood pressure, cholesterol etc contain more information and are less biased. The same applies for economic status (use annual income, savings, real estate values etc) and lifestyle (hours exercise per week, cigarette and alcohol consumption etc) and avoid vague features like *"social status"*, *"anxiety level"* etc.

<br>

**3.4) Binary**

A combination of ordinal and one-hot encoding is binary encoding, which is suitable if cardinality is high. We first assign an ordinal value to each state of the feature and then encode this number as binary, ususally 8-bit.<br> 
For example one-hot encoding a sentence is very inefficent:

In [None]:
#1) building encoder as before
Cat      = "abcdefghijklmnopqrstuvwxyz .!?"
Enc      = np.identity(len(Cat))

Encoding = {c: e for c, e in zip(Cat, Enc)}
Encoder  = lambda Sentence: [Encoding[letter] for letter in Sentence]

In [None]:
#2) encoding one sentence   
My_Sentence = 'this is a sentence.'

In [None]:
print(np.array(Encoder(My_Sentence)))

As we can see, storing this sentence in an one-hot encoded array is very memory inefficent, since cardinality $c$ is high. The sentence is stored in an array of shape $c\,\times\,len(My\_Sentence)$

Let us encode the same sentence now using binary encoding:

In [None]:
#1) building encoder as before
bytes_object 	= Cat.encode('utf-8') # first, the string has to be turned into a utf-8bit object
Encoding     	= {c: f"{b:08b}" for c, b in zip(Cat, bytes_object)}

In [None]:
Encoder  = lambda Sentence: [Encoding[letter] for letter in Sentence]

In [None]:
print(np.array(Encoder(My_Sentence)))

Apparently, the array is a lot smaller now.

<br>

**4.5) Count encoding**

Count encoding or frequency encoding assigns the number of occurrences of a feature state/value to the encoded value. This is in particular helpful, if we want to derive quantities like information and specifity (i.e. calculating **entropy**) from those features. For example one can count how often a gene was overexpressed for certain disease types in medical records and therefore determine the specifity. That however only works for large datasets when interpreting the relative frequency of a value as probability is a good approximation (see sampling methods).<br>
Let us run this encoding for the sentence as before:

In [None]:
Encoding = {s: Sequence.count(s) for s in set(Sequence)} # counting occurrences

In [None]:
Encoder  = lambda Sentence: [Encoding[letter] for letter in Sentence]

In [None]:
print(np.array(Encoder(My_Sentence)))