Libraries that support machine learning mostly only accept input data in numeric form. However, we have seen that with tabular data, very often data fields are stored as categories. Even many data fields are stored in numeric format but are still considered categories. For example, the user ID can be any value, as long as it is not duplicated. They can be numeric values ​​1, 2, 3, … but these values ​​should not be included directly in the model.

One thing that needs to be emphasized is that a good machine learning model is one that returns close output results when the input data (in numerical form) is close to each other. User codes, product codes, or any other type of code that is numbered in random order cannot be considered highly similar when the two codes are close to each other. Even if the codes are intentionally typed, they are only close to each other in one-dimensional space. Information can be defined as being “close to each other” in a higher dimensional space. As another example, let's say the days of the week are numbered 1 (Sunday), (Monday) 2, ..., (Monday) 7; Days 1 and 2 are close to each other, but days 1 and 7 are closer because they are the same weekend. Placing dates as points on a circle in two-dimensional space can yield more value because 1 is close to both 7 and 2.

Thus, with category data, we not only need to put them into digital form so that algorithms can process them, but we also need to put them into reasonable values ​​in multi-dimensional space to bring about good results.

# One-hot encoding

The most traditional way to convert item data into digital form is one-hot encoding. In this encoding, a "dictionary" needs to be built containing all possible values ​​of each data category. Each item value will then be encoded by a binary vector with all elements equal to 0 except one element equal to 1 corresponding to the position of that item value in the dictionary.

For example, if we have one-column data as "New York", "California", "Los Angeles", we do the following steps:
- Build a dictionary. In this case, we can build a dictionary as ["New York", "California", "Los Angeles"].
- After building the dictionary, we need to save the index of each item in the dictionary. With the dictionary as above, the corresponding index is [0, 1, 2].
- Finally, we encode the original values ​​as follows:

<table>
    <tr>
        <th>Original value</th>
        <th>Encoded value</th>
    </tr>
    <tr>
        <td>New York</td>
        <td>[1, 0, 0]</td>
    </tr>
    <tr>
        <td>California</td>
        <td>[0, 1, 0]</td>
    </tr>
    <tr>
        <td>Los Angeles</td>
        <td>[0, 0, 1]</td>
    </tr>

Since each item value is encoded in a vector with only one element equal to 1 at its corresponding position in the dictionary, this vector is called a “one-hot vector”. The dimension of this vector is exactly equal to the number of words in the dictionary. Interpreted in another way, each binary value in this vector represents whether the item value under consideration "is" the corresponding value in the dictionary. For new values ​​that are not in the dictionary (out-of-vocabolary or OOV), we can encode them as [0, 0, 0] in the sense that they are not any values ​​in the dictionary.

Another common way to encode values ​​that are not in the dictionary is to add the word "unknown" to the dictionary and all new values ​​are placed in this "unknown" category. It is important to note that "unknown" is also a possible value in the data set. Encoding unknown values ​​with the same vector can confuse the model that these are two same values. If somehow you know these values ​​will appear a lot in the future, you should specifically include them in the dictionary to have your own encoding, avoiding overlap with other values. If these values ​​rarely occur, we can put them together in one code and consider them to have the same nature as "rare". Trying to encode for each rare value will result in having to use a lot of memory and the model will also be more complicated to try to learn unique cases, in which case overfitting can easily occur.

## Example with Sklearn

In [7]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df_train = pd.DataFrame(
    data={"location": ["New York", "California", "Los Angeles"], "population (M)": [7, 9, 0.5]} # Example numbers
)
df_train

Unnamed: 0,location,population (M)
0,New York,7.0
1,California,9.0
2,Los Angeles,0.5


Next, we apply one-hot encryption to the "location" column

In [8]:
onehot = OneHotEncoder()

onehot_encoded_location = onehot.fit_transform(df_train[["location"]])
print(type(onehot_encoded_location))
print(onehot_encoded_location)

<class 'scipy.sparse._csr.csr_matrix'>
  (0, 2)	1.0
  (1, 0)	1.0
  (2, 1)	1.0


There are a few points to note here. First, the default "onehot_encoded_location" return result is stored in the <code>scipy.sparse.csr.csr_matrix</code> type, which is a special type for storing two-dimensional arrays with a majority of zero elements. This way of saving is very convenient in terms of storage. Remember in this case because each vector has only one non-zero element. If the dictionary size increases to millions and we store the matrix in normal form, it will be a waste of resources to store so many values. 0 doesn't carry much information.

When printing "onehot_encoded_location", we will see the column. The first column is the coordinates of the non-zero points, the second column is the value of the element at that coordinate – always equal to 1 in this case.

To return the result in regular matrix form, we can add <code>sparse = False</code> when initializing:

In [9]:
onehot = OneHotEncoder()

onehot_encoded_location = onehot.fit_transform(df_train[["location"]])
print(onehot_encoded_location)

  (0, 2)	1.0
  (1, 0)	1.0
  (2, 1)	1.0


In [10]:
onehot.categories_

[array(['California', 'Los Angeles', 'New York'], dtype=object)]

We need to save this order for consistency when encoding data later.

For values ​​not in the dictionary, sklearn provides two ways to handle them via the handle_unknown variable (see the official documentation for more details). This variable can take on one of two values ​​'error' (default) or 'ignore'. With 'error', the program will stop running and report an error when encountering a value that is not in the dictionary. With 'ignore', this encoder will transform strange values ​​into a vector of all 0. Unfortunately, this encoder does not support the case of lumping new values ​​into a separate category. The use of 'error' and 'ignore' depends on the context. If you know for sure all possible values ​​of that item data, you should use 'error' to catch erroneous input cases. Otherwise, you should use 'ignore'; However, be careful with misspellings!

# Hashing

One-hot encoding has a major limitation: it is necessary to know the dictionary and its size in advance. This dictionary should also be saved for coding new categories in the future. Imagine an e-commerce store has 1000 items today but in a month from now, that store has 1000 new items. So how should those new items be encoded? Obviously encoding them with a zero vector or with a one-hot vector corresponding to "unknown" will reduce model quality because there is no clear distinction between the 1000 new items. At this point, to be able to continue using one-hot encryption, we need to update the dictionary and re-encode all category values. This means that the model's input will change and there is a high possibility that we will also need to change the model's architecture to adapt to the change in input size.

A technique used a lot to solve this problem is hashing. Hashing is a transformation of any input value into an integer. A good hash function is one that has the property of turning different input values ​​into evenly distributed points within the range of possible values ​​(32-bit integers or more depending on the hash function). Another characteristic is that different input values ​​will be transformed into integers with different high probability, especially when using a large number of bits.

To use hashing as a way to transform category values ​​to a natural number used in machine learning models, we can perform the following steps:

- Convert item values ​​to string format (some hash functions only accept string input).

- Choose a "deterministic" hash function, that is, a function that always returns a fixed number in all runs if the input remains unchanged. This is very important because if for a value the hash function returns different outputs, the machine learning model cannot know that the inputs are the same. Note that some hash functions are capable of returning different values, possibly for security reasons, we need to avoid using these hash functions.

- Estimate the number of different elements of the category data and then choose a natural number $K$ as the mod. Take the remainder of the result in step two when divided by this number $K$ as the index for the corresponding category.

## Example with Predict Future Sales

In [11]:
import pandas as pd

sales_path = "https://media.githubusercontent.com/media/tiepvupsu/tabml_data/master/sales/"
df_items = pd.read_csv(sales_path + "items.csv")
print(f"Number of items: {len(df_items)}")
print(f"Number of category: {len(df_items['item_category_id'].unique())}")
df_items.head(5)

Number of items: 22170
Number of category: 84


Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


Thus, there are 22,170 different products divided into 84 different categories. These items can be directly used to construct one-hot vectors. We can also see that there will be many products divided into the same category. If there are no other characteristics to differentiate the products, we will build a model where all products in the same category have the same properties. To be able to separate products, we can process additional information about product names in the first column. This will be relatively difficult because not all engineers know Russian. Another way is to use <code>item_id</code> as the category characteristic and build a one-hot vector for this column with 22170 elements. This is a relatively large number of elements. In addition, in the training data ("sales_train.csv" file), many item_id only appear once. If you build one-hot with 22,170 elements, there is a high possibility that the model will be overfitted when there are too many items with little data.

Hashing is a possible technique that can be applied to <code>item_name</code>. Below is a simple implementation of hashing technique written in sklearn API with hash bucket number of 1000:

In [12]:
import hashlib
from typing import Tuple

from sklearn.base import BaseEstimator, TransformerMixin


def hash_modulo(val, mod):
    md5 = hashlib.md5()  # can be other deterministic hash functions
    md5.update(str(val).encode())
    return int(md5.hexdigest(), 16) % mod


class FeatureHasher(BaseEstimator, TransformerMixin):
    def __init__(self, num_buckets: int):
        self.num_buckets = num_buckets

    def fit(self, X: pd.Series):
        return self

    def transform(self, X: pd.Series):
        return X.apply(lambda x: hash_modulo(x, self.num_buckets))


fh = FeatureHasher(num_buckets=1000)

df_items["hashed_item"] = fh.transform(df_items["item_name"])
df_items.head(5)

Unnamed: 0,item_name,item_id,item_category_id,hashed_item
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40,252
1,!ABBYY FineReader 12 Professional Edition Full...,1,76,812
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40,198
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40,584
4,***КОРОБКА (СТЕКЛО) D,4,40,210


# Crossing

We are often accustomed to processing features separately from the corresponding data fields. In fact, data fields may have relationships with each other, but simple machine learning models are difficult to visualize. Those relationships are often discovered by data scientists or based on domain knowledge.

Take a small example with the problem of predicting California house prices, here longitude and latitude are two independent data fields. If we separate the two features created by these two data fields, the model can learn the property that areas with the same latitude or longitude will have house prices close to each other. This is clearly not correct. However, if there is information about both longitude and latitude in the same value, the model will learn more useful information.

Feature crossing can solve this problem. A cross feature represents events that occur simultaneously in other features and is a category feature.

Consider the example below with a data set with three features <code>col1</code>, <code>col2</code> and <code>col3</code>:

In [13]:
import typing
import pandas as pd

df = pd.DataFrame(
    data={
        "col1": ["A", "B", "C", "A", "A"],
        "col2": ["x", "x", "y", "x", "z"],
        "col3": [1, 3, 2, 1, 2],
    }
)
df

Unnamed: 0,col1,col2,col3
0,A,x,1
1,B,x,3
2,C,y,2
3,A,x,1
4,A,z,2


Below is an example of how to create a cross feature based on: (i) the first two columns and (ii) all three columns of the DataFrame <code>df</code>:

In [14]:
from functools import partial


def add_cross(df: pd.DataFrame, cols: typing.List[str]) -> pd.DataFrame:
    """Add an column to the original dataframe as a cross feature.

    Args:
        df: input dataframe.
        cols: a list of columns in df that are used to create the new cross feature.

    Returns:
        A new dataframe with the new cross feature.
    """
    cross_col = "_X_".join(cols)

    def cross_value(x):
        return "_X_".join(str(x[col]) for col in cols)

    df[cross_col] = df.apply(cross_value, axis=1)
    return df


first_cross = ["col1", "col2"]
second_cross = ["col1", "col2", "col3"]
df = add_cross(df, first_cross)
df = add_cross(df, second_cross)
df

Unnamed: 0,col1,col2,col3,col1_X_col2,col1_X_col2_X_col3
0,A,x,1,A_X_x,A_X_x_X_1
1,B,x,3,B_X_x,B_X_x_X_3
2,C,y,2,C_X_y,C_X_y_X_2
3,A,x,1,A_X_x,A_X_x_X_1
4,A,z,2,A_X_z,A_X_z_X_2


You can give the cross feature any name, as long as it does not have the same name as other features. As a convention, the name of a cross feature column can be created by concatenating the names of the component features by the string "_X_", the X represents cross features.

Similarly, the values ​​of diagonal features can be defined as strings created by concatenating strings representing the values ​​of the component columns. You may have a different way of concatenating these values; However, it is necessary to ensure that if the component columns have the same value.

## Transforming Nominal Attributes

Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format which can be easily understood by downstream code and pipelines.

In [16]:
vg_df = pd.read_csv('vgsales.csv', encoding='utf-8')
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,Publisher
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo
5,Tetris,GB,1989.0,Puzzle,Nintendo
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo


Let’s focus on the video game `Genre` attribute as depicted in the above data frame. It is quite evident that this is a nominal categorical attribute just like `Publisher` and `Platform`. We can easily get the list of unique video game genres as follows.

In [18]:
import numpy as np

genres = np.unique(vg_df['Genre'])
genres

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

This tells us that we have 12 distinct video game genres. We can now generate a label encoding scheme for mapping each category to a numeric value by leveraging `scikit-learn`.

In [19]:
from sklearn.preprocessing import LabelEncoder

gle = LabelEncoder()

genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

In [20]:
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,GenreLabel
1,Super Mario Bros.,NES,1985.0,Platform,4
2,Mario Kart Wii,Wii,2008.0,Racing,6
3,Wii Sports Resort,Wii,2009.0,Sports,10
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,7
5,Tetris,GB,1989.0,Puzzle,5
6,New Super Mario Bros.,DS,2006.0,Platform,4


These labels can be used directly often especially with frameworks like `scikit-learn` if you plan to use them as response variables for prediction, however as discussed earlier, we will need an additional step of encoding on these before we can use them as features.

## Transforming Ordinal Attributes

Ordinal attributes are categorical attributes with a sense of order amongst the values. Let’s consider our Pokémon dataset. Let’s focus more specifically on the Generation attribute.

In [21]:
poke_df = pd.read_csv('Pokemon.csv', encoding = 'utf-8')
poke_df = poke_df.sample(random_state = 1, frac = 1).reset_index(drop = True)
np.unique(poke_df['Generation'])

array([1, 2, 3, 4, 5, 6], dtype=int64)

Based on the above output, we can see there are a total of 6 generations and each Pokémon typically belongs to a specific generation based on the video games (when they were released) and also the television series follows a similar timeline. This attribute is typically ordinal (domain knowledge is necessary here) because most Pokémon belonging to Generation 1 were introduced earlier in the video games and the television shows than Generation 2 as so on. Fans can check out the following figure to remember some of the popular Pokémon of each generation (views may differ among fans!).