# Create Embeddings for Categorical Variables

* A categorical variable is used to represent categories or labels.

* Machine learning (ML) and deep learning (DL) models only work with numerical variables. Therefore, we will need to convert a categorical variable into numerical values to be able to feed them into an ML or DL model.

* Traditionally, we convert categorical variables into numbers by either ***one hot encoding*** or ***label encoding***.

## One-hot Encoding

* In one hot encoding, we build as many features as the number of unique categories in that feature and for every row, we assign a 1 to the feature representing that row’s category and the rest of features are marked 0.

* This technique becomes problematic when you have a lot of categories (unique values) in a feature leading to very sparse data. And as each vector is equidistant from every other vector, the relationship between variables is lost.

## Label Encoding

* Label encoding is simply converting each value in that column to an integer. This technique is very simple but induces comparison between feature categories because it uses number sequencing.

* However, if we have three transportation mode: bus, car, and bicyle, and label them 1, 2, and 3 respectively. We would implicitly assume there is an order or weight associated with each mode, which may not be what we desire to do.

## Categorical Embedding

* In categorical embedding, each categorical variable category is mapped to an n-dimension vector. This mapping is learned by a neural network during a standard supervised training process.

* After that, we will replace each category with their corresponding vectors in our data.

* The advantages of categorical embeddings are: (1) We can limit the number of columns we need per category. This is useful when a variable has many categories; and (2) The generated embeddings obtained from the neural network reveals the intrinsic properties of categorical variables, meaning that similar categories will have similar embeddings.

\
See the [article](https://medium.com/analytics-vidhya/categorical-embedder-encoding-categorical-variables-via-neural-networks-b482afb1409d) for more details.

This tutorial shows how to create categorical embeddings for ML or DL models.

**Medical Cost Personal Datasets**

Source: https://www.kaggle.com/mirichoi0218/insurance

Variable definitions:

*age:* age of primary beneficiary

*sex:* insurance contractor's gender (female or male)

*bmi:* Body mass index, defined as kg / m^2

*children:* Number of children/dependents covered by health insurance

*smoker:* Smoking status (yes or no)

*region:* beneficiary's residential area in the US (northeast, southeast, southwest, or northwest)

*charges:* Individual medical costs billed by health insurance

In [None]:
# The categorical_embedder works with lower version of keras and tensorflow;
# so we need to downgrade keras and tensflow versions accordingly.
# We will need to restart runtime before we can import the downgraded versions.
!pip install tensorflow_addons==0.8.3 --quiet
!pip install tqdm==4.41.1 --quiet
!pip install keras==2.3.1 --quiet
!pip install tensorflow==2.2.0 --quiet

[K     |████████████████████████████████| 1.0 MB 5.0 MB/s 
[K     |████████████████████████████████| 56 kB 1.2 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
panel 0.12.1 requires tqdm>=4.48.0, but you have tqdm 4.41.1 which is incompatible.[0m
[K     |████████████████████████████████| 377 kB 5.0 MB/s 
[K     |████████████████████████████████| 50 kB 5.3 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.2+zzzcolab20220929150707 requires keras<2.9,>=2.8.0rc0, but you have keras 2.3.1 which is incompatible.[0m
[K     |████████████████████████████████| 516.2 MB 3.6 kB/s 
[K     |████████████████████████████████| 26.1 MB 43 kB/s 
[K     |████████████████████████████████| 3.0 MB 69.6 MB/s 
[K     

In [None]:
# Install categorical_embedder
!pip install categorical-embedder --quiet

  Building wheel for sklearn (setup.py) ... [?25l[?25hdone


In [None]:
# Import libraries
from google.colab import drive
import tensorflow as tf
import keras
import categorical_embedder as ce
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [None]:
# mount the google drive to colab
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Open the health insurance dataset saved in the google drive
df = pd.read_csv('/content/drive/MyDrive/Machine Learning/Machine Learning Datasets/insurance.csv')
print(df.shape)
df.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [None]:
# Seperate features from the target
X = df.drop(['charges'], axis = 1)
y = df['charges']

In [None]:
# ce.get_embedding_info identifies the categorical variables.
# The function returns a dictionary, with tuples of
# (number of categories, embedding size)
# Note: The default is that the size of embedding to be half as the number of categories.
# We can also change the default by handcrafting the dictionary.
embedding_info = ce.get_embedding_info(X)
embedding_info

{'sex': (2, 1), 'smoker': (2, 1), 'region': (4, 2)}

In [None]:
# ce.get_label_encoded_data integer encodes the categorical variables 
# and prepares it to feed it to neural network.
X_encoded, encoders = ce.get_label_encoded_data(X)
X_encoded.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.9,0,1,3
1,18,1,33.77,1,0,2
2,28,1,33.0,3,0,2
3,33,1,22.705,0,0,1
4,32,1,28.88,0,0,1


In [None]:
# Show the encoders schema
encoders

{'sex': __LabelEncoder__(),
 'smoker': __LabelEncoder__(),
 'region': __LabelEncoder__()}

In [None]:
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y)

In [None]:
# ce.get_embeddings trains a neural network model, 
# extracts embeddings and returns a dictionary containing the embeddings
embeddings = ce.get_embeddings(
  # Provide the train set
  X_train, y_train, 
  # Provide the embedding info
  categorical_embedding_info = embedding_info, 
  # Our target is a continuous on healthcare expenditure
  is_classification = False,  
  # Specify epochs and batch size 
  epochs = 100, batch_size = 32)

HBox(children=(FloatProgress(value=0.0, description='Training', style=ProgressStyle(description_width='initial…

HBox(children=(FloatProgress(value=0.0, description='Epoch 0', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 11', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 12', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 13', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 14', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 15', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 16', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 17', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 18', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 19', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 20', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 21', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 22', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 23', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 24', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 25', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 26', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 27', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 28', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 29', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 30', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 31', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 32', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 33', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 34', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 35', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 36', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 37', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 38', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 39', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 40', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 41', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 42', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 43', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 44', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 45', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 46', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 47', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 48', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 49', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 50', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 51', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 52', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 53', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 54', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 55', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 56', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 57', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 58', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 59', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 60', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 61', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 62', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 63', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 64', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 65', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 66', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 67', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 68', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 69', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 70', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 71', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 72', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 73', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 74', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 75', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 76', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 77', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 78', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 79', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 80', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 81', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 82', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 83', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 84', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 85', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 86', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 87', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 88', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 89', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 90', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 91', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 92', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 93', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 94', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 95', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 96', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 97', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 98', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 99', max=802.0, style=ProgressStyle(description_wid…




In [None]:
# Take a look at the learned embeddings
embeddings

{'sex': array([[0.14685598],
        [0.3512356 ]], dtype=float32), 'smoker': array([[ 1.4771585],
        [-1.2328681]], dtype=float32), 'region': array([[ 0.00809342, -0.06681861],
        [ 0.13251856, -0.14833237],
        [ 0.04736542, -0.04221172],
        [ 0.42672214, -0.41650957]], dtype=float32)}

In [None]:
# Shapes of embeddings
print(embeddings['sex'].shape)
print(embeddings['smoker'].shape)
print(embeddings['region'].shape)

(2, 1)
(2, 1)
(4, 2)


In [None]:
# If you don't like the dictionary format; 
# we can convert it to dataframe for easy readibility
dfs = ce.get_embeddings_in_dataframe(
  embeddings = embeddings, 
  encoders = encoders)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




In [None]:
# Embeddings for regions
dfs['region']

Unnamed: 0,region_embedding_0,region_embedding_1
northeast,0.008093,-0.066819
northwest,0.132519,-0.148332
southeast,0.047365,-0.042212
southwest,0.426722,-0.41651


In [None]:
# Embeddings for sex
dfs['sex']

Unnamed: 0,sex_embedding_0
female,0.146856
male,0.351236


In [None]:
# Embeddings for smoker
dfs['smoker']

Unnamed: 0,smoker_embedding_0
no,1.477159
yes,-1.232868


In [None]:
# Include these embeddings in the dataset
data = ce.fit_transform(
  X, 
  embeddings = embeddings, 
  encoders = encoders, 
  # Remove the original categorical variables
  drop_categorical_vars = True)
data.head()

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




Unnamed: 0,age,bmi,children,sex_embedding_0,smoker_embedding_0,region_embedding_0,region_embedding_1
0,19,27.9,0,0.146856,-1.232868,0.426722,-0.41651
1,18,33.77,1,0.351236,1.477159,0.047365,-0.042212
2,28,33.0,3,0.351236,1.477159,0.047365,-0.042212
3,33,22.705,0,0.351236,1.477159,0.132519,-0.148332
4,32,28.88,0,0.351236,1.477159,0.132519,-0.148332
