<a href="https://colab.research.google.com/github/PrabhuRajendhran/reimagined-ML/blob/main/text%2Bnumerical_categorical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Combining numerical/categorical data with transformers

For simple text classification, the last hidden state for the CLS token is run through a classifier, producing a score for each label.  A simple way to combine text and numerical/categorical data is to concatenate the CLS embedding with the extra data. If the CLS embedding is [1.0, 2.0, 3.0] and the extra data is 5.0, then the concatenated version is [1.0, 2.0, 3.0, 5.0]. Likewise for categorical data, turn the variable into a one-hot encoding and concatenate that.

## UPDATE: See [this notebook](https://colab.research.google.com/drive/1F7COnwHqcLDPg_SS-oFgW3c2GPDWnS5Y) for version 2 that can be used in Hugging Face Trainer

In [None]:
!pip install transformers -q

[K     |████████████████████████████████| 3.4 MB 4.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 42.1 MB/s 
[K     |████████████████████████████████| 61 kB 256 kB/s 
[K     |████████████████████████████████| 895 kB 48.9 MB/s 
[K     |████████████████████████████████| 596 kB 49.9 MB/s 
[?25h

In [None]:
from transformers import AutoConfig, AutoModel
import torch

class CustomModel(torch.nn.Module):
    """
    This takes a transformer backbone and puts a slightly-modified classification head on top.

    """

    def __init__(self, model_name, num_extra_dims, num_labels):
        # num_extra_dims corresponds to the number of extra dimensions of numerical/categorical data

        super().__init__()

        self.config = AutoConfig.from_pretrained(model_name)
        self.transformer = AutoModel.from_pretrained(model_name, config=self.config)
        num_hidden_size = self.transformer.config.hidden_size # May be different depending on which model you use. Common sizes are 768 and 1024. Look in the config.json file
        self.classifier = torch.nn.Linear(num_hidden_size+num_extra_dims, num_labels)


    def forward(self, input_ids, extra_data, attention_mask=None):
        """
        extra_data should be of shape [batch_size, dim]
        where dim is the number of additional numerical/categorical dimensions
        """

        hidden_states = self.transformer(input_ids=input_ids, attention_mask=attention_mask) # [batch size, sequence length, hidden size]

        cls_embeds = hidden_states.last_hidden_state[:, 0, :] # [batch size, hidden size]

        concat = torch.cat((cls_embeds, extra_data), dim=-1) # [batch size, hidden size+num extra dims]

        output = self.classifier(concat) # [batch size, num labels]

        return output


In [None]:
model_name = "distilbert-base-cased"
num_extra_dims = 5
num_labels = 7
batch_size = 8

In [None]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(model_name)
custom_model = CustomModel(model_name, num_extra_dims=num_extra_dims, num_labels=num_labels)

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Duplicating input sentence for simplicity while still showing batching
encoded = tokenizer(["This is an example sentence"]*batch_size, return_tensors="pt")

# Dummy data (can be numerical or categorical)
extra_data = torch.rand((batch_size, num_extra_dims))

In [None]:
encoded["input_ids"] # [batch size, sequence length]

tensor([[ 101, 1188, 1110, 1126, 1859, 5650,  102],
        [ 101, 1188, 1110, 1126, 1859, 5650,  102],
        [ 101, 1188, 1110, 1126, 1859, 5650,  102],
        [ 101, 1188, 1110, 1126, 1859, 5650,  102],
        [ 101, 1188, 1110, 1126, 1859, 5650,  102],
        [ 101, 1188, 1110, 1126, 1859, 5650,  102],
        [ 101, 1188, 1110, 1126, 1859, 5650,  102],
        [ 101, 1188, 1110, 1126, 1859, 5650,  102]])

In [None]:
extra_data # pretend this is what you are adding to each sequence

tensor([[0.4903, 0.9065, 0.0020, 0.7355, 0.6623],
        [0.6330, 0.0328, 0.4184, 0.3743, 0.3858],
        [0.8357, 0.9689, 0.0681, 0.4625, 0.7332],
        [0.1433, 0.1529, 0.9276, 0.6186, 0.1452],
        [0.6958, 0.8449, 0.9475, 0.0389, 0.7816],
        [0.6270, 0.7992, 0.8266, 0.6182, 0.5612],
        [0.2574, 0.6601, 0.2305, 0.0559, 0.9399],
        [0.5884, 0.2998, 0.5159, 0.2381, 0.3132]])

In [None]:
with torch.no_grad():
    output = custom_model(encoded["input_ids"], extra_data, attention_mask=encoded["attention_mask"])
print(output.shape) # [batch size, num labels]
output

torch.Size([8, 7])


tensor([[-0.1769,  0.1392,  0.0755, -0.4035,  0.1356,  0.0360,  0.5473],
        [-0.1795,  0.1093,  0.0946, -0.4394,  0.1943,  0.0083,  0.5606],
        [-0.1757,  0.1427,  0.0633, -0.4032,  0.1493,  0.0373,  0.5623],
        [-0.1733,  0.1077,  0.1128, -0.4273,  0.1900,  0.0122,  0.5520],
        [-0.1772,  0.1222,  0.0839, -0.4004,  0.1861,  0.0357,  0.5760],
        [-0.1609,  0.1364,  0.0839, -0.3939,  0.1697,  0.0435,  0.5625],
        [-0.2037,  0.1042,  0.0959, -0.4214,  0.1607,  0.0178,  0.5537],
        [-0.1827,  0.1135,  0.0890, -0.4345,  0.1914,  0.0081,  0.5663]])