#### 2. Transferable Tabular Transformer

2-1. Transtab
- 컬럼명과 cell 값을 결합하여 Transferable한 Tabular 모델을 만드는 방법론

- Tabular 데이터에서 중요한 정보는 컬럼명에 담겨 있다. 예를 들어, 특정 값이 60 일 때, 우리는 컬럼명을 보고 나이를 의미하는 지 몸무게를 의미하는 지 알 수 있다. 즉, 컬럼명과 결합이 될 때, cell 값은 의미를 지닐 수 있다. 
- Tabular 도메인에서는 일반적인 경우, 전이학습이 불가능하다. 새로운 피쳐가 생기거나 없어졌을때 혹은 피쳐의 이름과 순서만 달라져도, 모델이 동작할 수 없기 때문이다. 
- 그러나, 컬럼의 의미를 학습한 모델은 순서의 이름이 달라지더라도 그 특징을 파악할 수 있다. 예를 들어, smoked와 smoking history 라는 두 컬럼이 동일한 의미를 지니는 것을 기존의 모델들은 알 수 없었지만, Transtab은 그 의미를 알 수 있기때문에 두 컬럼을 맵핑할 수 있다.


![fig2](../img/transtab-fig2.png)

- Transtab은 컬럼을 타입별로 나누어 인코딩한다.

|Age|Gender|Birth Country|Married|Smoked|Working Hour|
|---|---|---|---|---|---|
|32|Male|US|1|0|40

- (1) categorical 
    > - 컬럼의 이름과 값을 나열하여 하나의 문장으로 만들어 토크나이징 -> 임베딩
    > - tokenizer('Gender Male, Birth Country US')

- (2) binary
    > - 1 인 경우에만 컬럼명을 나열하여 토크나이징 -> 임베딩 
    > - tokenizer('Married')

- (3) numerical
    > - 컬럼 이름을 각각 토크나이징 -> 임베딩 -> 임베딩 값에 cell 값을 곱해준다
    > - token_embedding(tokenizer(['Age', 'Working Hour'])) * [32, 1, 40]

<br>

- 모든 임베딩을 concat 하고 CLS 토큰을 붙여 인코딩을 완료

- 인코딩된 인풋은 트랜스포머 인코더로 학습되는데, 이 때 gated layer를 추가하여 모델의 logit을 normalize 할 수 있도록 한다. 이를 gated tranformer라고 명명


Reference
 - NeurIPS'22 | TransTab: Learning Transferable Tabular Transformers Across Tables
 - https://github.com/RyanWangZf/transtab

In [1]:
!nvidia-smi

Sat Apr  6 00:49:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0              25W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

데이터 다운로드 및 필요 패키지 설치

In [2]:
!gdown 17x8tjQVBQkBLTNnNaRWk9Jwq0loGvjUB
!pip install transtab

Downloading...
From: https://drive.google.com/uc?id=17x8tjQVBQkBLTNnNaRWk9Jwq0loGvjUB
To: /content/open.zip
  0% 0.00/532k [00:00<?, ?B/s]100% 532k/532k [00:00<00:00, 118MB/s]
Collecting transtab
  Downloading transtab-0.0.5-py3-none-any.whl (29 kB)
Collecting loguru (from transtab)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting openml>=0.10.0 (from transtab)
  Downloading openml-0.14.2.tar.gz (144 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.5/144.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting liac-arff>=2.4.0 (from openml>=0.10.0->transtab)
  Downloading liac-arff-2.5.0.tar.gz (13 k

In [2]:
import transtab

import torch
import random
import numpy as np
import torch.backends.cudnn as cudnn

def seed_everything(seed):
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)
  np.random.seed(seed)
  cudnn.benchmark=False
  cudnn.deterministic=True
  random.seed(seed)
  transtab.random_seed(seed)

SEED = 555
seed_everything(SEED)

import warnings
warnings.filterwarnings('ignore')

데이터셋

- 한 사람에 관련된 다양한 사회적, 경제적 정보
- https://dacon.io/competitions/official/236230/data

```
ID : 학습 데이터 고유 ID
Age
Gender
Education_Status
Employment_Status
Working_Week (Yearly)
Industry_Status
Occupation_Status
Race
Hispanic_Origin
Martial_Status
Household_Status
Household_summary
Citizenship
Birth_Country
Birth_Country (Father)
Birth_Country (Mother)
Tax_Status
Gains
Losses
Divdends
Incom_Status
Income : 예측 목표, 1시간 단위의 소득을 예측
```

In [3]:
from zipfile import ZipFile

import pandas as pd

with ZipFile('/content/open.zip', 'r') as zipfile:
  with zipfile.open('open/train.csv', 'r') as zf:
    train = pd.read_csv(zf)

train.shape

(20000, 23)

In [4]:
# 범주형(or문자형) 변수와 수치형 변수를 구분
cat_cols = [col for col in train.columns[1:-1] if train[col].dtype=='object']
num_cols = [col for col in train.columns[1:-1] if train[col].dtype=='int']

In [5]:
# 변수형 변수 전처리
import re

# 성별 컬럼은 풀어서
train.Gender = train.Gender.apply(lambda x: 'Male' if x=='M' else 'Female')

# 연속된 숫자는 띄어서 ex. '12' -> '1 2'
for col in cat_cols:
    train[col] = train[col].apply(lambda x: re.sub(r'(\d)(\d)', r'\1 \2', x))

In [6]:
# 수치형 변수 전처리
from sklearn.preprocessing import RobustScaler, StandardScaler
from collections import defaultdict

# 컬럼별로 스케일링 해주고 test셋에 적용하기 위하여 저장
scalers = defaultdict(StandardScaler)

for col in num_cols:
  scaler = StandardScaler()
  train[col] = scaler.fit_transform(train[col].values.reshape(-1,1)).flatten()
  scalers[col] = scaler

In [7]:
# 훈련, 검증셋으로 나누기
from sklearn.model_selection import train_test_split

trn_X, val_X, trn_y, val_y = train_test_split(train.drop(['ID', 'Income'], axis=1), train['Income'], test_size=0.2, random_state=SEED)

#### Pretraining

![fig3](../img/transtab-fig3.png)

Vertical Partitioning Contrastive Learning

- 주어진 테이블을 'vertical' 하게 잘라 여러개의 테이블로 분해한 뒤 contrastive learning 방식으로 테이블의 특징을 훈련
- Supervised 방식은, label을 기준으로 positive pair가 구성되고, Self-supervised 방식은 row 단위로, positive pair가 구성되는 차이가 있음

In [8]:
# 본 데이터셋의 label은 class가 없으므로 self-VPCL 방식으로 사전 훈련
SAVE_PATH = './transtab-output'
model, collate_fn = transtab.build_contrastive_learner(
    cateogorical_columns=cat_cols, # 타입별로 나누어 컬럼 리스트 입력
    numerical_columns=num_cols,
    binary_columns=None,
    supervised=False, # self-supervised 방식으로 설정
    num_partition=4, # 테이블을 몇 개로 나눌 것인지 (= 배치 내 positive sample의 수)
    overlap_ratio=0.5, # 겹치는 컬럼 비율 설정
    device='cuda',
    #hidden_dim=64,
    #ffn_dim=128,
    #projection_dim=64,
    #num_attention_head=8,
)

In [12]:
training_args = {
    'num_epoch':1000,
    'batch_size':32,
    'eval_batch_size':32,
    'lr':1e-4,
    'eval_metric':'val_loss',
    'eval_less_is_better':True,
    'output_dir':f'{SAVE_PATH}/pretrained',
    'patience':10,
    'num_workers':2,
    'warmup_steps':2,
}

transtab.train(model, (trn_X, trn_y), (val_X, val_y), collate_fn=collate_fn, **training_args)

[32m2024-04-06 00:50:08.727[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36mtrain[0m:[36m105[0m - [1mset warmup training in initial 500000.0 steps[0m


Epoch:   0%|          | 0/1000 [00:00<?, ?it/s]

epoch: 0, test val_loss: 4.736674
epoch: 0, train loss: 2378.5265, lr: 0.000100, spent: 37.5 secs
epoch: 1, test val_loss: 4.721186
epoch: 1, train loss: 2363.8419, lr: 0.000100, spent: 78.8 secs
epoch: 2, test val_loss: 4.716955
epoch: 2, train loss: 2359.4826, lr: 0.000100, spent: 117.7 secs
epoch: 3, test val_loss: 4.708981
epoch: 3, train loss: 2357.1172, lr: 0.000100, spent: 154.6 secs
epoch: 4, test val_loss: 4.706509
epoch: 4, train loss: 2353.8605, lr: 0.000100, spent: 190.0 secs
epoch: 5, test val_loss: 4.703464
epoch: 5, train loss: 2353.0034, lr: 0.000100, spent: 227.7 secs
epoch: 6, test val_loss: 4.705224
EarlyStopping counter: 1 out of 10
epoch: 6, train loss: 2351.6065, lr: 0.000100, spent: 264.0 secs
epoch: 7, test val_loss: 4.702949
epoch: 7, train loss: 2350.3725, lr: 0.000100, spent: 300.6 secs
epoch: 8, test val_loss: 4.700225
epoch: 8, train loss: 2350.4754, lr: 0.000100, spent: 336.5 secs
epoch: 9, test val_loss: 4.700397
EarlyStopping counter: 1 out of 10
epoch: 

[32m2024-04-06 01:16:05.073[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36mtrain[0m:[36m136[0m - [1mload best at last from /content/drive/MyDrive/Colab Notebooks/transtab-output/pretrained[0m
[32m2024-04-06 01:16:05.121[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36msave_model[0m:[36m247[0m - [1msaving model checkpoint to /content/drive/MyDrive/Colab Notebooks/transtab-output/pretrained[0m


epoch: 39, test val_loss: 4.693530
EarlyStopping counter: 10 out of 10
early stopped


[32m2024-04-06 01:16:05.503[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36mtrain[0m:[36m141[0m - [1mtraining complete, cost 1556.8 secs.[0m


---
Downstream Task

In [154]:
# 오픈소스에는 분류 모델만 있으므로 회귀모델을 새로 정의해주어야 함
# transtab 인코더와 linear layer를 연결하여 최종 모델을 정의
class TransTabRegressor(transtab.TransTabModel):
  def __init__(self,
        categorical_columns=None,
        numerical_columns=None,
        binary_columns=None,
        feature_extractor=None,
        num_class=1,
        hidden_dim=128,
        num_layer=2,
        num_attention_head=8,
        hidden_dropout_prob=0,
        ffn_dim=256,
        activation='relu',
        device='cuda:0',
        **kwargs,
        ) -> None:
        super().__init__(
            categorical_columns=categorical_columns,
            numerical_columns=numerical_columns,
            binary_columns=binary_columns,
            feature_extractor=feature_extractor,
            hidden_dim=hidden_dim,
            num_layer=num_layer,
            num_attention_head=num_attention_head,
            hidden_dropout_prob=hidden_dropout_prob,
            ffn_dim=ffn_dim,
            activation=activation,
            device=device,
            **kwargs,
        )
        self.num_class = num_class
        self.clf = transtab.modeling_transtab.TransTabLinearClassifier(num_class=num_class, hidden_dim=hidden_dim)
        self.loss_fn = torch.nn.HuberLoss() 
        self.to(device)

  def forward(self, x, y=None):

        inputs = self.input_encoder.feature_extractor(x)

        outputs = self.input_encoder.feature_processor(**inputs)
        outputs = self.cls_token(**outputs)

        # get CLS
        encoder_output = self.encoder(**outputs) 

        # classifier
        logits = self.clf(encoder_output)

        if y is not None:
            y_ts = torch.tensor(y.values).to(self.device).float()
            loss = self.loss_fn(logits.flatten(), y_ts)
            loss = loss.mean()
        else:
            loss = None

        return logits, loss


def build_regressor(
    categorical_columns=None,
    numerical_columns=None,
    binary_columns=None,
    feature_extractor=None,
    num_class=1,
    hidden_dim=128,
    num_layer=2,
    num_attention_head=8,
    hidden_dropout_prob=0,
    ffn_dim=256,
    activation='relu',
    device='cuda:0',
    checkpoint=None,
    **kwargs) -> TransTabRegressor:
    
    model = TransTabRegressor(
        categorical_columns = categorical_columns,
        numerical_columns = numerical_columns,
        binary_columns = binary_columns,
        feature_extractor = feature_extractor,
        num_class=num_class,
        hidden_dim=hidden_dim,
        num_layer=num_layer,
        num_attention_head=num_attention_head,
        hidden_dropout_prob=hidden_dropout_prob,
        ffn_dim=ffn_dim,
        activation=activation,
        device=device,
        **kwargs,
        )

    if checkpoint is not None:
        model.load(checkpoint)

    return model

In [157]:
training_args = {
    'num_epoch':1000,
    'batch_size':64,
    'eval_batch_size':64,
    'lr':1e-3,
    'eval_metric':'val_loss',
    'eval_less_is_better':True,
    'output_dir':f'{SAVE_PATH}/downstream',
    'patience':10,
    'num_workers':2,
    'warmup_steps':2,
    'shuffle':True,
}

# self-VPCL 훈련이 완료되어 저장된 가중치를 불러와 regression model 만들고 downstream task 학습
clf = build_regressor(checkpoint=f'{SAVE_PATH}/pretrained')
transtab.train(clf, trainset=(trn_X, trn_y), valset=(val_X, val_y), **training_args)

[32m2024-04-06 04:27:30.063[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36mtrain[0m:[36m105[0m - [1mset warmup training in initial 250000.0 steps[0m


Epoch:   0%|          | 0/1000 [00:00<?, ?it/s]

epoch: 0, test val_loss: 837.425829
epoch: 0, train loss: 213072.6148, lr: 0.001000, spent: 28.3 secs
epoch: 1, test val_loss: 763.110691
epoch: 1, train loss: 198519.0906, lr: 0.001000, spent: 55.0 secs
epoch: 2, test val_loss: 680.523547
epoch: 2, train loss: 177654.5424, lr: 0.001000, spent: 81.5 secs
epoch: 3, test val_loss: 627.168341
epoch: 3, train loss: 160598.3363, lr: 0.001000, spent: 111.4 secs
epoch: 4, test val_loss: 609.700383
epoch: 4, train loss: 152575.5951, lr: 0.001000, spent: 139.6 secs
epoch: 5, test val_loss: 606.022073
epoch: 5, train loss: 149632.3834, lr: 0.001000, spent: 166.4 secs
epoch: 6, test val_loss: 607.547186
EarlyStopping counter: 1 out of 10
epoch: 6, train loss: 149860.9231, lr: 0.001000, spent: 193.3 secs
epoch: 7, test val_loss: 607.527159
EarlyStopping counter: 2 out of 10
epoch: 7, train loss: 150060.7104, lr: 0.001000, spent: 219.8 secs
epoch: 8, test val_loss: 607.541962
EarlyStopping counter: 3 out of 10
epoch: 8, train loss: 149442.5963, lr:

[32m2024-04-06 04:34:46.468[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36mtrain[0m:[36m136[0m - [1mload best at last from /content/drive/MyDrive/Colab Notebooks/transtab-output/downstream[0m
[32m2024-04-06 04:34:46.529[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36msave_model[0m:[36m247[0m - [1msaving model checkpoint to /content/drive/MyDrive/Colab Notebooks/transtab-output/downstream[0m


epoch: 15, test val_loss: 607.234356
EarlyStopping counter: 10 out of 10
early stopped


[32m2024-04-06 04:34:46.956[0m | [1mINFO    [0m | [36mtranstab.trainer[0m:[36mtrain[0m:[36m141[0m - [1mtraining complete, cost 436.9 secs.[0m


In [158]:
with ZipFile('/content/open.zip', 'r') as zipfile:
  with zipfile.open('open/test.csv', 'r') as zf:
    test = pd.read_csv(zf)

test.shape

(10000, 22)

#### Inference

In [159]:
test[cat_cols] = test[cat_cols].fillna('Unknown')

test.Gender = test.Gender.apply(lambda x: 'Male' if x=='M' else 'Female')

for col in cat_cols:
    test[col] = test[col].apply(lambda x: re.sub(r'(\d)(\d)', r'\1 \2', x))

for col in num_cols:
  scaler = scalers[col]
  test[col] = scaler.transform(test[col].values.reshape(-1,1)).flatten()

In [162]:
from tqdm import tqdm

clf.eval()

bs = 1024
pred = []
with torch.no_grad():
  for i in tqdm(range(int(len(test)/64 + 1))):
    rows = test.iloc[i*bs:(i+1)*bs, :]
    if len(rows)>0:
      logits, _ = clf(rows.drop('ID', axis=1))
      pred.extend(logits.detach().cpu().numpy().flatten())

100%|██████████| 157/157 [00:06<00:00, 23.18it/s]


In [163]:
pred = [p if p>0 else 0 for p in pred]

In [164]:
submission = pd.DataFrame()
submission['ID'] = test.ID
submission['Income'] = pred
submission.shape

(10000, 2)

In [165]:
submission.to_csv('submission-2.csv', index=False)