<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Student-Model" data-toc-modified-id="Student-Model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Student Model</a></span><ul class="toc-item"><li><span><a href="#Data-processing" data-toc-modified-id="Data-processing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data processing</a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model</a></span><ul class="toc-item"><li><span><a href="#Define-Model" data-toc-modified-id="Define-Model-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Define Model</a></span></li><li><span><a href="#Train" data-toc-modified-id="Train-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Train</a></span></li></ul></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Evaluation</a></span><ul class="toc-item"><li><span><a href="#ROC-AUC" data-toc-modified-id="ROC-AUC-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>ROC AUC</a></span></li><li><span><a href="#Compression-rate" data-toc-modified-id="Compression-rate-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Compression rate</a></span></li></ul></li></ul></li></ul></div>

# Student Model


Нужно обучть небольшую модель на [soft таргетах](https://drive.google.com/file/d/1tBbPOUT-Ow9f3zTDApykGXYwt-KslYle/view?usp=sharing)  модели учителя, которая не сильно уступала бы в качестве учителю.

In [1]:
import os

In [2]:
DATA_PATH = '../../data/criteo'

TRAIN_PATH = os.path.join(DATA_PATH, 'train.csv')
SOFT_TARGETS_PATH = os.path.join(DATA_PATH, 'soft_targets_full.csv')

## Data processing

Данные на Train/Validation/Test нужно разбить как 80/10/10

In [3]:
from itertools import chain, islice

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [4]:
num_features = [f'_c{i}' for i in range(1, 14)]
cat_features = [f'_c{i}' for i in range(14, 40)]

cat_features_dims = dict([
    ('c14', 1445),
    ('c15', 556),
    ('c16', 1130758),
    ('c17', 360209),
    ('c18', 304),
    ('c19', 21),
    ('c20', 11845),
    ('c21', 631),
    ('c22', 3),
    ('c23', 49223),
    ('c24', 5194),
    ('c25', 985420),
    ('c26', 3157),
    ('c27', 26),
    ('c28', 11588),
    ('c29', 715441),
    ('c30', 10),
    ('c31', 4681),
    ('c32', 2029),
    ('c33', 4),
    ('c34', 870796),
    ('c35', 17),
    ('c36', 15),
    ('c37', 87605),
    ('c38', 84),
    ('c39', 58187)])


MAX_DICT_SIZE = 10000


label_features = [feature for feature, cardinality in cat_features_dims.items() if cardinality <= MAX_DICT_SIZE]
hash_features = [feature for feature, cardinality in cat_features_dims.items() if cardinality > MAX_DICT_SIZE]

In [5]:
data = pd.read_csv(TRAIN_PATH)

soft_targets = pd.read_csv(SOFT_TARGETS_PATH)

In [6]:
num_encoder = Pipeline([
    ('fill', SimpleImputer(missing_values=np.nan, strategy='mean')), 
    ('scale', StandardScaler())
])

cat_encoder = Pipeline([
    ('fill', SimpleImputer(missing_values=np.nan, strategy='constant')),
    ('ord', OrdinalEncoder())
])

data_transformer = ColumnTransformer([
    ('num_encoder', num_encoder, num_features), 
    ('cat_encoder', cat_encoder, cat_features)
])

x = data_transformer.fit_transform(data).astype(np.float32)

In [7]:
y = data['_c0'].values
soft_targets = soft_targets['prob'].values.astype(np.float32)

In [8]:
x_train, x_tv, y_train, y_tv, soft_targets_train, soft_targets_tv = train_test_split(x, y, soft_targets, train_size=.8)
x_val, x_test, y_val, y_test, soft_targets_val, soft_targets_test = \
    train_test_split(x_tv, y_tv, soft_targets_tv, test_size=.5)

In [9]:
del data, x, y, soft_targets

## Model

Можно также использовать Pruning и/или Quantinization.

In [10]:
import torch
import torch.nn as nn

In [11]:
class CrossModule(nn.Module):
    def __init__(self, dim):
        super(CrossModule, self).__init__()
        self.dim = dim
        self.w = nn.Parameter(torch.FloatTensor(dim))
        self.b = nn.Parameter(torch.FloatTensor(dim))
        nn.init.normal_(self.w, std=1 / dim)
        nn.init.normal_(self.b, std=1 / dim)
        
    def forward(self, x0, x_prev):
        return x0.reshape(-1, self.dim, 1) @ x_prev.reshape(-1, 1, self.dim) @ self.w + self.b + x0

In [12]:
def calc_emb_dim(vocab_size):
    return int(4 * vocab_size ** 0.25)


class HashEmbedding(nn.Module):
    #  Используется тривиальная хеш-функция
    #  Для больших словарей деление на максимальный размер словаря используется в качестве второго хеша
    #  Ожидается, что размер словаря не превосходит квадрат ограничения на словарь эмбеддинга
    def __init__(self, vocab_size, emb_vocab_size):
        super(HashEmbedding, self).__init__()
        self.emb_vocab_size = emb_vocab_size
        emb1_dim = min(vocab_size, emb_vocab_size)
        self.emb1 = nn.Embedding(emb1_dim, calc_emb_dim(emb1_dim))
        if vocab_size > emb_vocab_size:
            emb2_dim = vocab_size // emb_vocab_size + 1
            self.emb2 = nn.Embedding(emb2_dim, calc_emb_dim(emb2_dim))
            self.out_dim = calc_emb_dim(emb1_dim) + calc_emb_dim(emb2_dim)
        else:
            self.emb2 = None
            self.out_dim = calc_emb_dim(emb1_dim)
        
    def forward(self, x):
        if self.emb2:
            return torch.cat((self.emb1(x % self.emb_vocab_size), self.emb2(x // self.emb_vocab_size)), axis=1)
        return self.emb1(x % self.emb_vocab_size)

In [13]:
class StudentNet(nn.Module):
    def __init__(self, num_dim, cat_vocabs, max_vocab, cross_layers, deep_layers, deep_dim):
        super(StudentNet, self).__init__()
        self.max_vocab = max_vocab
        self.embeddings = nn.ModuleList(HashEmbedding(vocab_size, max_vocab) for vocab_size in cat_vocabs)
        
        self.num_dim = num_dim
        self.n_cat_features = len(cat_vocabs)
        data_dim = num_dim + sum(emb.out_dim for emb in self.embeddings)
        
        self.cross = nn.ModuleList(CrossModule(data_dim) for _ in range(cross_layers))
        self.deep = nn.Sequential(
            nn.Linear(data_dim, deep_dim), nn.ReLU(),
            *chain(*((nn.Linear(deep_dim, deep_dim), nn.ReLU()) for _ in range(deep_layers - 1)))
        )
        self.final = nn.Linear(data_dim + deep_dim, 1)
        
        
    def forward(self, x):
        num_part = x[:, :self.num_dim]
        cat_part = [x[:, i].long() % self.max_vocab for i in range(self.num_dim, self.num_dim + self.n_cat_features)]
        cat_part = [x[:, i].long() for i in range(self.num_dim, self.num_dim + self.n_cat_features)]
        emb_features = [emb(cats) for emb, cats in zip(self.embeddings, cat_part)]
        embedded_batch = torch.cat([num_part, *emb_features], dim=1)
        cross_out = self.cross[0](embedded_batch, embedded_batch)
        for cross_layer in self.cross[1:]:
            cross_out = cross_layer(embedded_batch, cross_out)
        deep_out = self.deep(embedded_batch)
        out = self.final(torch.cat([cross_out, deep_out], dim=1))
        out = torch.sigmoid(out)
        return out

### Define Model

### Train

In [15]:
class BatchedDataset:
    def __init__(self, batch_size, *args):
        self.batch_size = batch_size
        self.data_streams = args
        self.size = len(args[0])

    def __iter__(self):
        for batch_start in range(0, self.size, self.batch_size):
            yield (torch.tensor(i[batch_start:batch_start + self.batch_size]) for i in self.data_streams)

In [16]:
BATCH_SIZE = 256
train_dataset = BatchedDataset(BATCH_SIZE, x_train, y_train, soft_targets_train)
val_dataset = BatchedDataset(BATCH_SIZE, x_val, y_val, soft_targets_val)
test_dataset = BatchedDataset(BATCH_SIZE, x_test, y_test, soft_targets_test)

In [17]:
from tqdm import tqdm


distillation_loss_weight = torch.tensor(.8).cuda()
pred_loss_weight = torch.tensor(1.).cuda() - distillation_loss_weight


def train(model, n_epochs, train_dataset, val_dataset):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.003)
    loss_func = nn.BCELoss()
    
    def get_loss():
        return loss_func(pred, soft_target_batch) * distillation_loss_weight + \
                loss_func(pred, y_batch.float()) * pred_loss_weight
    
    model = model.cuda()
    for epoch in range(n_epochs):
        train_losses, val_losses, val_aucs = [], [], []
        for x_batch, y_batch, soft_target_batch in tqdm(train_dataset):
            x_batch, y_batch, soft_target_batch = x_batch.cuda(), y_batch.cuda(), soft_target_batch.cuda()
            optimizer.zero_grad()
            pred = model(x_batch).flatten()
            loss = get_loss()
            cur_loss = loss.cpu().detach().numpy()
            train_losses.append(cur_loss)
            loss.backward()
            optimizer.step()
        train_loss = np.mean(train_losses)
        
        for x_batch, y_batch, soft_target_batch in val_dataset:
            x_batch, y_batch, soft_target_batch = x_batch.cuda(), y_batch.cuda(), soft_target_batch.cuda()
            pred = model(x_batch).flatten()
            loss = get_loss()
            val_losses.append(loss.cpu().detach().numpy())
            val_aucs.append(roc_auc_score(y_batch.cpu().numpy(), pred.cpu().detach().numpy()))
        val_loss = np.mean(val_losses)
        val_auc = np.mean(val_aucs)
        
        print(f'Epoch {epoch}: train loss = {train_loss}, val loss = {val_loss}, val auc = {val_auc}')

In [19]:
model = StudentNet(13, cat_features_dims.values(), 5000, 3, 3, 128)
train(model, 2, train_dataset, val_dataset)

11453it [27:09,  7.03it/s]
0it [00:00, ?it/s]

Epoch 0: train loss = 0.4879852831363678, val loss = 0.4566759169101715, val auc = 0.7766140730814006


11453it [27:06,  7.04it/s]


Epoch 1: train loss = 0.47401997447013855, val loss = 0.4486025273799896, val auc = 0.7834970444593512


## Evaluation

Наша основная задача получить модель, которая 
* в терминах ROC AUC не намного хуже модели учителя, и в то же время 
* сильно меньше по размеру

### ROC AUC

Сравним ROC AUC модели ученика с показателем для учителя.

ROC AUC учителя: 0.802

In [20]:
test_preds = [model(x_batch.cuda()).detach().cpu().numpy() for x_batch, _, _ in test_dataset]
test_preds = np.concatenate(test_preds).flatten()

In [21]:
roc_auc_score(y_test, test_preds)

0.7823543453206878

### Compression rate

Пусть 
* $a$ - \# of the parameters in the original model $M$
* $a^{*}$ - \# of the parameters in compressed model $M^{*}$

тогда compression rate is $$\alpha(M,M^{*}) = \frac{a}{a^{*}}$$

Можно также посчитать comression rate просто как отношение фактических размеров моделей.

Размер модели учителя - 168MB


In [22]:
from pathlib import Path

MODEL_PATH = 'model.pt'

torch.save(model.state_dict(), MODEL_PATH)
model_size = Path(MODEL_PATH).stat().st_size / (2 ** 10 * 2 ** 10)
model_size

8.856290817260742

In [23]:
compression_rate = 168 / model_size
compression_rate

18.969566770839457