## Salary prediction, episode II: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


In [21]:
import torch 
import torch.nn as nn
import torch.nn.functional as F

from nltk.tokenize import WordPunctTokenizer

import numpy as np
import pandas as pd

In [None]:
from collections import Counter

### Prepare data

In [None]:
data = pd.read_csv("./Train_rev1.zip", compression='zip', index_col=None)
data.shape

text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"


tokenizer = WordPunctTokenizer()


def tokenize(x):
    if not isinstance(x, str):
        return ''
    
    x = x.lower()
    x = tokenizer.tokenize(x)
    return ' '.join(x)


def get_tokens(data, min_count: int = 10):
    token_counts = Counter()
    
    for line in data['FullDescription']:
        token_counts.update(line.split(' '))
        
    for line in data['Title']:
        token_counts.update(line.split(' '))

    tokens = sorted(t for t, c in token_counts.items() if c >= min_count)
    tokens = ["UNK", "PAD"] + tokens
    return tokens


def prepare_data(data: pd.DataFrame):
    data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')
    data[categorical_columns] = data[categorical_columns].fillna('NaN')

    data["FullDescription"] = data["FullDescription"].apply(lambda x: tokenize(x))
    data["Title"] = data["Title"].apply(lambda x: tokenize(x))

    tokens = get_tokens(data)
    token_to_id = {token : id_ for id_, token in enumerate(tokens)}


data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"

data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string "NaN"

data.sample(3)

import nltk
#TODO YOUR CODE HERE

tokenizer = nltk.tokenize.WordPunctTokenizer()
# YOUR CODE HERE

def tokenize(x):
    if not isinstance(x, str):
        return ''
    
    x = x.lower()
    x = tokenizer.tokenize(x)
    return ' '.join(x)

data["FullDescription"] = data["FullDescription"].apply(lambda x: tokenize(x))
data["Title"] = data["Title"].apply(lambda x: tokenize(x))

print("Tokenized:")
print(data["FullDescription"][2::100000])
assert data["FullDescription"][2][:50] == 'mathematical modeller / simulation analyst / opera'
assert data["Title"][54321] == 'international digital account manager ( german )'

from collections import Counter
token_counts = Counter()


for line in data['FullDescription']:
    token_counts.update(line.split(' '))
    
for line in data['Title']:
    token_counts.update(line.split(' '))

min_count = 10

# tokens from token_counts keys that had at least min_count occurrences throughout the dataset
tokens = sorted(t for t, c in token_counts.items() if c >= min_count) #TODO<YOUR CODE HERE>

# Add a special tokens for unknown and empty words
UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + tokens

token_to_id = {token : id_ for id_, token in enumerate(tokens)}

UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])

def as_matrix(sequences, max_len=None):
    """ Convert a list of tokens into a matrix with padding """
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))
        
    max_len = min(max(map(len, sequences)), max_len or float('inf'))
    
    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i,seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))

print("Train size = ", len(data_train))
print("Validation size = ", len(data_val))

import torch
import torch.nn as nn
import torch.nn.functional as F


device = 'cuda' if torch.cuda.is_available() else 'cpu'


def to_tensors(batch, device):
    batch_tensors = dict()
    for key, arr in batch.items():
        if key in ["FullDescription", "Title"]:
            batch_tensors[key] = torch.tensor(arr, device=device, dtype=torch.int64)
        else:
            batch_tensors[key] = torch.tensor(arr, device=device)
    return batch_tensors


def make_batch(data, max_len=None, word_dropout=0, device=device):
    """
    Creates a keras-friendly dict from the batch data.
    :param word_dropout: replaces token index with UNK_IX with this probability
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    batch = {}
    batch["Title"] = as_matrix(data["Title"].values, max_len)
    batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len)
    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))
    
    if word_dropout != 0:
        batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)
    
    if TARGET_COLUMN in data.columns:
        batch[TARGET_COLUMN] = data[TARGET_COLUMN].values
    
    return to_tensors(batch, device)

def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
    dropout_mask &= matrix != pad_ix
    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])

### Salary prediction model

In [22]:
class MaxOverTimePooling(nn.Module):
    def forward(self, x, mask):
        x = torch.max(x, dim=1).values
        return x


class AvgOverTimePooling(nn.Module):
    def forward(self, x, mask):
        norm_constant = mask.sum(dim=1)
        out = torch.sum(x * mask, dim=1) / norm_constant
        return out


class SoftmaxPooling(nn.Module):
    def forward(self, x, mask):
        weights = torch.softmax(x, dim=1)
        out = torch.sum(x * weights, dim=1)
        return out


class AttentivePooling(nn.Module):
    def __init__(self, dim: int):
        self.attention_layer = nn.Linear(dim, 1, bias=False)

    def forward(self, x, mask):
        attention = self.attention_layer(x)
        weights = torch.softmax(x, dim=1)
        out = torch.sum(x * weights, dim=1)
        return out

In [23]:
class ConvLayer(nn.Module):
    def __init__(self, dim, kernel_size, n_conv_layers: int, use_bn: bool = True):
        self.dropout = nn.Dropout()
        self.conv_layers = nn.ModuleList()
        
        for _ in range(n_conv_layers):
            self.conv_layers.append(nn.Conv1d(dim, dim, kernel_size=kernel_size))

        self.norm = nn.BatchNorm1d(dim) if use_bn else nn.LayerNorm(dim)
        self.activation = nn.GELU()

    def forward(self, x):
        input = x
        x = self.dropout(x)
        x = self.norm(x)

        for layer in self.conv_layers:
            
        
        


class TextEncoder(nn.Module):
    def __init__(self, ch_out, kernel, emb_size: int, emb_dim: int, ):
        super().__init__()
        self.embeddings = nn.Embedding(emb_size, emb_dim)
        self.conv = nn.Conv1d(emb_dim, ch_out, kernel)
        self.bn = nn.BatchNorm1d(ch_out)
        self.relu = nn.ReLU()
        self.pool = nn.AdaptiveAvgPool1d(1)
        
    def forward(self, x):
        emb = self.embeddings(x)
        emb = emb.moveaxis(1, 2)
        h = self.relu(self.pool(self.bn(self.conv(emb))))
        return h.squeeze(2)

In [None]:
class SalaryPredictor(nn.Module):
    def __init__(self, n_tokens, n_cat_features, hid_size=64):
        super().__init__()
        
        self.title_encoder = TextEncoder(128, 3, 34158, 64)
        self.description_encoder = TextEncoder(128, 3, 34158, 64)
        self.cat_encoder = nn.Sequential(
            nn.Linear(3768, 1024),
            nn.ReLU(),
            nn.Linear(1024, 128)
        )
        self.final_proj = nn.Linear(128 * 3, 1)
        
        
    def forward(self, batch):
        h_title = self.title_encoder(batch['Title'])
        h_descr = self.description_encoder(batch['FullDescription'])
        h_cat = self.cat_encoder(batch['Categorical'])

        h = torch.cat([h_title, h_descr, h_cat], dim=1)
        out = self.final_proj(h)

        return out.squeeze()

### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!