**Salary prediction, episode II: make it actually work (4 points)**

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


# Приготовления

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
!wget https://ysda-seminars.s3.eu-central-1.amazonaws.com/Train_rev1.zip
!unzip Train_rev1.zip
data = pd.read_csv("./Train_rev1.csv", index_col=None)
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

--2022-11-01 12:45:29--  https://ysda-seminars.s3.eu-central-1.amazonaws.com/Train_rev1.zip
Resolving ysda-seminars.s3.eu-central-1.amazonaws.com (ysda-seminars.s3.eu-central-1.amazonaws.com)... 52.219.171.58
Connecting to ysda-seminars.s3.eu-central-1.amazonaws.com (ysda-seminars.s3.eu-central-1.amazonaws.com)|52.219.171.58|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 128356352 (122M) [application/zip]
Saving to: ‘Train_rev1.zip’


2022-11-01 12:45:40 (12.2 MB/s) - ‘Train_rev1.zip’ saved [128356352/128356352]

Archive:  Train_rev1.zip
  inflating: Train_rev1.csv          


In [3]:
text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"

data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string "NaN"

data.sample(3)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,Log1pSalary
52707,68671091,Executive Secretary/PA,"A well organised, experienced Executive Assist...",Leeds West Yorkshire Yorkshire,Leeds,,contract,Carlisle Managed Solutions,Accounting & Finance Jobs,22000 - 23000 per annum,22500,totaljobs.com,10.021315
76234,69006472,Teachers required for Primary Schools,Desperately Seeking Talented Primary Teachers ...,Birmingham,Birmingham,,contract,PK Education,Teaching Jobs,550 - 805/week,32520,cv-library.co.uk,10.389642
72594,68849217,Marketing & Brand Planner,"Our client, a leading business within its fiel...","Northampton,Northamptonshire",UK,,contract,one2one Recruitment,"PR, Advertising & Marketing Jobs","25,000 - 27,000 PA",26000,jobstoday.co.uk,10.165891


In [4]:
import nltk

tokenizer = nltk.tokenize.WordPunctTokenizer()
data["FullDescription"] = ([' '.join(tokenizer.tokenize(x.lower())) for x in data["FullDescription"]])
data["Title"] = ([' '.join(tokenizer.tokenize(str(x).lower())) for x in data["Title"]])

In [5]:
from collections import Counter
token_counts = Counter()

for row in data['FullDescription'].values:
    token_counts.update(row.split())
for row in data['Title'].values:
    token_counts.update(row.split())

In [6]:
min_count = 10

# tokens from token_counts keys that had at least min_count occurrences throughout the dataset
tokens = sorted(t for t, c in token_counts.items() if c >= min_count)

# Add a special tokens for unknown and empty words
UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + tokens

token_to_id = dict(map(reversed, enumerate(tokens)))

In [7]:
UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])

def as_matrix(sequences, max_len=None):
    """ Convert a list of tokens into a matrix with padding """
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))
        
    max_len = min(max(map(len, sequences)), max_len or float('inf'))
    
    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i,seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

In [8]:
from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

DictVectorizer(dtype=<class 'numpy.float32'>, sparse=False)

In [9]:
from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))

print("Train size = ", len(data_train))
print("Validation size = ", len(data_val))

Train size =  195814
Validation size =  48954


In [10]:
import torch

target_column = 'Log1pSalary'

def to_tensors(batch, device):
    batch_tensors = dict()
    for key, arr in batch.items():
        if key in ["FullDescription", "Title"]:
            batch_tensors[key] = torch.tensor(arr, device=device, dtype=torch.int64)
        else:
            batch_tensors[key] = torch.tensor(arr, device=device)
    return batch_tensors

def make_batch(data, max_len=None, word_dropout=0, device=torch.device('cuda')):
    """
    Creates a keras-friendly dict from the batch data.
    :param word_dropout: replaces token index with UNK_IX with this probability
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    batch = {}
    batch["Title"] = as_matrix(data["Title"].values, max_len)
    batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len)
    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))
    
    if word_dropout != 0:
        batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)
    
    if target_column in data.columns:
        batch[target_column] = data[target_column].values
    
    return to_tensors(batch, device)

def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
    dropout_mask &= matrix != pad_ix
    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])

In [11]:
import tqdm

BATCH_SIZE = 256
EPOCHS = 10
DEVICE = torch.device('cuda')

In [12]:
def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, device=torch.device('cuda'), **kwargs):
    """ iterates minibatches of data in random order """
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]], **kwargs)
            yield batch
        
        if not cycle: break

def print_metrics(model, data, batch_size=BATCH_SIZE, name="", **kw):
    squared_error = abs_error = num_samples = 0.0
    model.eval()
    with torch.no_grad():
        for batch in iterate_minibatches(data, batch_size=batch_size, shuffle=False, **kw):
            batch_pred = model(batch)
            squared_error += torch.sum(torch.square(batch_pred - batch[TARGET_COLUMN]))
            abs_error += torch.sum(torch.abs(batch_pred - batch[TARGET_COLUMN]))
            num_samples += len(batch_pred)
    mse = squared_error.detach().cpu().numpy() / num_samples
    mae = abs_error.detach().cpu().numpy() / num_samples
    print("%s results:" % (name or ""))
    print("Mean square error: %.5f" % mse)
    print("Mean absolute error: %.5f" % mae)
    return mse, mae

# CNN с обычным пулингом

In [13]:
import torch
import torch.nn as nn
import torch.functional as F

class GlobalMaxPooling(nn.Module):
    def __init__(self, dim=-1):
        super(self.__class__, self).__init__()
        self.dim = dim
        
    def forward(self, x):
        return x.max(dim=self.dim)[0]

class TitleEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(64, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()  #Softmax_pooling()       
        self.dense = nn.Linear(out_size, out_size)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

# Define an encoder for job descriptions.
# Use any means you want so long as it's torch.nn.Module.
class descriptionsEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(64, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()  #Softmax_pooling()    
        self.dense = nn.Linear(out_size, out_size)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

class SalaryPredictor(nn.Module):
    """
    This class does all the steps from (title, desc, categorical) features -> predicted target
    It unites title & desc encoders you defined above as long as some layers for head and categorical branch.
    """
    
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_)):
        super(self.__class__, self).__init__()
        
        self.title_encoder = TitleEncoder(out_size=64)
        self.desc_encoder = descriptionsEncoder(out_size=64)
        
        # define layers for categorical features. A few dense layers would do.
        self.layers = nn.Sequential(nn.Linear(n_cat_features, 192),
                                    nn.BatchNorm1d(192),
                                    nn.ReLU(),
                                    #nn.Dropout(0.3),
                                    nn.Linear(192, 132),
                                    nn.BatchNorm1d(132),
                                    nn.ReLU(),
                                    nn.Linear(132, 64))
        
        # define "output" layers that process depend the three encoded vectors into answer
        self.output = nn.Sequential(nn.Linear(192, 64),
                                    nn.BatchNorm1d(64),
                                    nn.ReLU(),
                                    #nn.Dropout(0.3),
                                    nn.Linear(64, 32),
                                    nn.BatchNorm1d(32),
                                    nn.ReLU(),
                                    nn.Linear(32, 1)
                                   )
    def forward(self, batch):
        """
        :param title_ix: int32 Variable [batch, title_len], job titles encoded by as_matrix
        :param desc_ix:  int32 Variable [batch, desc_len] , job descriptions encoded by as_matrix
        :param cat_features: float32 Variable [batch, n_cat_features]
        :returns: float32 Variable 1d [batch], predicted log1p-salary
        """
        
        title_ix = batch['Title']
        desc_ix = batch['FullDescription']
        cat_features = batch['Categorical']
        
        # process each data source with it's respective encoder
        title_h = self.title_encoder(title_ix)
        desc_h = self.desc_encoder(desc_ix)
        
        # apply categorical encoder
        cat_h = self.layers(cat_features)
        
        # concatenate all vectors together...
        joint_h = torch.cat([title_h, desc_h, cat_h], dim=1)
        
        # ... and stack a few more layers at the top
        output = self.output(joint_h)[:, 0]
        
        # Note 1: do not forget to select first columns, [:, 0], to get to 1d outputs
        # Note 2: please do not use output nonlinearities.
        
        return output

In [14]:
model = SalaryPredictor().to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.notebook.tqdm(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val)

epoch: 0


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.13948
Mean absolute error: 0.28643
epoch: 1


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.10888
Mean absolute error: 0.24948
epoch: 2


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.10252
Mean absolute error: 0.24229
epoch: 3


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09540
Mean absolute error: 0.23355
epoch: 4


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09327
Mean absolute error: 0.23095
epoch: 5


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08920
Mean absolute error: 0.22440
epoch: 6


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08479
Mean absolute error: 0.21790
epoch: 7


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08530
Mean absolute error: 0.21867
epoch: 8


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08142
Mean absolute error: 0.21276
epoch: 9


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08102
Mean absolute error: 0.21224


Отлично, обычная модель с семинара дает 0.09 MSE. Попробуем пообучать побольше и с Dropout.

In [15]:
EPOCHS = 10

model = SalaryPredictor().to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.notebook.tqdm(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE, word_dropout=0.1)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val)

epoch: 0


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.15084
Mean absolute error: 0.30219
epoch: 1


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.10661
Mean absolute error: 0.24732
epoch: 2


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09866
Mean absolute error: 0.23680
epoch: 3


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09704
Mean absolute error: 0.23639
epoch: 4


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09094
Mean absolute error: 0.22708
epoch: 5


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08617
Mean absolute error: 0.21965
epoch: 6


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09526
Mean absolute error: 0.23418
epoch: 7


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08779
Mean absolute error: 0.22399
epoch: 8


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08027
Mean absolute error: 0.21229
epoch: 9


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07812
Mean absolute error: 0.20833


0.08 MSE - уже лучше!

Добавим Dropout и в сети.

In [16]:
import torch
import torch.nn as nn
import torch.functional as F

class TitleEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(64, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()  #Softmax_pooling()       
        self.dense = nn.Linear(out_size, out_size)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

# Define an encoder for job descriptions.
# Use any means you want so long as it's torch.nn.Module.
class descriptionsEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(64, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()  #Softmax_pooling()    
        self.dense = nn.Linear(out_size, out_size)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

class SalaryPredictor(nn.Module):
    """
    This class does all the steps from (title, desc, categorical) features -> predicted target
    It unites title & desc encoders you defined above as long as some layers for head and categorical branch.
    """
    
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_)):
        super(self.__class__, self).__init__()
        
        self.title_encoder = TitleEncoder(out_size=64)
        self.desc_encoder = descriptionsEncoder(out_size=64)
        
        # define layers for categorical features. A few dense layers would do.
        self.layers = nn.Sequential(nn.Linear(n_cat_features, 192),
                                    nn.BatchNorm1d(192),
                                    nn.ReLU(),
                                    nn.Dropout(0.3),
                                    nn.Linear(192, 132),
                                    nn.BatchNorm1d(132),
                                    nn.ReLU(),
                                    nn.Linear(132, 64))
        
        # define "output" layers that process depend the three encoded vectors into answer
        self.output = nn.Sequential(nn.Linear(192, 64),
                                    nn.BatchNorm1d(64),
                                    nn.ReLU(),
                                    nn.Dropout(0.3),
                                    nn.Linear(64, 32),
                                    nn.BatchNorm1d(32),
                                    nn.ReLU(),
                                    nn.Linear(32, 1)
                                   )
    def forward(self, batch):
        """
        :param title_ix: int32 Variable [batch, title_len], job titles encoded by as_matrix
        :param desc_ix:  int32 Variable [batch, desc_len] , job descriptions encoded by as_matrix
        :param cat_features: float32 Variable [batch, n_cat_features]
        :returns: float32 Variable 1d [batch], predicted log1p-salary
        """
        
        title_ix = batch['Title']
        desc_ix = batch['FullDescription']
        cat_features = batch['Categorical']
        
        # process each data source with it's respective encoder
        title_h = self.title_encoder(title_ix)
        desc_h = self.desc_encoder(desc_ix)
        
        # apply categorical encoder
        cat_h = self.layers(cat_features)
        
        # concatenate all vectors together...
        joint_h = torch.cat([title_h, desc_h, cat_h], dim=1)
        
        # ... and stack a few more layers at the top
        output = self.output(joint_h)[:, 0]
        
        # Note 1: do not forget to select first columns, [:, 0], to get to 1d outputs
        # Note 2: please do not use output nonlinearities.
        
        return output

In [17]:
EPOCHS = 10

model = SalaryPredictor().to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.notebook.tqdm(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE, word_dropout=0.1)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val)

epoch: 0


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.17741
Mean absolute error: 0.32648
epoch: 1


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.19413
Mean absolute error: 0.34816
epoch: 2


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.12855
Mean absolute error: 0.27310
epoch: 3


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.11177
Mean absolute error: 0.25148
epoch: 4


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.10881
Mean absolute error: 0.24876
epoch: 5


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.10505
Mean absolute error: 0.24450
epoch: 6


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09266
Mean absolute error: 0.22766
epoch: 7


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08984
Mean absolute error: 0.22359
epoch: 8


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09436
Mean absolute error: 0.23080
epoch: 9


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09016
Mean absolute error: 0.22451


Ничего прикольного Dropout в категориальных фичах не дал.

In [18]:
import torch
import torch.nn as nn
import torch.functional as F

class GlobalMaxPooling(nn.Module):
    def __init__(self, dim=-1):
        super(self.__class__, self).__init__()
        self.dim = dim
        
    def forward(self, x):
        return x.max(dim=self.dim)[0]

class TitleEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(64, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()  #Softmax_pooling()       
        self.dense = nn.Linear(out_size, out_size)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        # print(len(tokens))
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

# Define an encoder for job descriptions.
# Use any means you want so long as it's torch.nn.Module.
class descriptionsEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(64, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()  #Softmax_pooling()   
        self.conv2 = nn.Conv1d(64, out_size, kernel_size=5, padding=1)
        self.pool2 = GlobalMaxPooling()  #Softmax_pooling()    
        self.dense = nn.Linear(out_size * 2, out_size)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h1 = self.conv1(h)
        h1 = self.pool1(h)

        h2 = self.conv2(h)
        h2 = self.pool2(h)

        h = torch.cat([h1, h2], dim=1)
        h = h.relu()
        h = self.dense(h)
        
        return h

class SalaryPredictor(nn.Module):
    """
    This class does all the steps from (title, desc, categorical) features -> predicted target
    It unites title & desc encoders you defined above as long as some layers for head and categorical branch.
    """
    
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_)):
        super(self.__class__, self).__init__()
        
        self.title_encoder = TitleEncoder(out_size=64)
        self.desc_encoder = descriptionsEncoder(out_size=64)
        
        # define layers for categorical features. A few dense layers would do.
        self.layers = nn.Sequential(nn.Linear(n_cat_features, 192),
                                    nn.BatchNorm1d(192),
                                    nn.ReLU(),
                                    nn.Linear(192, 132),
                                    nn.BatchNorm1d(132),
                                    nn.ReLU(),
                                    nn.Linear(132, 64))
        
        # define "output" layers that process depend the three encoded vectors into answer
        self.output = nn.Sequential(nn.Linear(192, 64),
                                    nn.BatchNorm1d(64),
                                    nn.ReLU(),
                                    nn.Linear(64, 32),
                                    nn.BatchNorm1d(32),
                                    nn.ReLU(),
                                    nn.Linear(32, 1)
                                   )
    def forward(self, batch):
        """
        :param title_ix: int32 Variable [batch, title_len], job titles encoded by as_matrix
        :param desc_ix:  int32 Variable [batch, desc_len] , job descriptions encoded by as_matrix
        :param cat_features: float32 Variable [batch, n_cat_features]
        :returns: float32 Variable 1d [batch], predicted log1p-salary
        """
        
        title_ix = batch['Title']
        desc_ix = batch['FullDescription']
        cat_features = batch['Categorical']
        
        # process each data source with it's respective encoder
        title_h = self.title_encoder(title_ix)
        desc_h = self.desc_encoder(desc_ix)
        
        # apply categorical encoder
        cat_h = self.layers(cat_features)
        
        # concatenate all vectors together...
        joint_h = torch.cat([title_h, desc_h, cat_h], dim=1)
        
        # ... and stack a few more layers at the top
        output = self.output(joint_h)[:, 0]
        
        # Note 1: do not forget to select first columns, [:, 0], to get to 1d outputs
        # Note 2: please do not use output nonlinearities.
        
        return output

In [19]:
EPOCHS = 10

# model = SalaryPredictor().to(DEVICE)
# criterion = nn.MSELoss(reduction='sum')
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.notebook.tqdm(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE, word_dropout=0.1)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val)

epoch: 0


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08973
Mean absolute error: 0.22494
epoch: 1


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09078
Mean absolute error: 0.22667
epoch: 2


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08219
Mean absolute error: 0.21314
epoch: 3


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08725
Mean absolute error: 0.22164
epoch: 4


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08602
Mean absolute error: 0.21891
epoch: 5


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08043
Mean absolute error: 0.21060
epoch: 6


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07722
Mean absolute error: 0.20633
epoch: 7


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07889
Mean absolute error: 0.20817
epoch: 8


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07367
Mean absolute error: 0.20041
epoch: 9


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07404
Mean absolute error: 0.20138


Такая архитектура ничего не улучшила

# CNN с pretrained embeddings

In [20]:
import gensim.downloader

w2v = gensim.downloader.load("glove-wiki-gigaword-100")
weights = torch.FloatTensor(w2v.vectors) 

In [21]:
import torch
import torch.nn as nn
import torch.functional as F

class GlobalMaxPooling(nn.Module):
    def __init__(self, dim=-1):
        super(self.__class__, self).__init__()
        self.dim = dim
        
    def forward(self, x):
        return x.max(dim=self.dim)[0]

class TitleEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=100):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding.from_pretrained(weights, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(out_size, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()
        self.dense = nn.Linear(out_size, 64)


    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

# Define an encoder for job descriptions.
# Use any means you want so long as it's torch.nn.Module.
class descriptionsEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=100):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding.from_pretrained(weights, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(out_size, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()
        self.dense = nn.Linear(out_size, 64)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

class SalaryPredictor(nn.Module):
    """
    This class does all the steps from (title, desc, categorical) features -> predicted target
    It unites title & desc encoders you defined above as long as some layers for head and categorical branch.
    """
    
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_)):
        super(self.__class__, self).__init__()
        
        self.title_encoder = TitleEncoder()
        self.desc_encoder = descriptionsEncoder()
        
        # define layers for categorical features. A few dense layers would do.
        self.layers = nn.Sequential(nn.Linear(n_cat_features, 192),
                                    nn.BatchNorm1d(192),
                                    nn.ReLU(),
                                    #nn.Dropout(0.3),
                                    nn.Linear(192, 132),
                                    nn.BatchNorm1d(132),
                                    nn.ReLU(),
                                    nn.Linear(132, 64))
        
        # define "output" layers that process depend the three encoded vectors into answer
        self.output = nn.Sequential(nn.Linear(192, 64),
                                    nn.BatchNorm1d(64),
                                    nn.ReLU(),
                                    #nn.Dropout(0.3),
                                    nn.Linear(64, 32),
                                    nn.BatchNorm1d(32),
                                    nn.ReLU(),
                                    nn.Linear(32, 1)
                                   )
    def forward(self, batch):
        """
        :param title_ix: int32 Variable [batch, title_len], job titles encoded by as_matrix
        :param desc_ix:  int32 Variable [batch, desc_len] , job descriptions encoded by as_matrix
        :param cat_features: float32 Variable [batch, n_cat_features]
        :returns: float32 Variable 1d [batch], predicted log1p-salary
        """
        
        title_ix = batch['Title']
        desc_ix = batch['FullDescription']
        cat_features = batch['Categorical']
        
        # process each data source with it's respective encoder
        title_h = self.title_encoder(title_ix)
        desc_h = self.desc_encoder(desc_ix)
        
        # apply categorical encoder
        cat_h = self.layers(cat_features)
        
        # concatenate all vectors together...
        joint_h = torch.cat([title_h, desc_h, cat_h], dim=1)
        
        # ... and stack a few more layers at the top
        output = self.output(joint_h)[:, 0]
        
        # Note 1: do not forget to select first columns, [:, 0], to get to 1d outputs
        # Note 2: please do not use output nonlinearities.
        
        return output

In [None]:
EPOCHS = 20

model = SalaryPredictor().to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.notebook.tqdm(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE, word_dropout=0.1)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val)

epoch: 0


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.12147
Mean absolute error: 0.26720
epoch: 1


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.10790
Mean absolute error: 0.25010
epoch: 2


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09893
Mean absolute error: 0.23807
epoch: 3


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.09495
Mean absolute error: 0.23270
epoch: 4


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08630
Mean absolute error: 0.22059
epoch: 5


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08208
Mean absolute error: 0.21494
epoch: 6


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.08110
Mean absolute error: 0.21293
epoch: 7


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07955
Mean absolute error: 0.21095
epoch: 8


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07640
Mean absolute error: 0.20643
epoch: 9


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07768
Mean absolute error: 0.20755
epoch: 10


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07488
Mean absolute error: 0.20358
epoch: 11


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07810
Mean absolute error: 0.21004
epoch: 12


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07335
Mean absolute error: 0.20139
epoch: 13


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07603
Mean absolute error: 0.20631
epoch: 14


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07275
Mean absolute error: 0.20038
epoch: 15


  0%|          | 0/764 [00:00<?, ?it/s]

 results:
Mean square error: 0.07201
Mean absolute error: 0.19916
epoch: 16


  0%|          | 0/764 [00:00<?, ?it/s]

0.72, так круче!

# LSTM

In [None]:
import torch
import torch.nn as nn
import torch.functional as F

class GlobalMaxPooling(nn.Module):
    def __init__(self, dim=-1):
        super(self.__class__, self).__init__()
        self.dim = dim
        
    def forward(self, x):
        return x.max(dim=self.dim)[0]

class TitleEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=100):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding.from_pretrained(weights, padding_idx=PAD_IX)
        self.conv1 = nn.Conv1d(out_size, out_size, kernel_size=3, padding=1)
        self.pool1 = GlobalMaxPooling()
        self.dense = nn.Linear(out_size, 64)


    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.conv1(h)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

# Define an encoder for job descriptions.
# Use any means you want so long as it's torch.nn.Module.
class descriptionsEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=100):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding.from_pretrained(weights, padding_idx=PAD_IX)
        self.lstm1 = nn.LSTM(out_size, 50, bidirectional=True, batch_first=True)
        self.pool1 = GlobalMaxPooling()
        self.dense = nn.Linear(out_size, 64)

    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        # h = torch.transpose(h, 1, 2)
        
        # Apply the layers as defined above. Add some ReLUs before dense.
        h = self.lstm1(h)
        h, _ = h
        h = torch.transpose(h, 1, 2)
        h = self.pool1(h)
        h = h.relu()
        h = self.dense(h)
        
        return h

class SalaryPredictor(nn.Module):
    """
    This class does all the steps from (title, desc, categorical) features -> predicted target
    It unites title & desc encoders you defined above as long as some layers for head and categorical branch.
    """
    
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_)):
        super(self.__class__, self).__init__()
        
        self.title_encoder = TitleEncoder()
        self.desc_encoder = descriptionsEncoder()
        
        # define layers for categorical features. A few dense layers would do.
        self.layers = nn.Sequential(nn.Linear(n_cat_features, 192),
                                    nn.BatchNorm1d(192),
                                    nn.ReLU(),
                                    #nn.Dropout(0.3),
                                    nn.Linear(192, 132),
                                    nn.BatchNorm1d(132),
                                    nn.ReLU(),
                                    nn.Linear(132, 64))
        
        # define "output" layers that process depend the three encoded vectors into answer
        self.output = nn.Sequential(nn.Linear(192, 64),
                                    nn.BatchNorm1d(64),
                                    nn.ReLU(),
                                    #nn.Dropout(0.3),
                                    nn.Linear(64, 32),
                                    nn.BatchNorm1d(32),
                                    nn.ReLU(),
                                    nn.Linear(32, 1)
                                   )
    def forward(self, batch):
        """
        :param title_ix: int32 Variable [batch, title_len], job titles encoded by as_matrix
        :param desc_ix:  int32 Variable [batch, desc_len] , job descriptions encoded by as_matrix
        :param cat_features: float32 Variable [batch, n_cat_features]
        :returns: float32 Variable 1d [batch], predicted log1p-salary
        """
        
        title_ix = batch['Title']
        desc_ix = batch['FullDescription']
        cat_features = batch['Categorical']
        
        # process each data source with it's respective encoder
        title_h = self.title_encoder(title_ix)
        desc_h = self.desc_encoder(desc_ix)
        
        # apply categorical encoder
        cat_h = self.layers(cat_features)
        
        # concatenate all vectors together...
        joint_h = torch.cat([title_h, desc_h, cat_h], dim=1)
        
        # ... and stack a few more layers at the top
        output = self.output(joint_h)[:, 0]
        
        # Note 1: do not forget to select first columns, [:, 0], to get to 1d outputs
        # Note 2: please do not use output nonlinearities.
        
        return output

In [None]:
EPOCHS = 20

model = SalaryPredictor().to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.notebook.tqdm(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE, word_dropout=0.1)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val)

# Попробуем что-нибудь предсказать

In [None]:
model.eval()
cnt = 0
with torch.no_grad():
  for i, batch in enumerate(
        iterate_minibatches(data_train, batch_size=1, device=DEVICE)):
    pred = model(batch)
    print(f'Предсказание зарплаты {cnt + 1}: {torch.exp(pred).item():.2f}')
    print(f'Реальное значение {cnt + 1}: {torch.exp(batch[TARGET_COLUMN]).item():.2f}')
    cnt += 1
    if cnt > 3:
      break

# Вполне неплохо

# A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

# Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!