### Kaggle Configuration

In [None]:
import os
from pathlib import Path

is_kaggle = "KAGGLE_WORKING_DIR" in os.environ or "/kaggle" in os.getcwd()
print("Running on Kaggle:", is_kaggle)

if is_kaggle:
    path = Path("/kaggle/input/us-patent-phrase-to-phrase-matching")
    ! pip install -q datasets
else:
    path = Path(os.getcwd())

In [None]:
import pandas as pd

df = pd.read_csv(path/"train.csv")

A good starting point within any Kaggle competition is to check the what our data consists of. To do this we should:

1. Print out the data frame
2. Read the [Dataset description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data)
3. Call dataframe's `describe()` method

In [None]:
df

In [None]:
df.describe(include="object")

We're going to do a lot with not much unique data. There's a lot of repetition and each entry only has 3-4 words.

Next we want to create an input column for our NLP model to read that combines categorizes and combines all our text columns in to one string we'll input in to the mode;.

In [None]:
df["input"] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; ANC1: " + df.anchor
df.input.head()

## Tokenization

Machine Learning models operate on numbers not text. We need to convert our input in to numbers. To do this we need to do two things:
1. Tokenization: Split each text up in to tokens
2. Numericalization: Convert each token in to a number

Transformers use Datasets for storing data

In [None]:
from datasets import Dataset, DatasetDict
ds = Dataset.from_pandas(df)
ds

How to tokenize & numericalize text varys between different models

In [None]:
model_name = "microsoft/deberta-v3-small"

Tokens aren't necessarily words as we need to be able to handle text that isn't made up of words such as URLs and we need to limit the size of our vocabularly so less common words will be split up.

In [None]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.tokenize("This is Declan's tokenizer")

In [None]:
def tokenization_func(dataset): return tokenizer(dataset["input"])
tokenized_dataset = ds.map(tokenization_func, batched=True)

This adds a new item to our dataset `input_ids` which converts our text in to numbers that match up with one of our models tokens. This list of tokens is known as the model's vocabulary.

In [None]:
row = tokenized_dataset[0]
row["input"], row["input_ids"]

In [None]:
tokenizer.convert_ids_to_tokens(54453)

Transformers always assume that your labels are within a column named `labels` so we need to rename our score column.

In [None]:
tokenized_dataset = tokenized_dataset.rename_columns({'score':'labels'})

## Create a Validation Set

In practice a making a random split for a validation set is [often a bad practice](https://www.fast.ai/2017/11/13/validation-sets/). 

Note that for Kaggle competitions we use the training data for our validation set. The test data is for our test set.

In [None]:
ds_dict = tokenized_dataset.train_test_split(0.25, seed=42)
ds_dict

## Create a Test Set

In [None]:
test_df = pd.read_csv(path/"test.csv")
test_df["input"] = "TEXT1: " + test_df.context + "; TEXT2: " + test_df.target + "; ANC1: " + test_df.anchor
test_ds = Dataset.from_pandas(test_df).map(tokenization_func, batched=True)
test_ds

## Training

The batch-size may need adjusting to fit the GPU we choose to use. If a memory crash occurs during training a lower batch size will be required.

Learning rate will also require some trial and error, like batch size the larger the value the faster we'll train but a value too large will result in failure.

In [None]:
from transformers import TrainingArguments, Trainer
learning_rate = 8e-5
epoch_count = 4
if is_kaggle:
    batch_size = 128
    fp16 = True
else:
    batch_size = 32
    fp16 = False

The arguments below should work fine in most cases, diving in to them isn't necessary at this stage

In [None]:
args = TrainingArguments('outputs', 
                         learning_rate=learning_rate,
                         warmup_ratio=0.1,
                         lr_scheduler_type='cosine',
                         fp16=fp16,
                         evaluation_strategy='epoch',
                         per_device_train_batch_size=batch_size,
                         per_device_eval_batch_size=batch_size*2,
                         num_train_epochs=epoch_count,
                         weight_decay=0.01,
                         report_to='none')

We can now create a `Trainer` which combines our model and data together. It will spit out a few warnings we can ignore.

In [None]:
import torch
import numpy as np
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

trainer = Trainer(model, 
                  args, 
                  train_dataset=ds_dict['train'], 
                  eval_dataset=ds_dict['test'],
                  tokenizer=tokenizer,
                  compute_metrics=corr_d)

In [None]:
trainer.train()