### Kaggle Configuration

In [1]:
import os

is_kaggle = "KAGGLE_WORKING_DIR" in os.environ or "/kaggle" in os.getcwd()
print("Running on Kaggle:", is_kaggle)

if is_kaggle:
    path = Path(".../input/us-patent-to-phrase-matching")
    ! pip install -q datasets
else:
    path = os.getcwd()

Running on Kaggle: False


In [2]:
import pandas as pd

df = pd.read_csv(path + "/train.csv")

A good starting point within any Kaggle competition is to check the what our data consists of. To do this we should:

1. Print out the data frame
2. Read the [Dataset description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data)
3. Call dataframe's `describe()` method

In [3]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [4]:
df.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


We're going to do a lot with not much unique data. There's a lot of repetition and each entry only has 3-4 words.

Next we want to create an input column for our NLP model to read that combines categorizes and combines all our text columns in to one string we'll input in to the mode;.

In [5]:
df["input"] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; ANC1: " + df.anchor
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

## Tokenization

Machine Learning models operate on numbers not text. We need to convert our input in to numbers. To do this we need to do two things:
1. Tokenization: Split each text up in to tokens
2. Numericalization: Convert each token in to a number

Transformers use Datasets for storing data

In [6]:
from datasets import Dataset, DatasetDict
ds = Dataset.from_pandas(df)
ds

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

How to tokenize & numericalize text varys between different models

In [7]:
model_name = "microsoft/deberta-v3-small"

Tokens aren't necessarily words as we need to be able to handle text that isn't made up of words such as URLs and we need to limit the size of our vocabularly so less common words will be split up.

In [8]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.tokenize("This is Declan's tokenizer")



['▁This', '▁is', '▁Declan', "'", 's', '▁token', 'izer']

In [9]:
def tokenization_func(dataset): return tokenizer(dataset["input"])
tokenized_dataset = ds.map(tokenization_func, batched=True)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Map: 100%|██████████| 36473/36473 [00:00<00:00, 61211.40 examples/s]


This adds a new item to our dataset `input_ids` which converts our text in to numbers that match up with one of our models tokens. This list of tokens is known as the model's vocabulary.

In [10]:
row = tokenized_dataset[0]
row["input"], row["input_ids"]

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [11]:
tokenizer.convert_ids_to_tokens(54453)

'▁TEXT'

Transformers always assume that your labels are within a column named `labels` so we need to rename our score column.

In [12]:
tokenized_dataset = tokenized_dataset.rename_columns({'score':'labels'})

## Create a Validation Set

In practice a making a random split for a validation set is [often a bad practice](https://www.fast.ai/2017/11/13/validation-sets/). 

Note that for Kaggle competitions we use the training data for our validation set. The test data is for our test set.

In [13]:
ds_dict = tokenized_dataset.train_test_split(0.25, seed=42)
ds_dict

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

## Create a Test Set

In [14]:
test_df = pd.read_csv("test.csv")
test_df["input"] = "TEXT1: " + test_df.context + "; TEXT2: " + test_df.target + "; ANC1: " + test_df.anchor
test_ds = Dataset.from_pandas(test_df).map(tokenization_func, batched=True)
test_ds

Map: 100%|██████████| 36/36 [00:00<00:00, 9328.16 examples/s]


Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36
})

## Training

The batch-size may need adjusting to fit the GPU we choose to use. If a memory crash occurs during training a lower batch size will be required.

Learning rate will also require some trial and error, like batch size the larger the value the faster we'll train but a value too large will result in failure.

In [15]:
from transformers import TrainingArguments, Trainer
batch_size = 128
epoch_count = 4
learning_rate = 8e-5

The arguments below should work fine in most cases, diving in to them isn't necessary at this stage

In [16]:
args = TrainingArguments('outputs', 
                         learning_rate=learning_rate,
                         warmup_ratio=0.1,
                         lr_scheduler_type='cosine',
                         fp16=True,
                         evaluation_strategy='epoch',
                         per_device_train_batch_size=batch_size,
                         per_device_eval_batch_size=batch_size*2,
                         num_train_epochs=epoch_count,
                         weight_decay=0.01,
                         report_to='none')

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA or NPU devices or certain XPU devices (with IPEX).

We can now create a `Trainer` which combines our model and data together. It will spit out a few warnings we can ignore.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
trainer = Trainer(model, 
                  args, 
                  train_dataset=ds_dict['train'], 
                  eval_dataset=ds_dict['test'],
                  tokenizer=tokenizer,
                  compute_metrics=corr_d)

In [None]:
trainer.train()