> :warning: **If you are using macOS**: This notebook can not be trained on
> macOS. If you try to run all, the notebook will automatically stop
> execution at the training stage, and transfer execution to a remote Kaggle
> GPU.

## Setup

In [None]:
import logging
import os
from pathlib import Path
from platform import system
import warnings
from zipfile import ZipFile

from datasets import Dataset, DatasetDict
from fastai.imports import *
import torch
from torch.utils.data import DataLoader
import transformers
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [None]:
try:
    from fastkaggle import *
except ModuleNotFoundError:
    ! pip install -Uq fastkaggle
    from fastkaggle import *

In [None]:
environment = system()
environment, iskaggle

In [None]:
data_path = setup_comp('us-patent-phrase-to-phrase-matching')
if not iskaggle:
    ZipFile(f'{data_path}.zip').extractall(data_path)

## Explore Data

### Training Data

In [None]:
train_df = pd.read_csv(data_path / 'train.csv')
train_df

In [None]:
train_df.describe(include='object')

#### Targets

In [None]:
train_df['target'].value_counts()

The vast majority of targets are unique. Most targets also contain very few
words. The least frequent targets tend to contain more than one word.

#### Anchors

In [None]:
train_df['anchor'].value_counts()

There are many targets than anchors. The length of each anchor varies quite a
 bit too, though they tend to be between 2-4 words.

#### Context

In [None]:
train_df['context'].value_counts()

The first letter in each context code references the section under which the
patent was filed

#### Scores

In [None]:
train_df['score'].hist();

Most patents seem to be somewhat similar or not very similar.

Below are the data points that have a score of 1.0.

In [None]:
train_df[train_df['score']==1]

It can be seen that the anchors and targets that have scored 1.0 are minor
rewords of each other. Each patent's context doesn't seem to be playing a
significant role.

### Testing Data

In [None]:
test_df = pd.read_csv(data_path/'test.csv')
len(test_df)

In [None]:
test_df.head()

In [None]:
test_df.describe(include='object')

## Data Processing

### Context Section

The first letter of each patent's context code refers to the section the
patent was filed under.

In [None]:
train_df['context'].value_counts()

Separating this letter into its own column may help
with performance.

In [None]:
train_df['section'] = train_df['context'].str[0]
train_df['section'].value_counts()

### Tokenization and Numericalization

In [None]:
# Disabling warnings since Huggingface outputs them unnecessarily.
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)

I'll initially use deberta-v3-small for experimentation. The larger version
can be used at the end.

In [None]:
model_name = 'microsoft/deberta-v3-small'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

I'll need to iterate on how to combine the anchors, targets, and contexts
together since there is not much research on this.

To begin with, I'll combine each data point's anchor, target, and context
into a single string. A special token will be used to separate these sections

In [None]:
separator = tokenizer.sep_token; separator

In [None]:
train_df['inputs'] = train_df['context'] + separator + train_df['anchor'] + \
                     separator + train_df['target']

Better performance is obtained when Pandas DataFrames are converted into
HuggingFace Datasets.

The `score` column will also be renamed to `label` in the training set, since
that
 is what
HuggingFace expects.

In [None]:
train_ds = Dataset.from_pandas(train_df).rename_column('score', 'label')
test_ds = Dataset.from_pandas(test_df)

In [None]:
def tokenize_function(document):
    return tokenizer(document['inputs'])

In [None]:
train_ds[0]

In [None]:
tokenize_function(train_ds[0])

`1` and `2` are special tokens. `1` represents the start of a document and
`2` represents the separator token I used.

In [None]:
tokenizer.all_special_tokens

In [None]:
inputs = ('anchor', 'target', 'context',)
tok_train_ds = train_ds.map(tokenize_function, batched=True,
                            remove_columns=inputs+('inputs', 'id', 'section'))

In [None]:
tok_train_ds[0]

### Validation Set

I'll create a validation set that contains anchors that are not present in
the training set.

In [None]:
anchors = train_df.anchor.unique()
np.random.seed(42)
np.random.shuffle(anchors)
anchors[:5]

In [None]:
split_ratio = 0.25
valid_set_size = int(len(anchors) * split_ratio)
valid_set_size

In [None]:
valid_anchors = anchors[:valid_set_size]
valid_anchors[:5]

In [None]:
valid_documents = np.isin(train_df['anchor'], valid_anchors)
indicies = np.arange(len(train_df))
valid_indices = indicies[valid_documents]
train_indicies = indicies[~valid_documents]
len(valid_indices), len(train_indicies)

In [None]:
# If error with this dictionary, change 'valid' to 'train'.
ds_dict = DatasetDict({
    'train': tok_train_ds.select(train_indicies),
    'valid': tok_train_ds.select(valid_indices)
})

## Train Model

### Use Kaggle remotely if on Mac

In [None]:
if environment == 'Darwin':
    # Create metadata file.
    nb_meta(
        user='forbo7',
        id='forbo7/push-patent-similarity-iteration',
        title='[PUSH] Patent Similarity Iteration',
        file='patent_similarity_iteration.ipynb',
        competition='us-patent-phrase-to-phrase-matching',
        private=True,
        gpu=True
    )

    # Push to Kaggle.
    push_notebook(
        user='forbo7',
        id='forbo7/push-patent-similarity-iteration',
        title='[PUSH] Patent Similarity Iteration',
        file='patent_similarity_iteration.ipynb',
        competition='us-patent-phrase-to-phrase-matching',
        private=True,
        gpu=True
    )

### Warning

In [None]:
class StopExecution(Exception):
    def _render_traceback_(self):
        return [
            "The training portion of the this notebook can not be run on a "
            "Mac. Instead, the notebook is now being executed remotely on "
            "Kaggle. This notebook will not execute further locally."
        ]

In [None]:
if environment == 'Darwin':
    raise StopExecution

### Metric Function

In [None]:
def pearson_correlation(valid_prediction):
    return {'pearson': np.corrcoef(*valid_prediction)[0][1]}

### Other Hyperparameters

In [None]:
learning_rate = 8e-5
batch_size = 128
weight_decay = 0.01
epochs = 4

### Training Arguments

In [None]:
arguments = TrainingArguments(
    'outputs',
    learning_rate=learning_rate,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    fp16=True,
    evaluation_strategy='epoch',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size*2,
    num_train_epochs=epochs,
    weight_decay=weight_decay,
    report_to='none'
)

### Trainer Creation

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           num_labels=1)

In [None]:
trainer = Trainer(
    model,
    arguments,
    train_dataset=ds_dict['train'],
    eval_dataset=ds_dict['valid'],
    tokenizer=tokenizer,
    compute_metrics=pearson_correlation
)

### Train Trainer

In [None]:
trainer.train()

## Convenience Functions

To be able to better and more quickly experiment, let's create a function
that quickly tokenizes and another function that creates the trainer.

In [None]:
def create_ds_dict(dataframe):
    dataset = Dataset.from_pandas(dataframe).rename_column('score', 'label')
    tokenized_ds = dataset.map(tokenize_function, batched=True,
                               remove_columns=inputs+('inputs', 'id',
                                                      'section'))
    return DatasetDict({
        'train': tokenized_ds.select(train_indicies),
        'valid': tokenized_ds.select(valid_indices)
    })

In [None]:
def get_model():
    return AutoModelForSequenceClassification.from_pretrained(model_name,
                                                              num_labels=1)

def create_trainer(ds_dict, model=None):
    if model is None: model = get_model()

    arguments = TrainingArguments(
        'outputs',
        learning_rate=learning_rate,
        warmup_ratio=0.1,
        lr_scheduler_type='cosine',
        fp16=True,
        evaluation_strategy='epoch',
        per_device_train_batch_size=batch_size,
        per_gpu_eval_batch_size=batch_size*2,
        num_train_epochs=epochs,
        weight_decay=weight_decay,
        report_to='none'
    )

    return Trainer(
        model,
        arguments,
        train_dataset=ds_dict['train'],
        eval_dataset=ds_dict['valid'],
        tokenizer=tokenizer,
        compute_metrics=pearson_correlation
    )

## Improving the Model

Make sure the initial model is stable; that is, it gives roughly the same
result in each run.

### Separator

Let's try using a different separator.

In [None]:
separator = ' [s] '
train_df['inputs'] = train_df['context'] + separator + train_df['anchor'] + \
                     separator + train_df['target']
ds_dict = create_ds_dict(train_df)

In [None]:
create_trainer(ds_dict).train()

### Lowercase

Changing to lowercase is often helpful.

In [None]:
train_df['inputs'] = train_df['inputs'].str.lower()
ds_dict = create_ds_dict(train_df)
create_trainer(ds_dict).train()

### Patent Section

Let's try making the patent section a special token. It may help the model
recognize that different sections are to be handled in different ways.

In [None]:
train_df['section token'] = '[' + train_df['section'] + ']'
section_tokens = list(train_df['section token'].unique())
tokenizer.add_special_tokens({'additional_special_tokens': section_tokens})

In [None]:
train_df['inputs'] = train_df['section token'] + separator + \
                     train_df['context'] + separator + train_df['anchor'].str\
                         .lower() + separator + train_df['target']
ds_dict = create_ds_dict(train_df)

The embedding matrix now needs to be resized due to the addition of more tokens.

In [None]:
model = get_model()
model.resize_token_embeddings(len(tokenizer))

In [None]:
trainer = create_trainer(ds_dict, model=model)
trainer.train()

## Submit Predictions

In [1]:
# TODO: Use large deberta model.
# TODO: Train on full set before submitting.

## Credit

[Iterate like a grandmaster! by Jeremy Howard](https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster)