This notebook will be on the U.S. Patent Phrase to Phrase Matching Kaggle Competition.

## The Problem
Due to the large database of patents, it is difficult to determine whether a requested patent already exists in the archives. This problem is to solve the problem of trying to find whether the contexts of a patent are semantically similar in order to the office in finding if an invention has been described before.

## The Data

In [1]:
#Imports
import numpy as np
import pandas as pd

In [2]:
train_data_original = pd.read_csv('train.csv')
train_data = train_data_original.copy()
train_data

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


The data can be described as :
- id : unique identifier for a pair of phrases
- anchor : the first phrase
- target : the second phrase
- context : CPC classification which indicates the subject within which the similarity will be scored
- score : similarity score

The score has additional classifications such as 

1.0 - Very close match

0.75 - Close Synonym

0.5 - Synonyms that dont have the same meaning

0.25 - Somewhat related

0.0 Unrelated

There is 0.25 Increment between each classification, which makes this a 5 class classification.

Because of that there is an architectural decision of whether we output to 5 neurons with a sigmoid probability or to one neuron and take the nearest interval based on the sigmoid probability of the neuron. I am going to go with the one neuron approach for simplicity.

In [3]:
# Check the data for any missing data
# Data Summary
train_data.shape

print(train_data.dtypes)
missing_values = train_data.isnull().any()
print(missing_values)

id          object
anchor      object
target      object
context     object
score      float64
dtype: object
id         False
anchor     False
target     False
context    False
score      False
dtype: bool


Luckily there isn't any missing data in any of the sets so we don't have to deal with them.

In [4]:
train_data.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


## Feature Engineering

My plan of action is goign to be to concatinate each of the features into a single sentence that can be tokenized and put into a Large Language Model and then fine tuned to be a classify the scores.

In [5]:
train_data['input'] = 'Feature1: ' + train_data.anchor + '; Feature2: ' + train_data.target + '; Context: ' + train_data.context

In [6]:
train_data.input.head()

0    Feature1: abatement; Feature2: abatement of po...
1    Feature1: abatement; Feature2: act of abating;...
2    Feature1: abatement; Feature2: active catalyst...
3    Feature1: abatement; Feature2: eliminating pro...
4    Feature1: abatement; Feature2: forest region; ...
Name: input, dtype: object

## Tokenization

We can't pass text directly into the large language model because it expects numbers as inputs. Therefore we need to do two things:
- Tokenize : which is to split each text up into tokens (words/characters/etc)
- Numericalize : after getting the tokens, we need to map those tokens into numerical values

In [7]:
# Create a dictionary of words based on our data using HuggingFace Dataset
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(train_data)

In [8]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

Because tokenization is model dependent, we have to choose a specific model that we want to work with and then tokenize of data accordingly.

I decided that I am going to work with a small model so that training is fast and so we can reiterate quickly.

In [9]:
model_nm = 'microsoft/deberta-v3-small'

Using HuggingFace, we can utilize AutoTokenizer so that we automatically utilize the right tokenizer.

In [10]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_nm, use_fast=False)

In [11]:
tokenizer.tokenize("Hello, this is a tokenized sentence")

['▁Hello', ',', '▁this', '▁is', '▁a', '▁token', 'ized', '▁sentence']

In [12]:
# Function to tokenize our inputs
def tok_func(x): return tokenizer(x["input"])

In [13]:
# Tokenize our dataset in parallel
tokenized_dataset = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

In [14]:
row = tokenized_dataset[0]
row['input'], row['input_ids']

('Feature1: abatement; Feature2: abatement of pollution; Context: A47',
 [1,
  16855,
  435,
  294,
  47284,
  346,
  16855,
  445,
  294,
  47284,
  265,
  6435,
  346,
  26846,
  294,
  336,
  5753,
  2])

In [15]:
tokenized_dataset

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

This creates a tokenized dataset where we have the numerical representation of the 'input' inside of 'input_ids'. The numbers come from the vocab of the tokenizer that we used where there is a number for every tokenized word. 

Now we need to prepare the data for the fine tuning. The model expects the score that to be called 'labels' so we need to change the name.

In [16]:
tokenized_dataset = tokenized_dataset.rename_columns({'score':'labels'})

## Validation and Test Sets

In [17]:
#Validation Set
dds = tokenized_dataset.train_test_split(0.27, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 26625
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9848
    })
})

In [18]:
eval_df = pd.read_csv('test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


In [19]:
# Test Set
eval_df['input'] = 'Feature1: ' + eval_df.anchor + '; Feature2: ' + eval_df.target + '; Context: ' + eval_df.context
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

In [20]:
# Metrics
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

## Training

In [21]:
from transformers import TrainingArguments, Trainer

In [22]:
# HyperParameters
bs = 128
epochs = 4
lr = 8e-5

In [23]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

Make our model using `Trainer` which combines our data and model together

In [24]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokenizer, compute_metrics=corr_d)

  return self.fget.__get__(instance, owner)()
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [25]:
trainer.train();

Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.025053,0.807928
2,No log,0.021268,0.826449
3,0.028800,0.021663,0.835616
4,0.028800,0.022188,0.837072


83% Accuracy is pretty good. Now lets predict our test set.

In [26]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

array([[ 0.61962891],
       [ 0.69873047],
       [ 0.56054688],
       [ 0.34936523],
       [-0.01913452],
       [ 0.56445312],
       [ 0.52001953],
       [-0.02534485],
       [ 0.33349609],
       [ 1.10546875],
       [ 0.21020508],
       [ 0.25878906],
       [ 0.72558594],
       [ 0.89306641],
       [ 0.77587891],
       [ 0.43530273],
       [ 0.29077148],
       [-0.00983429],
       [ 0.65185547],
       [ 0.30981445],
       [ 0.52001953],
       [ 0.2199707 ],
       [ 0.08215332],
       [ 0.22973633],
       [ 0.59082031],
       [-0.02456665],
       [-0.03564453],
       [-0.0496521 ],
       [-0.04367065],
       [ 0.4753418 ],
       [ 0.29443359],
       [-0.00561142],
       [ 0.68652344],
       [ 0.54882812],
       [ 0.45458984],
       [ 0.23803711]])

In [27]:
# Clip the data so its between 0 and 1
preds = np.clip(preds, 0, 1)

Finally lets export our data

In [28]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1052

## Reflections and Next Steps

This notebook was using the HuggingFace Transformers library and there was lots of studying to be done on how it should be utilized. Such as the output column of the data should be called 'labels.' It was good experience to utilize the HuggingFace library since its so popular in the field. I also noticed the power of fine tuning the model with a pre-trained model. Even with a small model we started off the training with a 80% accuracy and increased the accuracy from finetuning 3% with only 4 epochs of training. 

There definitely were some design choices that could be possibly optimized. 

The input feauture, I arbitrarily chose 'Feature1', 'Feature2' and 'Context'. Possibly there is some optimization there from picking better tokens to differentiate each feature. 

The validation set could have been chosen better. It used the `train_test_split` function, but that is not a reliable wayn to get a validation set. The function randomly chooses data to take off. Possibly a better way is to not remove unique context features from the training set and only use those, leaving unseen contexts in the validation set. Would this make it be valid?