In [1]:
# import necessary libraries and packages 
import os
import pandas as pd 
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
! pip install -q datasets transformers sentencepiece

creds = '{"username":"sebasmanco","key":"a0f01aadb0e584b9c9aa0e8ebf5a2696"}'                        # credentials for kaggle APi 
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

path = Path('us-patent-phrase-to-phrase-matching')                                                  # path for the data of the competition

[K     |████████████████████████████████| 451 kB 29.3 MB/s 
[K     |████████████████████████████████| 5.8 MB 62.1 MB/s 
[K     |████████████████████████████████| 1.3 MB 56.5 MB/s 
[K     |████████████████████████████████| 212 kB 75.3 MB/s 
[K     |████████████████████████████████| 132 kB 22.1 MB/s 
[K     |████████████████████████████████| 182 kB 79.5 MB/s 
[K     |████████████████████████████████| 127 kB 72.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 58.7 MB/s 
[?25h

In [2]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)                                                 # extract the data of the competition

Downloading us-patent-phrase-to-phrase-matching.zip to /content


100%|██████████| 682k/682k [00:00<00:00, 2.43MB/s]







# Import data and exploratory data analysis 

the first pahse, before the training of the model is the exploratory data analysis of this, for this task, we have to import the train and tests datasets.

In [3]:
if iskaggle:
    path = Path('../input/us-patent-phrase-to-phrase-matching')
    ! pip install -q datasets                                                                       # download the data 

!ls {path}

sample_submission.csv  test.csv  train.csv


For manipulating csv we can use the `pandas` library 

In [4]:
df = pd.read_csv(path/'train.csv')                                                                  # collect the csv train.csv and convert it into a dataframe 
df.head()

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0


the objective of the competition is to find contextual information (by matching phrases) that allows to connect points between different patent documents. This dataset contains the followig elements:
- `id` - a unique identifier for a pair of phrases 
- `anchor` - the first phrase 
- `target` - the second phrase 
- `context` - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
- `score` - the similarity 

In [5]:
df.describe(include='object')                                                                       # brief description of the dataset 

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


In [6]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor 
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

In [7]:
from datasets import Dataset, DatasetDict

Transformers use the object `Dataset` for storing the data that we have previously into a pandas dataframe:

In [8]:
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

for a deep learning model we can't pass to it a series of texts, for this we have to do two things 
- **tokenization**: split each text into words (tokens)
- **Numericalization**: convert each word (token) into a number   

with this in mind, we have to pick the model for our problem: in this case we have a huge amount of models:

In [9]:
model_nm = 'microsoft/deberta-v3-small'

The functoin `AutoTokenizer` will create an appropiaite tokenizer for the selected model:

In [10]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)                                                      #this will crete an appropiate tokenizer for the given model

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")                                              # example of a text tokenized

['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

we can define a function for tokenize any desired input:

In [12]:
def tok_func(x): return tokz(x["input"])                                                            # function for tokenize our inputs column

and we can use it to tokenize a selected row 

In [13]:
tok_ds = ds.map(tok_func, batched=True)

#row = tok_ds[0]                                                                                    # example of a tokenized text
#row['input'], row['input_ids']

  0%|          | 0/37 [00:00<?, ?ba/s]

there is a list called the `vocab` in the tokenizer that assign a unique number for every possible token string, for example, for the word "abatement":

In [14]:
tokz.vocab['▁abatement']

47284

In [15]:
# now for the labels for the model: transformers always assume that our labels are called "labels"

tok_ds = tok_ds.rename_columns({'score':'labels'})

## test and validation sets 

In [16]:
eval_df = pd.read_csv(path/'test.csv')                                                              # this is the test set 
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


now let's split our training data into validation and training sets 

In [17]:
dds = tok_ds.train_test_split(0.25,seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

In [18]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_df.head()

Unnamed: 0,id,anchor,target,context,input
0,4112d61851461f60,opc drum,inorganic photoconductor drum,G02,TEXT1: G02; TEXT2: inorganic photoconductor dr...
1,09e418c93a776564,adjust gas flow,altering gas flow,F23,TEXT1: F23; TEXT2: altering gas flow; ANC1: ad...
2,36baf228038e314b,lower trunnion,lower locating,B60,TEXT1: B60; TEXT2: lower locating; ANC1: lower...
3,1f37ead645e7f0c8,cap component,upper portion,D06,TEXT1: D06; TEXT2: upper portion; ANC1: cap co...
4,71a5b6ad068d531f,neural stimulation,artificial neural network,H04,TEXT1: H04; TEXT2: artificial neural network; ...


In [19]:
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

eval_ds

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36
})

In [20]:
import numpy as np

def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

In [21]:
from transformers import TrainingArguments, Trainer 

In [22]:
bs = 128                                                                                            # the batch size that fit for our GPU
epochs = 4 

In [23]:
lr = 8e-5                                                                                           # learning rate for our model 

In [24]:
args = TrainingArguments('outputs', 
                         learning_rate=lr, 
                         warmup_ratio=0.1, 
                         lr_scheduler_type='cosine', 
                         fp16=True, 
                         evaluation_strategy="epoch", 
                         per_device_train_batch_size=bs, 
                         per_device_eval_batch_size=bs*2, 
                         num_train_epochs=epochs, 
                         weight_decay=0.01, 
                         report_to='none')

now we can create our model and train it 

In [25]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels = 1)
trainer = Trainer(model, args, 
                  train_dataset=dds['train'], 
                  eval_dataset = dds['test'], 
                  tokenizer = tokz, 
                  compute_metrics = corr_d)

Downloading:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.classifier.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.dense.bias', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

In [26]:
trainer.train();

The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: id, anchor, context, target, input. If id, anchor, context, target, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 27354
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 856
  Number of trainable parameters = 141895681
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.02473,0.796711
2,No log,0.025609,0.821054
3,0.034300,0.022936,0.8303
4,0.034300,0.022419,0.831828


The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: id, anchor, context, target, input. If id, anchor, context, target, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: id, anchor, context, target, input. If id, anchor, context, target, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 256
Saving model checkpoint to outputs/checkpoint-500
Configuration saved in outputs/checkpoint-500/config.json
Model weights saved in outputs/checkpoint-500/pytorch_model.bin
tokenizer config file saved 