This is my first time doing anything NLP related. Beacuse of that I basically copied "Getting started with NLP for absolute beginners" notebook by Jeremy Howard. You can find original notebook on kaggle: https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners 

===========================================================================================================================

### Import and prepare dataset

In [1]:
import os

folder_path = './patient_phrase_data'
dataset_name = 'us-patent-phrase-to-phrase-matching'

if not os.path.exists(folder_path):
    os.makedirs(folder_path)

You need to set KAGGLE_USERNAME and KAGGLE_KEY environment variables to use kaggle command. You can find them in your kaggle profile (kaggle profile icon -> Settings -> API -> Create New Token). Kaggle API docs: https://github.com/Kaggle/kaggle-api

Link to competition: https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data. Make sure you accept competition's rules

In [2]:
!kaggle competitions download -c {dataset_name} -p {folder_path}

Downloading us-patent-phrase-to-phrase-matching.zip to ./patient_phrase_data




  0%|          | 0.00/682k [00:00<?, ?B/s]
100%|##########| 682k/682k [00:00<00:00, 1.51MB/s]
100%|##########| 682k/682k [00:00<00:00, 1.50MB/s]


In [3]:
import zipfile

zipfile.ZipFile(f'{folder_path}/{dataset_name}.zip').extractall(folder_path)

In [2]:
!ls {folder_path}

sample_submission.csv
test.csv
train.csv
us-patent-phrase-to-phrase-matching.zip


In [3]:
import pandas as pd

In [4]:
df = pd.read_csv(f'{folder_path}/train.csv')
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [5]:
df.describe()

Unnamed: 0,score
count,36473.0
mean,0.362062
std,0.258335
min,0.0
25%,0.25
50%,0.25
75%,0.5
max,1.0


In [6]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with "component composite coating" for instance appearing 152 times.

Earlier, I suggested we could represent the input to the model as something like "TEXT1: abatement; TEXT2: eliminating process". We'll need to add the context to this too. In Pandas, we just use + to concatenate, like so:

In [5]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

### Tokenization

In [6]:
from huggingface_hub import HfApi

In [7]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

In [8]:
model_nm = 'microsoft/deberta-v3-small'

In [9]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
tokz.tokenize("Ladies and gentelman, I'm Mateusz Czajka and I'm doing fast.ai course!")

['▁Ladies',
 '▁and',
 '▁gent',
 'elman',
 ',',
 '▁I',
 "'",
 'm',
 '▁Mate',
 'usz',
 '▁C',
 'za',
 'jka',
 '▁and',
 '▁I',
 "'",
 'm',
 '▁doing',
 '▁fast',
 '.',
 'ai',
 '▁course',
 '!']

In [10]:
def tokenize(x): 
    return tokz(x['input'])

In [11]:
tokz_ds = ds.map(tokenize, batched=True) 
tokz_ds

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

In [11]:
row = tokz_ds[0]

In [14]:
row

{'id': '37d61fd2272659b1',
 'anchor': 'abatement',
 'target': 'abatement of pollution',
 'context': 'A47',
 'score': 0.5,
 'input': 'TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 'input_ids': [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [14]:
row['input']

'TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement'

In [15]:
row['input_ids']

[1,
 54453,
 435,
 294,
 336,
 5753,
 346,
 54453,
 445,
 294,
 47284,
 265,
 6435,
 346,
 23702,
 435,
 294,
 47284,
 2]

In [16]:
row['token_type_ids']

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [17]:
row['attention_mask']

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [18]:
tokz.vocab['▁of']

265

Transformers always assume that our labels has the column name labels so we need to rename column name score -> label

In [12]:
tokz_ds = tokz_ds.rename_columns({'score': 'labels'})

### Test and validation sets

In [13]:
eval_df = pd.read_csv(f'{folder_path}/test.csv')
eval_df.head()

Unnamed: 0,id,anchor,target,context
0,4112d61851461f60,opc drum,inorganic photoconductor drum,G02
1,09e418c93a776564,adjust gas flow,altering gas flow,F23
2,36baf228038e314b,lower trunnion,lower locating,B60
3,1f37ead645e7f0c8,cap component,upper portion,D06
4,71a5b6ad068d531f,neural stimulation,artificial neural network,H04


In [14]:
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


In [14]:
dds = tokz_ds.train_test_split(0.2, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 29178
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7295
    })
})

In [24]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tokenize, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

### Metrics and correlation

In [28]:
import numpy as np

In [26]:
def corr(x,y): return np.corrcoef(x,y)[0][1]

In [27]:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

### Training

In [17]:
from transformers import TrainingArguments, Trainer

In [18]:
bs = 16
epochs = 4
lr = 8e-5

In [19]:
args = TrainingArguments(
    'outputs', 
    learning_rate=lr, 
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    fp16=True,
    evaluation_strategy='epoch',
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs,
    weight_decay=0.01,
    report_to='none'
)

In [20]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

In [21]:
trainer = Trainer(
    model,
    args,
    train_dataset=dds['train'],
    eval_dataset=dds['test'],
    tokenizer=tokz,
    compute_metrics=corr_d
)

In [22]:
import torch
torch.cuda.empty_cache()
# torch.cuda.memory_summary(device=None, abbreviated=False)

In [29]:
trainer.train();

Epoch,Training Loss,Validation Loss,Pearson
1,0.0223,0.026893,0.806768
2,0.0133,0.021714,0.82325
3,0.0092,0.022262,0.827541
4,0.0084,0.021835,0.828479


In [32]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

array([[ 0.38525391],
       [ 0.75      ],
       [ 0.58447266],
       [ 0.28686523],
       [-0.01696777],
       [ 0.50976562],
       [ 0.48242188],
       [-0.01130676],
       [ 0.21264648],
       [ 1.05566406],
       [ 0.26635742],
       [ 0.26025391],
       [ 0.70849609],
       [ 0.93505859],
       [ 0.7265625 ],
       [ 0.46142578],
       [ 0.33911133],
       [-0.01318359],
       [ 0.54052734],
       [ 0.44677734],
       [ 0.36865234],
       [ 0.25073242],
       [ 0.24230957],
       [ 0.24328613],
       [ 0.52880859],
       [-0.01269531],
       [-0.00924683],
       [-0.01522827],
       [-0.01551056],
       [ 0.72460938],
       [ 0.31591797],
       [-0.01083374],
       [ 0.703125  ],
       [ 0.57861328],
       [ 0.33203125],
       [ 0.24584961]])

In [33]:
preds = np.clip(preds, 0, 1)
preds

array([[0.38525391],
       [0.75      ],
       [0.58447266],
       [0.28686523],
       [0.        ],
       [0.50976562],
       [0.48242188],
       [0.        ],
       [0.21264648],
       [1.        ],
       [0.26635742],
       [0.26025391],
       [0.70849609],
       [0.93505859],
       [0.7265625 ],
       [0.46142578],
       [0.33911133],
       [0.        ],
       [0.54052734],
       [0.44677734],
       [0.36865234],
       [0.25073242],
       [0.24230957],
       [0.24328613],
       [0.52880859],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.72460938],
       [0.31591797],
       [0.        ],
       [0.703125  ],
       [0.57861328],
       [0.33203125],
       [0.24584961]])

In [36]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1045