In [22]:
from torchvision.transforms import ToTensor
from datasets import Dataset,DatasetDict
from torch.utils.data import DataLoader
from torchvision import datasets
from pathlib import Path
from torch import nn
import kaggle
import matplotlib.pyplot as plt
import torch
import numpy
import os
import pandas as pd

One area where deep learning has dramatically improved in the last couple of years is natural language processing (NLP). Computers can now generate text, translate automatically from one language to another, analyze comments, label words in sentences, and much more.

Perhaps the most widely practically useful application of NLP is classification -- that is, classifying a document automatically into some category. This can be used, for instance, for:

Sentiment analysis (e.g are people saying positive or negative things about your product)
Author identification (what author most likely wrote some document)
Legal discovery (which documents are in scope for a trial)
Organizing documents by topic
Triaging inbound emails
...and much more!

Classification models can also be used to solve problems that are not, at first, obviously appropriate. For instance, consider the Kaggle **U.S. Patent Phrase to Phrase Matching competition**. In this, we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of 1 it is considered that the two inputs have identical meaning, and 0 means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5, meaning they're somewhat similar, but not identical.

It turns out that this can be represented as a classification problem. How? By representing the question like this:

For the following text...: *"TEXT1: abatement; TEXT2: eliminating process"* ...chose a category of meaning similarity: "Different; Similar; Identical".

In this notebook we'll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.

---

# Setting up and getting the data ready

First we need to download the dataset from Kaggle.
Get your `kaggle.json` from from Kaggle's website and place it in `User/{name}/.kaggle` then install the Kaggle library.

```bash
pip install kaggle
```

In [23]:
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

creds = ''
cred_path = Path('./data/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

path = Path('us-patent-phrase-to-phrase-matching')

if not iskaggle and not path.exists():
    import zipfile
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

Documents in NLP datasets are generally in one of two main forms:

Larger documents: One text file per document, often organised into one folder per category
Smaller documents: One document (or document pair, optionally with metadata) per row in a CSV file.

Let's take a look at our dataset.

In [24]:
ls {path}

sample_submission.csv  test.csv               train.csv


We have csv files dataset, so we can use **pandas** for interacting with csv file or tabular data.

In [25]:
dataframe = pd.read_csv(path/'train.csv')
dataframe

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


It's important to carefully read the dataset description to understand how each of these columns is used.

One of the most useful features of `DataFrame` is the `describe()` method:

In [26]:
dataframe.describe()

Unnamed: 0,score
count,36473.0
mean,0.362062
std,0.258335
min,0.0
25%,0.25
50%,0.25
75%,0.5
max,1.0


We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with "component composite coating" for instance appearing 152 times.

Earlier, I suggested we could represent the input to the model as something like "TEXT1: abatement; TEXT2: eliminating process". We'll need to add the context to this too. In Pandas, we just use + to concatenate, like so:

In [27]:
dataframe['input'] = 'TEXT1: ' + dataframe.context + '; TEXT2: ' + dataframe.target + '; ANC1: ' + dataframe.anchor

We can refer to a column (also known as a series) either using regular python "dotted" notation, or access it like a dictionary. To get the first few rows, use head():

In [28]:
dataframe.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

Neural Networks work with numbers. Therefore, we need to convert our strings into numbers.

We need to take two steps to achieve that.

**1- [Tokenization](00_what_is_what.ipynb#tokenization):** Split each text up into words (or actually, as we'll see, into tokens)

**2- [Numericalization](00_what_is_what.ipynb#numericalization) (Vectorization):** Convert each word (or token) into a number.

# 1- Tokenization

We can use HuggingFace transformers to take care of tokenization for us.
So we are going to turn our pandas dataframe into a huggingface dataset.

In [29]:
hf_dataset = Dataset.from_pandas(dataframe)
hf_dataset

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

Once the text is tokenized into smaller units (words, subwords, etc.), the next step is to convert these tokens into a format that a machine learning model can understand. This is called feature engineering .

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace "small" with "large" for a slower but more accurate model, once you've finished exploring):

In general, a library is like a dictionary which according to your splitted text input, spit out a number. In our selected model `deberta-v3` example, number for "of" word is `265`

`AutoTokenizer` will create a tokenizer appropriate for a given model:

In [36]:
model_nm = 'microsoft/deberta-v3-small'

In [37]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Here's an example of how the tokenizer splits a text into "tokens" (which are like words, but can be sub-word pieces, as you see below):

In [38]:
tokenizer.tokenize("G'day folks, I'm Jeremy from fast.ai!")

['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

Let's create a function to tokenize our string inputs:

In [51]:
def tokenize(input: dict): return tokenizer(input['input'])

To run this quickly in parallel on every row in our dataset, use map:

In [53]:
tokenized_datasets = hf_dataset.map(tokenize, batched=True)
tokenized_datasets

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

This adds a new item to our dataset called `input_ids`. For instance, here is the input and IDs for the first row of our data:

In [54]:
row = tokenized_datasets[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

So, what are those IDs and where do they come from? The secret is that there's a list called vocab in the tokenizer which contains a unique integer for every possible token string. We can look them up like this, for instance to find the token for the word "of":

In [55]:
tokenizer.vocab['▁of']

265

## 2- Numericalization (Vectorization)