# Harvard USPTO Patent Dataset (HUPD)

## Loading the Dataset Using Hugging Face's Datasets and Transformers Libraries

In this tutorial, we show how to load and use the HUPD using Hugging Face's Datasets and Transformers libraries.

In [1]:
## Import relevant libraries and dependencies
# Pretty print
from pprint import pprint
# Datasets load_dataset function
from datasets import load_dataset
# Transformers Autokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
# Standard PyTorch DataLoader
from torch.utils.data import DataLoader

Let's use the `load_dataset` function to load all the patent applications that were filed to the USPTO in January 2016. We specify the date ranges of the training and validation sets as January 1-21, 2016 and January 22-31, 2016, respectively. 

In [2]:
# Data loading example
dataset_dict = load_dataset('/mnt/data/HUPD/patents-project-dataset/datasets/patents/patents.py', 
    data_dir='/mnt/data/HUPD/distilled',
    cache_dir='/mnt/data/HUPD/cache',
    icpr_label=None,
    train_filing_start_date='2016-01-01',
    train_filing_end_date='2016-01-21',
    val_filing_start_date='2016-01-22',
    val_filing_end_date='2016-01-31',
)

print('Loading is done!')

Using custom data configuration default
Reusing dataset patents (/mnt/data/HUPD/cache/patents/default-f6746976a4961295/1.0.1/704348b414e8c2991a15841dda7af72c4d35249bc4c98b06c41e6deb8b2367e8)


Loading is done!


Let's display some information about the training and validation sets.

In [3]:
# Dataset info
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
        num_rows: 17614
    })
    validation: Dataset({
        features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
        num_rows: 9194
    })
})


We can also display the fields within the dataset dictionary, as well as the sizes of the training and validation sets.

In [4]:
# Print dataset dictionary contents and cache directory
print('Dataset dictionary contents:')
pprint(dataset_dict)
print('Dataset dictionary cached to:')
pprint(dataset_dict.cache_files)

Dataset dictionary contents:
{'train': Dataset({
    features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
    num_rows: 17614
}),
 'validation': Dataset({
    features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
    num_rows: 9194
})}
Dataset dictionary cached to:
{'train': [{'filename': '/mnt/data/HUPD/cache/patents/default-f6746976a4961295/1.0.1/704348b414e8c2991a15841dda7af72c4d35249bc4c98b06c41e6deb8b2367e8/patents-train.arrow',
            'skip': 0,
            'take': 17614}],
 'validation': [{'filename': '/mnt/data/HUPD/cache/patents/default-f6746976a4961295/1.0.1/704348b414e8c2991a15841dda7af72c4d35249bc4c98b06c41e6deb8b2367e8/patents-validation.arrow',
                 'ski

In [5]:
# Print info about the sizes of the train and validation sets
print(f'Train dataset size: {dataset_dict["train"].shape}')
print(f'Validation dataset size: {dataset_dict["validation"].shape}')

Train dataset size: (17614, 14)
Validation dataset size: (9194, 14)


## Pre-Processing Steps

First, let's establish the label-to-index mapping for the decision status field by assigning the decision status labels to the class indices.

In [6]:
# Label-to-index mapping for the decision status field
decision_to_str = {'REJECTED': 0, 'ACCEPTED': 1, 'PENDING': 2, 'CONT-REJECTED': 3, 'CONT-ACCEPTED': 4, 'CONT-PENDING': 5}

# Helper function
def map_decision_to_string(example):
    return {'decision': decision_to_str[example['decision']]}

Let's now re-label the decision status fields of the examples in the training and validation sets.

In [7]:
# Re-labeling/mapping.
train_set = dataset_dict['train'].map(map_decision_to_string)
val_set = dataset_dict['validation'].map(map_decision_to_string)

HBox(children=(FloatProgress(value=0.0, max=17614.0), HTML(value='')))





HBox(children=(FloatProgress(value=0.0, max=9194.0), HTML(value='')))

In [8]:
# Display the cached directories of the processed train and validation sets
print('Processed train and validation sets are cached to: ')
pprint(train_set.cache_files)
pprint(val_set.cache_files)

Processed train and validation sets are cached to: 
[{'filename': '/mnt/data/HUPD/cache/patents/default-f6746976a4961295/1.0.1/704348b414e8c2991a15841dda7af72c4d35249bc4c98b06c41e6deb8b2367e8/cache-96136822676cc6f9.arrow'}]
[{'filename': '/mnt/data/HUPD/cache/patents/default-f6746976a4961295/1.0.1/704348b414e8c2991a15841dda7af72c4d35249bc4c98b06c41e6deb8b2367e8/cache-a5514eba224d4829.arrow'}]


For the time being, let's focus on the _abstract_ section of the patent applications.

In [9]:
# Focus on the abstract section and tokenize the text using the tokenizer. 
_SECTION_ = 'abstract'

In [10]:
# Training set
train_set = train_set.map(
    lambda e: tokenizer((e[_SECTION_]), truncation=True, padding='max_length'),
    batched=True)

HBox(children=(FloatProgress(value=0.0, max=18.0), HTML(value='')))




In [11]:
# Validation set
val_set = val_set.map(
    lambda e: tokenizer((e[_SECTION_]), truncation=True, padding='max_length'),
    batched=True)

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




In [12]:
# Set the format
train_set.set_format(type='torch', 
    columns=['input_ids', 'attention_mask', 'decision'])

val_set.set_format(type='torch', 
    columns=['input_ids', 'attention_mask', 'decision'])

Let's use `DataLoader` to crete our training set and validation set loaders. 

In [13]:
# train_dataloader and val_data_loader
train_dataloader = DataLoader(train_set, batch_size=16)
val_dataloader = DataLoader(val_set, batch_size=16)


In [14]:
# Get the next batch
batch = next(iter(train_dataloader))
# Print the ids
pprint(batch['input_ids'])
# Print the labels
pprint(batch['decision'])

tensor([[  101,  1996,  2556,  ...,     0,     0,     0],
        [  101,  7861,  5092,  ...,     0,     0,     0],
        [  101,  1037, 12109,  ...,     0,     0,     0],
        ...,
        [  101,  5622,  4226,  ...,     0,     0,     0],
        [  101,  1037,  3259,  ...,     0,     0,     0],
        [  101,  1996,  2556,  ...,     0,     0,     0]])
tensor([1, 1, 2, 1, 0, 1, 0, 1, 0, 1, 2, 1, 1, 0, 0, 0])


In [15]:
# Print the input and output shapes
input_shape = batch['input_ids'].shape
output_shape = batch['decision'].shape
print(f'Input shape: {input_shape}')
print(f'Output shape: {output_shape}')

Input shape: torch.Size([16, 512])
Output shape: torch.Size([16])


In [16]:
# A helper function that converts ids into tokens
def convert_ids_to_string(tokenizer, input):
    return ' '.join(tokenizer.convert_ids_to_tokens(input))

Let's print an example in the batch.

In [17]:
# Print the example
pprint(convert_ids_to_string(tokenizer,batch['input_ids'][1]))

('[CLS] em ##bo ##diment ##s of the invention provide a method of reading and '
 'verify ##ing a tag based on inherent disorder during a manufacturing process '
 '. the method includes using a first reader to take a first reading of an '
 'inherent disorder feature of the tag , and using a second reader to take a '
 'second reading of the inherent disorder feature of the tag . the method '
 'further includes matching the first reading with the second reading , and '
 'determining one or more acceptance criteria , wherein at least one of the '
 'acceptance criteria is based on whether the first reading and the second '
 'reading match within a pre ##de ##ter ##mined threshold . if the acceptance '
 'criteria are met , then the tag is accepted , and a finger ##print for the '
 'tag is recorded . the invention further provides a method of testing and '
 'character ##izing a reader of inherent disorder tags during a manufacturing '
 'process . the method includes taking a reading of a know