# Key Concepts

This page highlights key concepts and terminology used throughout the package.

### Patent Number

Patent numbers are assigned by the USPTO following the format described `here <https://www.uspto.gov/patents/apply/applying-online/patent-number>`_. Note that there are no leading zeroes in patent nubmers.

### Mention ID

An inventor's mention ID is a reference to a specific author on a specific patent. It takes the format ``US<patent_number>-<sequence_number>``, such as US12345-0, where ``patent_number`` is the patent number and where ``sequence_number`` is the authorship number (0 for the first author, 1 for the second author, etc).

### Cluster

An inventor cluster is a set of mention IDs thought to refer to the same person. There are *predicted* clusters which are provided by disambiguation algorithms, and there are *true* clusters which are *ground-truth* sets of mentions for inventors.

### Membership Vector

A clustering is typically represented as a *membership vector*. This is a map between mention IDs and the clusters to which they are associated. 

In this package, membership vectors are represented by pandas Series with mention ID as index and cluster assignment as values. All clusterings and disambiguation results follow this format.

Below is an example of a membership vector for a subset of inventor mention IDs. The values appearing in the right column (cluster IDs) are arbitrary; the only convention is that mention IDs corresponding to the same inventor (belonging to the same cluster) should have the same cluster ID.

In [1]:
from pv_evaluation.benchmark import load_israeli_inventors_benchmark

load_israeli_inventors_benchmark()

mention-id
US3858246-0    11797
US3858578-0    11797
US3858674-0    16606
US3859165-0    13384
US3859616-0     9865
               ...  
US6009346-0    12734
US6009390-1     7694
US6009409-2    11416
US6009543-0    19168
US6009552-0      650
Name: unique-id, Length: 9156, dtype: int64