# Performance Estimates for Lai's 2011 Benchmark

This notebook showcases the use of our precision and recall performance estimators in application to Lai's 2011 benchmark dataset.

Note that Lai's 2011 dataset only covers patents granted before 2010. As such, we can only estimate the performance of the current disambiguation algorithm for this time period.

The sampling process assumed for Lai's 2011 benchmark is a uniform sample of inventors. This is because inventors from this benchmark were identified from a set of CVs (not from sampling individual patents, which would bias towards large clusters).

## Data Preparation

First we import required modules and recover the current disambiguation from `rawinventor.tsv`. The current disambiguation is filtered to only contain inventor mentions for granted patents between 1975 and 2010.

In [1]:
from pv_evaluation.estimators import pairwise_precision_estimator, pairwise_recall_estimator, pairwise_precision_std, pairwise_recall_std
from pv_evaluation.benchmark import load_lai_2011_inventors_benchmark

import pandas as pd
import numpy as np
import wget
import zipfile
import os

if not os.path.isfile("rawinventor.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip")
    with zipfile.ZipFile("rawinventor.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("rawinventor.tsv.zip")

if not os.path.isfile("patent.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/patent.tsv.zip")
    with zipfile.ZipFile("patent.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("patent.tsv.zip")

In [2]:
patent = pd.read_csv("patent.tsv", sep="\t", dtype=str, usecols=["id", "date"])
rawinventor = pd.read_csv("rawinventor.tsv", sep="\t", dtype=str, usecols=["patent_id", "sequence", "inventor_id"])

date = pd.DatetimeIndex(patent.date)
patent["date"] = date.year.astype(int)
joined = rawinventor.merge(patent, left_on="patent_id", right_on="id", how="left")

In [3]:
joined["mention-id"] = "US" + joined.patent_id + "-" + joined.sequence
joined = joined.query('date >= 1975 and date <= 2010')
current_disambiguation = joined.set_index("mention-id")["inventor_id"]

## Precision and Recall Estimates

We can now estimate precision and recall using the "cluster_block" estimator (a sample of true clusters has been sampled) and with uniform probability weights.

Precision estimate:

In [4]:
pairwise_precision_estimator(current_disambiguation, load_lai_2011_inventors_benchmark(), sampling_type="cluster_block", weights="uniform")

0.9061700591403344

In [5]:
pairwise_precision_std(current_disambiguation, load_lai_2011_inventors_benchmark(), sampling_type="cluster_block", weights="uniform")

0.02694415809739732

Recall estimate:

In [6]:
pairwise_recall_estimator(current_disambiguation, load_lai_2011_inventors_benchmark(), sampling_type="cluster_block", weights="uniform")

0.9096034933749487

In [7]:
pairwise_recall_std(current_disambiguation, load_lai_2011_inventors_benchmark(), sampling_type="cluster_block", weights="uniform")

0.05017639288406865