# Performance Estimates for the Israeli Inventors Dataset

This notebook showcases the use of our precision and recall performance estimators in application to the Israeli inventors benchmark dataset.

Note that the Israeli dataset only covers patents granted between 1963 and 1999. As such, we can only estimate the performance of the current disambiguation algorithm for this time period.

Furthermore, we consider the the Israeli benchmark to be following the "single block" sampling process.

In [1]:
from pv_evaluation.estimators import pairwise_precision_estimator, pairwise_recall_estimator
from pv_evaluation.benchmark import load_israeli_inventors_benchmark

In [2]:
# TODO: Constrain `current_disambiguation` to only contain inventor mentions for granted patents between 1963 and 1999.
import pandas as pd
import wget
import zipfile
import os

if not os.path.isfile("rawinventor.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip")
    with zipfile.ZipFile("rawinventor.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("rawinventor.tsv.zip")

rawinventor = pd.read_csv("rawinventor.tsv", sep="\t")
rawinventor["mention-id"] = "US" + rawinventor.patent_id.astype(str) + "-" + rawinventor.sequence.astype(str)
current_disambiguation = rawinventor.set_index("mention-id")["inventor_id"]

In [None]:
pairwise_precision_estimator(current_disambiguation, load_israeli_inventors_benchmark(), sampling_type="single_block", weights="uniform")

0.0964202816671318

In [None]:
pairwise_recall_estimator(current_disambiguation, load_israeli_inventors_benchmark(), sampling_type="single_block", weights="uniform")

0.9409842164000638