# 🎯 Performance Estimates for the Israeli Inventors Dataset

This notebook showcases the use of our precision and recall performance estimators in application to the Israeli inventors benchmark dataset.

Note that the Israeli dataset only covers patents granted between 1963 and 1999. As such, we can only estimate the performance of the current disambiguation algorithm for this time period.

Furthermore, we consider the the Israeli benchmark to be following the "single block" sampling process. That is, we assume that the data corresponds to the sampling of a single block, here the block of Israeli inventors.

## Data Preparation

First we import required modules and recover the current disambiguation from `rawinventor.tsv`. The current disambiguation is filtered to only contain inventor mentions for granted patents between 1963 and 1999.

In [1]:
from pv_evaluation.estimators import pairwise_precision_estimator, pairwise_recall_estimator
from pv_evaluation.benchmark import load_israeli_inventors_benchmark

import pandas as pd
import numpy as np
import wget
import zipfile
import os

if not os.path.isfile("rawinventor.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip")
    with zipfile.ZipFile("rawinventor.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("rawinventor.tsv.zip")

if not os.path.isfile("patent.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/patent.tsv.zip")
    with zipfile.ZipFile("patent.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("patent.tsv.zip")

In [2]:
patent = pd.read_csv("patent.tsv", sep="\t", dtype=str, usecols=["id", "date"])
rawinventor = pd.read_csv("rawinventor.tsv", sep="\t", dtype=str, usecols=["patent_id", "sequence", "inventor_id"])

date = pd.DatetimeIndex(patent.date)
patent["date"] = date.year.astype(int)
joined = rawinventor.merge(patent, left_on="patent_id", right_on="id", how="left")

In [3]:
joined["mention_id"] = "US" + joined.patent_id + "-" + joined.sequence
joined = joined.query('date >= 1963 and date <= 1999')
current_disambiguation = joined.set_index("mention_id")["inventor_id"]

## Precision and Recall Estimates

Next, we can estimate precision and recall using the "single_block" estimator. Note that since a single block has been sampled, no standard deviation estimate can be provided.

In [4]:
pairwise_precision_estimator(current_disambiguation, load_israeli_inventors_benchmark(), sampling_type="single_block", weights="uniform")

0.7860646822490067

In [5]:
pairwise_recall_estimator(current_disambiguation, load_israeli_inventors_benchmark(), sampling_type="single_block", weights="uniform")

0.9409842164000638