# Record Disambiguation

In this notebook we perform entity disambiguation on records, specifically person records.

In [247]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

import sys
sys.path.append("../../..")

from heritageconnector.disambiguation.helpers import load_training_data, plot_performance_curves
from heritageconnector.disambiguation.pipelines import Disambiguator

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Load data
This data has already been generated using `Disambiguator.save_training_data_to_folder` and `Disambiguator.save_test_data_to_folder`.

In [177]:
train_dir = "/Volumes/Kalyan_SSD/SMG/disambiguation/train_people_1110/"
test_dir = "/Volumes/Kalyan_SSD/SMG/disambiguation/test_people_1110/"

In [263]:
X, y, pairs, pids = load_training_data(train_dir)
X_new, pairs_new, pids_new = load_training_data(test_dir)

In [264]:
pids

['label', 'P735', 'P734', 'P21', 'P569', 'P570', 'P106', 'P31']

## 2. Train classifier
The disambiguator wraps `sklearn.tree.DecisionTreeClassifier` and takes its parameters as inputs.

### 2a. Test classifier performance
We'll perform a train/test split on the labelled data to quickly test the classifier's performance using its `score` method. 

The `score` method here returns [balanced accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html): accuracy weighted so that each class is considered evenly.

In [265]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [281]:
clf = Disambiguator().fit(X_train, y_train)
print(clf.score(X_test, y_test))

balanced accuracy score: 0.9796144483822518
precision score: 0.9357429718875502
recall score: 0.9628099173553719


### 2b. Use classifier to predict new Wikidata links

In [271]:
clf = Disambiguator().fit(X, y)
y_pred = clf.predict(X_new, threshold=0.5)

print(f"{np.unique(y_pred, return_counts=True)[1][1]} potential new links found")

2520 potential new links found


In [275]:
pairs_new["y_pred"] = y_pred
pairs_new.sort_values("y_pred", ascending=False).head(20)

Unnamed: 0,internal_id,wikidata_id,y_pred
63528,https://collection.sciencemuseumgroup.org.uk/people/cp134465,Q1691137,True
31939,https://collection.sciencemuseumgroup.org.uk/people/cp118567,Q5345072,True
103911,https://collection.sciencemuseumgroup.org.uk/people/cp93519,Q20991622,True
27843,https://collection.sciencemuseumgroup.org.uk/people/cp117281,Q21556609,True
20473,https://collection.sciencemuseumgroup.org.uk/people/cp89366,Q616068,True
71131,https://collection.sciencemuseumgroup.org.uk/people/cp39886,Q3099164,True
75325,https://collection.sciencemuseumgroup.org.uk/people/cp167048,Q238948,True
11604,https://collection.sciencemuseumgroup.org.uk/people/cp44586,Q42590721,True
27850,https://collection.sciencemuseumgroup.org.uk/people/cp117281,Q21455568,True
94545,https://collection.sciencemuseumgroup.org.uk/people/cp166863,Q7108130,True


## 3. Explain classifier
We can see that the classifier prioritises P569/P570 (birth and death dates), P21 (gender), label similarity, and occupation.

It's interesting to note that P31 (instance of), which tells the classifier whether the Wikidata record is a human, is not used. This is likely because P569/P570/P106/P21 are qualities which only humans can have.

P31 is likely to be much more prevalent when classifying objects, and distinguishing between e.g. paintings and posters.

In [272]:
clf.print_tree(feature_names=pids_new)

|--- P569 <= 1.00
|   |--- P106 <= 0.50
|   |   |--- P570 <= 1.00
|   |   |   |--- label <= 0.99
|   |   |   |   |--- label <= 0.94
|   |   |   |   |   |--- class: False
|   |   |   |   |--- label >  0.94
|   |   |   |   |   |--- class: False
|   |   |   |--- label >  0.99
|   |   |   |   |--- P21 <= 0.50
|   |   |   |   |   |--- class: False
|   |   |   |   |--- P21 >  0.50
|   |   |   |   |   |--- class: False
|   |   |--- P570 >  1.00
|   |   |   |--- label <= 0.97
|   |   |   |   |--- class: False
|   |   |   |--- label >  0.97
|   |   |   |   |--- P734 <= 0.97
|   |   |   |   |   |--- class: False
|   |   |   |   |--- P734 >  0.97
|   |   |   |   |   |--- class: True
|   |--- P106 >  0.50
|   |   |--- label <= 0.95
|   |   |   |--- label <= 0.87
|   |   |   |   |--- P570 <= 0.28
|   |   |   |   |   |--- class: False
|   |   |   |   |--- P570 >  0.28
|   |   |   |   |   |--- class: False
|   |   |   |--- label >  0.87
|   |   |   |   |--- P569 <= 0.90
|   |   |   |   |   |--- class