### Precision and Recall Analysis
The final step of this class is to test our disambiguation against the **eval_als.txt** dataset (reference).

### Package imports

In [1]:
import pandas as pd

from pv_evaluation.metrics import cluster_precision_recall
from pv_evaluation.benchmark import inspect_clusters_to_split
from pv_evaluation.benchmark import inspect_clusters_to_merge

### Data Imports
Loading in the reference dataset and formatting into a series.

In [99]:
ref = pd.read_csv('input/eval_als.txt', sep='\t', header=None, names=['mention', 'cluster'], dtype={'mention': 'string', 'cluster': 'int16'})
ref['mention'] = ref.apply(lambda x: "US" + x.mention, axis=1)

#convert to series
ref.set_index('mention', inplace=True)
ref_series = ref.iloc[:, 0]
ref_series

0


mention
US5294443-0       0
US6207855-1       1
US5767288-2       2
US4996481-1       3
US6727070-0       4
               ... 
US5150706-0     493
US7556935-2    1516
US8405044-1    2534
US7088104-0    1517
US6068972-0     528
Name: cluster, Length: 41347, dtype: int16

Loading in our prediction dataset and formatting into a series. We have to join it with our initial **patents_2005_012.tsv** because we previously dropped 'id11'. Our final series uses the newly derived 'mention' id and 'id11' for the cluster id

In [100]:
#loading result and id datasets for merge
pred_mention_id = pd.read_csv('output/autosequence_cleaned.csv', dtype="string")
pred_cluster_id = pd.read_csv('input/patents_2005_012.tsv', sep='\t', usecols=['id11', 'patent', 'fname', 'mname', 'lname', 'suffix'], 
    dtype="string")

#join to match mention_id with cluster_id
pred_cluster_id.drop_duplicates(inplace=True)
pred = pd.merge(pred_mention_id, pred_cluster_id, on=['patent', 'fname', 'mname', 'lname', 'suffix'])

#convert to series
pred['id11'] = pd.to_numeric(pred['id11'])
pred.set_index('mention', inplace=True)
pred_series = pred['id11']
pred_series

mention
US6205043-1        1
US6434031-1        1
US6583702-1        1
US6423491-1        2
US4184768-1        3
               ...  
US6723088-1    17241
US6811612-4    28601
US6829498-3    40943
US6829498-4    41841
US6830895-1    49818
Name: id11, Length: 142522, dtype: int64

### Testing
Calculating our precision and recall values. Note that we could only calculate this on a fraction of our dataset since the overlap between **eval_als.txt** and **patents_2005_012.tsv** is small.

In [111]:
join = pd.concat({'pred': pred_series, 'ref': ref_series}, axis=1, join="inner")

print(cluster_precision_recall(join.pred, join.ref))

(0.9997167941093175, 0.9963057686842853)


In [112]:
print("Pred length:", len(pred_series))
print("Ref length:", len(ref_series))
print("Join length:", len(join))

Pred length: 142522
Ref length: 41347
Join length: 22652


### Output
Writing cases where there are inconsistencies between our two disambiguations.

In [154]:
df = inspect_clusters_to_merge(join.pred, join.ref)
info = []
ref_head = []
pred_head = []

for index, dict in df.iterrows():
    val = pred.loc[index]
    info.append({'fname': val.fname, 'mname': val.mname, 'lname': val.lname, 'suffix': val.suffix})

df['info'] = info
df.to_csv('output/clusters.csv')
df

Unnamed: 0_level_0,reference,prediction,info
mention,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US5656497-0,25,60231,"{'fname': 'Joseph', 'mname': 'Gregory', 'lname..."
US6495023-0,25,60228,"{'fname': 'Gregory', 'mname': 'J.', 'lname': '..."
US6270649-0,25,60231,"{'fname': 'Joseph', 'mname': 'Gregory', 'lname..."
US5980890-1,25,60231,"{'fname': 'Joseph', 'mname': 'Gregory', 'lname..."
US5908924-1,25,60231,"{'fname': 'Joseph', 'mname': 'Gregory', 'lname..."
...,...,...,...
US5459317-0,3061,49630,"{'fname': 'Gary', 'mname': 'W.', 'lname': 'Sma..."
US6534062-2,3097,44221,"{'fname': 'Douglas', 'mname': '&', 'lname': 'R..."
US6686462-4,3097,44222,"{'fname': 'Douglas', 'mname': 'D.', 'lname': '..."
US6111071-0,3262,17617,"{'fname': 'Eric', 'mname': '&', 'lname': 'Gers..."
