### Cleaning Results
Before verification using cluster precision/recall analysis, the output requires cleaning. The goal of obtaining sequence numbers was to have a unique identifier for patent-inventor instances. However, the our results contain 42 duplicate indices which is clearly an issue. Thus, manual review was used to fix these incorrect sequence numbers. Additionally, we can resolve the 163 cases where **get_sequence()** reported 'No Match'.

### Package imports

In [264]:
import pandas as pd
import numpy as np
from IPython.display import clear_output
import os

### Data Imports
Loading in the results from **get-inventor-sequence-number**. We then split up the dataframe into 'autosequence' (completed cases), 'no_sequence' (sequence number is "NaN"), and 'duplicated' (duplicate mention_id).

In [335]:
#read in results
autosequence = pd.read_csv('output/autosequence.csv', index_col = 0, dtype="string")
init_length = len(autosequence)

#separate dataframe into one with sequence numbers and one without
no_sequence = autosequence[autosequence['sequence'].isna()]
autosequence = autosequence[autosequence['sequence'].isna() == False]

#get mention id
autosequence['mention'] = "US" + autosequence.patent + "-" + autosequence.sequence

#separate dataframe (with sequence numbers) into one with duplicates and one wihtout
duplicated = autosequence[autosequence.mention.duplicated(keep=False)]
autosequence = autosequence[autosequence.mention.duplicated(keep=False) == False]

print({"Initial autosequence length": init_length, "Separated autosequence length": len(autosequence), 
    "No sequence length": len(no_sequence), "Duplicated length": len(duplicated)})

{'Initial autosequence length': 142619, 'Separated autosequence length': 142363, 'No sequence length': 214, 'Duplicated length': 42}


We are again loading in **rawinventor.tsv** to use in our manual inspection.

In [None]:
#reading in raw_inventor (and sorting for efficiency)
rawinventor = pd.read_csv("input/rawinventor.tsv", sep="\t", usecols=["patent_id", "sequence", "name_first", "name_last"], 
    dtype={"patent_id": "string", "sequence": "int16", "name_first": "string", "name_last": "string"})
rawinventor.set_index(['patent_id', 'sequence'], inplace=True)
rawinventor.sort_index(inplace=True)
rawinventor.head(10)

### Manual Inspection
Below are two separate processes for manual inspection: one of no_sequence cases and the other of duplicated cases. Results are stored in the variables below.

In [45]:
#variables for storing manual results
no_sequence_results = []
no_sequence_observed = set()
duplicated_results = []
duplicated_observed = set()

In [244]:
#manually entering sequence number for no_sequence cases
for index, dict in no_sequence.iterrows():
    if index not in no_sequence_observed:
        if dict.patent in rawinventor.index:
            print(pd.DataFrame(dict).T)
            print(rawinventor.loc[dict.patent])
            dict.sequence = input("Sequence number: ")
            no_sequence_observed.add(index)
            no_sequence_results.append(dict)
            clear_output(wait = True)
        else:
            dict.sequence = "NaN"
            no_sequence_observed.add(index)
            no_sequence_results.append(dict)

In [255]:
#manually entering sequence number for duplicated cases
for patent, dict in duplicated.groupby('patent'):
    if patent not in duplicated_observed:
        if patent in rawinventor.index:
            print(dict)
            print(rawinventor.loc[patent])
            dict.sequence[0] = input("First sequence number: ")
            dict.sequence[1] = input("Second sequence number: ")
            duplicated_observed.add(patent)
            duplicated_results.append(dict)
            clear_output(wait = True)
        else: #shouldn't encounter but just in case
            print("error")
            input("move on?")

### Output
The final step is to combine our three dataframes after 'no_sequence' and 'duplicated' are resolved of errors. After concatenation, we write our **autosequence_cleaned.csv** file and a separate **autosequence_errors.csv** file for cases which could not be resolved.

In [369]:
#converting our results to pandas dataframes
no_sequence = pd.DataFrame(no_sequence_results, dtype="string")
duplicated_results = [dict for i in range(len(duplicated_results)) for index, dict in duplicated_results[i].iterrows()] #flatten
duplicated = pd.DataFrame(duplicated_results, dtype="string")

#separating
no_sequence_errors = no_sequence[no_sequence['sequence'] == "NaN"]
no_sequence = no_sequence[no_sequence['sequence'] != "NaN"]
duplicated_errors = duplicated[duplicated['sequence'] == "NaN"].drop('mention', axis='columns')
duplicated = duplicated[duplicated['sequence'] != "NaN"]

#updating identifier
no_sequence['mention'] = "US" + no_sequence.patent + "-" + no_sequence.sequence
duplicated['mention'] = "US" + duplicated.patent + "-" + duplicated.sequence

#combining final output
autosequence_cleaned = pd.concat([autosequence, no_sequence, duplicated])
autosequence_cleaned.set_index('mention', inplace = True)
autosequence_errors = pd.concat([no_sequence_errors, duplicated_errors])

if not os.path.isfile("output/autosequence_cleaned.csv"):
    autosequence_cleaned.to_csv('output/autosequence_cleaned.csv')
    autosequence_errors.to_csv('output/autosequence_errors.csv')