### Get Inventor Sequence Number
This document walks through the get-inventor-sequence-number class to describe the methodology, data cleaning, and performance metrics. Here are the datasets we are working with:

**patents_2005_012.tsv:** patent-inventor instances containing inventor information such as:
- **id11:** inventor cluster identifier
- **patent:** patent identifier
- **fname:** first name
- **mname:** middle name
- **lname:** last name
- **suffix:** suffix
- other inventor information including location

**rawinventor.tsv:** second dataset with patent-inventor instances containing sequence number
- **patent_id:** patent identifier
- **name_first:** first name
- **name_last:** last name
- **sequence:** sequence number

**eval_als.txt:** manual disambiguation of patent-inventor instances
- **mention_id:** mention identifier for patent-inventor instances following "US_PatentNumber-SequenceNumber" (ie. US5294443-0)
- **cluster_id:** inventor cluster identifier to define a like inventor between patent-inventor instances

### Objective

The patents_2005_012.tsv dataset does not include the sequence number of inventors on a given patent. This is how Patents-View indexes patent-inventor instances, a necessary component to organizing the data. We will obtain this information by cross-referencing the names in patents_2005_012.tsv with names on the same patent in rawinventor.tsv.

There are several difficulties, as name instances are not guaranteed matches. For example, "G. Kenneth Adams 3rd" in patents_2005_012.tsv exists as "George K. Adams, III" in the rawinventor.tsv. The get_sequence function attempts to compare our inventor's name in patents_2005_012.tsv with all inventors of the same patent in rawinventor.tsv. If a close match is found, the sequence number is returned.

### Package Imports

In [None]:
!pip install git+https://github.com/OlivierBinette/StringCompare.git@release

In [4]:
import stringcompare
import pandas as pd
import numpy as np
import wget
import zipfile
import os

if not os.path.isfile("input/rawinventor.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip")
    with zipfile.ZipFile("rawinventor.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("rawinventor.tsv.zip")

### Data Imports

First we have the **rawinventor.tsv** file. Attribute information is listed above. We have set and sorted the index of rawinventor.tsv to speed up execution since our execution makes a large amount of **patent_id** lookups in this dataframe.

In [None]:
rawinventor = pd.read_csv("input/rawinventor.tsv", sep="\t", usecols=["patent_id", "sequence", "name_first", "name_last"], 
    dtype={"patent_id": "string", "sequence": "int16", "name_first": "string", "name_last": "string"})
rawinventor.set_index(['patent_id', 'sequence'], inplace=True)
rawinventor.sort_index(inplace=True)
rawinventor.head(10)

Next is the **patents_2005_012.tsv** file (attribute information listed above):

In [None]:
patents_2005_012 = pd.read_csv("patents_2005_012.tsv", sep="\t", usecols=["patent", "fname", "mname", "lname", "suffix"], dtype="string")
patents_2005_012.head(10)

### Functions

**get_word()** is used in cases where the entire name is not a match. In these cases, **get_sequence()** attempts to compare the "first word" (substring before first whitespace) of the names in **rawinventor.tsv** with the "first word" of the name in **patents_2005_012.tsv**.

In [None]:
def get_word(name):
    index = name.find(' ')
    if(index != -1):
        return name[0: index]
    else:
        return name

**get_sequence()** is the main function used in this class. For a single patent_id and name, it returns the sequence number for the nearest string match found. If no close match is found, it returns "NaN". Note the **autosequence_log** variable which logs information when **get_sequence()** detects a possibility for error.

In [14]:
comparator = stringcompare.Levenshtein()

def get_sequence(patent_id, name_first, name_last, name_middle, suffix):

    if patent_id in rawinventor.index:
        #combined names
        first_half = name_first
        second_half = name_last

        #concat middle name/initial
        if name_middle != "&":
            first_half += " " + name_middle

        #concat suffix
        if suffix != "&":
            if suffix == "2nd":
                suffix = "II"
            elif suffix == "3rd":
                suffix = "III"
            second_half += " " + suffix

        #computing string distances
        dat = rawinventor.loc[patent_id]
        last_distances = comparator.pairwise([second_half.lower()], dat.name_last.str.lower().values)[0]
        first_distances = comparator.pairwise([first_half.lower()], dat.name_first.str.lower().values)[0]

        #one last name match
        if sum(last_distances == 0) == 1: 
            return np.argmin(last_distances)

        #multiple last name matches
        elif sum(last_distances == 0) > 1:
            return np.argmin(first_distances)

        #close matches
        elif sum(last_distances < 0.2) >= 1:
            #record close data to close_match and return sequence number
            index = np.argmin(last_distances + first_distances)
            dict = {'patent_id': patent_id, 'name_last': second_half, 'name_first': first_half, 'index': index, 
                'referenced_last': dat.name_last[index], 'referenced_first': dat.name_first[index], 'type': "Close Match"}
            autosequence_log.append(dict)
            return index
        
        #vague matches
        elif sum(last_distances < 0.3) >= 1 or sum(first_distances < 0.3) >= 1:
            #record vague data to vague_match and return sequence number
            index = np.argmin(last_distances + first_distances)
            dict = {'patent_id': patent_id, 'name_last': second_half, 'name_first': first_half, 'index': index, 
                'referenced_last': dat.name_last[index], 'referenced_first': dat.name_first[index], 'type': "Vague Match"}
            autosequence_log.append(dict)
            return index

        #no matches for full word comparison
        else:
            #get first word for each name
            firsts = dat.apply(lambda x: get_word(x.name_first), axis=1)
            lasts = dat.apply(lambda x: get_word(x.name_last), axis=1)
            name_last = get_word(name_last)
            name_first = get_word(name_first)
            
            #recompute string distances
            last_distances = comparator.pairwise([name_last.lower()], lasts.str.lower().values)[0]
            first_distances = comparator.pairwise([name_first.lower()], firsts.str.lower().values)[0]

            #check if first word matches -> half match, otherwise no match
            if sum(last_distances < 0.2) >= 1 and sum(first_distances < 0.2) >= 1:
                #record half word data to half_match and return sequence number
                index = np.argmin(last_distances + first_distances)
                dict = {'patent_id': patent_id, 'name_last': second_half, 'name_first': first_half, 'index': index, 
                    'referenced_last': dat.name_last[index], 'referenced_first': dat.name_first[index], 'type': "Half Match"}
                autosequence_log.append(dict)
                return index
            else:
                #still record data but return "NaN"
                index = np.argmin(last_distances + first_distances)
                dict = {'patent_id': patent_id, 'name_last': second_half, 'name_first': first_half, 'index': index, 
                    'referenced_last': dat.name_last[index], 'referenced_first': dat.name_first[index], 'type': "No Match"}
                autosequence_log.append(dict)
                return "NaN"
    else:
        #if key is not present in rawinventors.tsv
        dict = {'type': "No Match"}
        autosequence_log.append(dict)
        return "NaN"

### Execution
We use **pandas.apply()** to run **get_sequence()** row-wise on **patents_2005_012.tsv**. Dependent on your machine and the size of your data files, this process could take a long time.

In [16]:
if not os.path.isfile("output/autosequence.csv"):
    autosequence_log = []
    patents_2005_012["sequence"] = patents_2005_012.apply(lambda x: get_sequence(x.patent, x.fname, x.lname, x.mname, x.suffix), axis=1)
    patents_2005_012.to_csv("output/autosequence.csv")
    results = pd.DataFrame(autosequence_log)
    results.to_csv("output/autosequence_log.csv")

### Results
**patents_2005_012.tsv** is 142619 rows long. After executing our code, **get_sequence()** attributed the following counts to each category for our match 'type':

In [14]:
#results = pd.read_csv('output/autosequence_log.csv')
results.type.value_counts()

Close Match    2598
Vague Match    1682
No Match        214
Half Match      107
Name: type, dtype: int64

Not reported in this table, there were 138018 'Exact Matches' (can be found by taking the difference between our dataframe length and our logged results length) reported by **get_sequence()**.

Of the 214 'No Match' cases, we attribute 51 to 'No Key Error' where the 'patent' in **patents_2005_012.tsv** did not exist in **rawinventor.tsv**. The other 163 cases will require manual review to disambiguate as **get_sequence()** could not accurately determine the sequence number.