### Project Outline

* Read data and create subset for ones that don't have VIAF ID. 
    * We might work with the entire dataset though, because more data might make clustering names easier.
* Create 3 new columns: `parsedName`, `_personID`, `regularizedName`
    * The leading underscore indicates `personId` is a temporary column. The ID can just be a sequential integer as long as it's unique - it doesn't need to be a true hash.
    * Whether we can reliably create a `regularizedName` - i.e. a modern regular version that should replace all variations in the metadata remains to be seen. It might not be possible without a lot of human intervention. 
* We need the following functions:
    * `name_preprocess()`: Returns a cleaned up version of the name string or `None` if it doesn't look like a name.
    * `name_pair()`: take a pair of preprocessed names and return a `true` if they are a close enough match that we should compute `weighted_levenshtein()` or `false` if we should ignore them. 
    * `weighted_levenshtein()`
    * `substitution_cost_dict_generate()`: generate a cost dict for `weighted_levenshtein()`
    * `ner_pubStmt()`: Takes the `pubStmt()` field and runs NER with `Spacy` on it.
* Procedure:
    * For each row, run `name_preprocess()`. 
        * If we get a name back, we store it in `parsedName`
    * After we are done with the entire DataFrame, take the subset that has `parsedName` set.
        * The other rows get written out to a CSV (without the 3 new columns -- these need to be run through NER and/or checked by hand)
    * In the new subset DataFrame, we generate combinations of pd.Dataframe `index` ids to generate all possible name pairs. Then run them through `name_pair()`.
    * If they pass, we run them through `weighted_levenshtein()` and store the result in a `networkx` graph where the nodes are pd.Dataframe `id` and the edges are 1/weigthed_levenshtein() \[i.e. the more similar the nodes, the higher the weight\]. If the `weighted_levenshtein()` score is above a certain threshold, we don't add it to the graph. 
    * When we are done, we break down the graph into discrete subgraphs using [this approach](https://stackoverflow.com/questions/61536745/how-to-separate-an-unconnected-networkx-graph-into-multiple-mutually-disjoint-gr). Each subgraph will be one name in all its variant forms. This performs the clustering for us. 
    * We sort these graphs by number of nodes and assign each of them a unique ID starting at 1 and then write everythin out to a CSV.