# Tutorial name analysis

In this tutorial the general functionality of name comparison is outlined, using the function `find_duplicate_persons` and the supporting functions.

The `find_duplicate_persons`, is a function that for a list of names - for example a list of names extracted from a text using NER - tries for each name, to obtain all the names, that likely represent the same person.
<br> 
It takes the following rule-based steps:
- it removes titles from names, if any. See the considered titles from `keywords.Titles.titles`. Name comparison in the end will take place on the bassis of the names without the titles.
- it considers first name abbreviations (`abbreviate`). When two names are compared, one with a first name abbreviation and the other not, it will try to abbreviate the second name and then applies comparison
- It compares two names, using the token set ration (`get_tsr`) to determine how similar two names are.
The ratio required to considere two names as similar enough, is in the range of 90-100%, and depends on the form (length, number of terms, presence of initials). The required ratio is chosen based on experience.
<br>
For each name, a last of names is created containing all the similar names found. The list is ordered from longest name version, to shortest name version. The longest name is considered the term with the most characters, unless it contains an abbreviation, in which case is selects the next longest name. Duplicate lists are removed if any.
<br>

The list of names for each name is are collected in a combined list, hence returning a list of lists. 

In [12]:
# Load the requirements
%matplotlib inline
import sys
sys.path.append('../')

from nedextract.utils import nameanalysis
from nedextract.utils import keywords

In [14]:
# check out the avaiable functions from nameanlysis
help(nameanalysis)

Help on module nedextract.utils.nameanalysis in nedextract.utils:

NAME
    nedextract.utils.nameanalysis - This file contains the class NameAnalysis.

CLASSES
    builtins.object
        NameAnalysis
    
    class NameAnalysis(builtins.object)
     |  NameAnalysis(names: list)
     |  
     |  This class contains functions used to analyse names.
     |  
     |  It contains the functions:
     |  - abbreviate
     |  - get_tsr
     |  - strip_names_from_title
     |  - sort_select_name
     |  - find_similar_names
     |  - find_duplicate_persons
     |  
     |  Methods defined here:
     |  
     |  __init__(self, names: list)
     |      Define class variables.
     |  
     |  find_duplicate_persons(self)
     |      Find duplicate names.
     |      
     |      From a list of names, find which names represent different writings of the same name,
     |      e.g. James Brown and J. Brown. Returns a list consisting of sublists, in which each sublist
     |      contains all versi

In [15]:
# check out the titles used for name analysis
keywords.Titles.titles

['prof.',
 'dr.',
 'mr.',
 'ir.',
 'drs.',
 'bacc.',
 'kand.',
 'dr.h.c.',
 'ing.',
 'bc.',
 'phd',
 'phd.',
 'dhr.',
 'mevr.',
 'mw.',
 'ds.',
 'mgr.',
 'mevrouw',
 'meneer',
 'jhr.']