# Lesson 6

In this tutorial we elaborate a way to establish a lookup table for a person name string variable, which consists of multiple components, i.e. first name, middle name, last name.

As a library we use _metaphone_ which may not be installed in your environment. When you are based on conda python envrionment you may have to install the package via the _pip_ command as described here

* https://www.puzzlr.org/install-packages-pip-conda-environment/

## Introduction

Suppose that the person is called _Peter Alfred Escher_ and we got that name via an official API.

We use the standard library _itertools_ to calcutlate our permutations and we require the doublemetaphone algorithm.

In [1]:
import itertools
from metaphone import doublemetaphone

Suppose we retrieved the full qualified name in one of our variables and we have also a unique identifier of this person.

In [2]:
person_qualified_name = "Peter Alfed Escher"
person_id = "A123"

As a fresh up,  we use the _doubelmetaphone_ alogrithm to generate a tuple of keys which can be used for the comparison. Refer to the tutorial __[Medium Towards Data Science Tutorial](https://towardsdatascience.com/python-tutorial-fuzzy-name-matching-algorithms-7a6f43322cc5)__

In [3]:
tp = doublemetaphone("Peter Alfed Escher")
tp.__str__()

"('PTRLFTXR', 'PTRLFTSKR')"

We now want to calculate all permutations of this name, which could be used in a Twitter account name. I.e. potentially our person will miss out some components in his Twittter account name, he could use the name "Peter Escher" or "Peter A. Escher".
As a first step we calculate all permutations of 3 the name components:

In [4]:
name_list = list( itertools.permutations(["peter", "alfred", "escher"]))
print(name_list)

[('peter', 'alfred', 'escher'), ('peter', 'escher', 'alfred'), ('alfred', 'peter', 'escher'), ('alfred', 'escher', 'peter'), ('escher', 'peter', 'alfred'), ('escher', 'alfred', 'peter')]


## Build up a Lookup Directory 
We now to build up a lookup dictionary. Each name component combination we concatenate into a single string, generate a metaphone tuple and add it to our dictionary. As value we store our unique person identifier. For this we define a function.

In [5]:
lookup_dict = [{},{}]

def add_permutations_to_dictionary(name_list,person_id):
    for name_component in name_list:
        name = ''.join(name_component)
        metaphone_tuple = doublemetaphone(name)
        lookup_dict[0][metaphone_tuple[0]] = [person_id]
        lookup_dict[1][metaphone_tuple[1]] = [person_id]

add_permutations_to_dictionary(name_list, person_id)
print("Our dictionary: "+lookup_dict.__str__())

Our dictionary: [{'PTRLFRTXR': ['A123'], 'PTRXRLFRT': ['A123'], 'ALFRTPTRXR': ['A123'], 'ALFRTXRPTR': ['A123'], 'AXRPTRLFRT': ['A123'], 'AXRLFRTPTR': ['A123']}, {'PTRLFRTSKR': ['A123'], 'PTRSKRLFRT': ['A123'], 'ALFRTPTRSKR': ['A123'], 'ALFRTSKRPTR': ['A123'], 'ASKRPTRLFRT': ['A123'], 'ASKRLFRTPTR': ['A123']}]


As one can see in the above output: The value of the dictionary is a list element `[person_id]` . We are doing that because potentially we want to store a second name combination which will map to the same key. Therefore we have to ensure that we can store multiple person identifiers in the value .

Perhaps our user was missing out a name component and was registering on Twitter with the account name _Peter Escher_ or as _Escher Peter_. So let's get  also all these permutations. This can be easily achieved by passing an additional parameter (in our case 2) to the _permutations_ method, which defines the number of elements in the permutations.

In [6]:
name_list = list( itertools.permutations(["peter", "alfred", "escher"],2))
print(name_list)

[('peter', 'alfred'), ('peter', 'escher'), ('alfred', 'peter'), ('alfred', 'escher'), ('escher', 'peter'), ('escher', 'alfred')]


Let's add our permutations to our lookup dictionary

In [7]:
add_permutations_to_dictionary(name_list, person_id)
print("Our dictionary: "+lookup_dict.__str__())

Our dictionary: [{'PTRLFRTXR': ['A123'], 'PTRXRLFRT': ['A123'], 'ALFRTPTRXR': ['A123'], 'ALFRTXRPTR': ['A123'], 'AXRPTRLFRT': ['A123'], 'AXRLFRTPTR': ['A123'], 'PTRLFRT': ['A123'], 'PTRXR': ['A123'], 'ALFRTPTR': ['A123'], 'ALFRTXR': ['A123'], 'AXRPTR': ['A123'], 'AXRLFRT': ['A123']}, {'PTRLFRTSKR': ['A123'], 'PTRSKRLFRT': ['A123'], 'ALFRTPTRSKR': ['A123'], 'ALFRTSKRPTR': ['A123'], 'ASKRPTRLFRT': ['A123'], 'ASKRLFRTPTR': ['A123'], '': ['A123'], 'PTRSKR': ['A123'], 'ALFRTSKR': ['A123'], 'ASKRPTR': ['A123'], 'ASKRLFRT': ['A123']}]


To finalize this, our user could have just one of his name component used when registering with Twitter.

In [8]:
name_list = list( itertools.permutations(["peter", "alfred", "escher"],1))
print(name_list)

[('peter',), ('alfred',), ('escher',)]


Let's add also this permutation to our dictionary.

In [9]:
add_permutations_to_dictionary(name_list, person_id)
print("Our dictionary: "+lookup_dict.__str__())

Our dictionary: [{'PTRLFRTXR': ['A123'], 'PTRXRLFRT': ['A123'], 'ALFRTPTRXR': ['A123'], 'ALFRTXRPTR': ['A123'], 'AXRPTRLFRT': ['A123'], 'AXRLFRTPTR': ['A123'], 'PTRLFRT': ['A123'], 'PTRXR': ['A123'], 'ALFRTPTR': ['A123'], 'ALFRTXR': ['A123'], 'AXRPTR': ['A123'], 'AXRLFRT': ['A123'], 'PTR': ['A123'], 'ALFRT': ['A123'], 'AXR': ['A123']}, {'PTRLFRTSKR': ['A123'], 'PTRSKRLFRT': ['A123'], 'ALFRTPTRSKR': ['A123'], 'ALFRTSKRPTR': ['A123'], 'ASKRPTRLFRT': ['A123'], 'ASKRLFRTPTR': ['A123'], '': ['A123'], 'PTRSKR': ['A123'], 'ALFRTSKR': ['A123'], 'ASKRPTR': ['A123'], 'ASKRLFRT': ['A123'], 'ASKR': ['A123']}]


Let's enhance our function in order to handle the existence of keys and values:
* check if the key already exists
* in case the key already exists, check if the value array already contains `person_id`
* in not, then add the `person_id`to the value arry

In [13]:
def add_permutations_to_dictionary(perm_tuple_list,person_id):
    for perm_tuple in perm_tuple_list:
        concat_name = ''.join(perm_tuple)
        metaphone_tuple = doublemetaphone(concat_name)
        if metaphone_tuple[0] in lookup_dict[0]:
            if not person_id in lookup_dict[0][metaphone_tuple[0]]:
                lookup_dict[0][metaphone_tuple[0]].append(person_id)
        else:
            lookup_dict[0][metaphone_tuple[0]] = [person_id]
        if metaphone_tuple[1] in lookup_dict[1]:
            if not person_id in lookup_dict[1][metaphone_tuple[1]]:
                lookup_dict[1][metaphone_tuple[1]].append(person_id)
        else:
            lookup_dict[1][metaphone_tuple[1]] = [person_id]
    
    
add_permutations_to_dictionary(["Peter Alfred"], "A555")
print("Our dictionary: "+lookup_dict.__str__())

Our dictionary: [{'PTRLFRTXR': ['A123'], 'PTRXRLFRT': ['A123'], 'ALFRTPTRXR': ['A123'], 'ALFRTXRPTR': ['A123'], 'AXRPTRLFRT': ['A123'], 'AXRLFRTPTR': ['A123'], 'PTRLFRT': ['A123', 'A555'], 'PTRXR': ['A123', 'A235'], 'ALFRTPTR': ['A123'], 'ALFRTXR': ['A123'], 'AXRPTR': ['A123', 'A235'], 'AXRLFRT': ['A123'], 'PTR': ['A123', 'A235', 'A345'], 'ALFRT': ['A123'], 'AXR': ['A123', 'A235'], 'PTRJNKT': ['A345'], 'PTRKTJN': ['A345'], 'JNPTRKT': ['A345'], 'JNKTPTR': ['A345'], 'KTPTRJN': ['A345'], 'KTJNPTR': ['A345'], 'PTRJN': ['A345'], 'PTRKT': ['A345'], 'JNPTR': ['A345'], 'JNKT': ['A345'], 'KTPTR': ['A345'], 'KTJN': ['A345'], 'JN': ['A345'], 'KT': ['A345']}, {'PTRLFRTSKR': ['A123'], 'PTRSKRLFRT': ['A123'], 'ALFRTPTRSKR': ['A123'], 'ALFRTSKRPTR': ['A123'], 'ASKRPTRLFRT': ['A123'], 'ASKRLFRTPTR': ['A123'], '': ['A123', 'A235', 'A345', 'A555'], 'PTRSKR': ['A123', 'A235'], 'ALFRTSKR': ['A123'], 'ASKRPTR': ['A123', 'A235'], 'ASKRLFRT': ['A123'], 'ASKR': ['A123', 'A235'], 'ANPTRKT': ['A345'], 'ANKTPTR'

As one can see in the above example. Certain keys are now pointing to two person identifiers, e.g. `'PTRLFRT': ['A123', 'A555']`

## Generalize our Name Component Permutation Generator

Up to now we were preparing our `name_list` manually for 3,2 and 1 element(s). Let's get that implemented into a function as well. We define a general `generatePermutations` method, which calculates all permutations for `n` name components passed in via the `name_list`:

In [11]:
def generate_permutations(name_list):
    perms = []
    perms.extend(itertools.permutations(name_list))
    i = len(name_list)-1
    while i > 0:
        perms.extend(itertools.permutations(name_list,i))
        i -=1
    return perms

perms = generate_permutations(["peter","escher"])
perms.__str__()

"[('peter', 'escher'), ('escher', 'peter'), ('peter',), ('escher',)]"

Let's define the overall function which will call the two functions we defined:

In [15]:
lookup_dict = [{},{}]
def add_person_to_lookup_table(person_id, name_list):
    perms = generate_permutations(name_list) 
    add_permutations_to_dictionary(perms,person_id)

add_person_to_lookup_table("A123",["peter", "alfred", "escher"])
add_person_to_lookup_table("A235",["peter", "escher"])
add_person_to_lookup_table("A345",["peter", "john", "goood"])
print("Our dictionary: "+lookup_dict.__str__())

Our dictionary: [{'PTRLFRTXR': ['A123'], 'PTRXRLFRT': ['A123'], 'ALFRTPTRXR': ['A123'], 'ALFRTXRPTR': ['A123'], 'AXRPTRLFRT': ['A123'], 'AXRLFRTPTR': ['A123'], 'PTRLFRT': ['A123'], 'PTRXR': ['A123', 'A235'], 'ALFRTPTR': ['A123'], 'ALFRTXR': ['A123'], 'AXRPTR': ['A123', 'A235'], 'AXRLFRT': ['A123'], 'PTR': ['A123', 'A235', 'A345'], 'ALFRT': ['A123'], 'AXR': ['A123', 'A235'], 'PTRJNKT': ['A345'], 'PTRKTJN': ['A345'], 'JNPTRKT': ['A345'], 'JNKTPTR': ['A345'], 'KTPTRJN': ['A345'], 'KTJNPTR': ['A345'], 'PTRJN': ['A345'], 'PTRKT': ['A345'], 'JNPTR': ['A345'], 'JNKT': ['A345'], 'KTPTR': ['A345'], 'KTJN': ['A345'], 'JN': ['A345'], 'KT': ['A345']}, {'PTRLFRTSKR': ['A123'], 'PTRSKRLFRT': ['A123'], 'ALFRTPTRSKR': ['A123'], 'ALFRTSKRPTR': ['A123'], 'ASKRPTRLFRT': ['A123'], 'ASKRLFRTPTR': ['A123'], '': ['A123', 'A235', 'A345'], 'PTRSKR': ['A123', 'A235'], 'ALFRTSKR': ['A123'], 'ASKRPTR': ['A123', 'A235'], 'ASKRLFRT': ['A123'], 'ASKR': ['A123', 'A235'], 'ANPTRKT': ['A345'], 'ANKTPTR': ['A345'], 'ANP

As an example: "peter" as a single component has as key the value `PTR` and as value list the three identifiers: ` 'PTR': ['A123', 'A235', 'A345']`

So we have now a function ready which populates or lookup table. Now we have to design a `matchName` function which will check if the name or its permutations are matching to any lookup table entry

## MatchName Function
In this first iteration we construct a `match_list` will store all lookup entries which are matching any permutations of our `name_list` 

In [35]:
def match_name(name_list):
    match_list = []
    permutation_list = generate_permutations(name_list) 
    for perm_tuple in permutation_list:
        concat_name = ''.join(perm_tuple)
        metaphone_tuple = doublemetaphone(concat_name)
        if metaphone_tuple[0] in lookup_dict[0]:
            match_list.append((concat_name, lookup_dict[0][metaphone_tuple[0]]))
    print("Match with "+ match_list.__str__())
              
match_name(["peter","alfred"])

Match with [('peteralfred', ['A123']), ('alfredpeter', ['A123']), ('peter', ['A123', 'A235', 'A345']), ('alfred', ['A123'])]


As we can see, we have three tuples which point to one `person_id`, all of them have the same id `'A123'`. So we enhance our method to do this uniquness check, that means

* Do we have in our match list one or multiple tuples which always point to one single person ?


In [46]:
def match_name(name_list):
    match_list = []
    permutation_list = generate_permutations(name_list) 
    for perm_tuple in permutation_list:
        concat_name = ''.join(perm_tuple)
        metaphone_tuple = doublemetaphone(concat_name)
        if metaphone_tuple[0] in lookup_dict[0]:
            match_list.append((concat_name, lookup_dict[0][metaphone_tuple[0]]))
    # Our enhancement
    unique_id = None
    for match_tuple in match_list:
        if len(match_tuple[1]) == 1:
            if  unique_id != None:
                if unique_id != match_tuple[1][0]:
                    unique_id = None
                    break
            else:
                unique_id = match_tuple[1][0]
    return (unique_id, match_list)
                     
id = match_name(["peter","alfred"])
print("Unique Id (expect A123): "+id[0])
id = match_name(["peter","john"])
print("Unique Id (expect A345): "+id[0])
id = match_name(["peter","escher"])
print("Unique Id (expect None): "+str(id[0]))

Unique Id (expect A123): A123
Unique Id (expect A345): A345
Unique Id (expect None): None
