# Resolving Incorrect Names with Fuzzy Matching

This notebook shows how a fuzzy search can be used to help identify the *correct* name for each entity, after recieving warning messages from the `pybel` debugger. This works to find typos and simple mistakes, but is not as powerful as a text mining approach in identifying logical and domain-specific (biological) mistakes.

In [1]:
import pybel
import os
from fuzzywuzzy import process, fuzz
from pybel.manager import database_models
from collections import defaultdict

Parse BEL document, get lots of name errors (PyBEL132)

The log file is generated with code like this 
```python
>>> import pybel
>>> with open('~/Desktop/testoutput.txt', 'w') as f:
...     g = pybel.from_path('~/Path/to/file.bel', log_stream=f)
```

In [2]:
log_path = os.path.expanduser('~/Desktop/testoutput.txt')

In [None]:
# cat ~/.pybel/test.txt | grep "PyBEL132" | cut -d "-" -f 6,8 | cut -d ":" -f 1,2

In [3]:
unique_lines = defaultdict(set)

with open(log_path) as f:
    for line in f:
        if 'PyBEL132' not in line:
            continue
        line = line.strip().split(' - ')
        line_number = line[3].strip('Line ')
        line = line[5].strip('Invalid').strip().split(':')
        namespace = line[0][:-5]
        name = line[1].strip()
        unique_lines[namespace, name].add(line_number)

In [4]:
# just get the unique ones, since fuzzy search is some heavy business
len(unique_lines)

126

Load namespaces, do fuzzy search to find correct terms. This relies on having run the corresponding file on this machine, and already having the namespaces loaded!

In [5]:
manager = pybel.manager.namespace_cache.DefinitionCacheManager()
manager.ensure_cache()

In [6]:
# This cell hacks up the definitions manager. This functionality will be built in soon :)
values = {}

def get_values(key):
    if key not in values:
        url = manager.sesh.query(database_models.Definition).filter_by(keyword=key,definitionType='N').first().url
        values[key] = set(manager.namespace_cache[url])
    return values[key]

Default score is better for global similarity

In [8]:
%%time
for namespace, name in unique_lines:
    print('{} {} [{}]'.format(namespace, name, ', '.join(unique_lines[namespace, name])))
    for putative, score in process.extract(name, get_values(namespace), limit=5):
        print('[{}%] {}'.format(score, putative))
    print()

GOBP negative regulation of dopaminergic neuron differentiation [0016013, 0010252]
[95%] negative regulation of neuron differentiation
[90%] neuron differentiation
[90%] dopaminergic neuron differentiation
[89%] negative regulation of cell differentiation
[88%] negative regulation of forebrain neuron differentiation

dbSNP rs212805 [0006145, 0000787, 0006132]
[75%] rs201825
[67%] rs12129547
[67%] rs12989701
[67%] rs12610605
[67%] rs17523802

dbSNP rs2890982 [0006114, 0000781, 0006139]
[71%] rs201825
[67%] rs9498
[67%] rs8106922
[67%] rs8079215
[67%] rs4291702

GOBP regulation of ubiquitin protein ligase activity [0010228, 0010229]
[90%] positive regulation of ubiquitin-protein ligase activity involved in regulation of mitotic cell cycle transition
[90%] negative regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle
[90%] regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle
[90%] regulation of ubiquitin-protein ligase activity involved 

Using partial ratio, finds substrings more easily

In [7]:
%%time
for namespace, name in unique_lines:
    print('{} {} [{}]'.format(namespace, name, ', '.join(unique_lines[namespace, name])))
    for putative, score in process.extract(name, get_values(namespace), scorer=fuzz.partial_ratio, limit=5):
        print('[{}%] {}'.format(score, putative))
    print()

GOBP negative regulation of dopaminergic neuron differentiation [0016013, 0010252]
[100%] neuron differentiation
[100%] dopaminergic neuron differentiation
[92%] fermentation
[91%] GABAergic neuron differentiation
[88%] regulation of pH

dbSNP rs212805 [0006145, 0000787, 0006132]
[75%] rs201825
[75%] rs12129547
[62%] rs10889061
[62%] rs2618697
[62%] rs1385600

dbSNP rs2890982 [0006114, 0000781, 0006139]
[67%] rs6859
[67%] rs12989701
[67%] rs8106922
[67%] rs8079215
[67%] rs4291702

GOBP regulation of ubiquitin protein ligase activity [0010228, 0010229]
[100%] positive regulation of ubiquitin-protein ligase activity involved in regulation of mitotic cell cycle transition
[100%] negative regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle
[100%] regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle
[100%] regulation of ubiquitin-protein ligase activity involved in meiotic cell cycle
[100%] positive regulation of ubiquitin-protein ligase 