# Resolving Incorrect Names with Fuzzy Matching

**Author:** [Charles Tapley Hoyt](https://github.com/cthoyt/)

**Estimated Run Time:**

This notebook demonstrates the ability of a fuzzy string search to help identify the *correct* name for incorrectly named entities in a BEL document, after recieving warning messages during compilation. This method can find typos and simple mistakes, but is not as powerful as a text mining approach in identifying logical and domain-specific (biological) mistakes.

In [1]:
from collections import defaultdict
import os
import time

from fuzzywuzzy import process, fuzz
import pybel
from pybel.manager import database_models

In [2]:
pybel.__version__

'0.2.6'

In [3]:
time.asctime()

'Wed Nov 16 12:16:45 2016'

In [4]:
# load definitions manager
manager = pybel.manager.namespace_cache.DefinitionCacheManager()
manager.ensure_cache()

In [5]:
# This cell hacks up the definitions manager. This functionality will be built in soon :)
values = {}

def get_values(key):
    if key not in values:
        url = manager.sesh.query(database_models.Definition).filter_by(keyword=key,de finitionType='N').first().url
        values[key] = set(manager.namespace_cache[url])
    return values[key]

## Generation of Error Log

During the parsing of a BEL document, name errors ([IllegalNamespaceNameException](http://pybel.readthedocs.io/en/latest/logging.html#pybel.parser.parse_exceptions.IllegalNamespaceNameException)) are marked in the log with `PyBEL132`. The Selventa example corpra do not have these errors, so a working copy of the [AETIONOMY](www.aetionomy.eu) Parkinson's Disease Knowledge Model is used as an example.

In [6]:
bel_path = os.path.expanduser('~/ownCloud/BEL/PD_Aetionomy.bel')
log_path = os.path.expanduser('~/Downloads/pd_log.txt')

In [7]:
%%time
with open(log_path, 'w') as f:
    g = pybel.from_path(bel_path, log_stream=f)

CPU times: user 27.4 s, sys: 1.59 s, total: 29 s
Wall time: 29.3 s


## Parsing Error Log

While a simple chain of bash commands can parse the appropriate lines, this notebook will use python to iterate over the log file and get the relevant information. 

``` sh
cat ~/.pybel/test.txt | grep "PyBEL132" | cut -d "-" -f 6,8 | cut -d ":" -f 1,2
```

An example line looks like: `2016-11-16 12:17:28,687 - pybel - WARNING - Line 0015655 - PyBEL132 - Invalid GOBP name: mitophagy: translocation(g(HGNC:PARK2),MESHCS:"Endoplasmic Reticulum" ,MESHCS:Mitochondria) causesNoChange bp(GOBP:mitophagy)`

In [8]:
unique_lines = defaultdict(set)

with open(log_path) as f:
    for line in f:
        if 'PyBEL132' not in line:
            continue
        line = line.strip().split(' - ')
        line_number = line[3].strip('Line ')
        line = line[5].strip('Invalid').strip().split(':')
        namespace = line[0][:-5]
        name = line[1].strip()
        unique_lines[namespace, name].add(line_number)

In [9]:
# just get the unique ones, since fuzzy search is some heavy business
len(unique_lines)

126

## Searching the Database

The `fuzzywuzzy` package is used to perform fuzzy string searches. It uses the levenshtein distance for local, global, and partial string matching. The [documentation](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) provides excellent examples. It uses a global search, and a partial ratio search to provide two possibilities for name matching.

In [10]:
%%time
for namespace, name in sorted(unique_lines):
    ns_values = get_values(namespace)
    print('{}: {} [{}]'.format(namespace, name, ', '.join(unique_lines[namespace, name])))
    print('Global match')
    for putative, score in process.extract(name, ns_values, limit=6):
        print('[{}%] {}'.format(score, putative))
    print('Partial Ratio')
    for putative, score in process.extract(name, ns_values, scorer=fuzz.partial_ratio, limit=6):
        print('[{}%] {}'.format(score, putative))
    print()

CHEBI: 7,9-Dihydro-1H-purine-2,6,8(3H)-trione [0007572, 0007573]
Global match
[100%] 7,9-dihydro-1H-purine-2,6,8(3H)-trione
[92%] 5,7-dihydro-1H-purine-2,6,8(9H)-trione
[90%] purine
[90%] 1H-purine
[86%] 9(R)-HPETE
[86%] 2-(1-Aziridinyl)ethanol
Partial Ratio
[100%] purine
[100%] 7,9-dihydro-1H-purine-2,6,8(3H)-trione
[100%] RI
[100%] 1H-purine
[100%] R
[92%] 5,7-dihydro-1H-purine-2,6,8(9H)-trione

CHEBI: Cycloheximide [0014585]
Global match
[100%] cycloheximide
[90%] imide
[79%] cyclohexylamine
[77%] cyclothiazide
[75%] imidate
[75%] cyclohexene
Partial Ratio
[100%] cycloheximide
[100%] imide
[83%] imidate
[80%] amide
[80%] imine
[80%] cycloxydim

CHEBI: Norepinephrine sulfate [0011442]
Global match
[90%] sulfate
[86%] heparan sulfate D-glucuronyl-D-galactosyl-D-galactosyl-D-xylosyl-L-serine
[86%] sodium(E)-((1R,4aS,4bS,8aS,10aS)-1-(2-((2S,5R)-2,5-dimethoxy-2,5-dihydrofuran-3-yl)ethyl)-4b,8,8,10a-tetramethyldecahydrophenanthren-2(1H,3H,4bH)-ylidene)methyl sulfate
[86%] N-acetyl-D-galac

# Conclusions

Many name errors are due to a simple misspelling or capitalization. The Global Match is very good at identifying these errors. 

In this document, there are many identifiers from dbSNP that are simply not included in the namespace. This prompts an update of the namespace for a more modern listing of the names. Other errors have shown that there are terms in the heirarchy of GOBP that are not included in the namespace, and therefore a more general assertion from a publication is difficult to represent in BEL. Interestingly, some of these errors are consistiently used multiple times. This prompts an investigation of the annotation practices used in creating this document. 

Additionally, this search does not take into account semantics. Proper ontologies could provide much better synonym matching and make better use of modern text-mining techniques. 