# Simple Demo

First we read the example files, which will be annotated automatically.  
Optionally you can pass your own tokenization method.

In [1]:
from data_utils import Records, Record
import glob

records = Records()
for i, file in enumerate(glob.glob('data/example/*.txt')):
    with open(file, 'r', encoding='utf-8') as r_file:
        records.append(Record(i, r_file.read()))


Then we build our lexicon from the `lexicon.txt` file and open the annotation GUI.  
You can close the GUI without marking anything for this demo.

In [2]:
from data_utils import build_lexicon
from extraction import extract_all
from gui import AnnotationGUI
from curation import get_false_positives

lexicon = build_lexicon()
print('Lexicon:', ', '.join(e.raw for e in lexicon.values()))

extract_all(records, lexicon)  # pre-annotate tokens that are contained in the lexicon
gui = AnnotationGUI(records)
tags = gui.annotated
fps = get_false_positives(records, lexicon) 

Lexicon: hydrochloorthiazide


Next the candidatefinder will find the misspelled example after we call the `extend_fuzzy` method, and open a review GUI where it will have found the final example from its context.

In [3]:
from context_utils import CandidateFinder

cdf = CandidateFinder(records, lexicon)  
cdf.extend_fuzzy()  # find fuzzy matches
print('Lexicon after fuzzy matching:', ', '.join(e.raw for e in lexicon.values()))
cdf.process_contexts()  # find lexicon match contexts
cdf.get_candidates()  # sets cdf.candidates to a list of candidates ordered by their occurrence count
cdf.start_review()  # opens the review GUI
cdf.save_all()  # saves the lexicon, rejected entries, and partial match statistics
print('Final lexicon:', ', '.join(e.raw for e in lexicon.values()))

Lexicon after fuzzy matching: hydrochloorthiazide, hydrocholoorthizide
Final lexicon: hydrochloorthiazide, hydrocholoorthizide, calci chew dD
