# Sample Workflow for Dataturks NER Tools

In [1]:
# Local Modules
from formatting import format_labelled_data,format_unlabelled_data
from training import train_crf
from pre_annotate import pre_annotate_unlabelled

## Modules:

The repo is broken down into three main files:

1. formatting.py
2. training.py
3. pre_annotate.py

## Sample Workflow:

Say you have just completed your first batch of annotations for a project using Dataturks.

You have the unlabelled data you initially uploaded as well as the next batch you would like to annotate saved, respectively, as:

 - #### /data/unlabelled/unlabelled_batch_1.txt
 - #### /data/unlabelled/unlabelled_batch_2.txt

The annotations you made for batch 1 through Dataturks were downloaded in "Standard NER Format" and saved as:

- #### /data/labelled/labelled_batch_1.tsv

Additionally you have folders for you models and pre_annotated data:
- #### models/
- #### data/pre_annotations/

And a file to keep your results in:
- #### data/model_results.csv

With this set up you could use each file in the following ways.

### Using formatting.py:

This file is called by the others to format unlabelled and labelled data before passing it to a CRF model. 

You can use <code>format_unlabelled_data()</code> for .txt files formatted for upload to Dataturks:

In [2]:
# Test Unlabelled Data Formatting
title("Formatting Unlabelled Data...")
unlabelled_file = "./data/unlabelled/unlabelled_batch_1.txt"
x,tokens = format_unlabelled_data(unlabelled_file)
print("\nRaw Text:")
print(' '.join([a for a,b in tokens[0]]))
print("\nToken Sample:")
print([a for a,b in tokens[0]])
print("\nFirst Word Features:")
print(x[0][0],"\n")

----------------------------------------
Formatting Unlabelled Data...
----------------------------------------

Raw Text:
HISTORY OF PRESENT ILLNESS 41-year-old black gentleman status post Nissen fundoplication five years ago

Token Sample:
['HISTORY', 'OF', 'PRESENT', 'ILLNESS', '41-year-old', 'black', 'gentleman', 'status', 'post', 'Nissen', 'fundoplication', 'five', 'years', 'ago']

First Word Features:
{'bias': 1.0, 'word.lower()': 'history', 'word[-3:]': 'ORY', 'word[-2:]': 'RY', 'word.isupper()': True, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', 'BOS': True, '+1:word.lower()': 'of', '+1:word.istitle()': False, '+1:word.isupper()': True, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'} 



You can use <code>format_labelled_data()</code> for data that has been annotated and downloaded as .tsv from dataturks:

In [3]:
labelled_files = ["./data/labelled/labelled_batch_1.tsv"]

title("Formatting Labelled Data...")
x,y,tokens = format_labelled_data(labelled_files)
print("\nRaw Text:")
print(" ".join(tokens[0]))
print("\nText Sample:")
print(tokens[0])
print("\nLabel Sample:")
print(y[0])
print("\nFirst Word Features:")
print(x[0][0],"\n")

----------------------------------------
Formatting Labelled Data...
----------------------------------------

Raw Text:
HISTORY OF PRESENT ILLNESS 41-year-old black gentleman status post Nissen fundoplication five years ago

Text Sample:
['HISTORY', 'OF', 'PRESENT', 'ILLNESS', '41-year-old', 'black', 'gentleman', 'status', 'post', 'Nissen', 'fundoplication', 'five', 'years', 'ago']

Label Sample:
['O', 'O', 'O', 'O', 'B-Age', 'O', 'B-Gender', 'O', 'O', 'B-Procedure', 'I-Procedure', 'B-Time', 'I-Time', 'I-Time']

First Word Features:
{'bias': 1.0, 'word.lower()': 'history', 'word[-3:]': 'ORY', 'word[-2:]': 'RY', 'word.isupper()': True, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', 'BOS': True, '+1:word.lower()': 'of', '+1:word.istitle()': False, '+1:word.isupper()': True, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'} 



### Using training.py

You can use training.py to train and evaluate CRF models with the downloaded annotated data. Simply call the <code>train_crf()</code> function and pass it the file name to get a baseline CRF model.

In [4]:
title("Testing CRF training...")
crf = train_crf(labelled_files)

----------------------------------------
Testing CRF training...
----------------------------------------
Test Results:

------------------------------------------------------------
                            precision    recall  f1-score   support

                      Dose       0.99      0.89      0.94        76
                       DOS       0.75      0.76      0.76       525
                     Route       0.97      0.87      0.92        70
                 Procedure       0.79      0.68      0.73       157
                      Time       0.74      0.74      0.74        78
                 Condition       0.77      0.81      0.79       246
                      Date       0.93      0.86      0.89        58
                      Drug       0.90      0.87      0.88       200
                      BODY       0.76      0.64      0.70       127
                       GEO       0.93      0.96      0.95       148
Other Measurement / Result       0.62      0.58      0.60       155
 

By default this will give you an evaluation of every entity that is being labelled, save the overall F1-Score and save the model.

## Using pre_annotate.py

Now that you have a saved model you can use it to pre_annotate your next batch of unlabelled data. Using pre-annotated data will greatly speed up the annotation process and almost entirely eliminate time taken on some of the easier NER tags. Simpy pass in the saved CRF model, the unlabelled file and the new save file to th function <code>pre_annotate_unlabelled()</code> and the script will pre-annotate your data and format it so it can be immediately uploaded to Dataturks.

In [5]:
crf_path = "./models/crf_Thu_Jan__2_19:06:58_2020"
unlabelled_file = "./data/unlabelled/unlabelled_batch_2.txt"
save_file = "./data/pre_annotated/pre_annotated_batch_2.txt"
ignore_tags = ["O"]
title("Getting Predictions for Unannotated Data...")

# Load Model, Generate Pre-Annotated File
pre_annotate_unlabelled(crf_path,unlabelled_file,save_file,ignore_tags)

----------------------------------------
Getting Predictions for Unannotated Data...
----------------------------------------
----------------------------------------
Annotations Saved to: ./data/pre_annotated/pre_annotated_batch_2.txt
----------------------------------------

Sample Raw Text:
HISTORY OF PRESENT ILLNESS 80 Russian female with h/o CAD, AF s/p PPM, HTN, CHF EF 45-50% , CRI Cr 1.5 , lung CA s/p resection in 2153 , chronic pain who presents to the ED with complaints of progressive LE pain and weakness over the past several days to weeks

Sample Prediction:
[['HISTORY', 'O'], ['OF', 'O'], ['PRESENT', 'O'], ['ILLNESS', 'O'], ['80', 'Age'], ['Russian', 'O'], ['female', 'Gender'], ['with', 'O'], ['h/o', 'O'], ['CAD, AF', 'Condition'], ['s/p', 'O'], ['PPM, HTN, CHF', 'Condition'], ['EF 45-50%', 'Other Measurement / Result'], [',', 'O'], ['CRI Cr 1.5', 'Other Measurement / Result'], [',', 'O'], ['lung', 'BODY'], ['CA', 'Condition'], ['s/p', 'O'], ['resection', 'Procedure'], ['in

The annotations saved can be directly be uploaded to Dataturks where the model can take care of some of the easier NER categories.

---