# BRAT-Standoff Dataset Proccessing

_Motivation_: The purpose of this notebook is to facilitate data cleaning, exploration and visulization of corpora in the biomedical domain.

_Prerequisites_: corpora must be in the BRAT-Standoff annotation (also known as A1).

### Table of Contents

1. [Data Cleaning](#Data-Cleaning)
2. [Corpus Processing](#Corpus-Processing)
3. [Corpus Statistics](#Corpus-Statistics) 
4. [Harmonize SSC with multiple GSC](#Harmonize-SSC-with-multiple-GSC)
    - [Extract disjunction of annotated entity sets](#Extract-disjunction-of-annotated-entity-sets)
5. [Evaluating a model](#Evaluating-a-Model)

Uncomment and run the following cell to install all requirements. _Note: its best to launch this notebook from a virtual enviornment_.

In [None]:
# ! pip install -U nltk sklearn

All code is stored in a single python file.

In [None]:
%load_ext autoreload
%autoreload 2
%aimport brat_standoff_corpus_proccessing

from brat_standoff_corpus_proccessing import *

It is useful to set the path to all datasets you are working with as constants in the following cell

In [None]:
# it is useful to set paths to different corpra here

## SSC
PATH_TO_CALBC_PRGE = ''
PATH_TO_CALBC_LIVB = ''
PATH_TO_CALBC_CHED = ''
PATH_TO_CALBC_DISO = ''

## GSC
# CHED
PATH_TO_CEMP = ''
PATH_TO_CDR_CHED = ''
PATH_TO_BIOSEMANTICS = ''
PATH_TO_SCAI = ''
# DISO
PATH_TO_CDR_DISO = ''
PATH_TO_MIRNA_DISO = ''
PATH_TO_NCBI_DISEASE = ''
PATH_TO_AZDC = ''
# LIVB
PATH_TO_S800 = ''
PATH_TO_LINNAEUS = ''
PATH_TO_MIRNA_LIVB = ''
PATH_TO_CELLFINDER_LIVB = ''
# PRGE
PATH_TO_BIOINFER = ''
PATH_TO_DECA = ''
PATH_TO_MIRNA_PRGE = ''
PATH_TO_FSU_PRGE = ''
PATH_TO_CELLFINDER_PRGE = ''
PATH_TO_IEPA = ''
PATH_TO_OSIRIS = ''

PATH_TO_PRGE_CORPORA = [PATH_TO_BIOINFER, PATH_TO_DECA, PATH_TO_MIRNA_PRGE, PATH_TO_FSU_PRGE, PATH_TO_CELLFINDER_PRGE]
PATH_TO_LIVB_CORPORA = [PATH_TO_S800, PATH_TO_LINNAEUS, PATH_TO_MIRNA_LIVB, PATH_TO_CELLFINDER_LIVB]
PATH_TO_CHED_CORPORA = [PATH_TO_CEMP, PATH_TO_CDR_CHED, PATH_TO_BIOSEMANTICS]
PATH_TO_DISO_CORPORA = [PATH_TO_MIRNA_DISO, PATH_TO_CDR_DISO, PATH_TO_NCBI_DISEASE]

CALBC_CHED_BLACKLIST = ''
CALBC_PRGE_BLACKLIST = ''
CALBC_LIVB_BLACKLIST = ''
CALBC_DISO_BLACKLIST = ''

## Data Cleaning

This collection of methods performs various cleaning tasks on a corpora. Examples of their use is given below.

__General cleaning tasks__: Remove hidden files, lone pairs, and invalid annotation pairs from a given corpus.

In [None]:
# set path to corpus to clean here
corpus_dir = PATH_TO_LINNAEUS

rm_hidden_files(corpus_dir)
rm_lone_pairs(corpus_dir)
_ = rm_invalid_ann(corpus_dir, remove=True)

Use the following function if you need to change the entity labels for a given entity in a given corpus.

In [None]:
change_ann_labels(corpus_dir, new_label='PRGE', labels_to_replace=['gene'], drop=True)

Once data is cleaned, it's best to compress it.

In [None]:
# Compress <directory> to <directory>.tar.gz
# --------------------------------------
# ! tar -zcvf <directory>.tar.gz <directory>

## Corpus Processing

A collection of functions for various corpus processing tasks. Examples of their use is given below.

`extract_ann` is a simple function which extracts all unique entities in a corpus and returns a counter object. It is mainly used for downstream corpus processing tasks. Here is an example where it is used to print the top N most common entities in a corpus:

In [None]:
corpus_dir = PATH_TO_BIOINFER

annotations = extract_ann(corpus_dir)
print_top_bottom_n(annotations, n=100)

`find_entity_in_corpra` is used to get a list of documents where a given entity exists in a corpus. Function allows for searching through either annotation and text files, or strictly annotation files (thus, the latter is equivant to checking if an annotation exists in a corpus). _Example usage_:

In [None]:
find_entity_in_corpra(PATH_TO_CALBC_LIVB, 'Respondents', search_text = False)                    

`split_brat_standoff` randomly splits a corpus into disjoint train, test, and validation sets. 

In [None]:
# set path to corpus to clean
corpus_dir = PATH_TO_AZDC

# set train, test and valid partition sizes
TRAIN_SIZE = 0.6
TEST_SIZE = 0.3
VALID_SIZE = 0.1
RANDOM_SEED = 42

In [None]:
split_brat_standoff(corpus_dir, TRAIN_SIZE, TEST_SIZE, VALID_SIZE, random_seed=RANDOM_SEED)

##  Corpus Statistics 

Get corpus stats, including number of sentences, tokens, and annotations for a given corpus. Uses `NTLK` in order to perform word tokenization and sentence tokenization. _Example usage_:

In [None]:
# example use
corpus_dir = PATH_TO_SCAI

get_corpus_stats(corpus_dir)

## Harmonize SSC with multiple GSC

A collection of functions which facilitate the 'harmonization' of a SSC with multiple GSC. Examples of their use is given below.

Provided a list of GSC, removes any document from a given SSC that appears in at least one of the GSC. _Example usage_: 

In [None]:
GSC = [PATH_TO_AZDC] # a list containing paths to the GSC
SSC = PATH_TO_CALBC_DISO # path to the SSC

harmonize_ssc_gsc(GSC, SSC)

### Extract disjunction of annotated entity sets

_Motivation_: Given a single SSC, and one or more GSC, we want to return the disjunction between annotated entities of the SSC, and all GSC (_which entities are present and annotated in the SSC, and present but not annotated in one more of the GSC_). 

_To use_: pass paths of corpus to `disjoint_entities(GSC, list_of_SSC)`. Returns disjunction of entity annotation sets.

_Example usage_: get disjoint annoations between SSC CALBC corpus and all corresponding GSC. Returns set.

In [None]:
disjoint_annotations = disjoint_annotations(SSC, GSC)

Print each of these disjoint annotations, seperated by newline.

In [None]:
disjoint_annotations = sorted(list(set([ann.lower().strip() for ann in disjoint_annotations])))
print('\n'.join(disjoint_annotations))

Remove these disjoint entities from a SSC (note, requires that they are placed in a blacklist `.txt` file, with each entity appearing on its own line).

In [None]:
blacklist = CALBC_CHED_BLACKLIST
rm_disjoint_annotations(PATH_TO_CALBC_CHED, blacklist)

# Evaluating a Model

We can perform an error analysis by looking at the sets of FP, FN, and TP of a model. In our case we will want to compare performance before and after transfer learning. We make the following assumptions: 

- predictions on the corpus from the baseline model are stored at: `path/to/corpus/deploy_test_baseline`
- predictions on the corpus from the transfer learning model are stored at: `path/to/corpus/deploy_test_tl`
- gold standard labels for the corpus are stored at: `path/to/corpus/test`

Here is a simple example for printing the number of FNs, FPs and TPs for the baseline and transfer learning methods (along with the intersection of these sets):

In [None]:
import os

# example
corpus_dir = PATH_TO_S800

path_to_pred_baseline = os.path.join(corpus_dir, 'deploy_test_baseline')
path_to_pred_tl = os.path.join(corpus_dir, 'deploy_test_tl')
path_to_gold = os.path.join(corpus_dir, 'test')

print_corpus_eval(path_to_pred_baseline, path_to_pred_tl, path_to_gold)

Here is a more complicated use case. We accumulate the FNs, FPs, and TPs for all corpora of a given entity type for both the baseline and transfer learning methods

In [None]:
# need to set this variable
path_to_corpora = PATH_TO_PRGE_CORPORA

# accumulators
FN_baseline_acc, FP_baseline_acc, TP_baseline_acc = set(), set(), set()
FN_tl_acc, FP_tl_acc, TP_tl_acc = set(), set(), set()

for corpus in path_to_corpora:
    # get paths to baseline, tl predictions and gold labels
    path_to_pred_baseline = os.path.join(corpus, 'deploy_test_baseline')
    path_to_pred_tl = os.path.join(corpus, 'deploy_test_tl')
    path_to_gold = os.path.join(corpus, 'test')
    
    # get sets of FN, FP, and TP for baseline and transfer learning methods
    FN_baseline, FP_baseline, TP_baseline = get_FN_FP_TP_single_corpus(path_to_pred_baseline, path_to_gold)
    FN_tl, FP_tl, TP_tl = get_FN_FP_TP_single_corpus(path_to_pred_tl, path_to_gold)
    
    # add corpus name to each annotation, to prevent loss of identical annotations across corpora
    FN_baseline = set([corpus + x for x in FN_baseline])
    FP_baseline = set([corpus + x for x in FP_baseline])
    TP_baseline = set([corpus + x for x in TP_baseline])
    
    FN_tl = set([corpus + x for x in FN_tl])
    FP_tl = set([corpus + x for x in FP_tl])
    TP_tl = set([corpus + x for x in TP_tl])
    
    # accumulate FNs, FPs and TPs across corpora
    FN_baseline_acc = FN_baseline_acc | FN_baseline
    FP_baseline_acc = FP_baseline_acc | FP_baseline
    TP_baseline_acc = TP_baseline_acc | TP_baseline
    
    FN_tl_acc = FN_tl_acc | FN_tl
    FP_tl_acc = FP_tl_acc | FP_tl
    TP_tl_acc = TP_tl_acc | TP_tl

And then we can look for patterns. First look at the average length of elements in each set:

In [None]:
print('BASELINE\n' + '-'*50)
print('FNs \n\tCount: {}\n\tAvg length: {}'.format(len(FN_baseline_acc), 
                                                   average_len_of_iterable([x.split('\t')[1] for x in FN_baseline_acc])))
print('FPs \n\tCount: {}\n\tAvg length: {}'.format(len(FP_baseline_acc), 
                                                   average_len_of_iterable([x.split('\t')[1] for x in FP_baseline_acc])))
print('TPs \n\tCount: {}\n\tAvg length: {}'.format(len(TP_baseline_acc), 
                                                   average_len_of_iterable([x.split('\t')[1] for x in TP_baseline_acc])))
print()

print('TRANSFER LEARNING\n' + '-'*50)
print('FNs \n\tCount: {}\n\tAvg length: {}'.format(len(FN_tl_acc), 
                                                   average_len_of_iterable([x.split('\t')[1] for x in FN_tl_acc])))
print('FPs \n\tCount: {}\n\tAvg length: {}'.format(len(FP_tl_acc), 
                                                   average_len_of_iterable([x.split('\t')[1] for x in FP_tl_acc])))
print('TPs \n\tCount: {}\n\tAvg length: {}'.format(len(TP_tl_acc), 
                                                   average_len_of_iterable([x.split('\t')[1] for x in TP_tl_acc])))
print()

Then at the intersection of sets:

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

intersection = get_top_n_intersection(FN_baseline_acc, FN_tl_acc)

print('FN\n' + '-'*50)
pp.pprint(get_top_n_intersection(FN_baseline_acc, FN_tl_acc))
print('FP\n' + '-'*50)
pp.pprint(get_top_n_intersection(FP_baseline_acc, FP_tl_acc))
print('TP\n' + '-'*50)
pp.pprint(get_top_n_intersection(TP_baseline_acc, TP_tl_acc))

Then the relative complements:

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

intersection = get_top_n_intersection(FN_baseline_acc, FN_tl_acc)


fn_b_minus_tl, fn_tl_minus_bl = get_top_n_difference(FN_baseline_acc, FN_tl_acc)
fp_b_minus_tl, fp_tl_minus_bl = get_top_n_difference(FP_baseline_acc, FP_tl_acc)
tp_b_minus_tl, tp_tl_minus_bl = get_top_n_difference(TP_baseline_acc, TP_tl_acc)

print('Baseline \ Transfer learning: FN\n' + '-'*50)
pp.pprint(fn_b_minus_tl)
print()
print('Baseline \ Transfer learning: FP\n' + '-'*50)
pp.pprint(fp_b_minus_tl)
print()
print('Baseline \ Transfer learning: TP\n' + '-'*50)
pp.pprint(tp_b_minus_tl)
print()

print('Transfer learning \ Baseline: FN\n' + '-'*50)
pp.pprint(fn_tl_minus_bl)
print()
print('Transfer learning \ Baseline: FP\n' + '-'*50)
pp.pprint(fp_tl_minus_bl)

print()
print('Transfer learning \ Baseline: TP\n' + '-'*50)
pp.pprint(tp_tl_minus_bl)
print()