# MedCATtrainer Annotations
A notebook to demonstrate:
- How to download (aka export) annotations from the trainer. These downloads can also be used as export / transfer / backup option for MCTrainer.
- The MedCATTrainer downloaded annotations schema. Both w/ and w/o text have the same format, except from the source text. 
- How to re-upload exported annotatinos into 'new' projects, this could be for recovery, or importing to a new Trainer deployment.

## Downloading Annotations
This covers API driven downloading, and is also accessible from \<hostname\>:\<port\>/admin/

In [4]:
import requests
import json
import pandas as pd
from pprint import pprint

In [None]:
URL = 'http://localhost:8001' # Should be set to your running deployment, IP / PORT if not running on localhost:8001

In [106]:
# API access is via a username / password. Upon login the API auth endpoint provides an auth token that must be used for all following requests.
payload = {"username": "admin", "password": "admin"}
headers = {
    'Authorization': f'Token {json.loads(requests.post("http://localhost:8001/api/api-token-auth/", json=payload).text)["token"]}',
}
headers

{'Authorization': 'Token cc3e60dd2cc4231f7f74d1f30d35ce31d3154f7c'}

In [118]:
# get project IDs to download
resp = json.loads(requests.get(f'{URL}/api/project-annotate-entities/', headers=headers).text)['results']
pprint([{'id': r['id'], 'name': r['name']} for r in resp])

[{'id': 68,
  'name': 'Top level Concept Annos (Diseases / Symptoms / Findings) (Clone)'},
 {'id': 69,
  'name': 'Chevron test - UMLS (Diseases / Symptoms / Findings) (Clone)'},
 {'id': 70,
  'name': 'Example Annotation Project - UMLS (Diseases / Symptoms / Findings) '
          '(Clone)'},
 {'id': 71, 'name': 'Example Annotation Project - SNOMED CT All (Clone)'}]


In [119]:
# projects to download
projects_to_download = ','.join(str(r['id']) for r in resp) 

In [120]:
# further parameters available here are:
# - with_text: Boolean: to download the annotations with the source document text. This will automatically include the doc_name. Default: False
# - with_doc_name: Boolean: if with_text is False, but you still want to include doc names. Default: False

resp = json.loads(requests.get(f'{URL}/api/download-annos/?project_ids={projects_to_download}&with_text=True', headers=headers).text)

In [121]:
# dump the output to a file.
json.dump(resp, open('trainer_export.json', 'w')) 

## Processing MedCATTrainer Annotations

In [1]:
import pandas as pd
import json

In [2]:
# Load the annotations downloaded - as described: https://github.com/CogStack/MedCATtrainer/blob/master/README.md#download-annos
projs = json.load(open('example_data/MedCAT_Export_With_Text_2020-05-22_10_34_09.json'))['projects']

In [3]:
# Number of annotation projects downloaded
print(f'Projects annotated:{len(projs)}')

Projects annotated:2


In [4]:
# select first project
proj = projs[0]
# project level cui / tui filters are top level dict keys
proj.keys()

dict_keys(['name', 'id', 'cuis', 'tuis', 'documents'])

In [5]:
# Annotations are found inside each document.
print(f'# of Documents: {len(proj["documents"])}')
print(f'# of Annotations: {sum([len(d["annotations"]) for d in proj["documents"]])}')

# Annotations that have been marked by a human annotator
print(f'# Validated Annotations: {len([a for d in proj["documents"] for a in d["annotations"] if a["validated"] == True])}')

# Annotations that have been marked correct - (blue) 
print(f'# Correct Annotations: {len([a for d in proj["documents"] for a in d["annotations"] if a["correct"] == True])}')

# Annotations that have been marked incorrect  - (red)
print(f'# Correct Annotations: {len([a for d in proj["documents"] for a in d["annotations"] if a["deleted"] == True])}')

# Annotations that have been marked terminated - (dark red)
print(f'# Correct Annotations: {len([a for d in proj["documents"] for a in d["annotations"] if a["killed"] == True])}')

# Annotations that have been marked alternative - (turquoise)
print(f'# Correct Annotations: {len([a for d in proj["documents"] for a in d["annotations"] if a["alternative"] == True])}')

# Annotations that have been manually created via right-click - 'Add Annotation', these will also be 'correct' == True
print(f'# Correct Annotations: {len([a for d in proj["documents"] for a in d["annotations"] if a["manually_created"] == True])}')

# of Documents: 2
# of Annotations: 47
# Validated Annotations: 47
# Correct Annotations: 32
# Correct Annotations: 15
# Correct Annotations: 0
# Correct Annotations: 0
# Correct Annotations: 0


### Meta Annotations 
Each Meta Annotation will have the names of the task and associated values you've previously selected.
In this case we have: 'Negation' and 'Skip'

In [6]:
## Correct Annotations that are Correct and Meta Annotation Temporarilty - Present, Experiencer - Patient

In [7]:
proj['documents'][1]['annotations'][2]['meta_anns']

{'Negation': {'name': 'Negation',
  'value': 'No',
  'acc': 1.0,
  'validated': True},
 'Skip': {'name': 'Skip', 'value': 'Yes', 'acc': 1.0, 'validated': True}}

In [8]:
annos = []
for doc in proj['documents']:
    for a in doc['annotations']:
        meta_anns = a['meta_anns']
        if a['correct'] == True and len(meta_anns) != 0:
            # meta_anns are a list of dictionaries, each dict is a meta annotation. Order is not neccessarily consistent
            negation = meta_anns['Negation']
            skip = meta_anns['Skip']
            if negation['value'] == 'No' and skip['value'] == 'Yes':
                # pull out the doc_name, the text span value, and the concept
                annos.append({'doc_name': doc['name'], 'anno_value': a['value'], 'cui': a['cui']})
# make DataFrame
df = pd.DataFrame(annos)
df.head(5)

Unnamed: 0,doc_name,anno_value,cui
0,Psych Text 1,psychopathology,C0004936
1,Psych Text 1,constipated,C0009806
2,Psych Text 1,depression,C0011570
3,Psych Text 1,depression,C0011570
4,Psych Text 1,fatigued,C0015672


### Comparing a Second (or More) set of Annotations
Often we'll dual annotate projects and compute metrics to develop a gold standard.
- We'll compute metrics such [Inter Annotator Agreement (IIA)](https://en.wikipedia.org/wiki/Inter-rater_reliability) and [Cohen's Kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa).
- Metrics can be output for each concept for the concept recognition+linking tasks.
- For tasks with only a handful of concept filters we can compute the meta annotation task agreement, but often we will not have enough annotatinos for any meaningful. Instead we can group all meta annotations together to compute scores.

In [14]:
from sklearn.metrics import cohen_kappa_score

In [15]:
proj['documents'][0]['annotations'][0]

{'id': 46102,
 'user': 'admin',
 'cui': 'C1656589',
 'value': 'oppositional defiant disorder',
 'start': 9593,
 'end': 9622,
 'validated': True,
 'correct': False,
 'deleted': True,
 'alternative': False,
 'killed': False,
 'last_modified': '2020-05-22 10:33:23.830587+00:00',
 'manually_created': False,
 'acc': 0.392611944253081,
 'meta_anns': {}}

In [16]:
def anno_state(anno):
    if anno['deleted']:
        return 'del'
    if anno['alternative']:
        return 'alt'
    if anno['killed']:
        return 'kil'
    if anno['manually_created']:
        return 'man'
    return 'cor'

In [17]:
# Concept Recognition + Linking Agreement per CUI across 2 projects

In [18]:
# only take documents completed by both
shared_docs = set([d['id'] for d in projs[0]['documents']]) & set([d['id'] for d in projs[1]['documents']])
projs[0]['documents'] = [d for d in projs[0]['documents'] if d['id'] in shared_docs]
projs[1]['documents'] = [d for d in projs[1]['documents'] if d['id'] in shared_docs]

In [19]:
# project 1 annos
proj1_annos_cuis = {f'{d["id"]}:{a["start"]}': a['cui'] for d in projs[0]['documents'] for a in d['annotations']}
proj1_annos_states = {f'{d["id"]}:{a["start"]}': anno_state(a) for d in projs[0]['documents'] for a in d['annotations']}
# project 2 annos
proj2_annos_cuis = {f'{d["id"]}:{a["start"]}': a['cui'] for d in projs[1]['documents'] for a in d['annotations']}
proj2_annos_states = {f'{d["id"]}:{a["start"]}': anno_state(a) for d in projs[1]['documents'] for a in d['annotations']}

In [20]:
all_cuis = set(proj1_annos_cuis.values()) | set(proj2_annos_cuis.values())

In [21]:
cui_ck = {}
for cui in all_cuis:
    cui_tuples = []
    p1 = {k:v for k,v in proj1_annos_cuis.items() if v == cui}
    p2 = {k:v for k,v in proj2_annos_cuis.items() if v == cui}
    for anno_key in set(p1.keys()) | set(p2.keys()):
        cui_tuples.append((proj1_annos_states.get(anno_key, 'na'), proj2_annos_states.get(anno_key, 'na')))
    cui_ck[cui] = cui_tuples

## IIA Per CUI

In [22]:
iia_per_cui = {cui: (len([i for i in v if i[0] == i[1]]) / len(v)) * 100 for cui, v in cui_ck.items()}

## Cohen's Kappa Per CUI
Note: for cuis with only one label it can be 

In [26]:
cohens_kappa_per_cui = {k: cohen_kappa_score([i[0] for i in v], [i[1] for i in v]) for k,v in cui_ck.items()}

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


### Meta Annotation
- Group all annos together for each task and compute IIA, CK

In [24]:
# project 1 meta annos
proj1_meta_annos_neg = {f'{d["id"]}:{a["start"]}': a['meta_anns'].get('Negation', {'value': 'na'})['value'] for d in projs[0]['documents'] for a in d['annotations']}
proj1_meta_annos_skip = {f'{d["id"]}:{a["start"]}': a['meta_anns'].get('Skip', {'value': 'na'})['value'] for d in projs[0]['documents'] for a in d['annotations']}
# project 2 meta annos
proj2_meta_annos_neg = {f'{d["id"]}:{a["start"]}': a['meta_anns'].get('Negation', {'value': 'na'})['value'] for d in projs[1]['documents'] for a in d['annotations']}
proj2_meta_annos_skip = {f'{d["id"]}:{a["start"]}': a['meta_anns'].get('Skip', {'value': 'na'})['value'] for d in projs[1]['documents'] for a in d['annotations']}

In [138]:
# remove na examples, these would be incorret or terminated exampels that have no meta anno value. 
def remove_na(meta_annos_dict):
    return {k:v for k,v in meta_annos_dict.items() if v != 'na'}
proj1_meta_annos_neg = remove_na(proj1_meta_annos_neg)
proj1_meta_annos_skip = remove_na(proj1_meta_annos_skip)
proj2_meta_annos_neg = remove_na(proj2_meta_annos_neg)
proj2_meta_annos_skip = remove_na(proj2_meta_annos_skip)

In [154]:
# Take meta annos from each project and combine across projects, 
# - A more strict measure: defaulting to 'na' if there is no appropriate meta anno in the 'other' project, to use this one swap '&' (intersection) with "|" a union.
# - A more fair measure: removing the instance where there was no meta anno in the other project. We use this one below.
neg_annos = []
for anno_key in set(proj1_meta_annos_neg.keys()) & set(proj2_meta_annos_neg.keys()):
    neg_annos.append((proj1_meta_annos_neg.get(anno_key, 'na'), proj2_meta_annos_neg.get(anno_key, 'na')))

skip_annos = []
for anno_key in set(proj1_meta_annos_skip.keys()) & set(proj2_meta_annos_skip.keys()):
    skip_annos.append(((proj1_meta_annos_skip.get(anno_key, 'na')), proj2_meta_annos_skip.get(anno_key, 'na')))

In [155]:
iia_neg = (len([a for a in neg_annos if a[0] == a[1]]) / len(neg_annos)) * 100
print('iia neg:', iia_neg)
iia_skip = (len([a for a in skip_annos if a[0] == a[1]]) / len(skip_annos)) * 100
print('iia skip:', iia_skip)

iia neg: 100.0
iia skip: 100.0


In [156]:
ck_neg = cohen_kappa_score([v[0] for v in neg_annos], [v[1] for v in neg_annos])
print("cohen's kappa neg:", ck_neg)
ck_skip = cohen_kappa_score([v[0] for v in skip_annos], [v[1] for v in skip_annos])
print("cohen's kappa skip:", ck_skip)

cohen's kappa neg: nan
cohen's kappa skip: nan


  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


We have 'nan's here as there are no other values exist in the intersection of values so cohen's kappa is undefined. We can report 100% IIA though!

# Uploading Annotations
This is useful if annotations have been exported, and need to be systematically modified or if a new instance of Trainer is required to be used for further annotations

This will use the previously exported trainer download

In [122]:
project_data = json.load(open('trainer_export.json'))

In [123]:
# filter out project data that is empty

In [124]:
project_data['projects'] = [p for p in project_data['projects'] if len(p['documents']) > 0 ] 

In [125]:
# current projects projects
resp = json.loads(requests.get(f'{URL}/api/project-annotate-entities/', headers=headers).text)['results']
pprint([{'id': r['id'], 'name': r['name']} for r in resp])

[{'id': 68,
  'name': 'Top level Concept Annos (Diseases / Symptoms / Findings) (Clone)'},
 {'id': 69,
  'name': 'Chevron test - UMLS (Diseases / Symptoms / Findings) (Clone)'},
 {'id': 70,
  'name': 'Example Annotation Project - UMLS (Diseases / Symptoms / Findings) '
          '(Clone)'},
 {'id': 71, 'name': 'Example Annotation Project - SNOMED CT All (Clone)'}]


In [126]:
# upload previoulsy exported projects
resp = requests.post(f'{URL}/api/upload-deployment/', json=project_data).text

In [127]:
print(resp)

"successfully uploaded"


In [128]:
# to show the newly uploaded projects
resp = json.loads(requests.get(f'{URL}/api/project-annotate-entities/', headers=headers).text)['results']
pprint([{'id': r['id'], 'name': r['name']} for r in resp])

[{'id': 68,
  'name': 'Top level Concept Annos (Diseases / Symptoms / Findings) (Clone)'},
 {'id': 69,
  'name': 'Chevron test - UMLS (Diseases / Symptoms / Findings) (Clone)'},
 {'id': 70,
  'name': 'Example Annotation Project - UMLS (Diseases / Symptoms / Findings) '
          '(Clone)'},
 {'id': 71, 'name': 'Example Annotation Project - SNOMED CT All (Clone)'},
 {'id': 81,
  'name': 'Chevron test - UMLS (Diseases / Symptoms / Findings) (Clone) '
          'IMPORTED'},
 {'id': 82,
  'name': 'Example Annotation Project - UMLS (Diseases / Symptoms / Findings) '
          '(Clone) IMPORTED'}]
