# AWS Comprenend Medical with Python

**Very important**: 

- Do not run the cells. It costs money and may overwrite saved data.
- Comprehend Medical only accept note string shorter than 20000 charaters. When preparing a dataframe `df` with notes column `note`, filter off those longer notes with: 
`df = df[df.note.str.len() < 20000]`

- Comprehend medical is very expensive. The cost for one entity is  \\$0.01. On aveage, there is one entity in 100 characters. So the cost of 1 M character is about $100.

### Tutorial
https://docs.aws.amazon.com/comprehend/latest/dg/get-started-api-med.html

#### boto3 documentation
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehendmedical.html

## A quick example
Comprehend Medical returns a dictionary of medical entities and other metadata regarding to the request sent to Amazon. Only the medical entities are useful for text analysis.

In [None]:
import boto3
client = boto3.client(service_name='comprehendmedical', region_name='us-east-1')
result = client.detect_entities_v2(Text= 'cerealx 84 mg daily')

In [61]:
result['Entities']

[{'Id': 0,
  'BeginOffset': 0,
  'EndOffset': 7,
  'Score': 0.8877691626548767,
  'Text': 'cerealx',
  'Category': 'MEDICATION',
  'Type': 'BRAND_NAME',
  'Traits': [],
  'Attributes': [{'Type': 'DOSAGE',
    'Score': 0.9337134957313538,
    'RelationshipScore': 0.9995118379592896,
    'Id': 1,
    'BeginOffset': 8,
    'EndOffset': 13,
    'Text': '84 mg',
    'Traits': []},
   {'Type': 'FREQUENCY',
    'Score': 0.990627646446228,
    'RelationshipScore': 0.9987651109695435,
    'Id': 2,
    'BeginOffset': 14,
    'EndOffset': 19,
    'Text': 'daily',
    'Traits': []}]}]

## Example with one medical note
We will extract medical entities from the first note in mtsample dataset.

In [None]:
note = "SUBJECTIVE: This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.MEDICATIONS: Her only medication currently is Ortho Tri-Cyclen and the Allegra.ALLERGIES: She has no known medicine allergies.OBJECTIVE:Vitals: Weight was 130 pounds and blood pressure 124/78.HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.Neck: Supple without adenopathy.Lungs: Clear.ASSESSMENT: Allergic rhinitis.PLAN:1. She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. She does not think she has prescription coverage so that might be cheaper.2. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well."
mt = client.detect_entities_v2(Text = note)

### get bog of words from medical entities of one medical note

In [4]:
def get_bow(me):
    # me: medical entities extracted from Amazon Comprehend Medical aaa["Entities"]
    text = [dic["Text"] for dic in me]
    traits = [dic["Traits"] for dic in me]  # trait including negation
    
    negation = []
    for ele in traits:
        if len(ele) == 0:
            negation.append("")
        elif len(ele) != 0:
            count = 0
            for dic in ele:
                if dic["Name"] == "NEGATION":
                    count += 1
            if count == 0:
                negation.append("")
            else:
                negation.append("-1")
    
    # attache "-1" to the text if it is negative
    bow = [t.replace(" ", "-") + n for t, n in zip(text, negation)]
    
    return(bow)

In [58]:
print(get_bow(mt["Entities"]))

['23', 'allergies', 'Seattle', 'Claritin', 'Zyrtec', 'lose-effectiveness', 'Allegra', 'summer', 'asthma', 'Ortho-Tri-Cyclen', 'Allegra', 'Vitals', 'Weight', 'blood-pressure', 'HEENT', 'throat', 'erythematous', 'exudate-1', 'Nasal-mucosa', 'erythematous', 'swollen', 'clear-drainage', 'TMs', 'TMs-were-clear', 'Neck', 'Supple', 'adenopathy-1', 'Lungs', 'Lungs:-Clear', 'Allergic-rhinitis', 'Zyrtec', 'Allegra', 'loratadine', 'Nasonex']


## Try 10 mtsample motes
Alway considering cost when using Comprehend Medical.

In [None]:
import pandas as pd
dat = pd.read_csv("../mtsample_scraped.csv")

In [None]:
notes = list(dat.medical_transcription)
nn = notes[0:10]

In [None]:
# do not run this cell, save money
mes = []
for nt in nn:
    me = client.detect_entities_v2(Text = nt)
    me = me["Entities"]
    mes.append(me)

In [None]:
# save the entities, it cost real money
import json
with open("comprehend_first_10_cases.txt", "w") as f:
    json.dump(mes, f)

In [None]:
# read the saved file
with open("comprehend_first_10_cases.txt", "r") as f:
    try_10 = json.load(f)

In [59]:
# get the bag of words
bows =[]
for me in try_10:
    bows.append(get_bow(me))

print(bows[0:2])

[['23', 'allergies', 'Seattle', 'Claritin', 'Zyrtec', 'lose-effectiveness', 'Allegra', 'summer', 'asthma', 'Ortho-Tri-Cyclen', 'Allegra', 'Vitals', 'Weight', 'blood-pressure', 'HEENT', 'throat', 'erythematous', 'exudate-1', 'Nasal-mucosa', 'erythematous', 'swollen', 'clear-drainage', 'TMs', 'TMs-were-clear', 'Neck', 'Supple', 'adenopathy-1', 'Lungs', 'Lungs:-Clear', 'Allergic-rhinitis', 'Zyrtec', 'Allegra', 'loratadine', 'Nasonex'], ['34', 'recommendation-of-Emergency-Room', 'medical-records', 'allergic-reaction', 'ABC-Medical-Center', 'perioral-swelling', '05/03/2008', 'ABC-Medical-Center', 'XYZ-Medical-Center', 'renal-failure', 'dialysis', 'allergy-reaction', 'Keflex-1', 'skin', 'skin-cellulitis', 'shunt-infection', 'anaphylactic-1', 'angioedema-reactions-1', 'atenolol', 'blood-pressure-control', 'corticosteroid', 'corticosteroid-therapy', 'antihistamine', 'antihistamine-therapy', 'urticaria', 'renal-failure', 'hypertension', 'renal-failure', 'dialysis', 'hypertension', 'PermCath-ins

## Download comprehend medical data for Gastroenterology and neurology cases
Start from very beginning. 

In [1]:
import pandas as pd
import numpy as np
import json
from tqdm import tqdm
dat = pd.read_csv("../mtsample_scraped.csv")

In [13]:
# save the selected data for future reproducible work
dat_gas_neu = dat.query('sample_type in ["Gastroenterology", "Neurology"]')
dat_gas_neu = dat_gas_neu[dat_gas_neu.medical_transcription.str.len() < 20000] # remove notes longer than 2000
dat_gas_neu["id"] = range(1, dat_gas_neu.shape[0] + 1)
selected = dat_gas_neu[["id", "sample_type", "medical_transcription"]]
selected.columns = ["id", "sample_type", "medical_note"]
selected.to_csv("mtsample_gastroenterology_neurology.csv", index=False)

In [10]:
selected

Unnamed: 0,id,sample_type,medical_note
1328,1,Gastroenterology,PREOPERATIVE DIAGNOSIS: Abdominal wall abscess...
1329,2,Gastroenterology,PREOPERATIVE DIAGNOSES: 1. Congenital chylous ...
1330,3,Gastroenterology,CHIEF COMPLAINT: Abdominal pain.HISTORY OF PRE...
1331,4,Gastroenterology,PREOPERATIVE DIAGNOSIS: Recurrent re-infected ...
1332,5,Gastroenterology,CHIEF COMPLAINT: Nausea.PRESENT ILLNESS: The p...
...,...,...,...
2259,447,Neurology,TIME SEEN: 0734 hours and 1034 hours.TOTAL REC...
2260,448,Neurology,DATE OF EXAMINATION: Start: 12/29/2008 at 1859...
2261,449,Neurology,EEG during wakefulness demonstrates background...
2262,450,Neurology,PROCEDURE: EEG during wakefulness demonstrates...


In [65]:
dat_gas_neu = pd.read_csv("mtsample_gastroenterology_neurology.csv")
gas_neu_notes = list(dat_gas_neu.medical_note)
dat_gas_neu.shape

(451, 6)

In [None]:
# Do not run, very expensive
# download and save result
mes = []
for nt in tqdm(gas_neu_notes):
    me = client.detect_entities_v2(Text = nt)
    me = me["Entities"]
    mes.append(me)

In [15]:
# save the entities, it cost real money
import json
with open("comprehend_medical_neurology_gastroenterology.txt", "w") as f:
    json.dump(mes, f)

NameError: name 'mes' is not defined

In [2]:
# load comprehend_medical
with open("comprehend_medical_neurology_gastroenterology.txt", "r") as f:
    mes = json.load(f)
len(mes)

451

In [10]:
bows =[]
for me in mes:
    bows.append(" ".join(get_bow(me)))
    
bows_amazon_gas_neu = pd.DataFrame({
    "id" : range(1, len(bows) + 1),
    "amazon_me" : bows
})

bows_amazon_gas_neu.to_csv("amazon_medical_entities.csv", index=False)