# Augmenting Document Splits/ Chunks with Healthcare NLP for RAG Applications

In [0]:
from johnsnowlabs import nlp, medical

spark

[91m🚨 Your Spark-Healthcare is outdated, installed==5.2.1 but latest version==5.2.1
You can run [92m nlp.install() [39mto update Spark-Healthcare


## Splitting with Medical Document Splitter

This Annotator splits large documents into small documents. `InternalDocumentSplitter` has setSplitMode method to decide how to split documents.

If splitMode is `recursive`, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

Additionally, you can set
- custom patterns with setSplitPatterns
- whether patterns should be interpreted as regex with setPatternsAreRegex
- whether to keep the separators with setKeepSeparators
- whether to trim whitespaces with setTrimWhitespace
- whether to explode the splits to individual rows with setExplodeSplits

**Parameters**:

- `chunkSize`: Size of each chunk of text. This param is applicable only for "recursive" splitMode.
- `chunkOverlap`: Length of the overlap between text chunks, by default `0`. This param is applicable only for `recursive` splitMode.
- `splitPatterns`: Patterns to split the document.
patternsAreRegex. Whether to interpret the split patterns as regular expressions, by default `True`.
- `keepSeparators`: Whether to keep the separators in the final result , by default `True`. This param is applicable only for "recursive" splitMode.
- `explodeSplits`: Whether to explode split chunks to separate rows , by default `False`.
- `trimWhitespace`: Whether to trim whitespaces of extracted chunks , by default `True`.
- `splitMode`: The split mode to determine how text should be segmented. Default: 'regex'. It should be one of the following values:
  - "char": Split text based on individual characters.
  - "token": Split text based on tokens. You should supply tokens from inputCols.
  - "sentence": Split text based on sentences. You should supply sentences from inputCols.
  - "recursive": Split text recursively using a specific algorithm.
  - "regex": Split text based on a regular expression pattern.
- `sentenceAwareness`: Whether to split the document by sentence awareness if possible.
  - If true, it can stop the split process before maxLength.
  - If true, you should supply sentences from inputCols. Default: `False`.
  - This param is not applicable only for `regex` and `recursive` splitMode.
- `maxLength`: The maximum length allowed for spitting. The mode in which the maximum length is specified:
  - "char": Maximum length is measured in characters. Default: `512`
  - "token": Maximum length is measured in tokens. Default: `128`
  - "sentence": Maximum length is measured in sentences. Default: `8`
- `customBoundsStrategy`: The custom bounds strategy for text splitting using regular expressions. This param is applicable only for `regex` splitMode.
- `caseSensitive`: Whether to use case sensitive when matching regex, by default `False`. This param is applicable only for `regex` splitMode.
-  `metaDataFields`: Metadata fields to add specified data in columns to the metadata of the split documents.         You should set column names to read columns.

- `enableSentenceIncrement`: Whether the sentence index should be incremented in the metadata of the annotator.When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: `False`.

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/healthcare-nlp/data/mt_samples.csv

In [0]:
import pandas as pd

df = pd.read_csv("mt_samples.csv")
df.head()



Unnamed: 0,text
0,Sample Type / Medical Specialty:\nHematology -...
1,Sample Type / Medical Specialty:\nHematology -...
2,Sample Type / Medical Specialty:\nHematology -...
3,Sample Type / Medical Specialty:\nHematology -...
4,Sample Type / Medical Specialty:\nHematology -...


In [0]:
note = df.loc[0,'text']
print(note)

Sample Type / Medical Specialty:
Hematology - Oncology
Sample Name:
Discharge Summary - Mesothelioma - 1
Description:
Mesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.
(Medical Transcription Sample Report)
PRINCIPAL DIAGNOSIS:
Mesothelioma.
SECONDARY DIAGNOSES:
Pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.
PROCEDURES
1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.
2. On August 20, 2007, thoracentesis.
3. On August 31, 2007, Port-A-Cath placement.
HISTORY AND PHYSICAL:
The patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week. She has had right-sided chest pain radiating to her back with fever starting yesterday. She has a history of pericarditis and pericardectomy in May 2006 and developed cough with right-sided chest pain, and went to an urgent care cen

In [0]:
import uuid
import numpy as np

df['document_id'] = [str(uuid.uuid4()) for _ in range(len(df))]

df['patient_id'] = ["pt_{}".format(np.random.randint(5, 30)) for _ in range(len(df))]

In [0]:
df.head()

Unnamed: 0,text,document_id,patient_id
0,Sample Type / Medical Specialty:\nHematology -...,2fc10874-3f71-448b-94b1-f4ea02618f1a,pt_9
1,Sample Type / Medical Specialty:\nHematology -...,4811cdb3-eb26-4313-914d-fb8c7867e3a1,pt_19
2,Sample Type / Medical Specialty:\nHematology -...,c152183b-8c4f-4475-b3e2-d21f6ffa2259,pt_22
3,Sample Type / Medical Specialty:\nHematology -...,b4e1f0ac-7ec9-4943-9efa-196f661fc116,pt_24
4,Sample Type / Medical Specialty:\nHematology -...,e3dc12c2-57c4-4df1-a3b1-e2ec17bdbd16,pt_7


In [0]:
import pandas as pd

spark_df = spark.createDataFrame(df)

spark_df.write.mode("overwrite").parquet('dbfs:/mtsamples_clinical_records.parquet')


In [0]:
#spark_df = spark.read.parquet("dbfs:/mtsamples_clinical_records.parquet")
#spark_df.show(5)

### Split by LangChain

In [0]:
from langchain.document_loaders import PySparkDataFrameLoader

loader = PySparkDataFrameLoader(spark, spark_df, page_content_column="text")

documents = loader.load()

In [0]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

split_texts = text_splitter.split_documents(documents)

In [0]:
split_texts[0]

Out[11]: Document(page_content='Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.\n(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma.\nSECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.\nPROCEDURES\n1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.\n2. On August 20, 2007, thoracentesis.\n3. On August 31, 2007, Port-A-Cath placement.\nHISTORY AND PHYSICAL:', metadata={'document_id': '2fc10874-3f71-448b-94b1-f4ea02618f1a', 'patient_id': 'pt_9'})

### Split by Medical Splitter

#### Split by Section Headers

In [0]:
document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_slim", "en", "clinical/models")\
      .setInputCols(["sentence","token", "word_embeddings"])\
      .setOutputCol("ner")

ner_converter = medical.NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter])

section_model = pipeline.fit(spark_df)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_jsl_slim download started this may take some time.
[ | ][OK!]


In [0]:
section_model_lp = nlp.LightPipeline(section_model)

In [0]:
headers = section_model_lp.annotate(note)['ner_chunk']
headers

Out[14]: ['PRINCIPAL DIAGNOSIS:',
 'SECONDARY DIAGNOSES:',
 'PROCEDURES',
 'HISTORY AND PHYSICAL:',
 'PAST MEDICAL HISTORY',
 'FAMILY HISTORY:',
 'SOCIAL HISTORY:',
 'MEDICATIONS',
 'REVIEW OF SYSTEMS:',
 'PHYSICAL EXAMINATION\nVITAL SIGNS:',
 'GENERAL:',
 'LABORATORY DATA:',
 'HOSPITAL COURSE:']

In [0]:
from pyspark.sql import functions as F

result = section_model.transform(spark_df)

headers_df = result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).cache()

headers_df.show(truncate=False)

+--------------------------------------------------------------+---------+----------+
|chunk                                                         |ner_label|confidence|
+--------------------------------------------------------------+---------+----------+
|PRINCIPAL DIAGNOSIS:                                          |Header   |0.5713667 |
|SECONDARY DIAGNOSES:                                          |Header   |0.6755    |
|PROCEDURES                                                    |Header   |0.9934    |
|HISTORY AND PHYSICAL:                                         |Header   |0.66112494|
|PAST MEDICAL HISTORY                                          |Header   |0.69296664|
|FAMILY HISTORY:                                               |Header   |0.7933    |
|SOCIAL HISTORY:                                               |Header   |0.70523334|
|MEDICATIONS                                                   |Header   |0.9037    |
|REVIEW OF SYSTEMS:                                   

In [0]:
unique_headers_df = headers_df.select('chunk').distinct()

headers = [row['chunk'] for row in unique_headers_df.collect()]

headers

Out[16]: ['PAST MEDICAL HISTORY',
 'LABORATORY DATA:',
 'PRINCIPAL DIAGNOSIS:',
 'PROCEDURES',
 'HISTORY AND PHYSICAL:',
 'FAMILY HISTORY:',
 'GENERAL:',
 'SOCIAL HISTORY:',
 'POSTOPERATIVE DIAGNOSIS:',
 'MEDICATIONS',
 'PROCEDURE:',
 'REVIEW OF SYSTEMS:',
 '(Medical Transcription Sample Report)\nPREOPERATIVE DIAGNOSIS:',
 'SECONDARY DIAGNOSES:',
 'HOSPITAL COURSE:',
 'PHYSICAL EXAMINATION\nVITAL SIGNS:',
 'TITLE OF OPERATION:',
 'PAST SURGICAL HISTORY',
 'PROCEDURE IN DETAIL:',
 'SPECIMEN:',
 'URINE OUTPUT:',
 'ALLERGIES:',
 'EBL:',
 'FLUIDS:',
 'FINDINGS:',
 'IMAGINING DATA:',
 'MEDICATIONS:',
 'HISTORY OF PRESENT ILLNESS:',
 'INDICATIONS FOR PROCEDURE:',
 'COMPLICATIONS:',
 'ASSESSMENT:',
 'HISTORY OF PRESENTING ILLNESS:',
 'GEN:',
 'IMPRESSION:',
 'PAST MEDICAL HISTORY:',
 'ASSESSMENT/PLAN:',
 'PAST MEDICAL AND SURGICAL HISTORY:',
 'PHYSICAL EXAMINATION:\nGENERAL:',
 'PHYSICAL EXAM:\nVITALS:',
 'LABORATORY FINDINGS:',
 'CURRENT MEDICATIONS:',
 'PLAN:',
 '(Medical Transcription Samp

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.DocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setExplodeSplits(True)\
    .setSplitPatterns(headers)\
    .setCaseSensitive(True) \
    .setCustomBoundsStrategy("prepend")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

splitter_model = pipeline.fit(spark_df)

In [0]:
splitter_model_lp = nlp.LightPipeline(splitter_model)

In [0]:
for split in splitter_model_lp.annotate(note)['splits']:
    print (split, '\n---------------------------')

Sample Type / Medical Specialty:
Hematology - Oncology
Sample Name:
Discharge Summary - Mesothelioma - 1 
---------------------------
Description:
Mesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.
(Medical Transcription Sample Report) 
---------------------------
PRINCIPAL DIAGNOSIS:
Mesothelioma. 
---------------------------
SECONDARY DIAGNOSES:
Pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis. 
---------------------------
PROCEDURES
1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.
2. On August 20, 2007, thoracentesis.
3. On August 31, 2007, Port-A-Cath placement. 
---------------------------
HISTORY AND PHYSICAL:
The patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week. She has had right-sided chest pain radiating to her back with fever starting yesterday.

#### Split by certain token length

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document", "token")\
    .setOutputCol("splits")\
    .setSplitMode("token")\
    .setMaxLength(50)\
    .setExplodeSplits(True)

token_splitter = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter
])

pipeline_token_splitter = token_splitter.fit(spark_df)
token_splitter_model_lp = nlp.LightPipeline(pipeline_token_splitter)

In [0]:
token_splitter_model_lp.annotate(note)['splits']

Out[21]: ['Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion,',
 'atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.\n(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS',
 ':\nMesothelioma.\nSECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis',
 '.\nPROCEDURES\n1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.\n2. On August',
 '20, 2007, thoracentesis.\n3. On August 31, 2007, Port-A-Cath placement.\nHISTORY AND PHYSICAL:\nThe patient is a',
 '41-year-old Vietnamese female with a nonproductive cough that started last week. She has had right-sided chest pain radiating to her back with fever starting',
 'yesterday. She has a history of pericarditis and pericardectomy in May 2006 and developed cough with r

## Assign section header for each split

In [0]:
spark_df.show()

+--------------------+--------------------+----------+
|                text|         document_id|patient_id|
+--------------------+--------------------+----------+
|Sample Type / Med...|2fc10874-3f71-448...|      pt_9|
|Sample Type / Med...|4811cdb3-eb26-431...|     pt_19|
|Sample Type / Med...|c152183b-8c4f-447...|     pt_22|
|Sample Type / Med...|b4e1f0ac-7ec9-494...|     pt_24|
|Sample Type / Med...|e3dc12c2-57c4-4df...|      pt_7|
|Sample Type / Med...|846cb9bd-fffc-44a...|      pt_7|
|Sample Type / Med...|2f3f6163-e1e5-486...|     pt_12|
|Sample Type / Med...|40de77d5-79b7-424...|     pt_14|
|Sample Type / Med...|45e753b0-ca09-4b8...|     pt_17|
|Sample Type / Med...|8e50d2ac-cef6-46f...|     pt_21|
|Sample Type / Med...|daeae18c-1260-45d...|      pt_5|
|Sample Type / Med...|ea7b76f0-5135-4d6...|      pt_8|
|Sample Type / Med...|678f9883-8502-418...|      pt_7|
|Sample Type / Med...|2573205f-d300-480...|     pt_21|
|Sample Type / Med...|c6c686f3-93ed-457...|     pt_16|
|Sample Ty

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.DocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setExplodeSplits(True)\
    .setSplitPatterns(headers)\
    .setCaseSensitive(True) \
    .setCustomBoundsStrategy("prepend")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["splits"])\
    .setOutputCol("token")\

sequenceClassifier = medical.BertForSequenceClassification\
    .pretrained('bert_sequence_classifier_clinical_sections', 'en', 'clinical/models')\
    .setInputCols(["splits", "token"])\
    .setOutputCol("section")\
    .setCaseSensitive(False)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter,
    tokenizer,
    sequenceClassifier
])

result = pipeline.fit(spark_df).transform(spark_df)

bert_sequence_classifier_clinical_sections download started this may take some time.
[ | ][OK!]


In [0]:
result.cache()
result.show()

+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+
|                text|         document_id|patient_id|            document|              splits|               token|             section|
+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+
|Sample Type / Med...|2fc10874-3f71-448...|      pt_9|[{document, 0, 54...|[{document, 0, 10...|[{token, 0, 5, Sa...|[{category, 0, 10...|
|Sample Type / Med...|2fc10874-3f71-448...|      pt_9|[{document, 0, 54...|[{document, 105, ...|[{token, 105, 115...|[{category, 105, ...|
|Sample Type / Med...|2fc10874-3f71-448...|      pt_9|[{document, 0, 54...|[{document, 284, ...|[{token, 284, 292...|[{category, 284, ...|
|Sample Type / Med...|2fc10874-3f71-448...|      pt_9|[{document, 0, 54...|[{document, 319, ...|[{token, 319, 327...|[{category, 319, ...|
|Sample Type / Med...|2fc10

In [0]:
result.select('splits.result','section.result').show(truncate=100)

+----------------------------------------------------------------------------------------------------+--------------------------------+
|                                                                                              result|                          result|
+----------------------------------------------------------------------------------------------------+--------------------------------+
|[Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesot...|         [Discharge Information]|
|[Description:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal r...|[Complications and Risk Factors]|
|                                                               [PRINCIPAL DIAGNOSIS:\nMesothelioma.]|[Diagnostic and Laboratory Data]|
|[SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux,...|                    [Procedures]|
|[PROCEDURES\n1. On August 24, 2007, decorticati

In [0]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

In [0]:
result.select('splits.result','section.result').limit(30).toPandas().sample(10)

Unnamed: 0,result,result.1
15,[FAMILY HISTORY:\nNotable for heart disease. She had three brothers that died of complications from open heart surgery. Her parents and brothers all had hypertension. Her younger brother died at the age of 18 of infection from a butcher's shop. He was cutting Argentinean beef and contracted an infection and died within 24 hours. She has one brother that is living who has angina and a sister who is 84 with dementia. She has two adult sons who are in good health.],[History]
10,[Description:\nNewly diagnosed cholangiocarcinoma. The patient is noted to have an increase in her liver function tests on routine blood work. Ultrasound of the abdomen showed gallbladder sludge and gallbladder findings consistent with adenomyomatosis.\n(Medical Transcription Sample Report)\nREASON FOR CONSULTATION:\nNewly diagnosed cholangiocarcinoma.],[Consultation and Referral]
3,[OPERATION PERFORMED:\nLeft neck dissection.],[Procedures]
18,[PHYSICAL EXAM:\nVITALS: BP: 108/60. HEART RATE: 80. TEMP: 98.5. Weight: 75 kg.],[Patient Information]
20,"[LABORATORY STUDIES:\nSodium 141, glucose 111, total bilirubin 2.3, alkaline phosphatase 941, AST 161, and ALT 220. White blood cell count 4.3, hemoglobin 11.6, hematocrit 35, and platelets 156,000. Total bilirubin from August 25, 2010 was 1.6, alkaline phosphatase 735, AST 123, ALT 184, CA99 is 109. Bile duct brushings are notable for atypical cell clusters present, highly suspicious for carcinoma.]",[Diagnostic and Laboratory Data]
27,"[SUMMARY:\nThe patient was brought to the OR in satisfactory condition and placed supine on the OR table. Underwent general anesthesia along with Marcaine in the nasal tip areas for planned excision. The area was injected, after sterile prep and drape, with Marcaine 0.25% with 1:200,000 adrenaline.\nThe specimen was sent to pathology. Margins were still positive at the inferior 6 o'clock ***** margin and this was resubmitted accordingly. Final margins were clear.\nClosure consisted of undermining circumferentially. Advancement closure with dog ear removal distally and proximally was accomplished without difficulty. Closure with interrupted 5-0 Monocryl running 7-0 nylon followed by Xeroform gauze, light pressure dressing, and Steri-Strips.\nThe patient is discharged on minocycline and Darvocet-N 100.]",[Diagnostic and Laboratory Data]
21,"[ASSESSMENT/PLAN:\nThis is a very pleasant 77-year-old female who has findings suspicious for a cholangiocarcinoma. The patient was referred to our office to discuss this diagnosis. I spent greater than an hour with the patient and her husband discussing this potential diagnosis, reviewing the anatomy and answering questions. She is yet to have a surgical consultation, and we discussed the difficulty that we sometimes have with patients meeting surgical criteria to manage cholangiocarcinoma. The patient also had questions about the Medical University and possibly seeking a second opinion. She will contact our office after her surgical consultation if she needs assistance with obtaining a second opinion. We also talked about our clinical research program here. Currently, we do have a Phase II Study for advanced gallbladder carcinoma or cholangiocarcinoma for patients that are unresectable. We will go ahead and provide her with a consent form so that she can look that over and it will give her some more information about the malignancy and treatment approaches. We will schedule her for followup in three weeks. We will also schedule her for PET/CT scan for staging.\nKeywords:\nhematology - oncology, liver function tests, gallbladder, sludge, adenomyomatosis, intrahepatic ductal dilatation, bile duct, ercp, mrcp, cholangiopancreatography, gastroenterology, common bile duct, oropharynx, cholangiocarcinoma,]",[Consultation and Referral]
11,"[HISTORY OF PRESENT ILLNESS:\nThe patient is a very pleasant 77-year-old female who is noted to have an increase in her liver function tests on routine blood work in December 2009. Ultrasound of the abdomen showed gallbladder sludge and gallbladder findings consistent with adenomyomatosis. Common bile duct was noted to be 10 mm in size on that ultrasound. She then underwent a CT scan of the abdomen in July 2010, which showed intrahepatic ductal dilatation with the common bile duct size being 12.7 mm. She then underwent an MRI MRCP, which was notable for stricture of the distal common bile duct. She was then referred to gastroenterology and underwent an ERCP. On August 24, 2010, she underwent the endoscopic retrograde cholangiopancreatography. She was noted to have a stricturing mass of the mid-to-proximal common bile duct consistent with cholangiocarcinoma. A temporary biliary stent was placed across the biliary stricture. Blood work was obtained during the hospitalization. She was also noted to have an elevated CA99. She comes in to clinic today for initial Medical Oncology consultation. After she sees me this morning, she has a follow-up consultation with a surgeon.]",[History]
28,"[NOTE:\nThe 2.6 mm loupe magnification was utilized throughout the procedure. No complications noted with excellent and all clear margins at the termination. An advancement closure technique was utilized.\nKeywords:\nhematology - oncology, basal cell carcinoma, closure, steri-strips, xeroform gauze, excision, light pressure dressing, loupe magnification, nasal tip, basal carcinoma, basal cell, cell carcinoma, biopsy, basal, carcinoma, nasal,]",[History]
24,"[POSTOPERATIVE DIAGNOSIS:\nBasal cell carcinoma, nasal tip, previous positive biopsy.]",[Procedures]


### Summarize notes

In [0]:
from sparknlp.pretrained import PretrainedPipeline

summarizer = PretrainedPipeline("summarizer_clinical_jsl_augmented_pipeline", "en", "clinical/models")


summarizer_clinical_jsl_augmented_pipeline download started this may take some time.
Approx size to download 885.7 MB
[ | ][OK!]


In [0]:
summary = summarizer.annotate(note)
summary

Out[29]: {'document': ['Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.\n(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma.\nSECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.\nPROCEDURES\n1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.\n2. On August 20, 2007, thoracentesis.\n3. On August 31, 2007, Port-A-Cath placement.\nHISTORY AND PHYSICAL:\nThe patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week. She has had right-sided chest pain radiating to her back with fever starting yesterday. She has a history of pericarditis and pericardectomy in May 2006 and developed cough with right-sided ch

In [0]:
summary['summary']

Out[30]: ['The patient is a 41-year-old Vietnamese female with a history of mesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and a history of deep venous thrombosis. She was admitted for a right-sided pleural effusion for thoracentesis and was started on prophylaxis for DVT with Lovenox. She was readmitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. She was started on chemotherapy on September 1, 2007 with cisplatin 75 mg/centimeter squared equaling 109 mg IV piggyback over 2 hours on September 1, 2007, and Alimta 500 mg/centimeter squared equaling 730 mg IV piggyback over 10 minutes. She was discharged the following day after discontinuing IV fluid and IV. She was instructed to follow up with Dr. XYZ in the office to check her INR on Tuesday.']

### Split augmentation using Healthcare NLP  

In [0]:
sample_split_df = result.select('patient_id','document_id','splits.result','section.result').limit(30).toPandas()
sample_split_df.head()

Unnamed: 0,patient_id,document_id,result,result.1
0,pt_9,2fc10874-3f71-448b-94b1-f4ea02618f1a,[Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1],[Discharge Information]
1,pt_9,2fc10874-3f71-448b-94b1-f4ea02618f1a,"[Description:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.\n(Medical Transcription Sample Report)]",[Complications and Risk Factors]
2,pt_9,2fc10874-3f71-448b-94b1-f4ea02618f1a,[PRINCIPAL DIAGNOSIS:\nMesothelioma.],[Diagnostic and Laboratory Data]
3,pt_9,2fc10874-3f71-448b-94b1-f4ea02618f1a,"[SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.]",[Procedures]
4,pt_9,2fc10874-3f71-448b-94b1-f4ea02618f1a,"[PROCEDURES\n1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.\n2. On August 20, 2007, thoracentesis.\n3. On August 31, 2007, Port-A-Cath placement.]",[Procedures]


In [0]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("word_embeddings")

# to get PROBLEM entities
clinical_ner = medical.NerModel().pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("clinical_ner")

clinical_ner_chunk = medical.NerConverter()\
    .setInputCols("sentence","token","clinical_ner")\
    .setOutputCol("clinical_ner_chunk")\
    .setWhiteList(["PROBLEM"])

# to get PROBLEM entitis
jsl_ner = medical.NerModel().pretrained("ner_jsl_enriched", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("jsl_ner")

jsl_ner_chunk = medical.NerConverter()\
    .setInputCols("sentence","token","jsl_ner")\
    .setOutputCol("jsl_ner_chunk")

# to get DRUG entities
posology_ner = medical.NerModel().pretrained("ner_drugs_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("drugs_ner")

posology_ner_chunk = medical.NerConverter()\
    .setInputCols("sentence","token","drugs_ner")\
    .setOutputCol("drugs_ner_chunk")\
    .setWhiteList(["DRUG"])

drug_mapper = medical.ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") \
    .setInputCols("drugs_ner_chunk")\
    .setOutputCol("drug_action")\
    .setRel("action")

# merge the chunks into a single ner_chunk
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("clinical_ner_chunk","drugs_ner_chunk")\
    .setOutputCol("merged_ner_chunk")\
    .setMergeOverlapping(False)

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "merged_ner_chunk", "word_embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","merged_ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])

chunk2doc = nlp.Chunk2Doc()\
    .setInputCols("assertion_filtered")\
    .setOutputCol("doc_final_chunk")

sbiobert_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["doc_final_chunk"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

# filter PROBLEM entity embeddings
router_sentence_icd10 = medical.Router() \
    .setInputCols("sbert_embeddings") \
    .setFilterFieldsElements(["PROBLEM"]) \
    .setOutputCol("problem_embeddings")

# filter DRUG entity embeddings
router_sentence_rxnorm = medical.Router() \
    .setInputCols("sbert_embeddings") \
    .setFilterFieldsElements(["DRUG"]) \
    .setOutputCol("drug_embeddings")

# use problem_embeddings only
icd_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised","en", "clinical/models") \
    .setInputCols(["problem_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

# use drug_embeddings only
rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
    .setInputCols(["drug_embeddings"]) \
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")


pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        clinical_ner_chunk,
        jsl_ner,
        jsl_ner_chunk,
        posology_ner,
        posology_ner_chunk,
        drug_mapper,
        chunk_merger,
        clinical_assertion,
        assertion_filterer,
        chunk2doc,
        sbiobert_embeddings,
        router_sentence_icd10,
        router_sentence_rxnorm,
        icd_resolver,
        rxnorm_resolver
])

empty_data = spark.createDataFrame([['']]).toDF("text")
model = pipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_clinical download started this may take some time.
[ | ][OK!]
ner_jsl_enriched download started this may take some time.
[ | ][OK!]
ner_drugs_large download started this may take some time.
[ | ][OK!]
drug_action_treatment_mapper download started this may take some time.
[ | ][OK!]
assertion_dl download started this may take some time.
[ | ][OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[ | ][OK!]
sbiobertresolve_icd10cm_generalised download started this may take some time.
[ | ][OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[ | ][OK!]


In [0]:

light_model = nlp.LightPipeline(model)

In [0]:
result = light_model.fullAnnotate(note)

In [0]:
result[0].keys()

Out[35]: dict_keys(['assertion_filtered', 'drug_action', 'document', 'word_embeddings', 'doc_final_chunk', 'jsl_ner_chunk', 'drugs_ner_chunk', 'assertion', 'drugs_ner', 'icd10cm_code', 'jsl_ner', 'clinical_ner', 'token', 'rxnorm_code', 'merged_ner_chunk', 'drug_embeddings', 'sbert_embeddings', 'clinical_ner_chunk', 'problem_embeddings', 'sentence'])

In [0]:
result[0]['drugs_ner_chunk']

Out[36]: [Annotation(chunk, 1163, 1174, thrombolytic, {'chunk': '0', 'confidence': '0.9072', 'ner_source': 'drugs_ner_chunk', 'entity': 'DRUG', 'sentence': '15'}, []),
 Annotation(chunk, 1609, 1621, Coumadin 1 mg, {'chunk': '1', 'confidence': '0.8162667', 'ner_source': 'drugs_ner_chunk', 'entity': 'DRUG', 'sentence': '28'}, []),
 Annotation(chunk, 1696, 1716, Amiodarone 100 mg p.o, {'chunk': '2', 'confidence': '0.88705003', 'ner_source': 'drugs_ner_chunk', 'entity': 'DRUG', 'sentence': '30'}, []),
 Annotation(chunk, 2770, 2777, Coumadin, {'chunk': '3', 'confidence': '0.9415', 'ner_source': 'drugs_ner_chunk', 'entity': 'DRUG', 'sentence': '53'}, []),
 Annotation(chunk, 4436, 4447, chemotherapy, {'chunk': '4', 'confidence': '0.9991', 'ner_source': 'drugs_ner_chunk', 'entity': 'DRUG', 'sentence': '67'}, []),
 Annotation(chunk, 4475, 4527, cisplatin 75 mg/centimeter squared equaling 109 mg IV, {'chunk': '5', 'confidence': '0.78065', 'ner_source': 'drugs_ner_chunk', 'entity': 'DRUG', 'sente

In [0]:
result[0]['clinical_ner_chunk']

Out[37]: [Annotation(chunk, 88, 99, Mesothelioma, {'chunk': '0', 'confidence': '0.973', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '0'}, []),
 Annotation(chunk, 118, 129, Mesothelioma, {'chunk': '1', 'confidence': '0.9993', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '0'}, []),
 Annotation(chunk, 132, 147, pleural effusion, {'chunk': '2', 'confidence': '0.99609995', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '0'}, []),
 Annotation(chunk, 150, 168, atrial fibrillation, {'chunk': '3', 'confidence': '0.99815', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '0'}, []),
 Annotation(chunk, 171, 176, anemia, {'chunk': '4', 'confidence': '0.9992', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '0'}, []),
 Annotation(chunk, 179, 185, ascites, {'chunk': '5', 'confidence': '0.9996', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '0'}, []),
 Annotation(chunk

In [0]:
[(x.result, x.begin, x.end, x.metadata['entity'], x.metadata['confidence']) for x in result[0]['jsl_ner_chunk']]

Out[38]: [('Discharge', 68, 76, 'Admission_Discharge', '0.9993'),
 ('Mesothelioma', 88, 99, 'Oncological', '0.9353'),
 ('Description', 105, 115, 'Section_Header', '0.9998'),
 ('Mesothelioma', 118, 129, 'Oncological', '0.9906'),
 ('pleural effusion', 132, 147, 'Disease_Syndrome_Disorder', '0.78805'),
 ('atrial fibrillation', 150, 168, 'Heart_Disease', '0.97745'),
 ('anemia', 171, 176, 'Disease_Syndrome_Disorder', '0.9965'),
 ('ascites', 179, 185, 'Disease_Syndrome_Disorder', '0.9642'),
 ('esophageal reflux', 188, 204, 'Disease_Syndrome_Disorder', '0.74625003'),
 ('deep venous thrombosis',
  222,
  243,
  'Disease_Syndrome_Disorder',
  '0.45533335'),
 ('PRINCIPAL DIAGNOSIS', 284, 302, 'Section_Header', '0.91955'),
 ('Mesothelioma', 305, 316, 'Oncological', '0.9835'),
 ('SECONDARY DIAGNOSES', 319, 337, 'Section_Header', '0.9634'),
 ('Pleural effusion', 340, 355, 'Disease_Syndrome_Disorder', '0.8196'),
 ('atrial fibrillation', 358, 376, 'Heart_Disease', '0.98144996'),
 ('anemia', 379, 384,

In [0]:
cols = [
     'entities_jsl_ner_chunk',
     'entities_jsl_ner_chunk_begin',
     'entities_jsl_ner_chunk_end',
     'entities_jsl_ner_chunk_origin_sentence',
     'entities_jsl_ner_chunk_class',
    'entities_jsl_ner_chunk_confidence'
]
df_clinical = nlp.nlu.to_pretty_df(model, note, positions=True, output_level='chunk')[cols].dropna()
df_clinical.head(20)



Unnamed: 0,entities_jsl_ner_chunk,entities_jsl_ner_chunk_begin,entities_jsl_ner_chunk_end,entities_jsl_ner_chunk_origin_sentence,entities_jsl_ner_chunk_class,entities_jsl_ner_chunk_confidence
0,Discharge,68,76,0,Admission_Discharge,0.9993
0,Mesothelioma,88,99,0,Oncological,0.9353
0,Description,105,115,0,Section_Header,0.9998
0,Mesothelioma,118,129,0,Oncological,0.9906
0,pleural effusion,132,147,0,Disease_Syndrome_Disorder,0.78805
0,atrial fibrillation,150,168,0,Heart_Disease,0.97745
0,anemia,171,176,0,Disease_Syndrome_Disorder,0.9965
0,ascites,179,185,0,Disease_Syndrome_Disorder,0.9642
0,esophageal reflux,188,204,0,Disease_Syndrome_Disorder,0.74625003
0,deep venous thrombosis,222,243,0,Disease_Syndrome_Disorder,0.45533335


In [0]:
split =  sample_split_df.iloc[7,2][0]
print (split) 

FAMILY HISTORY:
No family history of coronary artery disease, CVA, diabetes, CHF or MI. The patient has one family member, a sister, with history of cancer.


In [0]:

df_clinical = nlp.nlu.to_pretty_df(model, split, positions=True, output_level='chunk')[cols].dropna()
df_clinical.head(20)



Unnamed: 0,entities_jsl_ner_chunk,entities_jsl_ner_chunk_begin,entities_jsl_ner_chunk_end,entities_jsl_ner_chunk_origin_sentence,entities_jsl_ner_chunk_class,entities_jsl_ner_chunk_confidence
0,FAMILY HISTORY,0,13,0,Family_History_Header,0.9994
0,coronary artery disease,37,59,0,Heart_Disease,0.65173334
0,CVA,62,64,0,Cerebrovascular_Disease,0.9995
0,diabetes,67,74,0,Diabetes,0.9939
0,CHF,77,79,0,Heart_Disease,0.9964
0,MI,84,85,0,Heart_Disease,0.9861
0,sister,125,130,1,Gender,0.9577
0,cancer,149,154,1,Oncological,0.9815


In [0]:
split =  sample_split_df.iloc[6,2][0]
print (split) 

PAST MEDICAL HISTORY
1. Pericardectomy.
2. Pericarditis.
2. Atrial fibrillation.
4. RNCA with intracranial thrombolytic treatment.
5 PTA of MCA.
6. Mesenteric venous thrombosis.
7. Pericardial window.
8. Cholecystectomy.
9. Left thoracentesis.


In [0]:
split_result = light_model.fullAnnotate(split)

In [0]:
split_result[0]['clinical_ner_chunk']

Out[69]: [Annotation(chunk, 43, 54, Pericarditis, {'chunk': '0', 'confidence': '0.9861', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '2'}, []),
 Annotation(chunk, 60, 78, Atrial fibrillation, {'chunk': '1', 'confidence': '0.9803', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '3'}, []),
 Annotation(chunk, 148, 175, Mesenteric venous thrombosis, {'chunk': '2', 'confidence': '0.9354334', 'ner_source': 'clinical_ner_chunk', 'entity': 'PROBLEM', 'sentence': '6'}, [])]

In [0]:
split_result[0]['icd10cm_code'][0] #['target_text']

Out[70]: Annotation(entity, 43, 54, I31, {'chunk': '0', 'all_k_results': 'I31:::I30:::B33:::T46:::I32:::I09', 'all_k_distances': '0.0000:::4.8160:::5.9290:::6.6619:::6.9814:::7.2547', 'confidence': '0.9865', 'all_k_cosine_distances': '0.0000:::0.0393:::0.0593:::0.0724:::0.0822:::0.0868', 'all_k_resolutions': 'pericarditis:::infectious pericarditis:::viral pericarditis:::drug-induced pericarditis:::parasitic pericarditis:::rheumatic pericarditis', 'target_text': 'Pericarditis', 'all_k_aux_labels': 'no_aux_label_found:::no_aux_label_found:::no_aux_label_found:::no_aux_label_found:::no_aux_label_found:::no_aux_label_found', 'token': 'Pericarditis', 'resolved_text': 'pericarditis', 'all_k_confidences': '0.9865:::0.0080:::0.0026:::0.0013:::0.0009:::0.0007', 'distance': '0.0000', 'sentence': '2'}, [])

In [0]:
split_result[0]['icd10cm_code'][0].metadata['target_text']

Out[71]: 'Pericarditis'

In [0]:
[(x.metadata['target_text'], x.result, x.metadata['resolved_text'],x.begin, x.end, x.metadata['confidence']) for x in split_result[0]['icd10cm_code']]

Out[72]: [('Pericarditis', 'I31', 'pericarditis', 43, 54, '0.9865'),
 ('Atrial fibrillation', 'I48', 'atrial fibrillation', 60, 78, '0.9960'),
 ('Mesenteric venous thrombosis',
  'K55',
  'mesenteric vein thrombosis',
  148,
  175,
  '0.8964')]

In [0]:
[(x.metadata['target_text'], x.result, x.metadata['resolved_text'],x.begin, x.end, x.metadata['confidence']) for x in split_result[0]['rxnorm_code']]

Out[73]: [('thrombolytic', '1243768', 'thrombin Topical Spray', 107, 118, '0.1422')]

In [0]:
[x for x in result[0]['drug_action'] if x.result!='NONE']

Out[74]: [Annotation(labeled_dependency, 2770, 2777, anticoagulant, {'chunk': '3', '__trained__': 'Coumadin', 'relation': 'action', 'all_k_distances': '0.0:::0.0', '__distance_function__': 'levenshtein', 'confidence': '0.9415', 'all_k_resolutions': 'anticoagulant:::', 'target_text': 'Coumadin', 'ner_source': 'drugs_ner_chunk', 'ops': '0.0', 'all_relations': '', 'entity': 'Coumadin', 'resolved_text': 'anticoagulant', 'distance': '0.0', 'sentence': '53', '__relation_name__': 'action'}, []),
 Annotation(labeled_dependency, 2770, 2777, cerebrovascular accident, {'chunk': '3', '__trained__': 'Coumadin', 'relation': 'treatment', 'all_k_distances': '0.0:::0.0', '__distance_function__': 'levenshtein', 'confidence': '0.9415', 'all_k_resolutions': 'cerebrovascular accident:::pulmonary embolism:::heart attack:::af:::embolization', 'target_text': 'Coumadin', 'ner_source': 'drugs_ner_chunk', 'ops': '0.0', 'all_relations': 'pulmonary embolism:::heart attack:::af:::embolization', 'entity': 'Coumadin'

In [0]:
[(x.result, x.begin, x.end, x.metadata['entity'], x.metadata['confidence']) for x in result[0]['drug_action'] if x.result!='NONE' and x.metadata['relation']=='action']

Out[75]: [('anticoagulant', 2770, 2777, 'Coumadin', '0.9415'),
 ('anticoagulant', 4863, 4869, 'heparin', '0.9948'),
 ('anti-abstinence', 4917, 4922, 'Zofran', '0.996'),
 ('antiallergic', 4925, 4933, 'Phenergan', '0.997'),
 ('anticoagulant', 4936, 4943, 'Coumadin', '0.9987')]

In [0]:
[(x.result, x.begin, x.end, x.metadata['entity'], x.metadata['confidence']) for x in result[0]['drug_action'] if x.result!='NONE' and x.metadata['relation']=='treatment']

Out[76]: [('cerebrovascular accident', 2770, 2777, 'Coumadin', '0.9415'),
 ('pulmonary embolism', 4863, 4869, 'heparin', '0.9948'),
 ('burping', 4917, 4922, 'Zofran', '0.996'),
 ('anaphylaxis', 4925, 4933, 'Phenergan', '0.997'),
 ('cerebrovascular accident', 4936, 4943, 'Coumadin', '0.9987')]

In [0]:
txt ="A 28-year-old female with a history of type-2 diabetes mellitus diagnosed eight years ago takes 500 mg metformin 3 times per day."

txt_result = light_model.fullAnnotate(txt)

In [0]:
[(x.result, x.begin, x.end, x.metadata['entity'], x.metadata['confidence']) for x in txt_result[0]['drug_action'] if x.result!='NONE' and x.metadata['relation']=='treatment']

Out[78]: []

In [0]:
txt_result[0]['drugs_ner_chunk']

Out[54]: [Annotation(chunk, 103, 111, metformin, {'chunk': '0', 'confidence': '0.9991', 'ner_source': 'drugs_ner_chunk', 'entity': 'DRUG', 'sentence': '0'}, [])]

In [0]:
#### We can keep doing these kinds of augmentations forever ....