<a href="https://colab.research.google.com/github/Hung304-WBLEM/Lung-NLP/blob/main/Stanza_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Install Stanza NLP package**

In [None]:
pip install stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.4.0-py3-none-any.whl (574 kB)
[K     |████████████████████████████████| 574 kB 7.2 MB/s 
Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 18.4 MB/s 
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 30.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 1.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 41.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.

#**First example: Load the NLP biomedical model and run the Named Entity Recognition (NER) task on a sample sentence**

In [None]:
import stanza

You can download and initialize the biomedical Named Entity Recognition (NER) models by passing a dict with the processors argument. The following example downloads and initializes a pipeline with the **`MIMIC`** syntactic analysis models and the **`radiology`** clinical NER model.

In [None]:
# download and initialize a mimic pipeline with an radiology NER model
stanza.download('en', package='mimic', processors={'ner': 'radiology'})
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'radiology'})

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-14 21:41:11 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package   |
-------------------------------
| tokenize        | mimic     |
| pos             | mimic     |
| lemma           | mimic     |
| depparse        | mimic     |
| ner             | radiology |
| pretrain        | mimic     |
| forward_charlm  | mimic     |
| backward_charlm | mimic     |

2022-06-14 21:41:11 INFO: File exists: /root/stanza_resources/en/tokenize/mimic.pt
2022-06-14 21:41:11 INFO: File exists: /root/stanza_resources/en/pos/mimic.pt
2022-06-14 21:41:11 INFO: File exists: /root/stanza_resources/en/lemma/mimic.pt
2022-06-14 21:41:11 INFO: File exists: /root/stanza_resources/en/depparse/mimic.pt
2022-06-14 21:41:11 INFO: File exists: /root/stanza_resources/en/ner/radiology.pt
2022-06-14 21:41:11 INFO: File exists: /root/stanza_resources/en/pretrain/mimic.pt
2022-06-14 21:41:12 INFO: File exists: /root/stanza_resources/en/forward_charlm/mimic.pt
2022-06

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-14 21:41:13 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | mimic     |
| pos       | mimic     |
| lemma     | mimic     |
| depparse  | mimic     |
| ner       | radiology |

2022-06-14 21:41:13 INFO: Use device: gpu
2022-06-14 21:41:13 INFO: Loading: tokenize
2022-06-14 21:41:13 INFO: Loading: pos
2022-06-14 21:41:13 INFO: Loading: lemma
2022-06-14 21:41:13 INFO: Loading: depparse
2022-06-14 21:41:13 INFO: Loading: ner
2022-06-14 21:41:14 INFO: Done loading processors!


Now, we try to test a model on a sample sentence: "The patient had a sore throat and was treated with Cepacol lozenges". We also print out all the recognized entities. We use a for loop through a list of entities. Here, doc .entities is a Python list of detected entities for the input sentence. Variable ent is an object containing 4 attributes: text, type, start_char, end_char. 

Let explain those 4 attributes:

1) text: simply a text

2) type: the NER type of the text (which have been recognized by our nlp model)

3) start_char: index of the starting character of the entity in the input 
sentence

4) end_char: index of the ending character of the entity in the input sentence

In [None]:
# annotate clinical text
doc = nlp('The patient had a sore throat and was treated with Cepacol lozenges.')
# print out all entities
for ent in doc.entities:
    print(ent) # try printing out the ent object
    print(ent.text, ent.type) # try printing out 2 attributes: text and type

{
  "text": "sore throat",
  "type": "ANATOMY",
  "start_char": 18,
  "end_char": 29
}
sore throat ANATOMY


In [None]:
# Check the datatype of each variable above
print(type(doc))
print(type(doc.entities))
print(type(ent))

<class 'stanza.models.common.doc.Document'>
<class 'list'>
<class 'stanza.models.common.doc.Span'>


#**Second example: From document to sentences, and from sentences to words. How to get all the detected entities?**

Load a Stanza NLP model

In [None]:
import stanza

nlp = stanza.Pipeline('en') # Here we try a simple English model

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/tokenize/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/pos/combined.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/lemma/combined.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/depparse/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/sentiment/sstplus.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/constituency/wsj.pt:   0%|     …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/ner/ontonotes.pt:   0%|        …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/pretrain/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/forward_charlm/1billion.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/backward_charlm/1billion.pt:   …

2022-06-14 22:02:07 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-06-14 22:02:07 INFO: Use device: gpu
2022-06-14 22:02:07 INFO: Loading: tokenize
2022-06-14 22:02:07 INFO: Loading: pos
2022-06-14 22:02:07 INFO: Loading: lemma
2022-06-14 22:02:07 INFO: Loading: depparse
2022-06-14 22:02:07 INFO: Loading: sentiment
2022-06-14 22:02:08 INFO: Loading: constituency
2022-06-14 22:02:08 INFO: Loading: ner
2022-06-14 22:02:09 INFO: Done loading processors!


Run a model on a sample sentence

In [None]:
sentence = 'Obama was born in Honolulu, Hawaii. \
After graduating from Columbia University in 1983, \
he worked as a community organizer in Chicago. \
In 1988, he enrolled in Harvard Law School, \
where he was the first black president of the Harvard Law Review.'
doc = nlp(sentence)

Loop through all the sentences and print out all the detected entities of each sentence

In [None]:
for sentence in doc.sentences:
    print('Entities:', sentence.ents)     

Entities: [{
  "text": "Obama",
  "type": "PERSON",
  "start_char": 0,
  "end_char": 5
}, {
  "text": "Honolulu",
  "type": "GPE",
  "start_char": 18,
  "end_char": 26
}, {
  "text": "Hawaii",
  "type": "GPE",
  "start_char": 28,
  "end_char": 34
}]
Entities: [{
  "text": "Columbia University",
  "type": "ORG",
  "start_char": 58,
  "end_char": 77
}, {
  "text": "1983",
  "type": "DATE",
  "start_char": 81,
  "end_char": 85
}, {
  "text": "Chicago",
  "type": "GPE",
  "start_char": 125,
  "end_char": 132
}]
Entities: [{
  "text": "1988",
  "type": "DATE",
  "start_char": 137,
  "end_char": 141
}, {
  "text": "Harvard Law School",
  "type": "ORG",
  "start_char": 158,
  "end_char": 176
}, {
  "text": "first",
  "type": "ORDINAL",
  "start_char": 195,
  "end_char": 200
}, {
  "text": "the Harvard Law Review",
  "type": "ORG",
  "start_char": 220,
  "end_char": 242
}]


First, loop through all the sentences. For the inner nested loop, it is used to traverse all the words of each sentence.

Each word contain several attributes (e.g: 'id', 'text', 'lemma', 'feats',...). Here, we only print out three attributes: text, lemmatization, and part-of-speech tag

In [None]:
for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)      

{
  "id": 1,
  "text": "Obama",
  "lemma": "Obama",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 3,
  "deprel": "nsubj:pass",
  "start_char": 0,
  "end_char": 5
}
Obama Obama PROPN
{
  "id": 2,
  "text": "was",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 3,
  "deprel": "aux:pass",
  "start_char": 6,
  "end_char": 9
}
was be AUX
{
  "id": 3,
  "text": "born",
  "lemma": "bear",
  "upos": "VERB",
  "xpos": "VBN",
  "feats": "Tense=Past|VerbForm=Part|Voice=Pass",
  "head": 0,
  "deprel": "root",
  "start_char": 10,
  "end_char": 14
}
born bear VERB
{
  "id": 4,
  "text": "in",
  "lemma": "in",
  "upos": "ADP",
  "xpos": "IN",
  "head": 5,
  "deprel": "case",
  "start_char": 15,
  "end_char": 17
}
in in ADP
{
  "id": 5,
  "text": "Honolulu",
  "lemma": "Honolulu",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 3,
  "deprel": "obl",
  "start_char": 18,
  "

#**Third example: Try running on real reports from one Excel file**

In [None]:
import stanza

# This is what we have done earlier to download and initialize 
# a mimic pipeline with an radiology NER model
stanza.download('en', package='mimic', processors={'ner': 'radiology'})
model = stanza.Pipeline('en', package='mimic', processors={'ner': 'radiology'})

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-14 22:18:53 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package   |
-------------------------------
| tokenize        | mimic     |
| pos             | mimic     |
| lemma           | mimic     |
| depparse        | mimic     |
| ner             | radiology |
| pretrain        | mimic     |
| forward_charlm  | mimic     |
| backward_charlm | mimic     |

2022-06-14 22:18:53 INFO: File exists: /root/stanza_resources/en/tokenize/mimic.pt
2022-06-14 22:18:53 INFO: File exists: /root/stanza_resources/en/pos/mimic.pt
2022-06-14 22:18:53 INFO: File exists: /root/stanza_resources/en/lemma/mimic.pt
2022-06-14 22:18:54 INFO: File exists: /root/stanza_resources/en/depparse/mimic.pt
2022-06-14 22:18:54 INFO: File exists: /root/stanza_resources/en/ner/radiology.pt
2022-06-14 22:18:54 INFO: File exists: /root/stanza_resources/en/pretrain/mimic.pt
2022-06-14 22:18:54 INFO: File exists: /root/stanza_resources/en/forward_charlm/mimic.pt
2022-06

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-06-14 22:18:55 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | mimic     |
| pos       | mimic     |
| lemma     | mimic     |
| depparse  | mimic     |
| ner       | radiology |

2022-06-14 22:18:55 INFO: Use device: gpu
2022-06-14 22:18:55 INFO: Loading: tokenize
2022-06-14 22:18:55 INFO: Loading: pos
2022-06-14 22:18:55 INFO: Loading: lemma
2022-06-14 22:18:55 INFO: Loading: depparse
2022-06-14 22:18:56 INFO: Loading: ner
2022-06-14 22:18:56 INFO: Done loading processors!


Read the excel file using Pandas

In [None]:
import pandas as pd
df = pd.read_excel('https://github.com/Hung304-WBLEM/Lung-NLP/blob/main/AllLungCases.04.22.22.xlsx?raw=true') # You can download the file by clicking on this link

After loading the model and the data, we now try to run the model on each of the report in the above excel file.

Here, each row in the table is one single case. For each case, we will have some attributes such as: patient ID, study date, report,... In this example, we only focus on the report.

The below code will first loop through each row of the excel file. Then, it tries to extract the report for each case by accessing the column 'CT Reports' of the table.

After having the report, now we are going to try running the model on this extracted report.

In [None]:
for index, row in df.iterrows():
  report = row['CT Reports']
	
  doc = model(report)

  for ent in doc.entities:
    print(ent.text, ent.type)    

  break

CHEST ANATOMY
THORAX ANATOMY
Hemoptysis OBSERVATION
chest ANATOMY
tissue OBSERVATION
superior ANATOMY_MODIFIER
segment ANATOMY_MODIFIER
bronchus ANATOMY
right lower lobe ANATOMY
nodular OBSERVATION_MODIFIER
linear OBSERVATION_MODIFIER
opacity OBSERVATION
medial ANATOMY_MODIFIER
right ANATOMY_MODIFIER
major fissure ANATOMY
right lower lobe ANATOMY
base ANATOMY_MODIFIER
consolidation OBSERVATION
no UNCERTAINTY
pleural ANATOMY
effusion OBSERVATION
pneumothorax OBSERVATION
atherosclerotic OBSERVATION_MODIFIER
calcification OBSERVATION
thoracic aorta ANATOMY
coronary 
arteries ANATOMY
heart ANATOMY
not UNCERTAINTY
enlarged OBSERVATION
no UNCERTAINTY
pericardial ANATOMY
fluid OBSERVATION
thickening OBSERVATION
no UNCERTAINTY
mediastinal ANATOMY
hilar ANATOMY
adenopathy OBSERVATION
esophagus ANATOMY
unremarkable OBSERVATION
bones ANATOMY
mild OBSERVATION_MODIFIER
degenerative OBSERVATION
overlying ANATOMY_MODIFIER
soft tissues ANATOMY
unremarkable OBSERVATION
no UNCERTAINTY
axillary ANATOMY
l