<a href="https://colab.research.google.com/github/IyadSultan/AI_pediatric_oncology/blob/main/02_Hugging_Face_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face Transformers

---

# Introduction

In this session, we will explore how to use **Hugging Face Transformers** to perform **Named Entity Recognition (NER)** on medical text.  
This tutorial is designed for **pediatric oncologists** and **healthcare professionals** with minimal coding experience, and it builds on a previous workshop where we covered **Python basics** and the **OpenAI API**.  
Here, we'll shift focus to **open-source transformer models** that run in our environment (**Google Colab**) without requiring an external API.

---

## What is NER?

**Named Entity Recognition (NER)** is a technique to automatically identify and classify key terms (entities) in text into categories like:
- Diseases
- Medications
- Symptoms

**Example**:  
In a clinical note, an NER model might automatically highlight a disease name like **"acute lymphoblastic leukemia"** or a medication like **"vincristine."**

---

## By the End of This Tutorial, You Will Be Able To:

- **Set up and install** the Hugging Face Transformers library in Colab.
- **Load a pre-trained medical NER model** from Hugging Face (no training required).
- **Tokenize and run the model** on synthetic pediatric oncology notes to extract entities.
- **Build a simple interactive UI** with Streamlit to input text and highlight extracted entities.
- **Understand real-world applications** of these tools:
  - Classification
  - Named Entity Recognition (NER)
  - Embeddings
- **Collaborate effectively** with data scientists to prototype AI solutions in clinical settings.

---

## Let's Get Started!

---


# 1. Setting Up the Colab Environment
First, we'll ensure the required libraries are installed. We need the Transformers library (from Hugging Face) and Streamlit for the UI part. In a Colab notebook, you can install packages using pip. Run the following cell to install Transformers and Streamlit:

In [None]:
!pip install transformers streamlit

Running pip install transformers will fetch the Hugging Face Transformers library, which provides access to state-of-the-art Transformer models for NLP tasks (including NER). Similarly, streamlit will be installed for building our web interface. After this installation, we can import the libraries in Python.

# 2. Importing the Transformers Library

With the libraries installed, we import the necessary classes and functions from Transformers. We'll use Hugging Face's high-level pipeline API as well as some specific classes for tokenization and modeling:

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

**pipeline** is an easy-to-use function that abstracts away a lot of the complexity behind running models​. It gives a simple interface for various tasks like NER, text generation, etc. (We'll use it for NER in this tutorial).

As the Hugging Face docs note: “The pipelines are a great and easy way to use models for inference... offering a simple API dedicated to several tasks, including Named Entity Recognition.”​. In other words, pipelines let us apply a model in one line of code without deep knowledge of the model’s internals​.

**AutoTokenizer** and **AutoModelForTokenClassification** are classes that automatically load the appropriate tokenizer and model architecture for a given model name. We use these to get the components for our NER model.

# 3. Loading a Pretrained Medical NER Model

Hugging Face hosts thousands of pretrained models on their Hub. We will use a general biomedical NER model called d4data/biomedical-ner-all as an example. This model was trained to recognize a wide range of biomedical entities (it can identify 107 different types of entities in medical text​!). These include categories like medical conditions, medications, procedures, demographic info, etc., making it suitable for clinical notes​. The great thing is we don’t need to train anything ourselves – we can load this pretrained model directly.

Let's load the model and tokenizer, then wrap them in a pipeline for NER:

In [None]:
model_name = "d4data/biomedical-ner-all"

# Load tokenizer and model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create an NER pipeline with the model, using an aggregation strategy to group tokens into entities
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="first"  # 'simple' will group contiguous tokens into single entities
)


Device set to use cpu


When this code runs, it will download the model weights and tokenizer for **"d4data/biomedical-ner-all"** from Hugging Face.

Under the hood:
The **tokenizer** is responsible for preprocessing text (splitting text into tokens and converting them to numeric IDs the model understands).

The model is a transformer (DistilBERT base, in this case) fine-tuned for token classification (NER task).

The pipeline with **aggregation_strategy="simple"** means the pipeline will merge tokens that belong to the same entity, giving us whole entity spans instead of raw token-by-token output.

In [None]:
ner_pipeline = pipeline("ner", model=model_name, aggregation_strategy="first")  # try "simple", "average", "max" or "first" as an aggregation_strategy


Device set to use cpu


This : would load the model and tokenizer internally by name. We showed the longer form above for clarity.

# 4. Tokenizing and Running NER on Synthetic Text
Now that we have our **ner_pipeline**, let's test it on some example medical text. We'll create a synthetic clinical note inspired by pediatric oncology. It's important to use fake but realistic examples – so we ensure no real patient data is used, but the content resembles what a doctor might write. Example Clinical Note:

"*Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.*"

This text contains various medical entities (age, disease name, symptoms, physical findings, treatment plan, drug names). Let's see if the model can identify these. We will run the pipeline on this text and examine the output:

In [None]:
# Synthetic pediatric oncology note
text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
        "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
        "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")

# Use the NER pipeline on the text
entities = ner_pipeline(text)

print("Entities found:", len(entities))
for ent in entities:
    print(ent)


Entities found: 11
{'entity_group': 'Age', 'score': np.float32(0.99662447), 'word': '7 - year - old', 'start': 13, 'end': 23}
{'entity_group': 'Sex', 'score': np.float32(0.9985592), 'word': 'boy', 'start': 24, 'end': 27}
{'entity_group': 'Detailed_description', 'score': np.float32(0.99693465), 'word': 'acute', 'start': 33, 'end': 38}
{'entity_group': 'Disease_disorder', 'score': np.float32(0.9993123), 'word': 'lymphoblastic leukemia', 'start': 39, 'end': 61}
{'entity_group': 'Duration', 'score': np.float32(0.9950433), 'word': 'two - week', 'start': 88, 'end': 96}
{'entity_group': 'Biological_structure', 'score': np.float32(0.90964204), 'word': 'bone', 'start': 115, 'end': 119}
{'entity_group': 'Sign_symptom', 'score': np.float32(0.99826247), 'word': 'pallor', 'start': 154, 'end': 160}
{'entity_group': 'Medication', 'score': np.float32(0.9997482), 'word': 'chemotherapy', 'start': 202, 'end': 214}
{'entity_group': 'Medication', 'score': np.float32(0.56305647), 'word': 'vincristine', 'sta

We can see the model found entities like "7-year-old" (categorized as Age), "acute lymphoblastic leukemia" (Disease), "fever", "bone pain", "fatigue" (all Symptoms), "induction chemotherapy" (Treatment), and the medications "vincristine", "prednisone", "L-asparaginase". This is great – without any manual coding of rules, the model identified key medical terms in the text and classified them.

**How does this work? **
The pipeline handled everything:
- It tokenized the input (for example, splitting "acute lymphoblastic leukemia" into tokens like "acute", "lymphoblastic", "leukemia"). You can actually see the tokens by running tokenizer.tokenize(text) if curious.
- The tokens were fed into the model, which is a neural network that output a predicted label for each token (like B-Disease, I-Disease for beginning/inside of a disease name, etc.).
- The pipeline then aggregated those token-level labels into whole entity spans (that's why we got a single dictionary covering "acute lymphoblastic leukemia" as one entity, instead of three separate tokens). This makes the output easier to interpret.
Hugging Face provides the model’s predictions in that convenient list-of-entities format, which we can now use for downstream purposes – such as highlighting these entities in the original text.



# 5. Building a Simple Streamlit UI for NER
To make our demo interactive, let's build a small web interface using Streamlit. Streamlit allows us to create a user interface (with text boxes, buttons, etc.) for our Python code easily. Participants can paste in a clinical note and click a button to see the NER results with highlights, which is more engaging than just printing raw output.

We'll create a Streamlit app that does the following:
- Provides a text area for inputting (or editing) a clinical note.
- When a button is clicked, runs the ner_pipeline on the input text.
- Displays the input text with the recognized entities highlighted (e.g., with a background color).

In Colab, we will write the Streamlit app to a Python file (app.py) and then run it. Use the %%writefile magic to create the file:

In [None]:
%%writefile app.py
import streamlit as st
from transformers import pipeline

# Load the same NER pipeline inside the app (this will use the model we downloaded)
ner_pipeline = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")

# Streamlit UI layout
st.title("Clinical NER Demo")
st.markdown("Enter a synthetic clinical note and **extract entities** using a pretrained Transformer model:")

# A text area for input
default_text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
                "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
                "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")
user_input = st.text_area("Clinical Note", default_text, height=150)

# Button to run NER
if st.button("Extract Entities"):
    # Run the NER pipeline on the input text
    entities = ner_pipeline(user_input)
    # Highlight the entities in the text by wrapping them with HTML <mark> tag
    highlighted_text = user_input
    # Insert the highlight tags in reverse order of indices (to not mess up positions as we insert)
    for ent in sorted(entities, key=lambda x: x['start'], reverse=True):
        start, end = ent['start'], ent['end']
        highlighted_text = (highlighted_text[:start]
                             + f"<mark>{highlighted_text[start:end]}</mark>"
                             + highlighted_text[end:])
    # Display the highlighted text. 'unsafe_allow_html=True' lets us render the <mark> tags.
    st.write("**Extracted Entities Highlighted:**")
    st.markdown(highlighted_text, unsafe_allow_html=True)


Writing app.py


---

## Breaking Down the Code

### App Setup
- **Imports**:  
  We import **Streamlit** and the **Transformers pipeline**.

- **Model Initialization**:  
  Inside the Streamlit script, we instantiate `ner_pipeline` again.  
  - (Each Streamlit session needs its own copy of the model, which will download if not already cached.)

- **Interface Elements**:
  - **Title and Description**:  
    Defined using `st.title()` and `st.markdown()` for instructions.
  - **Text Input**:  
    `st.text_area()` provides a multi-line input box, pre-filled with an example note.  
    (The height is set to 150 pixels for better visibility.)

- **Button for Entity Extraction**:
  - **Trigger**:  
    `st.button("Extract Entities")` activates the following steps:
    - Run the NER pipeline on the user's text input.
    - Process the list of entities, inserting HTML highlight tags around the identified spans.
    - **Important Detail**:  
      Entities are sorted by their **start index in reverse order** to avoid disrupting earlier character positions when inserting HTML tags.
    - Display the modified, highlighted text using `st.markdown(unsafe_allow_html=True)`.
    - Add a label "**Extracted Entities Highlighted:**" for clarity.

---

### What the App Does
- Takes input text and highlights identified entities in **yellow**.
- Example:  
  If "acute lymphoblastic leukemia" is recognized, it will be highlighted immediately after clicking **Extract Entities**.

---

## Running the Streamlit App in Colab

### Challenge
- Colab doesn’t natively display Streamlit apps (since Streamlit runs a separate web server).

### Solution: Using LocalTunnel
We will generate code to:
- **Install LocalTunnel**:  
  A utility to expose localhost ports publicly.
- **Launch the Streamlit App**:  
  Run it (`app.py`) in the background.
- **Tunnel Port 8501**:  
  (Streamlit’s default port) to generate a public URL.

---


In [None]:
!npm install -q localtunnel
!streamlit run app.py & npx localtunnel --port 8501
# you need to run the next cell to get the tunnel password


In [None]:
!curl https://loca.lt/mytunnelpassword


---

## Launching a Streamlit App with LocalTunnel

### Steps Overview
- **Install LocalTunnel**:  
  The first line installs LocalTunnel using Node’s package manager (`npm`).  
  - The `-q` flag quiets the output.

- **Run the App**:  
  The second line runs the Streamlit app (`app.py`).  
  - The `&` symbol makes it run in the background.  
  - Then, `npx localtunnel --port 8501` starts the tunneling.

- **Access the URL**:  
  After a moment, you should see an output URL (usually ending with `.loca.lt`).  
  This URL tunnels into the Colab environment on port 8501.

---

### How to Use It
👉 **Click the URL that appears** (e.g., `https://warm-mouse-1234.loca.lt`).  
This will open a new page displaying your Streamlit app.

- You should see the title **"Clinical NER Demo"** and some pre-filled example text.
- Hit the **"Extract Entities"** button.
- After a second, the entities in the text will be highlighted!

---

### Experimenting with the App
- Try **editing the text** to simulate different clinical scenarios.
- Change symptoms, diseases, or add new sentences.
- Click the **"Extract Entities"** button again to see how the model responds.

---

### Troubleshooting
- If the LocalTunnel URL **doesn’t show or connect**, you may need to **rerun the cell**.  
- Occasionally, the tunnel fails to establish on the first attempt — this is normal.

---

### Real-World Context
Streamlit makes it easy to build prototype apps.  
In a real-world setting, you or your data science collaborators could deploy similar apps to:
- Let clinicians test models on their own examples.
- Quickly gather feedback and iterate on model improvements.

---


# Conclusion: Real-World Applications and Next Steps

---

# Using NLP for Clinical Text: A Hands-On Tutorial

### What We Covered
- **Installation and Setup**:  
  We installed a Transformers model and built a mini-application that identifies medical entities in text — all within a short session.

- **Focus on Non-Generative AI**:  
  We demonstrated how models that **extract or classify** information (rather than generating new text) can be highly useful in healthcare.

---

### Real-World Clinical Applications

#### Information Extraction (Named Entity Recognition - NER)
- **Practical Example**: Automatically scanning pathology reports or clinical notes to find mentions of diseases, medications, allergies, or symptoms.
- **Benefits**:
  - Populating fields in an electronic health record.
  - Alerting physicians to key findings in long narratives.
  - Saving time on manual data entry.

#### Clinical Text Classification
- **Practical Example**: Categorizing patient notes or messages by urgency or by topic (e.g., cancer vs. no cancer in radiology reports).
- **In Pediatric Oncology**:  
  Classifying notes based on disease type, complications, or routine check-ups.
- **Benefits**:
  - Organizing large volumes of documents (e.g., discharge summaries, emails).
  - Supporting faster and more informed decision-making.

#### Embeddings and Semantic Search
- **Concept**:  
  Transformer models can create embeddings — numerical representations capturing the meaning of clinical text.
- **Practical Uses**:
  - Find similar past cases or research articles by meaning, not just keywords.
  - Enable case-based reasoning and smarter literature searches.

#### Other Non-Generative Applications
- **Predictive Models**: Predict outcomes like 30-day readmission risk based on discharge summaries.
- **Clustering**: Group notes to reveal hidden patterns (e.g., symptom clusters).
- **Entity Linking**: Map extracted terms to standardized codes like ICD or medical ontologies.

---

### Collaboration Is Key
- **Clinician Expertise**:  
  You know what problems need solving and can validate model outputs for clinical relevance.

- **Rapid Prototyping**:  
  In this session, we went from idea to working demo in one notebook — illustrating the power of quick iteration in clinical environments.

- **Interdisciplinary Teams**:  
  Many hospitals are forming clinician-data scientist teams to build AI prototypes for clinical documentation, decision support, and more.

---

### Next Steps
- **Fine-tuning**:  
  Try adapting a model to your institution's jargon with help from a data scientist.

- **Explore Other NLP Tasks**:  
  Look into medical question answering or summarizing long clinical reports.

- **Model Evaluation**:  
  Learn how to assess model performance rigorously (accuracy, errors) to meet healthcare standards.

---

### Final Thoughts
Modern NLP tools are accessible even for those new to coding. With practice, you can start prototyping AI solutions that solve real-world clinical challenges, particularly in pediatric oncology and beyond.

**Happy experimenting!**

---



Let us try to modify the code above:  not only highlight the text identified by NER but color code it according to its category.

In [None]:
# prompt: Let us try to modify the code : not only highlight the text identified by NER but color code it according to its category.
# Cover all entities and provide color coding legend on the bottom
# We can see the model found entities like "7-year-old" (categorized as Age), "acute lymphoblastic leukemia" (Dis
# %%writefile app.py
# import streamlit as st
# from transformers import pipeline
# # Load the same NER pipeline inside the app (this will use the model we downloaded)
# ner_pipeline = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")
# # Streamlit UI layout
# st.title("Clinical NER Demo")
# st.markdown("Enter a synthetic clinical note and **extract entities** using a pretrained Transformer model:")
# # A text area for input
# default_text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
#                 "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
#                 "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")
# user_input = st.text_area("Clinical Note", default_text, height=150)
# # Button to run NER
# if st.button("Extract Entities"):
#     # Run the NER pipeline on the input text
#     entities = ner_pipeline(user_input)
#     # Highlight the entities in the text by wrapping them with HTML <mark> tag
#     highlighted_text = user_input
#     # Insert the highlight tags in reverse order of indices (to not mess up positions as we insert)
#     for ent in sorted(entities, key=lambda x: x['start'], reverse=True):
#         start, end = ent['start'], ent['end']
#         highlighted_text = (highlighted_text[:start]
#                              + f"<mark>{highlighted_text[start:end]}</mark>"
#                              + highlighted_text[end:])
#     # Display the highlighted text. 'unsafe_allow_html=True' lets us render the <mark> tags.
#     st.write("**Extracted Entities Highlighted:**")
#     st.markdown(highlighted_text, unsafe_allow_html=True)

%%writefile app.py
import streamlit as st
from transformers import pipeline

# Load the same NER pipeline inside the app
ner_pipeline = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")

# Streamlit UI layout
st.title("Clinical NER Demo with Color Coding")
st.markdown("Enter a synthetic clinical note and **extract entities** using a pretrained Transformer model:")

# A text area for input
default_text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
                "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
                "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")
user_input = st.text_area("Clinical Note", default_text, height=150)

# Button to run NER
if st.button("Extract Entities"):
    # Run the NER pipeline on the input text
    entities = ner_pipeline(user_input)

    # Color mapping for entity categories
    color_map = {
        "Age": "lightblue",
        "Disease": "lightcoral",
        "Symptom": "lightgreen",
        "Treatment": "lightsalmon",
        "Medication": "lightgoldenrodyellow",
        "Finding": "lightcyan", # Example for "pallor" and "bruising"
        # Add more categories and colors as needed
    }

    highlighted_text = user_input
    for ent in sorted(entities, key=lambda x: x['start'], reverse=True):
        start, end = ent['start'], ent['end']
        entity_type = ent['entity_group']
        color = color_map.get(entity_type, "yellow")  # Default to yellow if category not in map
        highlighted_text = (highlighted_text[:start] +
                             f"<mark style='background-color:{color}'>{highlighted_text[start:end]}</mark>" +
                             highlighted_text[end:])

    st.write("**Extracted Entities Highlighted:**")
    st.markdown(highlighted_text, unsafe_allow_html=True)

    # Display color coding legend
    st.write("**Color Coding Legend:**")
    for entity_type, color in color_map.items():
        st.markdown(f"<span style='background-color:{color}; padding: 0.3em 0.5em; border-radius: 0.2em;'>{entity_type}</span>", unsafe_allow_html=True)


Overwriting app.py


In [None]:
!curl https://loca.lt/mytunnelpassword

In [None]:
!streamlit run app.py & npx localtunnel --port 8501