<a href="https://colab.research.google.com/github/IyadSultan/educational/blob/main/Hugging_Face_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face Transformers

# Introduction
In this session, we will explore how to use Hugging Face Transformers to perform Named Entity Recognition (NER) on medical text. This tutorial is designed for pediatric oncologists and healthcare professionals with minimal coding experience, and it builds on a previous workshop where we covered Python basics and the OpenAI API. Here, we'll shift focus to open-source transformer models that run in our environment (Google Colab) without requiring an external API.

What is NER?

It's a technique to automatically identify and classify key terms in text (entities) into categories like diseases, medications, symptoms, etc. For example, in a clinical note, an NER model might highlight a disease name or a drug.

By the end of this tutorial, you'll know how to:

- Set up and install the Hugging Face Transformers library in Colab.
- Load a pre-trained medical NER model from Hugging Face (no training required).
- Tokenize and run the model on synthetic pediatric oncology notes to extract entities.
- Build a simple interactive UI with Streamlit to input text and see highlighted entities.
- Understand real-world applications of such tools (classification, NER, embeddings) in clinical settings, and how clinicians can collaborate with data scientists to prototype AI solutions.


Let's get started!

# 1. Setting Up the Colab Environment
First, we'll ensure the required libraries are installed. We need the Transformers library (from Hugging Face) and Streamlit for the UI part. In a Colab notebook, you can install packages using pip. Run the following cell to install Transformers and Streamlit:

In [None]:
!pip install transformers streamlit

Collecting streamlit
  Downloading streamlit-1.44.1-py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.44.1-py3-none-any.whl (9.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.8/9.8 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInst

Running pip install transformers will fetch the Hugging Face Transformers library, which provides access to state-of-the-art Transformer models for NLP tasks (including NER). Similarly, streamlit will be installed for building our web interface. After this installation, we can import the libraries in Python.

# 2. Importing the Transformers Library

With the libraries installed, we import the necessary classes and functions from Transformers. We'll use Hugging Face's high-level pipeline API as well as some specific classes for tokenization and modeling:

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

**pipeline** is an easy-to-use function that abstracts away a lot of the complexity behind running models​. It gives a simple interface for various tasks like NER, text generation, etc. (We'll use it for NER in this tutorial).

As the Hugging Face docs note: “The pipelines are a great and easy way to use models for inference... offering a simple API dedicated to several tasks, including Named Entity Recognition.”​. In other words, pipelines let us apply a model in one line of code without deep knowledge of the model’s internals​.

**AutoTokenizer** and **AutoModelForTokenClassification** are classes that automatically load the appropriate tokenizer and model architecture for a given model name. We use these to get the components for our NER model.

# 3. Loading a Pretrained Medical NER Model

Hugging Face hosts thousands of pretrained models on their Hub. We will use a general biomedical NER model called d4data/biomedical-ner-all as an example. This model was trained to recognize a wide range of biomedical entities (it can identify 107 different types of entities in medical text​!). These include categories like medical conditions, medications, procedures, demographic info, etc., making it suitable for clinical notes​. The great thing is we don’t need to train anything ourselves – we can load this pretrained model directly.

Let's load the model and tokenizer, then wrap them in a pipeline for NER:

In [None]:
model_name = "d4data/biomedical-ner-all"

# Load tokenizer and model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create an NER pipeline with the model, using an aggregation strategy to group tokens into entities
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="first"  # 'simple' will group contiguous tokens into single entities
)


Device set to use cpu


When this code runs, it will download the model weights and tokenizer for **"d4data/biomedical-ner-all"** from Hugging Face.

Under the hood:
The **tokenizer** is responsible for preprocessing text (splitting text into tokens and converting them to numeric IDs the model understands).

The model is a transformer (DistilBERT base, in this case) fine-tuned for token classification (NER task).

The pipeline with **aggregation_strategy="simple"** means the pipeline will merge tokens that belong to the same entity, giving us whole entity spans instead of raw token-by-token output.

In [None]:
ner_pipeline = pipeline("ner", model=model_name, aggregation_strategy="first")  # try "simple", "average", "max" or "first" as an aggregation_strategy


Device set to use cpu


This : would load the model and tokenizer internally by name. We showed the longer form above for clarity.

# 4. Tokenizing and Running NER on Synthetic Text
Now that we have our **ner_pipeline**, let's test it on some example medical text. We'll create a synthetic clinical note inspired by pediatric oncology. It's important to use fake but realistic examples – so we ensure no real patient data is used, but the content resembles what a doctor might write. Example Clinical Note:

"*Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.*"

This text contains various medical entities (age, disease name, symptoms, physical findings, treatment plan, drug names). Let's see if the model can identify these. We will run the pipeline on this text and examine the output:

In [None]:
# Synthetic pediatric oncology note
text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
        "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
        "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")

# Use the NER pipeline on the text
entities = ner_pipeline(text)

print("Entities found:", len(entities))
for ent in entities:
    print(ent)


Entities found: 11
{'entity_group': 'Age', 'score': np.float32(0.99662447), 'word': '7 - year - old', 'start': 13, 'end': 23}
{'entity_group': 'Sex', 'score': np.float32(0.9985592), 'word': 'boy', 'start': 24, 'end': 27}
{'entity_group': 'Detailed_description', 'score': np.float32(0.99693465), 'word': 'acute', 'start': 33, 'end': 38}
{'entity_group': 'Disease_disorder', 'score': np.float32(0.9993123), 'word': 'lymphoblastic leukemia', 'start': 39, 'end': 61}
{'entity_group': 'Duration', 'score': np.float32(0.9950433), 'word': 'two - week', 'start': 88, 'end': 96}
{'entity_group': 'Biological_structure', 'score': np.float32(0.90964204), 'word': 'bone', 'start': 115, 'end': 119}
{'entity_group': 'Sign_symptom', 'score': np.float32(0.99826247), 'word': 'pallor', 'start': 154, 'end': 160}
{'entity_group': 'Medication', 'score': np.float32(0.9997482), 'word': 'chemotherapy', 'start': 202, 'end': 214}
{'entity_group': 'Medication', 'score': np.float32(0.56305647), 'word': 'vincristine', 'sta

We can see the model found entities like "7-year-old" (categorized as Age), "acute lymphoblastic leukemia" (Disease), "fever", "bone pain", "fatigue" (all Symptoms), "induction chemotherapy" (Treatment), and the medications "vincristine", "prednisone", "L-asparaginase". This is great – without any manual coding of rules, the model identified key medical terms in the text and classified them.

**How does this work? **
The pipeline handled everything:
- It tokenized the input (for example, splitting "acute lymphoblastic leukemia" into tokens like "acute", "lymphoblastic", "leukemia"). You can actually see the tokens by running tokenizer.tokenize(text) if curious.
- The tokens were fed into the model, which is a neural network that output a predicted label for each token (like B-Disease, I-Disease for beginning/inside of a disease name, etc.).
- The pipeline then aggregated those token-level labels into whole entity spans (that's why we got a single dictionary covering "acute lymphoblastic leukemia" as one entity, instead of three separate tokens). This makes the output easier to interpret.
Hugging Face provides the model’s predictions in that convenient list-of-entities format, which we can now use for downstream purposes – such as highlighting these entities in the original text.



# 5. Building a Simple Streamlit UI for NER
To make our demo interactive, let's build a small web interface using Streamlit. Streamlit allows us to create a user interface (with text boxes, buttons, etc.) for our Python code easily. Participants can paste in a clinical note and click a button to see the NER results with highlights, which is more engaging than just printing raw output.

We'll create a Streamlit app that does the following:
- Provides a text area for inputting (or editing) a clinical note.
- When a button is clicked, runs the ner_pipeline on the input text.
- Displays the input text with the recognized entities highlighted (e.g., with a background color).

In Colab, we will write the Streamlit app to a Python file (app.py) and then run it. Use the %%writefile magic to create the file:

In [None]:
%%writefile app.py
import streamlit as st
from transformers import pipeline

# Load the same NER pipeline inside the app (this will use the model we downloaded)
ner_pipeline = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")

# Streamlit UI layout
st.title("Clinical NER Demo")
st.markdown("Enter a synthetic clinical note and **extract entities** using a pretrained Transformer model:")

# A text area for input
default_text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
                "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
                "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")
user_input = st.text_area("Clinical Note", default_text, height=150)

# Button to run NER
if st.button("Extract Entities"):
    # Run the NER pipeline on the input text
    entities = ner_pipeline(user_input)
    # Highlight the entities in the text by wrapping them with HTML <mark> tag
    highlighted_text = user_input
    # Insert the highlight tags in reverse order of indices (to not mess up positions as we insert)
    for ent in sorted(entities, key=lambda x: x['start'], reverse=True):
        start, end = ent['start'], ent['end']
        highlighted_text = (highlighted_text[:start]
                             + f"<mark>{highlighted_text[start:end]}</mark>"
                             + highlighted_text[end:])
    # Display the highlighted text. 'unsafe_allow_html=True' lets us render the <mark> tags.
    st.write("**Extracted Entities Highlighted:**")
    st.markdown(highlighted_text, unsafe_allow_html=True)


Writing app.py


Let’s break down the code above:
- We import Streamlit and our Transformers pipeline. Inside the Streamlit script, we instantiate ner_pipeline again. (When running inside the app, it needs its own copy of the model. This will use the same model name, downloading it if not already cached.)
- We define the app title and a description using st.title and st.markdown for some instructions.
- We use st.text_area to provide a multi-line text input. We even pre-fill it with our example note so users can see an example. They can edit this or replace with their own example. (The height is set to 150 pixels just to make it a bit larger.)
- We have an st.button("Extract Entities"). When the button is clicked, the code under the if block runs:
  - It calls the NER pipeline on whatever text is in user_input.
  - It then goes through the list of entities and inserts HTML tags around each entity span in the text. The tag by default highlights text with a yellow background.
  - We sort the entities by their start index in reverse order because if we insert tags from the end of the string towards the beginning, we don't disturb the character positions of entities that come earlier in the text.
  - Finally, we display the modified highlighted_text using st.markdown with unsafe_allow_html=True (this flag is needed to render raw HTML in Streamlit, in this case to apply the highlight). We also label it with a subheader "Extracted Entities Highlighted:" for clarity.

That’s it! This simple app will take the input text and show you the same text with identified entities highlighted in yellow. For instance, "acute lymphoblastic leukemia" would be highlighted as soon as you hit "Extract Entities," confirming the model spotted it as a disease.


**Running the Streamlit App in Colab**

Now that we have **app.py**, we need to run the Streamlit server and make it accessible. Colab doesn’t show Streamlit apps by default (since it’s basically a separate web server). However, we can use a tool called LocalTunnel to get a public URL for our app.

We will ask colab to generate the code to:
- Install LocalTunnel (a utility to expose localhost ports to the web),
- Launch the Streamlit app in the background, and
- Create a tunnel to port 8501 (Streamlit’s default port) so we can access it.

In [None]:
!npm install -q localtunnel
!streamlit run app.py & npx localtunnel --port 8501
# you need to run the next cell to get the tunnel password


[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K
up to date, audited 23 packages in 701ms
[1G[0K⠼[1G[0K
[1G[0K⠼[1G[0K3 packages are looking for funding
[1G[0K⠼[1G[0K  run `npm fund` for details
[1G[0K⠼[1G[0K
2 [31m[1mhigh[22m[39m severity vulnerabilities

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.
[1G[0K⠼[1G[0K
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[1G[0K⠙[1G[0K[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.48.156.56:8501[0m
[0m
your url is: https://grumpy-bags-behave.loca.lt
2025-04-25 11:36:12.341357: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already bee

In [None]:
!curl https://loca.lt/mytunnelpassword


34.48.156.56

This will do a few things:

The first line installs localtunnel via Node’s package manager (npm). The -q flag just quiets the output.

The second line runs the Streamlit app (app.py). The & symbol makes it run in the background, and then npx localtunnel --port 8501 starts the tunneling. After a moment, you should see an output in the cell with a URL (usually ending with *.loca.lt). This URL is an external address that tunnels into the Colab environment on port 8501​.

👉 Click the URL that appears (it will look similar to https://warm-mouse-1234.loca.lt or some random words). This will open a new page with your Streamlit app. You should see the title "Clinical NER Demo" and the example text pre-filled. Hit the "Extract Entities" button, and after a second, the text will display with highlighted entities!

Try editing the text to other scenarios (e.g., change the symptoms or disease, or add a new sentence) and click the button again to see how the model performs.

*If the localtunnel URL doesn’t show or connect on the first try, you might need to rerun the cell. Occasionally the tunnel fails to establish on the first attempt.*

Streamlit makes it easy to build such prototype apps. In a real setting, you or your data science collaborators could deploy similar apps to allow clinicians to test out models on their own examples.

# Conclusion: Real-World Applications and Next Steps

In this tutorial, we demonstrated how even those with minimal coding experience can leverage powerful NLP models for clinical text. We installed a Transformers model and built a mini application that identifies medical entities in text – all within a short session. This kind of non-generative AI model (one that extracts or classifies information rather than generating free-form text) can be very useful in healthcare. Here are some real-world clinical challenges where these tools can help:
- Information Extraction (NER in practice): Just as we saw, NER can pull out critical details from unstructured text. For example, it can automatically scan pathology reports or clinical notes to find mentions of diseases, medications, allergies, or symptoms. This could help populate fields in an electronic health record or alert a physician to key findings in a long narrative. NER algorithms have been used to extract patient names, medical conditions, prescription names, and more from clinical text​, saving time on manual data entry.
- Clinical Text Classification: Another common task is classification. For instance, categorizing patient notes or messages by urgency or by category (e.g., classifying radiology reports as showing cancer vs. no cancer, or triaging emails into appointment requests, medication queries, etc.). In pediatrics oncology, you might classify notes based on disease type or complication versus routine check-up. Machine learning models (like fine-tuned transformers or simpler algorithms with embeddings) can learn to assign labels to texts. For example, entire documents like discharge summaries can be classified into categories (diagnoses, specialties, outcomes) to help organize and triage information​. This helps healthcare professionals make decisions faster by organizing data.
- Embeddings and Semantic Search: Beyond direct classification, transformer models can convert clinical text into embeddings (numerical vector representations) that capture the meaning. These embeddings enable semantic similarity comparisons. What’s the use? Imagine you have a complex patient case – you could find similar cases from the literature or past records by searching via embeddings. For example, you could encode a patient’s symptoms and lab findings, and quickly retrieve past cases or relevant research papers that are similar in semantic content (even if they don’t share exact keywords). This is a powerful way to do case-based reasoning or literature search. Researchers have shown that embedding models and vector databases can successfully be used to encode and classify medical text without training new task-specific models​. In practice, this could mean quicker access to relevant information for decision support.
- Other Applications: There are many other non-generative NLP applications in healthcare. Prediction models can be built on text (e.g., predicting 30-day readmission risk from a discharge summary – which is essentially a classification/regression task on text). Clustering of notes via unsupervised learning could reveal patterns (maybe grouping patients by similar symptom profiles). Entity linking could be used to map extracted terms to standardized codes (like linking "ALL" to a specific ICD code or ontology term). All these tasks do not generate new text, but rather analyze or organize existing text, providing decision support while keeping the human in the loop.

Finally, it's worth emphasizing the importance of collaboration. As a clinician, you bring the domain knowledge (you know what problems need solving, and you can interpret whether model output is useful or clinically valid). By partnering with data scientists or informaticians, you can build solutions that address those problems. Tools like the ones we explored enable rapid prototyping – in a single notebook, we went from an idea to a working demo. This is incredibly powerful in a clinical setting where iterating quickly on ideas can lead to impactful tools. In fact, many hospitals and clinics are now forming interdisciplinary teams to develop AI-driven prototypes for tasks like clinical documentation assistance and decision support, allowing clinicians to test and give feedback early in development.

*Next Steps: If this session piqued your interest, you might explore further:*
- Try fine-tuning a model like this on your own dataset (with a data scientist’s help) to better adapt to your hospital’s jargon.
- Explore other Hugging Face Transformers tasks: e.g., question answering (on medical FAQs), or text summarization (summarizing a long report).
- Learn about how to evaluate these models’ performance on clinical data (accuracy, errors, etc.) and ensure they meet healthcare standards for reliability.

We hope this tutorial showed that modern NLP tools are within reach even if you’re new to coding. With a bit of practice, you can begin to prototype AI solutions that address everyday challenges in pediatric oncology and beyond. Happy experimenting!

Let us try to modify the code above:  not only highlight the text identified by NER but color code it according to its category.

In [None]:
# prompt: Let us try to modify the code : not only highlight the text identified by NER but color code it according to its category.
# Cover all entities and provide color coding legend on the bottom
# We can see the model found entities like "7-year-old" (categorized as Age), "acute lymphoblastic leukemia" (Dis
# %%writefile app.py
# import streamlit as st
# from transformers import pipeline
# # Load the same NER pipeline inside the app (this will use the model we downloaded)
# ner_pipeline = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")
# # Streamlit UI layout
# st.title("Clinical NER Demo")
# st.markdown("Enter a synthetic clinical note and **extract entities** using a pretrained Transformer model:")
# # A text area for input
# default_text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
#                 "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
#                 "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")
# user_input = st.text_area("Clinical Note", default_text, height=150)
# # Button to run NER
# if st.button("Extract Entities"):
#     # Run the NER pipeline on the input text
#     entities = ner_pipeline(user_input)
#     # Highlight the entities in the text by wrapping them with HTML <mark> tag
#     highlighted_text = user_input
#     # Insert the highlight tags in reverse order of indices (to not mess up positions as we insert)
#     for ent in sorted(entities, key=lambda x: x['start'], reverse=True):
#         start, end = ent['start'], ent['end']
#         highlighted_text = (highlighted_text[:start]
#                              + f"<mark>{highlighted_text[start:end]}</mark>"
#                              + highlighted_text[end:])
#     # Display the highlighted text. 'unsafe_allow_html=True' lets us render the <mark> tags.
#     st.write("**Extracted Entities Highlighted:**")
#     st.markdown(highlighted_text, unsafe_allow_html=True)

%%writefile app.py
import streamlit as st
from transformers import pipeline

# Load the same NER pipeline inside the app
ner_pipeline = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")

# Streamlit UI layout
st.title("Clinical NER Demo with Color Coding")
st.markdown("Enter a synthetic clinical note and **extract entities** using a pretrained Transformer model:")

# A text area for input
default_text = ("Patient is a 7-year-old boy with acute lymphoblastic leukemia (ALL) who presents with "
                "a two-week history of fever, bone pain, and fatigue. On exam, noted pallor and bruising. "
                "Plan is to start induction chemotherapy with vincristine, prednisone, and L-asparaginase.")
user_input = st.text_area("Clinical Note", default_text, height=150)

# Button to run NER
if st.button("Extract Entities"):
    # Run the NER pipeline on the input text
    entities = ner_pipeline(user_input)

    # Color mapping for entity categories
    color_map = {
        "Age": "lightblue",
        "Disease": "lightcoral",
        "Symptom": "lightgreen",
        "Treatment": "lightsalmon",
        "Medication": "lightgoldenrodyellow",
        "Finding": "lightcyan", # Example for "pallor" and "bruising"
        # Add more categories and colors as needed
    }

    highlighted_text = user_input
    for ent in sorted(entities, key=lambda x: x['start'], reverse=True):
        start, end = ent['start'], ent['end']
        entity_type = ent['entity_group']
        color = color_map.get(entity_type, "yellow")  # Default to yellow if category not in map
        highlighted_text = (highlighted_text[:start] +
                             f"<mark style='background-color:{color}'>{highlighted_text[start:end]}</mark>" +
                             highlighted_text[end:])

    st.write("**Extracted Entities Highlighted:**")
    st.markdown(highlighted_text, unsafe_allow_html=True)

    # Display color coding legend
    st.write("**Color Coding Legend:**")
    for entity_type, color in color_map.items():
        st.markdown(f"<span style='background-color:{color}; padding: 0.3em 0.5em; border-radius: 0.2em;'>{entity_type}</span>", unsafe_allow_html=True)


Overwriting app.py


In [None]:
!curl https://loca.lt/mytunnelpassword

In [None]:
!streamlit run app.py & npx localtunnel --port 8501