<a href="https://colab.research.google.com/github/MoLue/wft_digital_medicine/blob/main/dm_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook Overview: NLP x Transformers
This notebook explores the progression from traditional Natural Language Processing (NLP) pipelines to modern Transformer-based models. You’ll start by learning the fundamental steps of an NLP pipeline, such as tokenization, Part-of-Speech tagging, and Named Entity Recognition (NER), using SpaCy. Then, you’ll transition to Transformer models, like BERT and GPT, to tackle tasks such as text generation and advanced NER. Both approaches will be applied to real-world examples, including analyzing medical documents. Finally, we’ll compare the strengths and limitations of SpaCy and Transformers, highlighting when to use each. By the end, you’ll have practical skills in both traditional NLP and modern Transformers, understanding how they complement each other in different contexts.

# NLP Pipeline

## **Introduction to Natural Language Processing (NLP)**

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language. NLP bridges the gap between human communication and machine understanding, making it possible to extract insights, automate processes, and interact with text and speech data effectively.

### **Applications of NLP**
NLP has a wide range of real-world applications, including:
- **Sentiment Analysis**: Determining the sentiment (positive, negative, neutral) in customer reviews or social media posts.
- **Named Entity Recognition (NER)**: Identifying entities like names, dates, or locations in unstructured text.
- **Machine Translation**: Translating text from one language to another, e.g., Google Translate.
- **Text Summarization**: Generating concise summaries of lengthy documents.
- **Chatbots and Virtual Assistants**: Powering conversational systems like Siri or Alexa.
- **Medical Text Analysis**: Extracting clinical information, such as diagnoses or prescribed medications, from medical notes.

---

## **What is an NLP Pipeline?**

An NLP pipeline is a step-by-step process to transform raw text into meaningful data that machines can analyze. Each step in the pipeline performs a specific task to preprocess, analyze, or extract information from text. The pipeline is modular, meaning you can adjust or add steps depending on the complexity of the task.

### **Steps in a Typical NLP Pipeline**
Here’s an overview of the common steps involved in an NLP pipeline:

1. **Tokenization**
   - The process of breaking down text into smaller units called tokens, such as words or subwords.
   

2. **Sentence Segmentation**
   - Dividing text into individual sentences to enable sentence-level analysis.

3. **Part-of-Speech (PoS) Tagging**
   - Assigning grammatical roles (e.g., noun, verb, adjective) to each token.


4. **Stop Word Removal**
   - Filtering out commonly used words (like "the", "and", "is") that carry little meaning for the analysis.

5. **Lemmatization**
   - Reducing words to their base or dictionary form (lemma).  


6. **Dependency Parsing**
   - Analyzing the grammatical structure of sentences to identify relationships between words.  

7. **Named Entity Recognition (NER)**
   - Identifying and categorizing entities such as names, dates, organizations, or medications.


8. **Coreference Resolution**
   - Resolving references within text to identify when different expressions refer to the same entity.


---

### **Why Do We Need an NLP Pipeline?**

Text data in its raw form is unstructured and challenging for machines to process. An NLP pipeline:
- **Transforms raw text into structured data** that can be analyzed or used in machine learning models.
- **Reduces noise** by filtering irrelevant information, such as stop words.
- **Extracts meaningful patterns** from text, such as grammatical relationships, key entities, or sentiments.
- **Provides flexibility** to customize the pipeline based on the specific task (e.g., sentiment analysis, summarization).

## Getting Started

In [None]:
! pip install spacy
! pip install coreferee


In [None]:

! python -m spacy download en_core_web_sm


In [None]:
! python -m coreferee install en

Spacy

In [None]:
import spacy
from spacy import displacy
import nltk

from nltk.stem.porter import *

# load the text corpus of your choice. We will work here with the downloaded small core
nlp = spacy.load("en_core_web_sm")

example = "In the U.K., Joe Biden and Angela Merkel talked about the current economic situation. They both think the inflation rate will not drop in the near future!"

example_doc = nlp(example)
example_doc

## Tokenization
What is Tokenization?
Tokenization is the first step in any NLP pipeline. It involves breaking down a text into smaller units, called tokens. Tokens can be words, punctuation marks, or numbers.

Why is Tokenization important?
Tokenization allows us to preprocess and analyze text in manageable pieces. It’s foundational for further processing like Part-of-Speech (PoS) tagging, Named Entity Recognition (NER), and dependency parsing. Modern NLP tools like SpaCy handle tokenization efficiently, even for edge cases such as abbreviations or contractions.

In [None]:
for token in example_doc:
    print(token.text)

## Sentence Segmentation
What is Sentence Segmentation?
Sentence segmentation is the process of splitting a document into sentences.


Why is Sentence Segmentation important?
Segmenting a text into sentences makes it easier to analyze the structure and meaning of a document. In medical text, this helps separate observations, instructions, and findings into discrete, analyzable units. SpaCy performs sentence segmentation automatically as part of its pipeline.

In [None]:
for sent in example_doc.sents:
    print(sent.text)

## Part-of-Speech Tagging
Part-of-Speech tagging assigns a grammatical role to each token in a sentence, such as noun, verb, or adjective.

Why is PoS Tagging important?
PoS tagging helps understand the grammatical structure of a sentence and can be used to extract patterns or relationships, such as identifying actions (verbs) or key entities (nouns). It’s also a precursor to tasks like dependency parsing.

In [None]:
for token in example_doc:
    print (f"{token.text : <15}{token.pos_}")

##Lemmatization
Lemmatization is the process of reducing words to their base or dictionary form, called a lemma. For example:
- "running" → "run"
- "patients" → "patient"

Why is Lemmatization important?
Lemmatization ensures consistency in text analysis by normalizing words to their base forms. This is particularly useful when analyzing medical notes, where different forms of the same word (e.g., "diagnosed", "diagnosis") should be treated as the same concept. Unlike stemming, lemmatization ensures that the resulting words are valid dictionary entries.

In [None]:
stemmer = PorterStemmer()
for token in example_doc:
    print (f"{token.text : <15}{token.lemma_  : <15}{stemmer.stem(token.text)}")

## Stop Words
Stop words are common words such as "the", "is", "and", or "in" that usually carry little meaning on their own. Removing stop words can help focus on the more meaningful parts of a text.

Why is Stop Word Removal important?
Filtering out stop words reduces noise in text data and can improve the performance of downstream tasks. However, in specialized domains like medicine, some stop words (e.g., "with", "of") may carry significant meaning and should be retained. SpaCy allows customization of the stop word list to suit your needs.

In [None]:
SW = list(nlp.Defaults.stop_words)
print('First 20 stop-words: ', SW[:20])
print('Number of stop-words: ', len(SW))

##Dependency Parsing
Dependency parsing analyzes the grammatical structure of a sentence and identifies relationships between words.

Why is Dependency Parsing important?
Dependency parsing helps in understanding how different parts of a sentence are connected. This is critical for extracting meaningful relationships, such as determining who performed an action or what an action was performed on.

In [None]:
sents = list(example_doc.sents)
sent = sents[0]

for token in sent:
    print (f"{token.text : <15}{token.dep_ : <15}{spacy.explain(token.dep_)}")

displacy.render(sent, style="dep", jupyter=True)

## Named Entity Recognition
Named Entity Recognition (NER) identifies and categorizes specific entities in a text, such as names of medications, diseases, or dosages. For example, in the text:


Why is NER important?
In medical contexts, NER helps extract structured information from unstructured text, such as identifying medications, symptoms, or procedures in clinical notes. SpaCy and its extensions (like SciSpacy) are particularly effective for this task.

In [None]:
for ent in example_doc.ents:
    print (f"{ent.text : <20}{ent.label_ : <15}{spacy.explain(ent.label_)}")

displacy.render(example_doc, style="ent", jupyter=True)

## Coreference Resolution
Coreference resolution identifies when different words or phrases in a text refer to the same entity. For example:

Why is Coreference Resolution important?
In medical documents, resolving coreferences ensures that all references to a patient, medication, or symptom are correctly attributed.

In [None]:
nlp.add_pipe('coreferee')
example_doc = nlp(example)

example_doc._.coref_chains.print()
example_doc._.coref_chains.resolve(example_doc[16])

# **Exploring Self-Hosted Large Language Models (LLMs)**

Now that we’ve explored traditional NLP modules, it’s time to dive into the capabilities of self-hosted Large Language Models (LLMs). These powerful models excel at handling complex language tasks, offering advanced solutions for generating text, recognizing entities, and answering questions based on a given context.

In this section, we will use LLMs to perform tasks such as text generation, Named Entity Recognition (NER), and Question Answering (QA). Hosting these models locally gives us greater control over their functionality and allows customization for specific domains, such as medical text analysis.

## Getting Started
In this section, we explore two approaches to working with Transformer-based models:

Direct Model Loading:
Use AutoTokenizer and AutoModelForCausalLM to load models manually. This approach provides flexibility and control over how models are used and configured

Pipeline Helper:
Use the pipeline API as a high-level interface for common NLP tasks like text generation, NER, or QA. This method simplifies implementation and is ideal for quick experimentation.
These approaches allow us to tailor model usage to different levels of complexity and customization.

The pipeline abstracts much of the complexity involved in tokenization, model interaction, and post-processing, allowing us to focus on solving real-world problems with minimal setup.

An important difference to classical NLP pipelines is End-to-End Processing. It takes raw text as input and outputs task-specific results (e.g., text generation, question answering, sentiment analysis) without requiring explicit intermediate steps like tokenization or feature extraction.

In [13]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForTokenClassification
# Use a pipeline as a high-level helper
from transformers import pipeline

## Text Generation with Transformers
This example demonstrates how to generate text using a pre-trained language model with the Hugging Face pipeline API. By initializing a text-generation pipeline, we can easily create coherent continuations based on a given prompt.

Prompt: A starting text is defined to guide the model in generating relevant content.
Parameters: The max_length controls the output size, and num_return_sequences specifies how many variations to generate.
Output: The generated text provides a natural continuation of the input prompt, showcasing the model's ability to create context-aware responses.

**Question**: What can you observe if you run it multiple times?

In [None]:
# Initialize a text generation pipeline
text_generator = pipeline("text-generation")

# Define a starting prompt (triple quotes allow multi-line strings)
prompt = """
The patient was diagnosed with gastroesophageal reflux disease (GERD).
The doctor prescribed a treatment plan including
"""

# Generate text using a pre-trained model with a specified token limit
output = text_generator(prompt, max_length=150, num_return_sequences=1)

# Print the generated text
print("Generated Text:")
print(output[0]['generated_text'])

## Named Entity Recognition
This example demonstrates how to perform Named Entity Recognition (NER) using a pre-trained model and tokenizer. The pipeline API simplifies the process by providing a high-level interface for token classification tasks.

Model and Tokenizer: A pre-trained model fine-tuned for NER is loaded, capable of identifying entities such as persons, locations, and organizations in text.

Pipeline Setup: The pipeline is configured for NER, combining the model and tokenizer for seamless inference.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

# Create a pipeline for token classification
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

# You can also try the text we used for NLP
text = "Dr. Smith diagnosed John Doe with reflux in Heidelberg."
# text = "In the U.K., Joe Biden and Angela Merkel talked about the current economic situation. They both think the inflation rate will not drop in the near future!"

# Perform NER
entities = nlp(text)
for entity in entities:
  print(entity)

## Question Answering with Transformers

This code showcases how to use a pre-trained Transformer model for Question Answering (QA). By providing a context (e.g., physician notes) and a natural language question, the model extracts relevant information directly from the text.

**Experimentation Encouraged**

Try Different Models: The pipeline allows you to swap models for QA tasks easily, offering flexibility to explore which model performs best for your data.

Refine Your Prompts: Experiment with the phrasing of your questions to see how the model responds. Different prompts can yield varying levels of specificity or relevance in the answers.

In [None]:
# Initialize the Question Answering pipeline
# -> Try out different models!
# - deepset/bert-base-cased-squad2
# - allenai/longformer-base-4096
# - distilbert/distilbert-base-cased-distilled-squad
qa_pipeline = pipeline("question-answering", model="dmis-lab/biobert-base-cased-v1.1")


# Combine physician letters
text1 = """
The patient was prescribed Omprazle 20mg daily for acid reflux and heartburn. Aspirin was also given for pain relief.
"""
text2 = """
Pantoprazole was recommended for heartburn.
"""
text3 = """
Omaprazole 40mg daily.
"""
context = text1 + " " + text2 + " " + text3

# -> Define your question and improve the prompt. Experiment a bit!
question = "Which medications for reflux are included?"

# Get the answer from the model
result = qa_pipeline(question=question, context=context)

# Interpret and display the result
answer = result['answer']
print(f"Answer: {answer}")

# Discussion

- Traditional NLP pipelines often require manual customization for specific domains (e.g., medicine, law). What challenges could this pose when working with highly specialized texts?
- In what situations might rule-based or manually fine-tuned systems still be preferable to more automated approaches like Transformers?
- Do you think there are tasks where a modular NLP pipeline might still outperform an integrated Transformer model?