<a href="https://colab.research.google.com/github/DrDavidL/learning-dhds/blob/main/Part_6_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Part 6: Large Language Models in Medicine**

**Jonathan Theros, BS**  
[Jonathan.theros@northwestern.edu](mailto:Jonathan.theros@northwestern.edu)  

**Alan Soetikno, BS**  
[alan.soetikno@northwestern.edu](mailto:alan.soetikno@northwestern.edu)  

**Jay Manadan**

**David Liebovitz, MD**  
[davidl@northwestern.edu](mailto:davidl@northwestern.edu)  

*Northwestern University Feinberg School of Medicine, Chicago, IL*

[GitHub Study Guide](https://github.com/DrDavidL/learning-dhds)

___


Medicine is more attuned to machine learning applications outlined in the previous Parts, as they are typically built and trained for very specific applications and easily compared by test characteristics with receiver operating characteristic curves (ROC), evaluated with confusion matrices, etc. However, LLMs are extremely versatile, with a wide variety of out-of-the-box potential use cases. Let’s explore the considerations around large language models a bit more.

---





# 1. How GPT Models Work

Let's explore how GPT models work - they're a bit complicated in their parts yet work surprisingly similarly to how we think in medicine. Here's a step-by-step breakdown:

1. **Input Encoding**: Just like how clinicians break down a patient’s story into key data points, GPT begins by slicing input text into tokens—think words or meaningful fragments. These are converted into numerical vectors (embeddings) that encode both meaning and context—similar to how we might tag diagnoses in EMRs, but much richer.

2. **Attention Mechanism**: Imagine assembling a differential diagnosis: you focus more on chest pain or dyspnea when evaluating for cardiac issues. GPT does the same: its attention mechanism assigns weights to tokens, highlighting the most relevant parts of the input.

3. **Masked Self-Attention**: When generating text, GPT behaves like a clinician reasoning in real-time—you only use information available so far. This mechanism ensures each predicted token only considers what’s been previously generated, reinforcing a logical, causal flow.

4. **Feed-Forward Networks**: Just as patient data flows through multiple layers of clinical evaluation (history, exams, consults), GPT passes embedding vectors through successive neural layers. Each layer refines the representation, capturing deeper nuances before producing the final output.

5. **Decoding Outputs**: GPT builds its text one token at a time, each choice informed by the cumulative context—akin to writing an assessment and plan, step by logical step.

# 2. Deeper GPT Concepts

Now, let's look into a few more relevant concepts:

1. **Pre-training and Fine-tuning**
   - Pre-training is like our basic sciences years - GPT learns general patterns from massive amounts of text
   - Fine-tuning is like our specialty training - refining the model for specialty tasks—like medical summaries or bedside documentation.
   - This explains why GPTs can handle both everyday conversation and complex medical terminology

2. **Position Embeddings**
   - We rely on the sequence of events—did chest pain come before eating? GPT does too: each token gets a position vector to preserve order, making contextual sense possible.
   - This is crucial for maintaining logical flow in responses

3. **Temperature and Sampling Controls**  
    Analogous to clinical vs creative thinking modes:
   - Low temperature: protocol-driven, predictable outputs.
   - Higher temperature: creative, hypothesis-generating responses—like brainstorming rare causes.
   - We can adjust this based on whether we need safe, predictable outputs or more innovative thinking

4. **Context Window Limitations**
   - GPT has a “working memory” limit—like a clinician referencing only recent notes. The context window defines how much prior text it can attend to before older data is forgotten.
   - Understanding this helps us use GPT more effectively, just like knowing when to look back at old charts

Ready to see these concepts in action? The following exercises will help bring these ideas to life!


## **2. Encoding: How GPT Understands Our Medical Text**

### From Words to Numbers: Tokenization and Embedding

Before GPT can process a medical note, it turns words into numbers—making them machine-readable. This is analogous to how we structure data in EMRs.
1. Tokenization: The input text is split into smaller “tokens” (words or subwords).
  - "A 45‑year‑old male with chest pain." → ["A", " 45", "‑year", "‑old", " male", " with", " chest", " pain", "."]

2. Each token maps to a unique integer ID—like an internal code in GPT’s vocabulary. (like how we use ICD-10 codes!)
3. Finally, it creates rich numerical representations (embeddings) that capture the meaning and context

Let's see this in action with a typical patient presentation! (Be patient - this will take a minute to download and run the GPT model! Here, you are actually running this full GPT model on your Colab computer!)


In [None]:
import numpy as np
from transformers import GPT2Tokenizer, GPT2Model

# Load our tools
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

# Let's use a familiar clinical scenario
text = "A 45-year-old male with high blood pressure, diabetes, and heart failure presents with chest pain radiating to the left arm, shortness of breath, nausea, and sweating, starting 2 hours ago."

# Convert to tokens
tokens = tokenizer(text, return_tensors="pt")

print("\n=== Let's see how GPT breaks down our clinical note ===\n")
print("1. First, it converts our text into token IDs:")
print(tokens["input_ids"])

print("\n2. Here's what those numbers actually represent (our tokens):")
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens["input_ids"].squeeze().tolist())
print(decoded_tokens)

# Create embeddings
embeddings = model(**tokens).last_hidden_state
print("\n3. Finally, it creates rich numerical representations (embeddings):")
print(f"Shape of embeddings: {embeddings.shape}")
print("(Each token gets a 768-dimensional vector to capture its full meaning!)")

# Verify we can reconstruct our original text
print("\n4. We can reconstruct our original note from the tokens:")
decoded_text = tokenizer.decode(tokens["input_ids"].squeeze())
print(decoded_text)

# Look at embedding variance to show different words have different patterns
embedding_variance = np.var(embeddings.detach().numpy(), axis=-1)
print("\n5. Different words have different patterns in their representations:")
print("Variance in embeddings:", embedding_variance)

### What's happening behind the scenes above?
- `tokenizer_config.json`: defines how to split text,
- `vocab.json`: The dictionary of words/subwords GPT knows
- `merges.txt`: Rules for combining subwords (like combining "dia" and "betes")
- `model.safetensors`: The model's learned patterns
- `config.json`: The model's architecture blueprint

Want to experiment with different clinical scenarios? Try modifying the text about the patient in the cell above!

## **3. Visualizing How GPT "Thinks" About Medical Terms**
Just as clinicians group related symptoms logically, GPT organizes medical terms based on similarity. By reducing embeddings to two dimensions, we can visualize these relationships.

In [None]:
# Get our tokens and their numerical representations
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens["input_ids"].squeeze().tolist())
embeddings_np = embeddings.squeeze().detach().numpy()

# Clean up our tokens (remove technical markers and keep meaningful medical terms)
cleaned_tokens = []
cleaned_embeddings = []

for token, embedding in zip(decoded_tokens, embeddings_np):
    cleaned_token = token.lstrip("Ġ")  # Remove technical prefixes
    if cleaned_token.isalnum():  # Keep only clear, readable terms
        cleaned_tokens.append(cleaned_token)
        cleaned_embeddings.append(embedding)

# Convert to array for processing
cleaned_embeddings = np.array(cleaned_embeddings)

# Convert high-dimensional embeddings to 2D for visualization
# (Like converting a complex patient case into a clear summary)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(cleaned_embeddings)

# Create our visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1],
           c='lightblue', s=100, alpha=0.6)  # Make points more visible

# Add labels with a white background for better readability
for i, token in enumerate(cleaned_tokens):
    plt.annotate(token,
                (reduced_embeddings[i, 0], reduced_embeddings[i, 1]),
                bbox=dict(facecolor='white', edgecolor='none', alpha=0.7),
                fontsize=11)

plt.title("How GPT Organizes Medical Terms in Our Case", fontsize=14, pad=20)
plt.xlabel("First Principal Component", fontsize=12)
plt.ylabel("Second Principal Component", fontsize=12)

# Add a note explaining what we're looking at
plt.figtext(0.02, -0.05,
            "Note: Terms that appear closer together are more related in GPT's 'understanding'\n" +
            "Just as we group related symptoms, GPT learns to cluster related medical concepts.",
            fontsize=10, ha='left')

plt.tight_layout()
plt.show()

# Let's also print some interesting pairs of words that ended up close together
print("\nInteresting patterns in how GPT groups medical terms: sweating, chest, and pain and also heart and nausea!")
# This shows which terms GPT considers related


### What Are We Looking At Above?
- Each point represents a medical term from our case
- Terms that appear closer together are more related in GPT's "understanding"
- It’s a simplified 2D snapshot of deeper, higher-dimensional meanings

Try analyzing the visualization:
1. Do you see any symptom clusters that make clinical sense?
2. Are there any surprising relationships in how GPT groups terms?
3. How does this compare to how we mentally organize symptoms during differential diagnosis?

Want to try different patient presentations to see how GPT organizes other medical concepts? Go back 2 cells earlier, change the text sentence, and re-run both cells! You can also try non-medical content.


### Example: What Does This Mean in Practice?
Imagine typing into ChatGPT: "A patient presents with chest pain and shortness of breath." Here's what happens:

1. **Tokenization**: The text is broken into smaller parts, called tokens. For example, "chest" becomes one token, while "pain" is another.
2. **Mapping to IDs**: Each token is mapped to a unique ID that the model understands (e.g., 5827 for "chest").
3. **Embedding**: These IDs are converted into high-dimensional vectors (e.g., 768-dimensional), capturing semantic meaning and relationships between words.
4. **Processing**: The embeddings are passed through the model to generate predictions about the next tokens or text.

This is why the tokenized input, token IDs, and embeddings are foundational steps in generating meaningful responses in GPT models.


# Understanding GPT's Attention Mechanism in Medical Context

Just as clinicians focus on the most relevant symptoms during a diagnosis, GPT uses an "attention mechanism" to decide which parts of the input deserve the most focus. Let’s visualize this concept using a sample clinical sentence.

**What we'll do:**
- Tokenize a medical text sample (breaking it into meaningful pieces)
- Process it through the GPT model
- Visualize how the model "pays attention" to different words
- Interpret relationships GPT draws between terms

In [None]:
# Import necessary libraries
from transformers import GPT2Tokenizer, GPT2Model
import torch
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Initialize the GPT-2 model and tokenizer with explicit attention implementation
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained(
    "gpt2",
    output_attentions=True,
    attn_implementation="eager"  # Explicitly specify attention implementation
)

# Set device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Input a medical case description
text = "A patient complains of chest pain and shortness of breath."
tokens = tokenizer(text, return_tensors="pt").to(device)

try:
    # Process the text through the model (similar to analyzing symptoms)
    with torch.no_grad():
        outputs = model(**tokens, output_attentions=True)

    # Extract attention patterns
    attentions = outputs.attentions

    # Choose which layer and attention head to visualize
    layer, head = 0, 0  # First layer, first attention head

    # Get the attention matrix and move to CPU for visualization
    attention_matrix = attentions[layer][0, head].cpu().numpy()

    # Convert model's internal tokens to readable text
    tokens_list = tokenizer.convert_ids_to_tokens(tokens["input_ids"].squeeze().tolist())
    cleaned_tokens = [token.lstrip("Ġ") for token in tokens_list]

    # Create an attention heatmap
    plt.figure(figsize=(10, 8))
    sns.heatmap(
        attention_matrix,
        xticklabels=cleaned_tokens,
        yticklabels=cleaned_tokens,
        cmap="Blues",
        cbar=True,
        square=True
    )
    plt.title(f"Attention Patterns in Medical Text Analysis (Layer {layer + 1}, Head {head + 1})")
    plt.xlabel("Target Tokens")
    plt.ylabel("Source Tokens")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

    # Add interpretation helper
    print("\nInterpretation Guide:")
    print("- Darker colors indicate stronger attention between tokens")
    print("- Look for patterns between related medical terms")
    print("- Notice how the model connects symptoms with each other")

    # Optional: Print attention statistics
    print("\nAttention Statistics:")
    print(f"Maximum attention value: {attention_matrix.max():.4f}")
    print(f"Average attention value: {attention_matrix.mean():.4f}")

except Exception as e:
    print(f"An error occurred: {str(e)}")
    print("Please ensure you have the latest version of transformers installed:")
    print("!pip install --upgrade transformers")

To give an example of how to interpret this attention map, look at the patient column (not row). The darkest squares are at "chest" and "complains." This initial attention layer is noting a relationship between these terms. Let's try one more layer of attention. In the code above, there are two lines enabling a depicture of two different layers. You can comment one or the other above to see the output for the layer. Look above to find these code lines copied below. Then, move the `#` from one of the code lines to the other and run the cell to generate an attention map for the other layer!



```
# Visualize attention for a specific layer and head
layer, head = 0, 0  # Choose the first layer and first head
# layer, head = 11, 0  # For the last layer
```





# A bit more about the layers and heads

## Core Concepts: Layers, Heads & Context
Think of GPT like a medical team analyzing a complex case:
- **Layers** process information with increasing sophistication:
  - Early layers → symptom recognition
  - Deep layers → complex clinical reasoning
  - Each layer "milks" more meaning from the input
- **Attention Heads**: Each head offers a different view, like consulting multiple specialists.
- **Context Building**: GPT understands a word based on what surrounds it—“discharge” means different things in “hospital discharge” vs. “wound discharge.”

## Key Questions in Medical AI

### 1. "How do larger models improve understanding?"
Modern models (beyond GPT-2's 12 layers):
- Process more context (like considering longer medical histories)
- Capture more complex relationships between symptoms
- Better understand rare conditions and subtle interactions
- BUT: Require more resources and time

### 2. "What makes these models effective?"
Three key features:
- **Layer-by-Layer Processing:** Each layer extracts deeper meaning
  - Like moving from symptoms → diagnosis → treatment plan
- **Parallel Analysis:** Multiple attention heads work simultaneously
  - Like specialists collaborating on a complex case
- **Context Integration:** Previous information influences interpretation
  - Similar to how prior medical history affects diagnosis

### 3. "When do we need bigger vs. smaller models?"
Choose based on:
- Task complexity (routine vs. complex cases)
- Resource availability
- Speed requirements
- Accuracy needs

## Quick Takeaway
Modern GPT models are powerful because they can extract deep meaning from medical text (or any text...) through multiple layers of analysis. While larger models can understand more complexity, choose the right size for your specific medical application.

# Understanding How GPT Generates Medical Responses

## How Does GPT "Think" Through Medical Cases?
Just as clinicians process information systematically, GPT analyzes prompts and generates responses step by step.

### The Response Generation Process 🔄
1. **Input Analysis** (like reviewing patient information)
   - GPT “reads” the full prompt and encodes context using attention

2. **Token Prediction** (like clinical reasoning)
   - Predicts next word/token based on context
   - Uses probability distributions to choose options
   - Builds response piece by piece, like constructing a clinical assessment

3. **Generation Controls**
   - `temperature`: Controls randomness. Lower = focused, Higher = creative
     - Lower (0.3-0.5): More focused, standard responses
       - Best for: Guidelines, protocols, standard care paths
     - Higher (0.7-0.9): More diverse, exploratory responses
       - Best for: Brainstorming, complex cases, generating differentials
   - `top_p`: Filters the next-token options to top % by probability
     - Limits which possibilities the model considers
     - Like setting a threshold for clinical relevance

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Initialize our "medical AI assistant"
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Medical scenarios for demonstration
medical_prompts = [
    "The first-line treatment for high blood pressure includes",
    "The differential diagnosis for chest pain includes",
    "Potential complications of diabetes mellitus include"
]

def generate_medical_response(prompt, temp=0.7):
    """
    Generate medical responses with controlled variety
    - Low temp (0.3): Standard, protocol-based responses
    - High temp (0.7): More comprehensive, varied responses
    """
    inputs = tokenizer(prompt, return_tensors="pt", padding=True)

    outputs = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=50,
        temperature=temp,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        num_return_sequences=1
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Demonstrate different response styles
print("Standard Protocol Response (temperature=0.3):")
print(generate_medical_response(medical_prompts[2], temp=0.3))
print("\nComprehensive Exploration (temperature=0.7):")
print(generate_medical_response(medical_prompts[2], temp=0.7))

### 💡 Key Points for Medical Students

(Beyond that GPT-2 isn't very smart!)

1. **Foundation of AI Responses**
- Models use pre-trained medical and general knowledge
- Process information systematically, like clinical reasoning
- Build responses incrementally, considering context

2. **Temperature Controls Response Style**
- Low Temperature (0.3-0.5):
  - Generates focused, standard responses
  - Ideal for protocol-based answers
  - Example: Standard treatment guidelines
- High Temperature (0.7-0.9):
  - Generates more varied, comprehensive responses
  - Better for exploring possibilities
  - Example: Building complex differential diagnoses

3. **Practical Applications**
- Use low temperature for:
  - Standard treatment protocols
  - Clinical guidelines
  - Routine care pathways
- Use high temperature for:
  - Brainstorming sessions
  - Complex case discussions
  - Exploring rare possibilities

4. **Important Limitations**
- Pre-training knowledge may be outdated
- Cannot replace clinical judgment
- Requires verification against current guidelines

5. **Best Practices**
- Match temperature to your needs:
  - Guidelines → Low temperature
  - Brainstorming → High temperature
- Always verify generated content
- Use as a supplement to clinical knowledge

Remember: AI is a tool to enhance, not replace, clinical thinking. The temperature setting helps you control how focused or exploratory the responses will be!


## I think we are ready for a smarter model here!

### **Open-Source vs. Private Models**

**Accessibility and Cost:**  
- **Open-Source Models**: These models are freely available, allowing users to download, modify, and deploy them locally without licensing fees. This democratizes access to advanced AI capabilities, especially for those with limited budgets. As we will see, some of these powerful models can be run on a <2 GB of storage.  
- **Proprietary Models**: Access typically requires subscription fees or pay-per-use arrangements. Users interact with these models via APIs, with limited insight into the underlying architecture.

**Customization and Control:**  
- **Open-Source Models**: Users can tailor the model's architecture and training data to specific needs, fostering innovation and adaptability.  
- **Proprietary Models**: Customization is often restricted, with users dependent on the provider for updates and feature enhancements.

**Transparency and Security:**  
- **Open-Source Models**: The open nature allows for thorough examination of code and data, promoting transparency and enabling users to identify and address potential biases or vulnerabilities.  
- **Proprietary Models**: The closed nature can obscure potential biases or security issues, as the internal workings are not open to public scrutiny.

**Performance and Resources:**  
- **Open-Source Models**: While offering flexibility, they sometimes require significant local computational resources for training and deployment, which can be a barrier for some users.  
- **Proprietary Models**: Often trained on extensive datasets using substantial computational power, they can deliver superior performance, especially in complex tasks.

**Data Privacy:**  
- **Open-Source Models**: Deploying models locally ensures that sensitive data remains in-house, enhancing privacy and compliance with data protection regulations, which is especially important in medicine.  
- **Proprietary Models**: Data shared with these models may be processed externally, raising concerns about data privacy and security.


### **LLM Pitfalls**

It is important to consider where large language models can lead clinicians and students awry (Omiye et al). Each model has its own performance characteristics that vary based on task.

1. **Inaccuracy and Inconsistency**  
   - LLMs can generate factually incorrect or inconsistent information, which is particularly concerning in high-stakes medical contexts. For example, ChatGPT showed a consistency rate of 85% in answering pathology board examination-style questions but still produced factual inaccuracies and interpretive errors (Koga et al). Similarly, LLMs often struggle with summarizing medical evidence accurately, leading to factually inconsistent summaries (Tang et al).  

2. **Misinformation and Obsolete Data**  
   - LLMs can provide outdated or misleading information. In cancer care, for instance, ChatGPT demonstrated a significant error rate and a tendency to provide obsolete data, necessitating expert-driven verification to avoid misinformation (Iannantuono et al). This issue is compounded by the models' inability to distinguish between real and fake information (Deng et al).  

3. **Algorithmic Bias and Inequity**  
   - The integration of LLMs in medical education and practice can perpetuate existing biases present in the training data, leading to biased outcomes and inequities in healthcare delivery (Abd-alrazaq et al). This is a critical concern as biased algorithms can exacerbate disparities in medical treatment and outcomes.  

4. **Overreliance and Plagiarism**  
   - There is a risk of overreliance on LLMs, which can lead to reduced critical thinking and problem-solving skills among medical professionals and students. Additionally, the use of LLMs can facilitate plagiarism, as students might use these models to generate content without proper attribution (Abd-alrazaq et al).  

5. **Privacy and Ethical Concerns**  
   - The use of LLMs raises significant privacy and ethical issues, including concerns about patient consent, data privacy, and legal liability. These models often lack transparency, making it difficult to understand how they generate responses, which complicates their ethical use in clinical settings (Safranek et al, Borkowski et al).  

6. **Lack of Contextual Understanding**  
   - LLMs are limited in their ability to integrate contextual and external information, comprehend sensory and nonverbal cues, and cultivate rapport and interpersonal interactions, which are crucial in medical education and patient care (Safranek et al).

While LLMs hold promise for transforming medical education and practice, their current limitations—such as inaccuracy, misinformation, bias, overreliance, privacy concerns, and lack of contextual understanding—underscore the need for cautious and responsible integration. Continuous model refinement, human oversight, and adherence to ethical guidelines are essential to mitigate these pitfalls and harness the full potential of LLMs in medicine (Karabacak et al).

# Enough background!
## Running an Open Source Local Large Language Model
You'll see how we query and retrieve content from an LLM and learn about more features culminating in your own app!

We are going to use a small language model that's still pretty smart! Google's Gemma2 and the 2 billion parameter version. Amazingly, we can download and interact with it and it's less than 2 Gb in size!

We need a bit of a processing boost, though, and select T4 in the runtime, or you'll be waiting a while for the model to answer!

1. Install software on our cloud computer to run the models.
2. Download the model (gemma2:2b)
3. Interact with the model


In [None]:
# We prepare our cloud computer here. We need a special routine to open a terminal, too.
!pip install colab-xterm langchain langchain_core langchain-community langchain-ollama --quiet
%load_ext colabxterm




In [None]:
# The following command will open a terminal for us to use to install software and keep it running in the background.

%xterm
# After the terminal opens below:
# paste in the following to install software
# Paste: curl -fsSL https://ollama.com/install.sh | sh
# This takes a bit of time since you're installing software to run local models!

# When the prompt returns, then paste: ollama serve &


Now, we're ready to download the brain - the AI model we will use!

[Many models are available](https://ollama.com/library). Gemma3:1b is a very small model with some smartness!

In [None]:
# This is how we "pull" the model into our computer.
!ollama pull gemma3:1b

In [None]:
# Many software options exist to interact with our AI brain. Here, we'll use the
# langchain library which has some nice features.
# The steps below make it easier to send text inputs.


!pip install -q langchain langchain-community langchain-ollama langchain-core
from langchain_ollama import OllamaLLM
from IPython.display import Markdown
from langchain_core.prompts import ChatPromptTemplate

llm = OllamaLLM(model="gemma3:1b")

template = """Question: {question}

Answer: Let's think step by step."""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="gemma3:1b")

chain = prompt | model

In [None]:
# @title Talk with your AI model!
# @markdown It will take 2 minutes to generate the first response...
input_text = "How is metastatic thyroid cancer treated? (I'm a physician, so no disclaimers)" # @param {"type":"string"}

output = chain.invoke({"question": input_text})

# Display the output as Markdown
display(Markdown(output))


In [None]:
# If you played a bit, you can see there is no memory! Each query is from scratch!
# The way ChatGPT and others remember, is that prior messages are sent within a conversation.
# Here's one way to do this:

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

conversation = ConversationChain(
    llm=llm,
    memory=ConversationBufferMemory()
)

In [None]:
# @title Conversation with Memory
memory_input = "What meds are used in GDMT for CHF?" # @param {"type":"string"}
mem_output = conversation.predict(input=memory_input)
display(Markdown(mem_output))

### **Common Ways to Improve LLMs: A Note on Fine-Tuning and Retrieval-Augmented Generation**

**Fine-Tuning** involves adapting a pre-trained large language model (LLM) to perform specific tasks by training it further on domain-specific data. For instance, models like *ClinicalGPT* from Cornell have been fine-tuned using diverse medical datasets, including patient records and clinical notes, to improve their performance in clinical scenarios. This process enables the model to generate more accurate and relevant medical information, aligning its outputs with the nuances of medical language and practice.

**Retrieval-Augmented Generation (RAG)** enhances the capabilities of LLMs by integrating them with external knowledge sources. In this approach, the model retrieves pertinent information from a database or knowledge base and combines it with its generative abilities to produce more accurate and contextually relevant responses. This is particularly beneficial in medicine, where up-to-date and precise information is crucial. For example, the *Bailicai* framework employs RAG to improve LLM performance in medical applications, effectively reducing issues like hallucinations by grounding responses in reliable external data (Bailicai). This can dramatically improve accuracy, with an absolute increase of up to ~40% (Woo et al).

---

### **Potential Uses for Secure, Locally Run Large Language Models in Healthcare**

- **Clinical Documentation and Workflow Optimization**: LLMs can automate administrative tasks, such as clinical documentation and prior authorization, reducing the burden on healthcare professionals and improving workflow efficiency (Tripathi et al, Denecke et al).  
- **Patient Education and Engagement**: These models can enhance patient education by providing personalized information and improving patient engagement, acting as health assistants (Tripathi et al, Denecke et al).  
- **Diagnostic Assistance and Treatment Recommendations**: LLMs can assist in diagnostics and offer treatment recommendations, supporting clinical decision-making processes. Caution is necessary, as there are multiple pitfalls with unrefined LLMs (Zou et al).  
- **Data Handling and Extraction**: LLMs can efficiently handle and extract data from electronic health records, improving the quality of healthcare services and outcomes (Denecke et al, Zou et al).  
- **Information Retrieval**: They can be used for retrieving relevant medical information, aiding in research and clinical inquiries (Chen, Al Nazi et al).



### **References and Further Reading**

1. **Abd-Alrazaq, A., AlSaad, R., Alhuwail, D., Ahmed, A., Healy, P. M., Latifi, S., Aziz, S., Damseh, R., Alabed Alrazak, S., & Sheikh, J.** (2023). *Large language models in medical education: Opportunities, challenges, and future directions.* JMIR Medical Education, 9, e48291. [https://doi.org/10.2196/48291](https://doi.org/10.2196/48291)

2. **Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J., & Daneshjou, R.** (2024). *Large language models in medicine: The potentials and pitfalls: A narrative review.* Annals of Internal Medicine, 177(2), 210–220. [https://doi.org/10.7326/M23-2772](https://doi.org/10.7326/M23-2772)

3. **Tang, L., Sun, Z., Idnay, B., Nestor, J. G., Soroush, A., Elias, P. A., Xu, Z., Ding, Y., Durrett, G., Rousseau, J. F., Weng, C., & Peng, Y.** (2023). *Evaluating large language models on medical evidence summarization.* NPJ Digital Medicine, 6(1), 158. [https://doi.org/10.1038/s41746-023-00896-7](https://doi.org/10.1038/s41746-023-00896-7)

4. **Koga, S.** (2023). *Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examination-style questions.* Pathology International, 73(12), 618–620. [https://doi.org/10.1111/pin.13382](https://doi.org/10.1111/pin.13382)

5. **Iannantuono, G. M., Bracken-Clarke, D., Floudas, C. S., Roselli, M., Gulley, J. L., & Karzai, F.** (2023). *Applications of large language models in cancer care: Current evidence and future perspectives.* Frontiers in Oncology, 13, 1268915. [https://doi.org/10.3389/fonc.2023.1268915](https://doi.org/10.3389/fonc.2023.1268915)

6. **Deng, J., Zubair, A., Park, Y. J., Affan, E., & Zuo, Q. K.** (2024). *The use of large language models in medicine: Proceeding with caution.* Current Medical Research and Opinion, 40(2), 151–153. [https://doi.org/10.1080/03007995.2023.2295411](https://doi.org/10.1080/03007995.2023.2295411)

7. **Safranek, C. W., Sidamon-Eristoff, A. E., Gilson, A., & Chartash, D.** (2023). *The role of large language models in medical education: Applications and implications.* JMIR Medical Education, 9, e50945. [https://doi.org/10.2196/50945](https://doi.org/10.2196/50945)

8. **Borkowski, A. A., Jakey, C. E., Mastorides, S. M., Kraus, A. L., Vidyarthi, G., Viswanadhan, N., & Lezama, J. L.** (2023). *Applications of ChatGPT and large language models in medicine and health care: Benefits and pitfalls.* Federal Practitioner, 40(6), 170–173. [https://doi.org/10.12788/fp.0386](https://doi.org/10.12788/fp.0386)

9. **Karabacak, M., & Margetis, K.** (2023). *Embracing large language models for medical applications: Opportunities and challenges.* Cureus, 15(5), e39305. [https://doi.org/10.7759/cureus.39305](https://doi.org/10.7759/cureus.39305)

10. **Wang, G., Yang, G., Du, Z., Fan, L., & Li, X.** (2023). *ClinicalGPT: Large language models finetuned with diverse medical data and comprehensive evaluation.* arXiv. [https://arxiv.org/abs/2306.09968](https://arxiv.org/abs/2306.09968)

11. **Long, C., Liu, Y., Ouyang, C., & Yu, Y.** (2024). *Bailicai: A domain-optimized retrieval-augmented generation framework for medical applications.* arXiv. [https://arxiv.org/abs/2407.21055](https://arxiv.org/abs/2407.21055)

12. **Woo, J. J., Yang, A. J., Olsen, R. J., Hasan, S. S., Nawabi, D. H., Nwachukwu, B. U., Williams, R. J., & Ramkumar, P. N.** (2024). *Custom large language models improve accuracy: Comparing retrieval augmented generation and artificial intelligence agents to non-custom models for evidence-based medicine.* Arthroscopy. [https://doi.org/10.1016/j.arthro.2024.10.042](https://doi.org/10.1016/j.arthro.2024.10.042)

13. **Tripathi, S., Sukumaran, R., & Cook, T.** (2024). *Efficient healthcare with large language models: Optimizing clinical workflow and enhancing patient care.* Journal of the American Medical Informatics Association. [https://doi.org/10.1093/jamia/ocad258](https://doi.org/10.1093/jamia/ocad258)

14. **Denecke, K., May, R., & Romero, O.** (2023). *Potential of large language models in health care: Delphi study.* Journal of Medical Internet Research, 26. [https://doi.org/10.2196/52399](https://doi.org/10.2196/52399)

15. **Zou, S., & He, J.** (2023). *Large language models in healthcare: A review.* In 2023 7th International Symposium on Computer Science and Intelligent Control (ISCSIC) (pp. 141–145). [https://doi.org/10.1109/ISCSIC60498.2023.00038](https://doi.org/10.1109/ISCSIC60498.2023.00038)

16. **Chen, S.** (2024). *Potential applications and safety of large language models in healthcare.* Interdisciplinary Humanities and Communication Studies. [https://doi.org/10.61173/f578jp05](https://doi.org/10.61173/f578jp05)

17. **Al Nazi, Z., & Peng, W.** (2023). *Large language models in healthcare and medical domain: A review.* arXiv. [https://doi.org/10.48550/arXiv.2401.06775](https://doi.org/10.48550/arXiv.2401.06775)
