# Chapter 4: Case Studies on Hallucinations in LLM Applications

In this chapter, we will explore real-world applications of Large Language Models (LLMs) where hallucinations could present significant challenges. Case studies in fields such as healthcare, finance, and legal applications illustrate why hallucination mitigation is essential in high-stakes environments.

---

## Learning Objectives

By the end of this chapter, you will be able to:
- Understand the potential impacts of hallucinations in specific fields.
- Identify why certain applications are more susceptible to hallucination risks.
- Implement techniques that reduce hallucinations in practical use cases.

---

## 1. Healthcare Case Study: Medical Advice Assistant

LLMs can assist in healthcare by answering medical questions, suggesting treatments, or summarising patient notes. However, hallucinations can lead to incorrect medical advice, which could be dangerous for patients.

---

### Example: Observing Hallucinations in a Healthcare Scenario

Let's simulate a medical advice assistant using GPT-4 and LLaMA. We’ll use ambiguous prompts to see if hallucinations occur in medical contexts.

### Required Libraries

Ensure the necessary libraries are installed before proceeding.


In [None]:
!pip install openai transformers


---

### Code Implementation: Simulating a Medical Advice Assistant


In [None]:
import openai
from transformers import AutoModelForCausalLM, AutoTokenizer



In [None]:
# Set up OpenAI API for GPT-4
openai.api_key = "YOUR_OPENAI_API_KEY"  # Replace with your OpenAI API key

def gpt4_medical_response(prompt):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response['choices'][0]['message']['content']


In [None]:
# Set up LLaMA model for medical prompts
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
llama_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

def llama_medical_response(prompt):
    inputs = llama_tokenizer(prompt, return_tensors="pt")
    outputs = llama_model.generate(inputs["input_ids"], max_length=50, num_return_sequences=1)
    return llama_tokenizer.decode(outputs[0], skip_special_tokens=True)



In [None]:
# Define ambiguous medical prompts
prompt = "What are the best treatments for chronic pain?"
followup_prompt = "Is it safe to take aspirin if you have diabetes?"



In [None]:
# Test responses with GPT-4
print("GPT-4 Response to Medical Prompt:")
print(gpt4_medical_response(prompt))
print("\nGPT-4 Response to Follow-Up Medical Prompt:")
print(gpt4_medical_response(followup_prompt))

# Test responses with LLaMA
print("\nLLaMA Response to Medical Prompt:")
print(llama_medical_response(prompt))
print("\nLLaMA Response to Follow-Up Medical Prompt:")
print(llama_medical_response(followup_prompt))


---

## Observations

After running the code above, observe the following:

1. **Medical Accuracy**: Did the model provide safe, accurate medical information, or did it produce any potentially unsafe advice?
2. **Pattern Recognition**: Notice if the model fills in information that sounds plausible but may not be factually correct.
3. **Model Comparison**: Compare how GPT-4 and LLaMA handle ambiguous medical prompts. Does one model show a higher tendency to hallucinate?

---

## 2. Financial Case Study: Investment Advice Assistant

In finance, LLMs can be used to provide investment insights or summarise market trends. Hallucinations in this field could lead to inaccurate advice, potentially resulting in financial losses.

---

### Example: Simulating Financial Advice

In this example, we’ll test how models respond to prompts about investment advice. Financial advice requires factual and reliable information, as incorrect advice could be costly.

---

### Code Implementation: Simulating Financial Advice


In [None]:
# Define financial prompts
financial_prompt = "What are the best stocks to invest in right now?"
risk_assessment_prompt = "Is investing in tech stocks safe during a recession?"


In [None]:
# Test responses with GPT-4
print("GPT-4 Response to Financial Prompt:")
print(gpt4_medical_response(financial_prompt))
print("\nGPT-4 Response to Risk Assessment Prompt:")
print(gpt4_medical_response(risk_assessment_prompt))

# Test responses with LLaMA
print("\nLLaMA Response to Financial Prompt:")
print(llama_medical_response(financial_prompt))
print("\nLLaMA Response to Risk Assessment Prompt:")
print(llama_medical_response(risk_assessment_prompt))


---

## Observations

For each response, observe:

1. **Accuracy in Financial Advice**: Does the model provide realistic, well-researched investment advice, or does it fabricate information?
2. **Safety of Advice**: Note if the model’s advice would lead to risky financial decisions without proper disclaimers.
3. **Comparative Analysis**: Assess which model is more prone to generating unsubstantiated or risky financial advice.

---

## 3. Legal Case Study: Legal Assistance Chatbot

In legal applications, LLMs might help users understand legal terms or offer guidance on common questions. However, hallucinations can lead to inaccurate legal interpretations, which could mislead users and cause legal complications.

---

### Example: Simulating a Legal Assistance Chatbot

In this case study, we’ll test how models respond to legal questions and if they provide accurate or speculative answers.

---

### Code Implementation: Simulating Legal Assistance


In [None]:
# Define legal prompts
legal_prompt = "Can I get a divorce without a lawyer?"
followup_legal_prompt = "What are the tax implications of divorce settlements?"

# Test responses with GPT-4
print("GPT-4 Response to Legal Prompt:")
print(gpt4_medical_response(legal_prompt))
print("\nGPT-4 Response to Follow-Up Legal Prompt:")
print(gpt4_medical_response(followup_legal_prompt))

# Test responses with LLaMA
print("\nLLaMA Response to Legal Prompt:")
print(llama_medical_response(legal_prompt))
print("\nLLaMA Response to Follow-Up Legal Prompt:")
print(llama_medical_response(followup_legal_prompt))


---

## Observations

For each legal prompt, observe:

1. **Legal Accuracy**: Did the model provide accurate information, or did it produce any speculative or misleading responses?
2. **Adherence to Jurisdiction**: Check if the model provides advice that could vary by jurisdiction, making it potentially inaccurate for certain users.
3. **Comparison Across Models**: Analyse if one model is more prone to hallucinating legal information or if both exhibit similar tendencies.

---

## Exercise: Testing Model Hallucinations in Real-World Scenarios

### Instructions

1. **Create New Prompts for Each Case Study**: Write your own prompts for healthcare, finance, and legal fields. For example:
   - Healthcare: "What is the best diet for managing hypertension?"
   - Finance: "Should I invest in real estate or cryptocurrency?"
   - Legal: "Do I need to report foreign income on my taxes?"

2. **Test with Both Models**: Use GPT-4 and LLaMA to respond to these prompts.

3. **Record Your Observations**: For each response, note if the model hallucinated information, provided accurate advice, or produced vague answers.


In [None]:
# Example custom prompts
custom_prompts = [
    "What is the best diet for managing hypertension?",
    "Should I invest in real estate or cryptocurrency?",
    "Do I need to report foreign income on my taxes?"
]

print("Observations with GPT-4:")
for prompt in custom_prompts:
    print(f"\nPrompt: {prompt}")
    print("GPT-4 Response:", gpt4_medical_response(prompt))

print("\nObservations with LLaMA:")
for prompt in custom_prompts:
    print(f"\nPrompt: {prompt}")
    print("LLaMA Response:", llama_medical_response(prompt))


---

## Summary

In this chapter, we examined real-world case studies where hallucinations in LLMs can have significant consequences:

1. **Healthcare**: Hallucinations can lead to unsafe medical advice.
2. **Finance**: Inaccurate investment advice could lead to financial losses.
3. **Legal**: Incorrect legal guidance can mislead users and cause legal issues.

These case studies highlight the importance of mitigating hallucinations in high-stakes applications, where reliable and accurate information is critical.
