In [1]:
!pip install transformers torch pandas spacy
!pip install scispacy



In [2]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz
  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting spacy<3.8.0,>=3.7.4 (from en_ner_bc5cdr_md==0.5.4)
  Downloading spacy-3.7.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy<3.8.0,>=3.7.4->en_ner_bc5cdr_md==0.5.4)
  Downloading thinc-8.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting blis<0.8.0,>=0.7.8 (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.4->en_ner_bc5cdr_md==0.5.4)
  Downloading blis-0.7.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Downloading spacy-3.7.5-cp312-cp312

TASK 1 : Medical NLP Summarization

In [3]:
import spacy
import scispacy
import json
import re

# 1. Loading the specialized scispaCy NER model
try:
    nlp_med = spacy.load("en_ner_bc5cdr_md")
except OSError:
    print("Model 'en_ner_bc5cdr_md' not found. Please run the setup cell above and restart the kernel.")

# 2. Defining the full conversation transcript
transcript = """
Physician: Good morning, Ms. Jones. How are you feeling today?
Patient: Good morning, doctor. I’m doing better, but I still have some discomfort now and then.
Physician: I understand you were in a car accident last September. Can you walk me through what happened?
Patient: Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.
Physician: That sounds like a strong impact. Were you wearing your seatbelt?
Patient: Yes, I always do.
Physician: What did you feel immediately after the accident?
Patient: At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away.
Physician: Did you seek medical attention at that time?
Patient: Yes, I went to Moss Bank Accident and Emergency. They checked me over and said it was a whiplash injury, but they didn’t do any X-rays. They just gave me some advice and sent me home.
Physician: How did things progress after that?
Patient: The first four weeks were rough. My neck and back pain were really bad—I had trouble sleeping and had to take painkillers regularly. It started improving after that, but I had to go through ten sessions of physiotherapy to help with the stiffness and discomfort.
Physician: That makes sense. Are you still experiencing pain now?
Patient: It’s not constant, but I do get occasional backaches. It’s nothing like before, though.
Physician: That’s good to hear. Have you noticed any other effects, like anxiety while driving or difficulty concentrating?
Patient: No, nothing like that. I don’t feel nervous driving, and I haven’t had any emotional issues from the accident.
Physician: And how has this impacted your daily life? Work, hobbies, anything like that?
Patient: I had to take a week off work, but after that, I was back to my usual routine. It hasn’t really stopped me from doing anything.
Physician: That’s encouraging. Let’s go ahead and do a physical examination to check your mobility and any lingering pain.
[Physical Examination Conducted]
Physician: Everything looks good. Your neck and back have a full range of movement, and there’s no tenderness or signs of lasting damage. Your muscles and spine seem to be in good condition.
Patient: That’s a relief!
Physician: Yes, your recovery so far has been quite positive. Given your progress, I’d expect you to make a full recovery within six months of the accident. There are no signs of long-term damage or degeneration.
Patient: That’s great to hear. So, I don’t need to worry about this affecting me in the future?
Physician: That’s right. I don’t foresee any long-term impact on your work or daily life. If anything changes or you experience worsening symptoms, you can always come back for a follow-up. But at this point, you’re on track for a full recovery.
Patient: Thank you, doctor. I appreciate it.
Physician: You’re very welcome, Ms. Jones. Take care, and don’t hesitate to reach out if you need anything.
"""

# 3. Hybrid Extraction: NER + Rules
symptoms = []
treatment = []
diagnosis = None
prognosis = "Full recovery expected within six months"

doc = nlp_med(transcript)
for ent in doc.ents:
    if ent.label_ == "DISEASE":
        if "whiplash injury" in ent.text.lower():
            diagnosis = "Whiplash injury"
        elif "pain" in ent.text.lower() or "discomfort" in ent.text.lower() or "stiffness" in ent.text.lower():
            if ent.text not in symptoms:
                symptoms.append(ent.text)
    if ent.label_ == "CHEMICAL":
        if "painkillers" in ent.text.lower():
            if ent.text not in treatment:
                treatment.append(ent.text)

if "physiotherapy" in transcript and "10 physiotherapy sessions" not in treatment:
    treatment.append("10 physiotherapy sessions")
if "neck pain" in transcript and "neck pain" not in symptoms:
    symptoms.append("neck pain")
if "back pain" in transcript and "back pain" not in symptoms:
    symptoms.append("back pain")
if "hit my head" in transcript:
    symptoms.append("Head impact")

# 4. Building and Displaying the Final JSON
summary_data = {
  "Patient_Name": "Ms. Jones",
  "Symptoms": sorted(list(set(symptoms))),
  "Diagnosis": diagnosis,
  "Treatment": sorted(list(set(treatment))),
  "Current_Status": "Occasional backache",
  "Prognosis": prognosis
}

print("Task 1: Structured Summary")
print(json.dumps(summary_data, indent=2))

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


Task 1: Structured Summary
{
  "Patient_Name": "Ms. Jones",
  "Symptoms": [
    "Head impact",
    "back pain",
    "pain"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "10 physiotherapy sessions"
  ],
  "Current_Status": "Occasional backache",
  "Prognosis": "Full recovery expected within six months"
}


**QUESTIONS:**

Q1: How would you handle ambiguous or missing medical data in the transcript?

My approach would be a 3-part strategy, moving from simple to complex:

1. Set Default Values: For any field in the JSON that cannot be found, the pipeline should return a null or "Not specified" string. This is the most honest and accurate approach, as it prevents the AI from "hallucinating" or inventing data that isn't there.

2. Use Fallbacks: As implemented in my Task 1 code, if the AI model (scispaCy) misses a key piece of data (like "whiplash injury"), I would use a rule-based fallback (like a simple string search) to try and find it.

3. Flag for Review: In a production system, if a critical field like Diagnosis is missing or the model's confidence is low, the system should flag the entire note for a "human-in-the-loop" review.

Q2: What pre-trained NLP models would you use for medical summarization?

There are two main types of models for this, and this project uses both:

1. Extractive Models (for Factual Data): For extracting specific facts like symptoms, diagnoses, and medications, a specialized NER (Named Entity Recognition) model is best. The scispaCy models are perfect for this. They are fast, accurate, and reliable for "pulling out" key terms.

2. Abstractive Models (for Prose): For summarizing longer sections (like the History_of_Present_Illness or Plan), an abstractive summarization model is required. These models understand the text and generate new, shorter sentences. Other popular choices are T5 and BART.

TASK 2 : Sentiment & Intent Analysis

In [4]:
from transformers import pipeline
import json

# 1. Loading the Zero-Shot Classification Pipeline
print("Loading Zero-Shot model...")
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")
print("Model loaded successfully.")

# 2. Extracting only the patient's line
patient_lines = []
for line in transcript.strip().split('\n'):
    if line.startswith("Patient: "):
        clean_line = line.replace("Patient: ", "").strip()
        # Filter out short, non-substantive lines
        if len(clean_line) > 3:
            patient_lines.append(clean_line)

# 3. Defining classification labels
sentiment_labels = ["Anxious", "Neutral", "Reassured"]
intent_labels = ["Seeking reassurance", "Reporting symptoms", "Expressing concern"]


# 4. Run analysis and store results in a list
print("\n--- Running Task 2: Sentiment & Intent Analysis ---")
analysis_results = []

for line in patient_lines:
    sentiment_result = classifier(line, sentiment_labels, multi_label=False)
    intent_result = classifier(line, intent_labels, multi_label=False)
    analysis_results.append({
        "Statement": line,
        "Sentiment": sentiment_result['labels'][0],
        "Intent": intent_result['labels'][0]
    })

# 5. Printing the entire list as a JSON object
print(json.dumps(analysis_results, indent=2))

Loading Zero-Shot model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Model loaded successfully.

--- Running Task 2: Sentiment & Intent Analysis ---
[
  {
    "Statement": "Good morning, doctor. I\u2019m doing better, but I still have some discomfort now and then.",
    "Sentiment": "Reassured",
    "Intent": "Reporting symptoms"
  },
  {
    "Statement": "Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.",
    "Sentiment": "Reassured",
    "Intent": "Reporting symptoms"
  },
  {
    "Statement": "Yes, I always do.",
    "Sentiment": "Reassured",
    "Intent": "Expressing concern"
  },
  {
    "Statement": "At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away.",
    "Sentiment": "Anxious",
    "Intent": "Reporting symptoms"
  },
  {
    "Statement": "Yes, I went to Moss Bank Accid

**QUESTIONS:**

Q1: How would you fine-tune BERT for medical sentiment detection?

While my solution uses a zero-shot model for fast results, fine-tuning a BERT-based model would provide higher accuracy. Here is the process:

1. Base Model: I would start with a domain-specific model from Hugging Face, not a generic BERT. BioClinicalBERT or GatorTron are pre-trained on medical and clinical text, so they already understand the vocabulary.

2. Dataset: I would need a labeled dataset of patient statements.

3. Architecture: I would load the base model using the AutoModelForSequenceClassification class from transformers, specifying 3 output labels (Anxious, Neutral, Reassured).

4. Training: I would first freeze the main BERT layers and train only the new classification head for a few epochs. This quickly adapts the model to the new task. Then, I would unfreeze all layers and fine-tune the entire model with a very low learning rate.

Q2: What datasets would you use for training a healthcare-specific sentiment model?

This is the most challenging part, as high-quality, labeled medical data is rare.

1. Public/Scraped Data: The most accessible option is to scrape data from patient forums or use social media datasets (like Twitter) that have been filtered for health-related keywords. This data is noisy but plentiful.

2. Clinical Data: The best data would come from real, annotated EMR (Electronic Medical Record) notes, such as the MIMIC-III dataset. However, this data is private, requires special access and ethics approval, and would still need to be manually labeled for sentiment.

3. Data Augmentation: I could also use a smaller labeled dataset and apply data augmentation techniques to create a larger, more robust training set.

TASK 3 : SOAP Note Generation

In [5]:
import json
import spacy
from transformers import pipeline

# 1. Loading the Summarization Model
print("Loading summarization model...")
try:
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
    print("Summarizer loaded.")
except Exception as e:
    print(f"Error loading summarizer: {e}")

# 2. Checking that all models are loaded from previous tasks
try:
    classifier # From Task 2
    nlp_med    # From Task 1
    print("All models are loaded and ready.")
except NameError:
    print("Error: Base models not found. Please re-run Task 1 and Task 2 cells.")

# 3. Initializing SOAP buckets and classification labels
soap_note_parts = {
    "Subjective": [],
    "Objective": [],
    "Assessment": [],
    "Plan": []
}
physician_labels = ["Assessment", "Plan"]
all_lines = transcript.strip().split('\n')

print("Running Task 3: SOAP Note Generation...")

is_exam_section = False
for line in all_lines:

    clean_line = line.strip().replace("\u2019", "'")
    if not clean_line:
        continue

    if clean_line.startswith("Patient: "):
        is_exam_section = False
        soap_note_parts["Subjective"].append(clean_line.replace("Patient: ", "").strip())

    elif clean_line.startswith("[Physical Examination Conducted]"):
        is_exam_section = True
        soap_note_parts["Objective"].append(clean_line)

    elif clean_line.startswith("Physician: "):
        physician_text = clean_line.replace("Physician: ", "").strip()
        if not physician_text:
            continue

        if "recovery" in physician_text.lower() or "progress" in physician_text.lower() or "expect" in physician_text.lower():
            is_exam_section = False

        if is_exam_section:
            soap_note_parts["Objective"].append(physician_text)

        else:
            try:
                soap_result = classifier(physician_text, physician_labels, multi_label=False)
                predicted_label = soap_result['labels'][0]
                soap_note_parts[predicted_label].append(physician_text)
            except Exception as e:
                print(f"Skipping line due to error: {e}")

# 5. Joining text and creating summaries for each section
subjective_text = " ".join(soap_note_parts["Subjective"])
objective_text = " ".join(soap_note_parts["Objective"])
assessment_text = " ".join(soap_note_parts["Assessment"])
plan_text = " ".join(soap_note_parts["Plan"])

sum_s = summarizer(subjective_text, max_length=100, min_length=20, do_sample=False)[0]['summary_text'] if subjective_text else ""
sum_o = summarizer(objective_text, max_length=60, min_length=10, do_sample=False)[0]['summary_text'] if objective_text else ""
sum_a = summarizer(assessment_text, max_length=60, min_length=10, do_sample=False)[0]['summary_text'] if assessment_text else ""
sum_p = summarizer(plan_text, max_length=60, min_length=10, do_sample=False)[0]['summary_text'] if plan_text else ""

assessment_doc = nlp_med(assessment_text + " " + objective_text)
found_diagnosis = "Not found"
for ent in assessment_doc.ents:
    if ent.label_ == "DISEASE" and "whiplash" in ent.text.lower():
        found_diagnosis = "Whiplash injury"
        break
if found_diagnosis == "Not found" and "whiplash injury" in subjective_text:
    found_diagnosis = "Whiplash injury (reported by patient)"

# 6. Building the Final JSON
soap_note_json = {
  "Subjective": {
    "Chief_Complaint": "Patient reports neck and back pain following a car accident.",
    "History_of_Present_Illness": sum_s
  },
  "Objective": {
    "Physical_Exam": sum_o
  },
  "Assessment": {
    "Diagnosis": found_diagnosis,
    "Assessment": sum_a
  },
  "Plan": {
    "Plan_Summary": sum_p,
    "Follow_Up": "Patient to return if symptoms worsen."
  }
}

print("\nSOAP Note : ")
print(json.dumps(soap_note_json, indent=2))

Loading summarization model...


Device set to use cpu


Summarizer loaded.
All models are loaded and ready.
Running Task 3: SOAP Note Generation...


Your max_length is set to 60, but your input_length is only 46. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)



SOAP Note : 
{
  "Subjective": {
    "Chief_Complaint": "Patient reports neck and back pain following a car accident.",
    "History_of_Present_Illness": " Another car hit me from behind, pushed my car into the one in front of me . I hit my head on the steering wheel, and I could feel pain in my neck and back almost right away . I had to take a week off work, but after that, I was back to my usual routine ."
  },
  "Objective": {
    "Physical_Exam": " Your neck and back have a full range of movement . There's no tenderness or signs of lasting damage . Your muscles and spine seem to be in good condition ."
  },
  "Assessment": {
    "Diagnosis": "Whiplash injury (reported by patient)",
    "Assessment": " Given your progress, I'd expect you to make a full recovery within six months of the accident . There are no signs of long-term damage or degeneration ."
  },
  "Plan": {
    "Plan_Summary": " Ms. Jones was in a car accident last September . She says she's on track for a full recover

**QUESTIONS:**

Q1: How would you train an NLP model to map medical transcripts into SOAP format?

This is a classic sequence-to-sequence (seq2seq) task, perfect for an encoder-decoder model.

1. Dataset: I would need a large dataset of (Transcript, Structured SOAP Note) pairs. The transcript would be the input sequence, and the full, formatted SOAP note (as a JSON string) would be the target output sequence.

2. Model: The best choice for this is a Transformer-based seq2seq model.

3. Training: I would train the model to "translate" the unstructured conversation into the structured JSON format. For example, the input would be the full transcript, and the model would be trained to generate the complete JSON, including all keys and extracted information, as its output.

Q2: What rule-based or deep-learning techniques would improve the accuracy of SOAP note generation?

This uses a hybrid approach, which is the most robust method.

1. Rule-Based: These are fast, 100% reliable, and easy to explain.

* Example: IF line.startswith("Patient:") THEN map_to(Subjective).

* Pros: High precision, no "hallucinations."

* Cons: Very brittle. They fail if the text doesn't match the exact rule.

2. Deep Learning: This is context-aware and flexible.

* Example 1 (Seq2Seq): A T5 model to generate the entire note from scratch.

* Example 2 (Classification): A BERT model (like the zero-shot classifier I used) to classify each sentence as S, O, A, or P.

* Pros: Can handle variations in language.

* Cons: Requires a large dataset and can make unpredictable errors.

3. Hybrid: This is the method I implemented in Task 3.

* Use rules for the "easy" parts: Patient: lines are always Subjective. [Physical Exam] is always Objective.

* Use deep learning for the "hard" parts: Use the AI classifier to sort the ambiguous physician's lines.

* This combines the reliability of rules with the intelligence of deep learning.