# Project 02: Medical Text Classification — Scope & Data

## 🎯 Concept Primer
Text classification maps medical Q&A text to categories. Uses embeddings + transformers.

**Labels:** Multi-class classification  
**Success Metric:** Macro-F1 (equal weight per class)  
**Expected:** Tokenization → Vocab → Encoding → Padding → Model

## 📋 Objectives
1. Define label space and categories
2. Load medical text dataset
3. Set success metrics (Macro-F1 primary)
4. Document ethical considerations

## ✅ Acceptance Criteria
- [ ] Label categories defined
- [ ] Dataset loaded and inspected
- [ ] Text length statistics computed
- [ ] Macro-F1 chosen as primary metric

## 🔧 Setup

In [1]:
# TODO 1: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 📊 Load Dataset

### TODO 2: Load medical text data

**Location:** `../../data/medquad.csv`  
**Expected:** Columns for text and labels

In [4]:
# TODO 2: Load data
df = pd.read_csv('../../../datasets/medquad.csv')
print(df.head())
print(f"Shape: {df.shape}")

                                 question  \
0                What is (are) Glaucoma ?   
1                  What causes Glaucoma ?   
2     What are the symptoms of Glaucoma ?   
3  What are the treatments for Glaucoma ?   
4                What is (are) Glaucoma ?   

                                              answer           source  \
0  Glaucoma is a group of diseases that can damag...  NIHSeniorHealth   
1  Nearly 2.7 million people have glaucoma, a lea...  NIHSeniorHealth   
2  Symptoms of Glaucoma  Glaucoma can develop in ...  NIHSeniorHealth   
3  Although open-angle glaucoma cannot be cured, ...  NIHSeniorHealth   
4  Glaucoma is a group of diseases that can damag...  NIHSeniorHealth   

  focus_area  
0   Glaucoma  
1   Glaucoma  
2   Glaucoma  
3   Glaucoma  
4   Glaucoma  
Shape: (16412, 4)


## 🎯 Define Label Space

### TODO 3: Document categories

**Expected:** List of medical categories (e.g., Cardiology, Dermatology, etc.)

In [6]:
from transformers import pipeline

CATEGORIES = [
  "Oncology", "Cardiology", "Neurology", "Ophthalmology",
  "Endocrinology", "Pulmonology", "Gastroenterology",
  "Nephrology", "Dermatology", "Psychiatry", "Infectious Disease"
]

clf = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

res = clf(
    "Breast cancer",                    # focus_area text
    candidate_labels=CATEGORIES,
    multi_label=False
)
print(res["labels"][0])

Device set to use mps:0


Oncology


**Categories:**

*List your categories here*

## 🤔 Reflection
1. How many categories? Is this balanced?
2. Why Macro-F1 over weighted-F1?
3. Ethical concerns with medical text classification?

**Your reflection:**

*Write here*

## 📌 Summary
✅ Dataset loaded  
✅ Labels defined  
✅ Metrics chosen

**Next:** `02_load_clean_tokenize.ipynb`