# üè• Enhanced Medical NER Training - Production-Ready
## Advanced Named Entity Recognition for French Medical Documents

**Objectives:**
- üéØ Train a high-performance medical NER model
- üá´üá∑ Optimized for French medical terminology
- üìä 100+ diverse training samples across 12 entity types
- üöÄ CamemBERT-based (French-optimized BERT)
- üìà Advanced evaluation with per-entity metrics
- üíæ Production deployment package

**Entity Types:**
`DISEASE` | `MEDICATION` | `SYMPTOM` | `DOSAGE` | `DATE` | `PROCEDURE` | `ANATOMY` | `TEST` | `LAB_VALUE` | `AGE` | `GENDER` | `FREQUENCY`

---
**‚öôÔ∏è Recommended Colab Settings:**
- Runtime ‚Üí Change runtime type ‚Üí GPU (Tesla T4)
- Edit ‚Üí Notebook settings ‚Üí GPU hardware accelerator

**üì¶ Model:** CamemBERT-ner (French medical BERT)  
**‚è±Ô∏è Training Time:** ~15-20 minutes on Tesla T4  
**üéì Perfect for:** Academic projects, medical document processing, research

## üîß Step 1: Environment Setup & GPU Verification

In [None]:
import torch
import sys
from datetime import datetime

print("="*70)
print("üîç ENVIRONMENT VERIFICATION")
print("="*70)

# Python version
print(f"\nüêç Python Version: {sys.version.split()[0]}")

# PyTorch version
print(f"üî• PyTorch Version: {torch.__version__}")

# GPU detection
if torch.cuda.is_available():
    print(f"\n‚úÖ GPU AVAILABLE!")
    print(f"   ‚îú‚îÄ Device: {torch.cuda.get_device_name(0)}")
    print(f"   ‚îú‚îÄ Compute Capability: {torch.cuda.get_device_capability(0)}")
    print(f"   ‚îú‚îÄ Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"   ‚îî‚îÄ CUDA Version: {torch.version.cuda}")
    device = torch.device('cuda')
    
    # Memory check
    torch.cuda.empty_cache()
    print(f"\nüíæ GPU Memory Status:")
    print(f"   ‚îú‚îÄ Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"   ‚îî‚îÄ Reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
else:
    print(f"\n‚ö†Ô∏è  NO GPU DETECTED!")
    print(f"   ‚îî‚îÄ Training will be extremely slow. Enable GPU in Runtime settings!")
    device = torch.device('cpu')

print(f"\nüéØ Selected Device: {device}")
print(f"‚è∞ Start Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*70)

## üì¶ Step 2: Install Dependencies

Installing state-of-the-art NLP libraries optimized for token classification.

In [None]:
%%capture install_output
!pip install -q transformers==4.37.0 datasets==2.16.1 accelerate==0.26.1 seqeval==1.2.2 scikit-learn matplotlib

# Show installation summary
print("‚úÖ Installation Complete!")
print("\nüì¶ Installed Packages:")
print("   ‚îú‚îÄ transformers: 4.37.0 (Hugging Face)")
print("   ‚îú‚îÄ datasets: 2.16.1 (Data processing)")
print("   ‚îú‚îÄ accelerate: 0.26.1 (Training optimization)")
print("   ‚îú‚îÄ seqeval: 1.2.2 (NER metrics)")
print("   ‚îú‚îÄ scikit-learn (ML utilities)")
print("   ‚îî‚îÄ matplotlib (Visualization)")

## üîÑ Step 3: Import Libraries & Set Random Seeds

In [None]:
# Core libraries
import torch
import torch.nn as nn
import numpy as np
import json
import random
import os
import shutil
from collections import Counter
from typing import Dict, List, Tuple

# Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification,
    EarlyStoppingCallback
)

# Data processing
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Metrics
from seqeval.metrics import (
    classification_report as seqeval_report,
    f1_score,
    precision_score,
    recall_score
)

# Visualization
import matplotlib.pyplot as plt

# Set random seeds for reproducibility
def set_seed(seed=42):
    """Set random seeds for reproducibility"""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

set_seed(42)
print("‚úÖ Libraries imported successfully!")
print("‚úÖ Random seeds set for reproducibility (seed=42)")

## üè∑Ô∏è Step 4: Define Enhanced Entity Labels (12 Types)

Extended label set for comprehensive medical entity extraction.

In [None]:
# BIO tagging format (Beginning, Inside, Outside)
# 12 entity types for comprehensive medical NER
labels = [
    'O',  # Outside any entity
    'B-DISEASE', 'I-DISEASE',           # Diseases, conditions
    'B-MEDICATION', 'I-MEDICATION',     # Drugs, medicines
    'B-SYMPTOM', 'I-SYMPTOM',           # Symptoms, signs
    'B-DOSAGE', 'I-DOSAGE',             # Medication dosages, frequencies
    'B-DATE', 'I-DATE',                 # Dates, temporal expressions
    'B-PROCEDURE', 'I-PROCEDURE',       # Medical procedures, surgeries
    'B-ANATOMY', 'I-ANATOMY',           # Body parts, organs
    'B-TEST', 'I-TEST',                 # Lab tests, imaging
    'B-LAB_VALUE', 'I-LAB_VALUE',       # Lab result values
    'B-AGE', 'I-AGE',                   # Patient age
    'B-GENDER', 'I-GENDER',             # Patient gender
    'B-FREQUENCY', 'I-FREQUENCY'        # Medication frequency
]

label2id = {label: idx for idx, label in enumerate(labels)}
id2label = {idx: label for idx, label in enumerate(labels)}

# Entity categories for reporting
entity_types = ['DISEASE', 'MEDICATION', 'SYMPTOM', 'DOSAGE', 'DATE', 
                'PROCEDURE', 'ANATOMY', 'TEST', 'LAB_VALUE', 'AGE', 
                'GENDER', 'FREQUENCY']

print("="*70)
print("üè∑Ô∏è  ENHANCED LABEL SCHEMA")
print("="*70)
print(f"\nüìä Total labels: {len(labels)} (BIO format)")
print(f"üìä Entity types: {len(entity_types)}")
print(f"\nüéØ Entity Categories:")
for i, entity in enumerate(entity_types, 1):
    print(f"   {i:2d}. {entity}")

print(f"\n‚úÖ Label mappings created!")
print(f"   ‚îú‚îÄ label2id: {len(label2id)} mappings")
print(f"   ‚îî‚îÄ id2label: {len(id2label)} mappings")

## üìù Step 5: Create Comprehensive Training Dataset (100+ Samples)

High-quality annotated French medical texts with diverse clinical scenarios.