# Few-shot Classifiction of SDoH

## 0. Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import os
from pathlib import Path
import sys
from IPython.display import display, HTML

# Add the project root to the Python path to import the modules
project_root = Path().absolute().parent
sys.path.append(str(project_root))

In [4]:
# Define SDOH categories
EXPECTED_SDOH = [
        'EmploymentStatus', 'Housing', 'Transportation', 'ParentalStatus',
        'RelationshipStatus', 'SocialSupport', 'SubstanceUse', 
        'FinancialSituation', 'EducationLevel', 'FoodInsecurity',
        'NoSDOH'
    ]

GUEVARA_SDOH = [
    'EMPLOYMENT', 'HOUSING', 'PARENT', 'RELATIONSHIP',
    'SUPPORT', 'TRANSPORTATION'
]

# Load cleaned data
brc_referrals_cleaned = pd.read_csv("../data/processed/referrals_cleaned.csv")

## 1. Few-shot classification of SDoH

### 1.1 Loading the models

In [5]:
import torch
import transformers

# Use shared cache
os.environ['HF_HOME'] = '/data/resource/huggingface'
os.environ['TRANSFORMERS_OFFLINE'] = '1'  # Force offline mode

# What models are available
cache_dir = "/data/resource/huggingface/hub"
available_models = []

if os.path.exists(cache_dir):
    for item in os.listdir(cache_dir):
        if item.startswith("models--"):
            # Convert models--org--name to org/name format
            model_name = item.replace("models--", "").replace("--", "/")
            available_models.append(model_name)

print("Available cached models:")
for model in sorted(available_models):
    print(f"  {model}")



Available cached models:
  CohereForAI/aya-23-35B
  CohereForAI/aya-23-8B
  CohereForAI/aya-vision-8b
  HuggingFaceTB/SmolLM-135M-Instruct
  LLaMAX/LLaMAX3-8B-Alpaca
  Qwen/Qwen2.5-1.5B
  Qwen/Qwen2.5-3B
  Qwen/Qwen2.5-72B-Instruct
  Qwen/Qwen2.5-7B
  Qwen/Qwen2.5-7B-Instruct
  Qwen/Qwen2.5-7B-instruct
  Qwen/Qwen2.5-VL-7B-Instruct
  Qwen/Qwen3-0.6B
  Qwen/Qwen3-8B
  Unbabel/wmt20-comet-qe-da
  Unbabel/wmt22-comet-da
  bert-base-uncased
  bert-large-uncased
  cardiffnlp/twitter-roberta-base-sentiment
  clairebarale/refugee_cases_ner
  deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
  deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
  deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  facebook/nllb-200-3.3B
  facebook/nllb-200-distilled-1.3B
  facebook/nllb-200-distilled-600M
  google/gemma-3-1b-it
  google/gemma-3-27b-it
  google/gemma-3-27b-it-qat-q4_0-gguf
  gpt2
  gpt2-medium
  gpt2-xl
  hfl/chinese-bert-wwm
  hfl/chines

In [6]:
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Transformers version: 4.52.3
PyTorch version: 2.6.0
CUDA available: True


In [7]:
# Load one of the instruction-tuned models
# Qwen/Qwen2.5-7B-Instruct
# meta-llama/Llama-3.1-8B-Instruct
# microsoft/Phi-4-mini-instruct
# mistralai/Mistral-7B-Instruct-v0.3

from utils.model_helpers import load_instruction_model

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer, model = None, None

tokenizer, model = load_instruction_model(model_name)

Loading meta-llama/Llama-3.1-8B-Instruct...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✓ meta-llama/Llama-3.1-8B-Instruct loaded successfully!


### 1.2 Extraction from one note

In [8]:
# Load a specific note: Case Reference = CAS-467812
sample_note = brc_referrals_cleaned[brc_referrals_cleaned['Case Reference'] == 'CAS-467812'].iloc[0]['Referral Notes (depersonalised)']

#### Defining the classification task

I define a similar classification task as Guevara et al. (2024). I also added 4 of the 21 SDoH used in Keloth et al. (2025)

**Task**: Multi-label sentence-level classification

**SDoH**: 10 SDoH categories

Guevara et al. (2024) & Keloth et al. (2025):
- Employment status 
- Housing issues 
- Transportation issues 
- Parental status
- Relationship status
- Social support

Keloth et al. (2025)
- Substance use
- Financial issues
- Education level 
- Food insecurity

I define two levels of classification, similarly to the papers cited above:

1. Level 1: *Any SDoH mentions*. The presence of language describing an SDoH category as defined above, regardless of the attribute. 
2. Level 2: *Adverse SDoH mentions*. The presence or absence of language describing an SDoH category with an attribute that could create an additional social work or resource support need for patients:
    - Employment status: unemployed, underemployed, disability
    - Housing issue: financial status, undomiciled, other
    - Transportation issue: distance, resources, other
    - Parental status: having a child under 18 years old
    - Relationship: widowed, divorced, single
    - Social support: absence of social support
    - Substance use: alcohol abuse, drug use, smoking
    - Financial issues: poverty, debt, inability to pay bills, benefit dependency
    - Education level: low education, illiteracy, lack of qualifications
    - Food insecurity: hunger, inability to afford food, reliance on food banks, poor nutrition

In [9]:
from utils.prompt_creation_helpers import create_automated_prompt

prompt_example_basic = create_automated_prompt("This is a sentence", tokenizer=tokenizer, prompt_type="five_shot_basic", level=1)
print("=" * 50)
print("Example Prompt (Five Shot Basic):")
print("=" * 50)
print(prompt_example_basic)

prompt_example_detailed = create_automated_prompt("This is a sentence", tokenizer=tokenizer, prompt_type="five_shot_basic", level=1)
print("=" * 50)
print("Example Prompt (Five Shot Detailed):")
print("=" * 50)
print(prompt_example_detailed)

prompt_example_basic_lvl2 = create_automated_prompt("This is a sentence", tokenizer=tokenizer, prompt_type="five_shot_basic", level=2)
print("=" * 50)
print("Example Prompt (Five Shot Basic - Level 2):")
print("=" * 50)
print(prompt_example_basic_lvl2)

Using chat template for meta-llama/llama-3.1-8b-instruct
Example Prompt (Five Shot Basic):
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are analyzing a referral note sentence to identify mentions of Social Determinants of Health (SDoH).

Given a sentence, output all SDoH factors that can be inferred from that sentence from the following list: 
EmploymentStatus, Housing, Transportation, ParentalStatus, RelationshipStatus, SocialSupport, SubstanceUse, FinancialSituation, EducationLevel, FoodInsecurity. 

If the sentence does NOT mention any of the above categories, output <LIST>NoSDoH</LIST>.

Your response must be a comma-separated list of SDoH factors embedded with <LIST> and </LIST>.

**STRICT RULES**: 
- DO NOT generate any other text, explanations, or new SDoH labels.
- A sentence CAN be labeled with one or more SDoH factors.
- Your response must ONLY contain the <LIST>...</LIST> format.
- Do not cont

In [None]:
from utils.SDoH_classification_helpers import SDoHExtractor

# Initialize the SDoH extractor
extractor_lvl1 = SDoHExtractor(
    model=model,
    tokenizer=tokenizer,
    prompt_type="five_shot_detailed",
    level=1,
    debug=True,
)

# Extract SDoH factors
results_lvl1 = extractor_lvl1.extract_from_note(sample_note)
results_lvl1_df = extractor_lvl1.results_to_dataframe(results_lvl1, note_id="sample")

print("\nExtracted SDoH Factors (Level 1):")
display(results_lvl1_df)

In [None]:
# Some debugging
print("Prompt: \n")
print(results_lvl1['sentences'][1]['debug']['prompt'])

print("Raw response: \n")
print(results_lvl1['sentences'][1]['debug']['raw_response'])

In [None]:
# Initialize the SDoH extractor (level 2)
extractor_lvl2 = SDoHExtractor(
    model=model,
    tokenizer=tokenizer,
    prompt_type="five_shot_basic",
    level=2,
    debug=True,
)

# Extract SDoH factors
results_lvl2 = extractor_lvl2.extract_from_note(sample_note)
results_lvl2_df = extractor_lvl2.results_to_dataframe(results_lvl2, note_id="sample")

print("\nExtracted SDoH Factors (Level 2):")
display(results_lvl2_df)

### 1.3. Extracting from multiple notes (batch processing) and evaluating few-shot extraction

The following script processes a batch of notes, based on the SDoH extractor used for a single note earlier. It includes many options that can be modified:

- model name
- prompt type
- level of classification (1 for mention of SDoH, 2 for adverse vs. protective mention)
- batch size and start index


To run it, enter the following command in the terminal, after activating the conda environment and adjusting the options:

```console
python scripts/batch_process_notes.py --model_name "meta-llama/Llama-3.1-8B-Instruct" \
                                 --prompt_type "five_shot_basic" \
                                 --level 1 \
                                 --batch_size 10 \
                                 --start_index 0
```

We can also evaluate models on the annotation dataset, this is done using another script:

```console
python scripts/evaluate_on_annotations.py --model_name "meta-llama/Llama-3.1-8B-Instruct" \
                                  --prompt_type "five_shot_basic" \
                                  --level 1 \
                                  --annotation_data "data/raw/BRC-Data/annotated_BRC_referrals.csv" \
                                  --sample_size 5
```

The evaluation system has two main components:

- **Main evaluation script** (`evaluate_on_annotations.py`) -- This is the orchestrator that runs everything;
- **Supporting utilies** (`utils/evaluation_helpers_lvl1.py`)-- These handle the specific tasks like model loading, SDoH extraction, and metric calculation.

The evaluation script does the following for each sentence:

1. **Extract**: Run the sentence through the SDoHExtractor
2. **Format**: Convert the model's list output to a comparable string
3. **Record**: Store both human and model labels plus metadata
4. **Structure**: Build a DataFrame ready for multi-label metrics calculation

Demonstration:

In [None]:
from utils.SDoH_classification_helpers import SDoHExtractor
from utils.model_helpers import load_instruction_model

# Load annotated data (first few rows for demo)
annotated_df = pd.read_csv("../data/raw/BRC-Data/annotated_BRC_referrals.csv")
sample_df = annotated_df.head(5)  # Just 5 sentences for demo
sample_df.columns = ['CAS', 'Sentence', 'Label', 'Adverse', 'Comments']  # Standardise column names

print("Sample annotated data:")
print(sample_df[['Sentence', 'Label']])

# Load model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer, model = load_instruction_model(model_name)

# Create extractor
extractor = SDoHExtractor(
    model=model,
    tokenizer=tokenizer,
    prompt_type="five_shot_basic",
    level=1,
    debug=True 
)

# Process each sentence and build results
results = []
for idx, row in sample_df.iterrows():
    sentence = str(row['Sentence']).strip()
    
    # Extract SDoH from sentence
    extraction_result = extractor.extract_from_sentence(sentence)
    factors = extraction_result["sdoh_factors"]
    
    # Convert to comparison format
    model_prediction = ", ".join(sorted(factors)) if factors and factors != ["NoSDoH"] else "NoSDoH"
    
    # Create result record (same structure as your evaluation script)
    result = {
        'cas': row['CAS'],
        'sentence_number': idx + 1,
        'original_sentence': sentence,
        'original_label': row['Label'],
        'model_prediction': model_prediction,
        'model_factors_list': factors,
        'model_has_sdoh': factors != ["NoSDoH"] and bool(factors),
        'num_model_factors': len(factors) if factors != ["NoSDoH"] else 0,
    }
    
    results.append(result)
    
    # Show what happened
    print(f"\n--- Sentence {idx + 1} ---")
    print(f"Text: {sentence[:60]}...")
    print(f"Human labeled: {row['Label']}")
    print(f"Model predicted: {model_prediction}")
    if extraction_result.get("debug"):
        print(f"Raw model response: {extraction_result['debug']['raw_response']}")

# Convert to DataFrame (ready for metrics calculation)
results_df = pd.DataFrame(results)

Sample annotated data:
                                            Sentence  \
0  She needs help with food , toiletry and some cash   
1  Mr PERSON was having support of a friend who h...   
2  Isolated , housing concern impacting MH SU pre...   
3      Equipment delivery to ensure safer discharge.   
4  XXXX shopping FBG Patient is no longer driving...   

                                Label  
0  FoodInsecurity, FinancialSituation  
1                       SocialSupport  
2              SocialSupport, Housing  
3                              NoSDoH  
4                      Transportation  
Loading meta-llama/Llama-3.1-8B-Instruct...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


✓ meta-llama/Llama-3.1-8B-Instruct loaded successfully!
Using chat template for meta-llama/llama-3.1-8b-instruct


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Sentence 1 ---
Text: She needs help with food , toiletry and some cash...
Human labeled: FoodInsecurity, FinancialSituation
Model predicted: FinancialSituation, FoodInsecurity
Raw model response: <LIST>FinancialSituation, FoodInsecurity</LIST>
Using chat template for meta-llama/llama-3.1-8b-instruct


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Sentence 2 ---
Text: Mr PERSON was having support of a friend who had a car accid...
Human labeled: SocialSupport
Model predicted: SocialSupport, Transportation
Raw model response: <LIST>SocialSupport, Transportation</LIST>
Using chat template for meta-llama/llama-3.1-8b-instruct


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Sentence 3 ---
Text: Isolated , housing concern impacting MH SU previously suppor...
Human labeled: SocialSupport, Housing
Model predicted: Housing, SocialSupport, SubstanceUse
Raw model response: <LIST>Housing, SocialSupport, SubstanceUse</LIST>
Using chat template for meta-llama/llama-3.1-8b-instruct


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Sentence 4 ---
Text: Equipment delivery to ensure safer discharge....
Human labeled: NoSDoH
Model predicted: NoSDoH
Raw model response: <LIST>NoSDoH</LIST>
Using chat template for meta-llama/llama-3.1-8b-instruct

--- Sentence 5 ---
Text: XXXX shopping FBG Patient is no longer driving , therefore n...
Human labeled: Transportation
Model predicted: FinancialSituation, SocialSupport, Transportation
Raw model response: <LIST>Transportation, FinancialSituation, SocialSupport</LIST>

Results DataFrame:
                       original_label  \
0  FoodInsecurity, FinancialSituation   
1                       SocialSupport   
2              SocialSupport, Housing   
3                              NoSDoH   
4                      Transportation   

                                    model_prediction  model_has_sdoh  
0                 FinancialSituation, FoodInsecurity            True  
1                      SocialSupport, Transportation            True  
2               Housing, SocialS

In [7]:
results_df

Unnamed: 0,cas,sentence_number,original_sentence,original_label,model_prediction,model_factors_list,model_has_sdoh,num_model_factors
0,CAS-548353,1,"She needs help with food , toiletry and some cash","FoodInsecurity, FinancialSituation","FinancialSituation, FoodInsecurity","[FinancialSituation, FoodInsecurity]",True,2
1,CAS-548411,2,Mr PERSON was having support of a friend who h...,SocialSupport,"SocialSupport, Transportation","[SocialSupport, Transportation]",True,2
2,CAS-548427,3,"Isolated , housing concern impacting MH SU pre...","SocialSupport, Housing","Housing, SocialSupport, SubstanceUse","[Housing, SocialSupport, SubstanceUse]",True,3
3,CAS-548530,4,Equipment delivery to ensure safer discharge.,NoSDoH,NoSDoH,[NoSDoH],False,0
4,CAS-548590,5,XXXX shopping FBG Patient is no longer driving...,Transportation,"FinancialSituation, SocialSupport, Transportation","[Transportation, FinancialSituation, SocialSup...",True,3


After classifying SDoH from annotated sentences, the `calculate_multilabel_metrics` function from `utils/evaluation_helpers.py` does three main steps:

1. Label parsing & preparation
2. Binary matrix conversion

    Example: If the labels are ["Housing", "Employment", "Social Support"]:
    - ["Housing", "Employment"] becomes [1, 1, 0]
    - ["Housing"] becomes [1, 0, 0]
    - ["NoSDoH"] becomes [0, 0, 0]

3. Return multi-Label metrics

    The function calculates four types of metrics:
    - Example-based: How well does the model predict the exact set of labels for each sentence?
    - Label-based: How well does the model perform on each individual label?
    - Per-label: Performance breakdown for each SDoH factor
    - Statistics: Overall dataset characteristics

I can now dive deeper into the metrics used for multi-label classification.

1. **Example-Based Metrics (averaged across sentences)**. These look at how well the model predicts the complete set of labels for each sentence:

    - Exact Match Ratio (Subset Accuracy): Percentage where model prediction exactly matches human annotation. ["Housing", "Employment"] == ["Housing", "Employment"] ✓ = 1.0; ["Housing"] vs ["Housing", "Employment"] ✗ = 0.0
    - Hamming Loss: Fraction of wrong label assignments (lower is better). Counts individual label mistakes across all positions; If you have 3 possible labels and get 1 wrong: hamming_loss = 1/3 = 0.33.
    - Additional metrics: Example-based Precision/Recall/F1; Jaccard Index
2. **Label-Based Metrics (averaged across labels)**. These treat each SDoH factor as a separate binary classification problem:

    - Macro-averaged (treats all labels equally). Calculate precision/recall/F1 for each label separately; and average them (rare labels get same weight as common ones)
    - Micro-averaged (weighted by frequency). Pool all true/false positives across labels; more influenced by common labels.

3. **Per-Label Performance**. Individual precision/recall/F1 for each SDoH factor:
4. **Dataset Statistics**

    - Label Cardinality: Average number of labels per sentence
    - Label Density: Cardinality divided by total possible labels
    - Coverage: How many different labels appear at least once