# 06 - Push to Hugging Face Hub

## Goal

Publish: push the model to the Hugging Face Hub with a model card, and (optionally) push a small sample dataset. Then add an inference widget.


In [7]:
# === TODO (you code this) ===
# Goal: Import libraries for Hugging Face Hub interaction.
# Hints:
# 1) os, Path, huggingface_hub (login, HfApi, create_repo, upload_folder)
# Acceptance:
# - All imports successful

# TODO: import libraries
import os
from pathlib import Path
from huggingface_hub import HfApi, login, create_repo, upload_folder
from dotenv import load_dotenv



## Login & Setup

Set `HUGGINGFACE_HUB_TOKEN` in your environment or pass it here.


In [None]:
# === TODO (you code this) ===
# Goal: Authenticate with Hugging Face Hub.
# Hints:
# 1) Get HF token from environment variable
# 2) Call login() if token available
# 3) Define repo names
# Acceptance:
# - Logged in or message shown
# - MODEL_REPO and DATASET_REPO defined

# TODO: login and define repo names
load_dotenv()
hf_token = os.getenv("HUGGINGFACE_API_KEY")

MODEL_REPO = "Tuminha/dental-evidence-triage"
DATASET_REPO = "Tuminha/dental-evidence-dataset"

if hf_token:
    login(token=hf_token)
    print("Logged in to Hugging Face Hub")
else:
    print("No token provided, skipping login")





Logged in to Hugging Face Hub


## Prepare Model Card

Load the template and fill in your actual metrics from notebook 05.


In [14]:
# === TODO (you code this) ===
# Goal: Prepare model card with actual metrics.
# Hints:
# 1) Load MODEL_CARD_TEMPLATE.md
# 2) Replace [TBD] placeholders with your metrics from notebook 05
# 3) Write to ../artifacts/model/best/README.md
# Acceptance:
# - README.md created with filled metrics

# Load template
template_path = Path("../MODEL_CARD_TEMPLATE.md")
with open(template_path, "r") as file:
    template = file.read()

# Metrics from Notebook 05 (Test Set Evaluation)
# Aggregate metrics
micro_f1 = 0.8917
macro_f1 = 0.7397
micro_precision = 0.8966
micro_recall = 0.8868
macro_precision = 0.8201
macro_recall = 0.7596

# Per-label metrics (from classification_report in Notebook 05)
per_label_metrics = {
    'SystematicReview': {'precision': 0.81, 'recall': 0.93, 'f1': 0.87, 'support': 1326},
    'MetaAnalysis': {'precision': 0.77, 'recall': 0.97, 'f1': 0.86, 'support': 601},
    'RCT': {'precision': 0.70, 'recall': 0.92, 'f1': 0.80, 'support': 1046},
    'ClinicalTrial': {'precision': 0.64, 'recall': 0.28, 'f1': 0.39, 'support': 103},
    'Cohort': {'precision': 0.69, 'recall': 0.89, 'f1': 0.78, 'support': 1768},
    'CaseControl': {'precision': 0.89, 'recall': 0.04, 'f1': 0.08, 'support': 1513},
    'CaseReport': {'precision': 0.95, 'recall': 0.89, 'f1': 0.92, 'support': 1409},
    'InVitro': {'precision': 0.93, 'recall': 0.93, 'f1': 0.93, 'support': 2183},
    'Animal': {'precision': 0.86, 'recall': 0.79, 'f1': 0.82, 'support': 1651},
    'Human': {'precision': 0.95, 'recall': 0.96, 'f1': 0.96, 'support': 16489}
}

# Replace aggregate metrics
template = template.replace("[TARGET: â‰¥0.75]", f"{micro_f1:.4f}")
template = template.replace("[Expected lower due to imbalance]", f"{macro_f1:.4f}")
template = template.replace("**Micro-Precision** | [TBD]", f"**Micro-Precision** | {micro_precision:.4f}")
template = template.replace("**Micro-Recall** | [TBD]", f"**Micro-Recall** | {micro_recall:.4f}")
template = template.replace("**Macro-Precision** | [TBD]", f"**Macro-Precision** | {macro_precision:.4f}")
template = template.replace("**Macro-Recall** | [TBD]", f"**Macro-Recall** | {macro_recall:.4f}")

# Replace per-label table
per_label_table = "| Label | Precision | Recall | F1 | Support |\n"
per_label_table += "|-------|-----------|--------|-----|---------|\n"
for label in ['SystematicReview', 'MetaAnalysis', 'RCT', 'ClinicalTrial', 'Cohort', 
              'CaseControl', 'CaseReport', 'InVitro', 'Animal', 'Human']:
    metrics = per_label_metrics[label]
    per_label_table += f"| {label} | {metrics['precision']:.2f} | {metrics['recall']:.2f} | {metrics['f1']:.2f} | {metrics['support']} |\n"

# Find and replace the per-label table section
import re
# Match the entire table including header, separator, and all rows
# Pattern matches from "| Label | Precision..." through the separator line to the last "| Human |..."
pattern = r'\| Label \| Precision \| Recall \| F1 \| Support \|\n\|-+\|.*?\| Human \| 0\.XX \| 0\.XX \| 0\.XX \| XXX \|'
replacement = per_label_table.strip()
template = re.sub(pattern, replacement, template, flags=re.DOTALL)

# Fallback: if regex didn't match, use string replacement
if "| Human | 0.XX" in template:
    # Find the table start (header) and end (last row)
    start_marker = "| Label | Precision | Recall | F1 | Support |"
    end_marker = "| Human | 0.XX | 0.XX | 0.XX | XXX |"
    start_idx = template.find(start_marker)
    if start_idx != -1:
        # Find the end of the table (after the last row)
        end_idx = template.find(end_marker, start_idx)
        if end_idx != -1:
            end_idx = template.find("\n", end_idx) + 1  # Include the newline after the last row
            # Replace the entire table section
            template = template[:start_idx] + per_label_table.strip() + "\n" + template[end_idx:]

# Update training data section with actual numbers from README
template = template.replace("~50,000â€“100,000 (varies by query scope)", "64,981 labeled articles (from 76,165 total)")
template = template.replace("â‰¤2021 (~60-70%)", "â‰¤2021 (29,926 articles, 46.3%)")
template = template.replace("2022-2023 (~15-20%)", "2022-2023 (16,057 articles, 24.8%)")
template = template.replace("â‰¥2024 (~15-20%)", "â‰¥2024 (18,666 articles, 28.9%)")

# Update hyperparameters with actual values from training
template = template.replace("**Learning Rate:** 5e-5", "**Learning Rate:** 2e-5")
template = template.replace("**Epochs:** 3â€“4", "**Epochs:** 3")

# Update hardware specification
template = template.replace("- **Hardware:** [Specify: e.g., 1x NVIDIA T4, 16GB RAM]", 
                            "- **Hardware:** Apple Silicon (MPS - Metal Performance Shaders) on macOS")

# Add YAML front matter for Hugging Face Hub
yaml_front_matter = """---
library_name: transformers
license: mit
tags:
- multi-label-classification
- dental
- medical
- distilbert
- text-classification
- evidence-based-medicine
- systematic-review
task: text-classification
datasets:
- pubmed
metrics:
- f1
- precision
- recall
model-index:
- name: dental-evidence-triage
  results:
  - task:
      type: text-classification
      name: Multi-label Text Classification
    dataset:
      name: PubMed Dental Abstracts
      type: pubmed
    metrics:
    - type: f1
      value: 0.8917
      name: Micro-F1
    - type: f1
      value: 0.7397
      name: Macro-F1
    - type: precision
      value: 0.8966
      name: Micro-Precision
    - type: recall
      value: 0.8868
      name: Micro-Recall
base_model: distilbert-base-uncased
---

"""

# Prepend YAML front matter to template
template_with_yaml = yaml_front_matter + template

# Write to model directory
output_path = Path("../artifacts/model/best/README.md")
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as file:
    file.write(template_with_yaml)

print(f"âœ… Model card created at {output_path}")
print(f"   - YAML metadata added")
print(f"   - Micro-F1: {micro_f1:.4f}")
print(f"   - Macro-F1: {macro_f1:.4f}")
print(f"   - Per-label metrics filled for all 10 labels")





âœ… Model card created at ../artifacts/model/best/README.md
   - YAML metadata added
   - Micro-F1: 0.8917
   - Macro-F1: 0.7397
   - Per-label metrics filled for all 10 labels


## Push Model to Hub


In [15]:
# === TODO (you code this) ===
# Goal: Push model folder to Hugging Face Hub.
# Hints:
# 1) Create HfApi instance
# 2) Create repo (with exist_ok=True)
# 3) Upload folder from ../artifacts/model/best
# Acceptance:
# - Model uploaded successfully
# - Accessible at huggingface.co/Tuminha/dental-evidence-triage

# TODO: push model
hf_api = HfApi()

# Create repository (if it doesn't exist)
hf_api.create_repo(MODEL_REPO, exist_ok=True)

# Upload the model folder
# Note: folder_path comes first, then repo_id
folder_path = Path("../artifacts/model/best")
hf_api.upload_folder(
    folder_path=str(folder_path),
    repo_id=MODEL_REPO,
    repo_type="model"
)

print(f"âœ… Model uploaded successfully!")
print(f"   View at: https://huggingface.co/{MODEL_REPO}")


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

âœ… Model uploaded successfully!
   View at: https://huggingface.co/Tuminha/dental-evidence-triage


## (Optional) Push Sample Dataset

Push a small sample (2-5k rows) for reproducibility.


In [16]:
# === TODO (you code this) ===
# Goal: (Optional) Push sample dataset for reproducibility.
# Hints:
# 1) Load train.parquet, sample 2000 rows
# 2) Create text column, keep key fields
# 3) Save locally, then upload to dataset repo
# Acceptance:
# - Sample dataset available on HF

# TODO: (optional) push sample dataset
import pandas as pd

# Load train.parquet from correct path
path_train_parquet = Path("../data/processed/train.parquet")
train_df = pd.read_parquet(path_train_parquet)

# Sample 2000 rows (with random seed for reproducibility)
sample_df = train_df.sample(n=2000, random_state=42)

# Create text column (title + abstract)
sample_df['text'] = sample_df['title'] + ' ' + sample_df['abstract']
sample_df['text'] = sample_df['text'].str[:2000]  # Truncate to 2000 chars (same as training)

# Keep key fields: pmid, title, abstract, text, labels, year
sample_df = sample_df[['pmid', 'title', 'abstract', 'text', 'labels', 'year']]

# Save locally to a temp location
path_sample_parquet = Path("../artifacts/sample_dataset.parquet")
path_sample_parquet.parent.mkdir(parents=True, exist_ok=True)
sample_df.to_parquet(path_sample_parquet)

print(f"âœ… Sample dataset prepared: {len(sample_df)} rows")
print(f"   Saved to: {path_sample_parquet}")

# Create dataset repository (if it doesn't exist)
hf_api.create_repo(DATASET_REPO, repo_type="dataset", exist_ok=True)

# Upload sample dataset to Hugging Face Hub
hf_api.upload_file(
    path_or_fileobj=str(path_sample_parquet),
    repo_id=DATASET_REPO,
    path_in_repo="sample.parquet",
    repo_type="dataset"
)

print(f"âœ… Sample dataset uploaded to {DATASET_REPO}")
print(f"   View at: https://huggingface.co/datasets/{DATASET_REPO}")




âœ… Sample dataset prepared: 2000 rows
   Saved to: ../artifacts/sample_dataset.parquet


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

âœ… Sample dataset uploaded to Tuminha/dental-evidence-dataset
   View at: https://huggingface.co/datasets/Tuminha/dental-evidence-dataset


## Configure Inference Widget

1. Go to https://huggingface.co/Tuminha/dental-evidence-triage
2. Settings â†’ Model Card â†’ Add example inputs
3. Example:

```
Title: Effect of chlorhexidine on dental implants: a randomized controlled trial. 
Abstract: This study evaluated the efficacy of chlorhexidine mouthrinse in preventing peri-implantitis. Sixty patients with dental implants were randomly assigned...
```

Expected labels: `[RCT, Human]`

## Recommendations

- **Enable the inference widget** in model settings
- **Add label list and examples** to the model card
- **Test the widget** with 3-5 diverse abstracts

## ðŸ§˜ Reflection Log

**What did you learn in this session?**
- 

**What challenges did you encounter?**
- 

**How will this improve Periospot AI?**
- 
