## Hugging Face secret loading

Steps to keep your Hugging Face token local and usable in this notebook:
1. Put your token in the project root `.env` (git-ignored) as `HUGGINGFACE_TOKEN=...` (or set the env var in your shell before starting the notebook).
2. Run the install cell below once to ensure `python-dotenv` is available for loading `.env` files locally.
3. Run the load cell to pull the token from the environment; it will error if the variable is missing.
4. Use `hf_token` for any Hugging Face calls (e.g., pass to `huggingface_hub.login`).


In [1]:
# Install Pytorch & other libraries
%pip install "torch>=2.4.0" tensorboard torchvision

# Install Gemma release branch from Hugging Face
%pip install "transformers>=4.51.3"

# Install Hugging Face libraries
%pip install  --upgrade \
  "datasets==3.3.2" \
  "accelerate==1.4.0" \
  "evaluate==0.4.3" \
  "bitsandbytes==0.45.3" \
  "trl==0.15.2" \
  "peft==0.14.0" \
  "pillow==11.1.0" \
  protobuf \
  sentencepiece

Collecting datasets==3.3.2
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting accelerate==1.4.0
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate==0.4.3
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting bitsandbytes==0.45.3
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting trl==0.15.2
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting peft==0.14.0
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting pillow==11.1.0
  Downloading pillow-11.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.1 kB)
Collecting protobuf
  Downloading protobuf-6.33.1-cp39-abi3-manylinux2014_x86_64.whl.metadata (593 bytes)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets==3.3.2)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   

In [2]:
# Install lightweight helper to load .env files locally (no effect if already installed)
%pip install -q python-dotenv

In [3]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables from the project-level .env file if present
env_path = Path.cwd().parent / '.env'
load_dotenv(env_path)

try:
    hf_token = os.getenv('HUGGINGFACE_TOKEN') or os.getenv('HUGGINGFACE_HUB_TOKEN')
    if not hf_token:
        raise RuntimeError('Set HUGGINGFACE_TOKEN or HUGGINGFACE_HUB_TOKEN in your .env or shell before running training.')

    print('Hugging Face token loaded from environment.')
except:
    from google.colab import userdata
    from huggingface_hub import login
    # Login into Hugging Face Hub
    hf_token = userdata.get('hugging_face') # If you are running inside a Google Colab
    login(hf_token)
    print('Hugging Face token loaded from environment.')

TimeoutException: Requesting secret hugging_face timed out. Secrets can only be fetched when running from the Colab UI.

In [2]:
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

  from .autonotebook import tqdm as notebook_tqdm


## Dataset utilities

These cells load ROCO samples with configurable prompts, optionally subset the dataset, 
preview a sample (by index), and extract images for vision models.


In [6]:
from pathlib import Path
import sys
from importlib import reload

# Make the project src folder importable from this notebook
project_root = Path.cwd().parent
sys.path.append(str(project_root / "src"))

import notebook_utils.data as nud
reload(nud)
from notebook_utils import (
    PromptConfig,
    load_roco_samples,
    preview_sample,
    process_vision_info,
)


In [7]:
# Configure prompts for the experiment (edit as needed)
prompt_config = PromptConfig(
    system_message="You are a radiologist who can understand the medical scan of images",
    user_prompt=(
        "Create a description based on the provided image and return the description of the "
        "image with details of the scan and what's the ability. The description should be SEO "
        "optimized and in medical terms."
    ),
)

# Load a subset of the dataset; change sample_size or sample_indices to control what is loaded
formatted_samples = load_roco_samples(
    sample_size=100,              # set to None to load the full split
    sample_indices=None,          # e.g., [0, 10, 42] to pick specific rows
    seed=123,                     # deterministic shuffling when using sample_size
    prompt_config=prompt_config,
    token=hf_token,               # pass HF token if the dataset is gated/private
)
print(f"Loaded {len(formatted_samples)} formatted samples.")


TypeError: must be called with a dataclass type or instance

In [9]:
# Load dataset from the hub
from datasets import load_dataset
dataset = load_dataset("eltorio/ROCOv2-radiology", split="train")

TypeError: must be called with a dataclass type or instance