<a href="https://colab.research.google.com/github/Algocrat/slm-dragon-labs/blob/main/lab3_fresh_data_loading_tokenization_revised_fixed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3 â€“ Data Loading and Tokenization
**Part 3 of the 7 Lab Hands-On SLM Training Series**

This notebook downloads the `ncbi/Open-Patients` dataset, performs basic cleaning and sanity checks, detects the text field automatically, and prepares tokenized chunks for causal language modeling (CLM). It saves a tokenized dataset to disk for use in Lab 4.

### Note on dataset access and licensing
The `ncbi/Open-Patients` dataset is publicly available on the Hugging Face Hub under CC-BY-SA 4.0. No authentication is required to download. Please provide attribution if you reuse the data, and ensure your use complies with the license and any applicable privacy rules.

## Step 0. Install dependencies (if needed)

In [1]:
!pip -q install --upgrade datasets transformers sentencepiece pyarrow tqdm > /dev/null
import importlib
for m in ["datasets", "transformers", "sentencepiece", "pyarrow", "tqdm"]:
    importlib.import_module(m)
print("Dependencies OK")

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.[0m[31m
[0mDependencies OK


## Step 1. Download dataset: `ncbi/Open-Patients`

In [2]:
from datasets import load_dataset

# Public dataset under CC-BY-SA 4.0; no authentication required
dataset = load_dataset("ncbi/Open-Patients")
print(dataset)
print("Example record:")
first_split = list(dataset.keys())[0]
print(dataset[first_split][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Open-Patients.jsonl:   0%|          | 0.00/482M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/180142 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['_id', 'description'],
        num_rows: 180142
    })
})
Example record:
{'_id': 'trec-cds-2014-1', 'description': 'A 58-year-old African-American woman presents to the ER with episodic pressing/burning anterior chest pain that began two days earlier for the first time in her life. The pain started while she was walking, radiates to the back, and is accompanied by nausea, diaphoresis and mild dyspnea, but is not increased on inspiration. The latest episode of pain ended half an hour prior to her arrival. She is known to have hypertension and obesity. She denies smoking, diabetes, hypercholesterolemia, or a family history of heart disease. She currently takes no medications. Physical examination is normal. The EKG shows nonspecific changes.'}


## Step 1.1 Clean and sanity check

In [3]:
import re, unicodedata
from datasets import DatasetDict
import numpy as np

TEXT_FIELD_CANDIDATES = ["text", "content", "description", "body", "note"]
sample_split = list(dataset.keys())[0]
sample_item = dataset[sample_split][0]
text_field = None
for k in TEXT_FIELD_CANDIDATES:
    if k in sample_item and isinstance(sample_item[k], str):
        text_field = k
        break
if text_field is None:
    for k, v in sample_item.items():
        if isinstance(v, str):
            text_field = k
            break
print("Using text field:", text_field)

KEEP_ASCII_ONLY = False
MIN_LEN_CHARS = 10
MAX_LEN_CHARS = 50000

def basic_clean(s: str) -> str:
    if not isinstance(s, str):
        s = str(s)
    s = s.strip()
    s = re.sub(r"\s+", " ", s)
    if KEEP_ASCII_ONLY:
        s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
    return s

def map_clean(example):
    t = example.get(text_field, "")
    t = basic_clean(t)
    example[text_field] = t
    example["len_chars"] = len(t)
    return example

def is_valid(example):
    ln = example["len_chars"]
    return (ln >= MIN_LEN_CHARS) and (ln <= MAX_LEN_CHARS)

cleaned = DatasetDict()
for split in dataset.keys():
    cleaned_split = dataset[split].map(map_clean, desc=f"Cleaning {split}")
    cleaned_split = cleaned_split.filter(is_valid, desc=f"Filtering {split}")
    cleaned[split] = cleaned_split

def dedupe_exact(ds, key):
    seen = set()
    idxs = []
    for i, s in enumerate(ds[key]):
        if s not in seen:
            idxs.append(i)
            seen.add(s)
    return ds.select(idxs)

for split in list(cleaned.keys()):
    before = len(cleaned[split])
    cleaned[split] = dedupe_exact(cleaned[split], text_field)
    after = len(cleaned[split])
    if after != before:
        print(f"Deduped {split}: {before} -> {after}")

for split in cleaned.keys():
    arr = np.array(cleaned[split]["len_chars"])
    if arr.size:
        print(f"{split}: n={arr.size} mean={arr.mean():.1f} p50={np.percentile(arr,50):.0f} "
              f"p90={np.percentile(arr,90):.0f} p99={np.percentile(arr,99):.0f}")

Using text field: description


Cleaning train:   0%|          | 0/180142 [00:00<?, ? examples/s]

Filtering train:   0%|          | 0/180142 [00:00<?, ? examples/s]

Deduped train: 180140 -> 180138
train: n=180138 mean=2614.0 p50=2353 p90=4633 p99=8081


## Step 2. Initialize tokenizer

In [4]:
from transformers import AutoTokenizer

TOKENIZER_MODEL = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_MODEL, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

sample = cleaned['train'][0][text_field] if 'train' in cleaned else list(cleaned.values())[0][0][text_field]
encoded = tokenizer(sample, truncation=True, max_length=128)
print("Tokenized sample IDs (first 20):", encoded['input_ids'][:20])

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

Tokenized sample IDs (first 20): [1, 24449, 8118, 3358, 369, 3125, 989, 2202, 5585, 354, 272, 907, 727, 297, 559, 1411, 28723, 415, 3358, 2774]


## Step 3. Tokenize dataset and chunk for CLM

In [5]:
from functools import partial
from itertools import chain
SEQ_LEN = 1024

def tokenize_function(examples, text_key):
    return tokenizer(examples[text_key], truncation=False)

# Remove all original columns so only token arrays remain
remove_cols = cleaned['train'].column_names
tokenized = cleaned.map(
    partial(tokenize_function, text_key=text_field),
    batched=True,
    remove_columns=remove_cols,
    desc='Tokenizing',
)

# Sanity check: ensure tokenized has only token-array columns
print(tokenized)
batch = tokenized['train'][:2]
for k, v in batch.items():
    print(k, type(v), type(v[0]))


Tokenizing:   0%|          | 0/180138 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 180138
    })
})
input_ids <class 'list'> <class 'list'>
attention_mask <class 'list'> <class 'list'>


## Step 4. Save tokenized dataset and preview

In [6]:
from itertools import chain

def group_texts(examples):
    valid_keys = [k for k, v in examples.items() if isinstance(v, list) and v and isinstance(v[0], list)]
    concatenated = {k: list(chain.from_iterable(examples[k])) for k in valid_keys}
    total_length = len(concatenated['input_ids'])
    total_length = (total_length // SEQ_LEN) * SEQ_LEN
    result = {}
    for k, t in concatenated.items():
        result[k] = [t[i:i+SEQ_LEN] for i in range(0, total_length, SEQ_LEN)]
    result['labels'] = list(result['input_ids'])
    return result

lm_datasets = tokenized.map(group_texts, batched=True, desc='Grouping into fixed-length chunks')
print(lm_datasets)


Grouping into fixed-length chunks:   0%|          | 0/180138 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 121736
    })
})
