Assignment1

Dataset Overview
Data Source
Dataset name: Wikipedia English Dump
	•Provider: Hugging Face Datasets
	•Configuration: wikipedia / 20220301.en
	•Access mode: Streaming (streaming=True)

The dataset is accessed in streaming mode to avoid downloading the full corpus locally and to reduce memory usage during preprocessing.

Python version

In [1]:
import sys
print(sys.version)

3.13.9 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 19:09:58) [MSC v.1929 64 bit (AMD64)]


Install dependencies

In [2]:
!pip install -U datasets transformers

Collecting datasets
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting transformers
  Downloading transformers-4.57.6-py3-none-any.whl.metadata (43 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.6.0-cp313-cp313-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Downloading multiprocess-0.70.18-py313-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub<2.0,>=0.25.0 (from datasets)
  Downloading huggingface_hub-1.3.2-py3-none-any.whl.metadata (13 kB)
Collecting hf-xet<2.0.0,>=1.2.0 (from huggingface-hub<2.0,>=0.25.0->datasets)
  Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl.metadata (5.0 kB)
Collecting typer-slim (from huggingface-hub<2.0,>=0.25.0->datasets)
  Downloading typer_slim-0.21.1-py3-none-any.whl.metadata (16 kB)
Collecting huggingface-hub<2.0,>=0.25.0 (from datasets)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downlo

Load dataset with streaming

This uses Hugging Face Datasets in streaming mode so the full corpus is not downloaded into local disk/RAM.

Inspect one sample

Verify the sample structure and confirm which field contains the raw article text.

In [5]:
from datasets import load_dataset

ds = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train",
    streaming=True
)

sample = next(iter(ds))
print(type(sample))
print(sample.keys())
print(sample["text"][:300])

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

<class 'dict'>
dict_keys(['id', 'url', 'title', 'text'])
Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state 


Define clean_text

Cleaning strategy implemented:
	•handle None
	•collapse multiple whitespaces into single space
	•strip()
	•lowercase
	•filter very short documents by word count threshold (min_words)

In [6]:
import re

def clean_text(text, min_words=50):
    if text is None:
        return None

    text = re.sub(r"\s+", " ", text).strip()

    text = text.lower()

    if len(text.split()) < min_words:
        return None

    return text

Quick sanity check for cleaning

This iterates over the streaming dataset and prints the first 3 cleaned outputs to confirm cleaning/filtering works.

In [8]:
cleaned_count = 0

for sample in ds:
    cleaned = clean_text(sample["text"])
    if cleaned:
        print(cleaned[:300])
        cleaned_count += 1
    if cleaned_count >= 3:
        break

anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. anarchism advocates for the replacement of the state 
albedo (; ) is the fraction of sunlight that is diffusely reflected by a body. it is measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation). surface albedo is defined as the ratio of radiosity
a, or a, is the first letter and the first vowel of the latin alphabet, used in the modern english alphabet, the alphabets of other western european languages and others worldwide. its name in english is a (pronounced ), plural aes. it is similar in shape to the ancient greek letter alpha, from whic


Define is_duplicate using MD5 hashes

This maintains a seen_hashes set and marks a text as duplicate if its MD5 hash already exists.

In [9]:
import hashlib

def is_duplicate(text, seen_hashes):
    h = hashlib.md5(text.encode("utf-8")).hexdigest()
    if h in seen_hashes:
        return True
    seen_hashes.add(h)
    return False

Sanity check for dedup + cleaning

Stream through dataset → clean → dedup → print first 3 unique cleaned samples.

In [10]:
seen_hashes = set()
kept = 0

for sample in ds:
    cleaned = clean_text(sample["text"])
    if not cleaned:
        continue

    if is_duplicate(cleaned, seen_hashes):
        continue

    print(cleaned[:200])
    kept += 1

    if kept >= 3:
        break

anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typi
albedo (; ) is the fraction of sunlight that is diffusely reflected by a body. it is measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding 
a, or a, is the first letter and the first vowel of the latin alphabet, used in the modern english alphabet, the alphabets of other western european languages and others worldwide. its name in english


Initialize tokenizer

Load GPT-2 tokenizer and set padding token to EOS (common quick fix since GPT-2 has no pad token by default).

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Define chunk_token_ids

This yields non-overlapping blocks of exactly block_size.

In [12]:
def chunk_token_ids(token_ids, block_size=512):
    for i in range(0, len(token_ids) - block_size + 1, block_size):
        yield token_ids[i:i + block_size]


Define iter_token_blocks

Pipeline implemented inside one iterator:
	1.clean_text(sample["text"])
	2.dedup by MD5 (local seen_hashes)
	3.tokenize (add_special_tokens=False)
	4.chunk into fixed-size blocks
	5.stop after max_blocks

In [13]:
def iter_token_blocks(ds, tokenizer, block_size=512, max_blocks=10):
    import hashlib

    seen_hashes = set()
    produced = 0

    for sample in ds:
        cleaned = clean_text(sample["text"])
        if not cleaned:
            continue

        h = hashlib.md5(cleaned.encode("utf-8")).hexdigest()
        if h in seen_hashes:
            continue
        seen_hashes.add(h)

        ids = tokenizer(cleaned, add_special_tokens=False)["input_ids"]

        # chunking
        for block in chunk_token_ids(ids, block_size=block_size):
            yield block
            produced += 1
            if produced >= max_blocks:
                return


Smoke test the iterator + decode preview

Generate a few blocks (block_size=128, max_blocks=3) and print:
	•block length
	•first token ids
	•decoded preview

In [14]:
blocks = list(iter_token_blocks(ds, tokenizer, block_size=128, max_blocks=3))

for i, b in enumerate(blocks):
    print(f"Block {i} length:", len(b))
    print("First 30 token ids:", b[:30])
    print("Decoded preview:", tokenizer.decode(b[:80]))
    print("-" * 60)


Token indices sequence length is longer than the specified maximum sequence length for this model (8524 > 1024). Running this sequence through the model will result in indexing errors


Block 0 length: 128
First 30 token ids: [272, 998, 1042, 318, 257, 1964, 8876, 290, 3356, 326, 318, 17988, 286, 477, 655, 6637, 329, 4934, 290, 12932, 284, 35531, 262, 6712, 340, 3667, 5529, 13114, 32000, 290]
Decoded preview: anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. anarchism advocates for the replacement of the state with stateless societies and voluntary free associations. as a historically left-wing movement, this reading of anarchism is placed on the farthest left of the political
------------------------------------------------------------
Block 1 length: 128
First 30 token ids: [3812, 4934, 635, 8278, 13, 3584, 20675, 286, 26177, 4213, 389, 1043, 477, 3690, 2106, 11, 3660, 41661, 9349, 422, 262, 35957, 13, 1141, 262, 6846, 2063, 286, 262, 678]
Decoded preview:  toward authorit

Install/import torch

In [15]:
!pip install -U torch

Collecting torch
  Downloading torch-2.9.1-cp313-cp313-win_amd64.whl.metadata (30 kB)
Downloading torch-2.9.1-cp313-cp313-win_amd64.whl (110.9 MB)
   ---------------------------------------- 0.0/110.9 MB ? eta -:--:--
   - -------------------------------------- 5.2/110.9 MB 30.7 MB/s eta 0:00:04
   ----- ---------------------------------- 15.2/110.9 MB 40.0 MB/s eta 0:00:03
   --------- ------------------------------ 25.2/110.9 MB 42.6 MB/s eta 0:00:03
   ------------ --------------------------- 35.1/110.9 MB 43.8 MB/s eta 0:00:02
   ---------------- ----------------------- 44.8/110.9 MB 44.5 MB/s eta 0:00:02
   ------------------- -------------------- 54.8/110.9 MB 44.9 MB/s eta 0:00:02
   ----------------------- ---------------- 64.7/110.9 MB 45.3 MB/s eta 0:00:02
   -------------------------- ------------- 74.7/110.9 MB 45.4 MB/s eta 0:00:01
   ------------------------------ --------- 84.4/110.9 MB 45.5 MB/s eta 0:00:01
   --------------------------------- ------ 94.1/110.9 MB 45.6 

TokenBlockDataset as IterableDataset

This wraps iter_token_blocks and yields tensors of dtype torch.long.
It also supports max_blocks control for bounded sampling.

In [16]:
import torch
print(torch.__version__)

2.9.1+cpu


TokenBlockDataset as IterableDataset

This wraps iter_token_blocks and yields tensors of dtype torch.long.
It also supports max_blocks control for bounded sampling.

In [17]:
import torch
from torch.utils.data import IterableDataset, DataLoader

class TokenBlockDataset(IterableDataset):
    def __init__(self, hf_ds, tokenizer, block_size=128, max_blocks=None):
        self.hf_ds = hf_ds
        self.tokenizer = tokenizer
        self.block_size = block_size
        self.max_blocks = max_blocks

    def __iter__(self):
        produced = 0
        for block in iter_token_blocks(
            self.hf_ds,
            self.tokenizer,
            block_size=self.block_size,
            max_blocks=self.max_blocks or 10**12
        ):
            yield torch.tensor(block, dtype=torch.long)
            produced += 1
            if self.max_blocks is not None and produced >= self.max_blocks:
                return

Collation (pad to same length)

Batches are padded to the max seq_len in that batch.

In [18]:
def collate_pad(batch):
    # batch: List[Tensor(seq_len)]
    return torch.nn.utils.rnn.pad_sequence(
        batch,
        batch_first=True,
        padding_value=tokenizer.pad_token_id
    )

Build DataLoader and sample a few batches

Construct dataset with block_size=512, max_blocks=2000, then take first 5 batches.

In [21]:
dataset_pt = TokenBlockDataset(ds, tokenizer, block_size=512, max_blocks=2000)

loader = DataLoader(
    dataset_pt,
    batch_size=8,
    collate_fn=collate_pad,
    num_workers=0 
)
sample_batches = []
for i, batch in enumerate(loader):
    print("batch", i, "shape:", batch.shape)
    sample_batches.append(batch)
    if i >= 4:
        break

batch 0 shape: torch.Size([8, 512])
batch 1 shape: torch.Size([8, 512])
batch 2 shape: torch.Size([8, 512])
batch 3 shape: torch.Size([8, 512])
batch 4 shape: torch.Size([8, 512])


Save sampled batches to disk

In [22]:
out_path = "sample_dataset.pt"
torch.save(sample_batches, out_path)
print("Saved:", out_path)

Saved: sample_dataset.pt
