Customizations Made

1. Changed dataset from `wikitext-2-raw-v1` to `ag_news` for real-world text variety.  
2. Added `tokenizer.pad_token = tokenizer.eos_token` to handle padding safely.  
3. Fixed `ArrowInvalid` error using `batch_size=None` and `remove_columns` in `.map()`.  
4. Reduced `block_size` from 512 to **128** for efficient small-batch processing.  
5. Added validation check for sequence length (128), created DataLoader, and saved processed dataset.  


In [2]:
!pip install --upgrade pip
!pip install torch transformers datasets tqdm

Collecting transformers
  Using cached transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
Collecting datasets
  Using cached datasets-4.2.0-py3-none-any.whl.metadata (18 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.35.3-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Using cached tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting httpx<1.0.0 (from datasets)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.9.0,>=2023.1.0->datasets)
  Using cached aiohttp-3.13.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.1 kB)
Using cached transformers-4.57.1-py3-none-any.whl (12.0 MB)
Using cached huggingface_hub-0.35.3-py3-none-any.whl (564 kB)
Using cached tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl (2.9 MB)
Using cached datasets-4.2.0-py3-none-any.whl (506 kB)
Using cached httpx-0.28.1-py

In [3]:
import datasets, transformers, torch, tqdm
print("‚úÖ All libraries imported successfully!")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All libraries imported successfully!


In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch


In [5]:
# Load AG News dataset (subset for demonstration)
dataset = load_dataset("ag_news", split="train[:5000]")
print(f"‚úÖ Loaded {len(dataset)} samples.")
print(dataset[0])


Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 120000/120000 [00:00<00:00, 1480403.55 examples/s]
Generating test split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7600/7600 [00:00<00:00, 2250385.49 examples/s]

‚úÖ Loaded 5000 samples.
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}





In [6]:
# Use GPT-2 tokenizer for consistency with LLM architectures
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# GPT-2 does not have a padding token ‚Äî set it manually
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    # tokenize each text with truncation
    return tokenizer(examples["text"], truncation=True)

tokenized_ds = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
print("‚úÖ Tokenization complete.")
tokenized_ds[0]


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 19853.42 examples/s]

‚úÖ Tokenization complete.





{'label': 2,
 'input_ids': [22401,
  520,
  13,
  15682,
  30358,
  5157,
  20008,
  262,
  2619,
  357,
  12637,
  8,
  8428,
  532,
  10073,
  12,
  7255,
  364,
  11,
  5007,
  3530,
  338,
  45215,
  59,
  3903,
  286,
  14764,
  12,
  948,
  77,
  873,
  11,
  389,
  4379,
  4077,
  757,
  13],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

In [10]:
block_size = 128

def group_texts(examples):
    concatenated = sum(examples["input_ids"], [])
    total_length = (len(concatenated) // block_size) * block_size
    concatenated = concatenated[:total_length]

    input_blocks = [concatenated[i:i+block_size] for i in range(0, total_length, block_size)]
    return {"input_ids": input_blocks, "labels": input_blocks.copy()}

# ‚úÖ Final working version (resolves ArrowInvalid)
lm_dataset = tokenized_ds.map(
    group_texts,
    batched=True,
    batch_size=None,
    remove_columns=tokenized_ds.column_names,  # <- critical line
    load_from_cache_file=False
)

print(f"‚úÖ Created {len(lm_dataset)} grouped sequences.")



Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:02<00:00, 1973.90 examples/s]

‚úÖ Created 2155 grouped sequences.





In [11]:
# ‚úÖ Sanity check: ensure every grouped sequence has length = 128
lengths = [len(x["input_ids"]) for x in lm_dataset]
unique_lengths = set(lengths)
print(f"‚úÖ Unique sequence lengths: {unique_lengths}")
print(f"Total sequences checked: {len(lengths)}")


‚úÖ Unique sequence lengths: {128}
Total sequences checked: 2155


In [12]:
from torch.utils.data import DataLoader

def collate_fn(batch):
    input_ids = torch.tensor([x["input_ids"] for x in batch], dtype=torch.long)
    labels = input_ids.clone()
    return {"input_ids": input_ids, "labels": labels}

train_loader = DataLoader(lm_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)

# Inspect one batch
for batch in train_loader:
    print("‚úÖ Batch tensor shape:", batch["input_ids"].shape)
    break


‚úÖ Batch tensor shape: torch.Size([8, 128])


In [13]:
import os
os.makedirs("data", exist_ok=True)
lm_dataset.save_to_disk("data/ag_news_lm_tokenized")
print("üíæ Saved dataset to data/ag_news_lm_tokenized/")


Saving the dataset (1/1 shards): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2155/2155 [00:00<00:00, 201420.06 examples/s]

üíæ Saved dataset to data/ag_news_lm_tokenized/



