# Homework 1

You are provided with a list of technical reports about large language models (LLMs) trained by different companies. Your assignment is to convert these PDFs into a training dataset that could be used to train an LLM. <br>
Write a code to download, extract, and combine all the PDFs into one text file.
Write code to download all the provided PDF files,
extract text from them, and
Combine everything into a single text file. <br>
Clean the text file, e.g., fix hyphenation across line breaks, remove headers/footers/page numbers, drop figure/table captions, trim “References” etc.. 
For each cleaning step, provide comments explaining why you chose to apply it. <br>
Tokenization
Implement a regex-based tokenizer and build your own vocabulary.
Tokenize the same text using Byte Pair Encoding (BPE) from the tiktoken library.
Compare the two methods and comment on the differences (e.g., handling of unknown words, vocabulary size, subword splitting). <br>
Your own dataset and dataloader
Prepare your own dataset of input–target sequences (using a sliding window approach).
Implement a PyTorch DataLoader to batch the dataset for training. <br>
Statistical Analysis
Total documents, total tokens, average tokens per doc.
Compare before vs after cleaning: how much text was removed. <br>

You should prepare a single python notebook with your code and answers. Make sure your notebook is working without any errors before submitting.


# 1. Download, extract and combine all PDF files into one text file

## a. Download pdfs from urls

In [4]:
import requests
import os

def download_pdf(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, 'wb') as f:
            f.write(response.content)
    else:
        print(f"Failed to download {url}")

# List of (title, url) for the PDFs to download
pdf_links = [
    ("GPT-3", "https://arxiv.org/pdf/2005.14165.pdf"),
    ("GPT-4", "https://arxiv.org/pdf/2303.08774.pdf"),
    ("PaLM", "https://arxiv.org/pdf/2204.02311.pdf"),
    ("PaLM2", "https://arxiv.org/pdf/2305.10403.pdf"),
    ("Gemini 1.0", "https://arxiv.org/pdf/2312.11805.pdf"),
    ("Gemini 1.5 (2024)", " https://arxiv.org/pdf/2403.05530.pdf"),
    ("Gemma (2024)", " https://arxiv.org/pdf/2403.08295.pdf"),
    ("Gemma 2 (2024)", " https://arxiv.org/pdf/2408.00118.pdf"),
    ("Gemma 3", " https://arxiv.org/pdf/2503.19786.pdf"),
    ("CodeGemma (2024)", " https://arxiv.org/pdf/2406.11409.pdf"),
    ("RecurrentGemma (2024)", " https://arxiv.org/pdf/2404.07839.pdf"),
    ("LLaMA (2023)", " https://arxiv.org/pdf/2302.13971.pdf"),
    ("Llama 2 (2023)", " https://arxiv.org/pdf/2307.09288.pdf"),
    ("Llama 3 (2024)", " https://arxiv.org/pdf/2407.21783.pdf"),
    # Mistral
    ("Mistral 7B (2023)", " https://arxiv.org/pdf/2310.06825.pdf"),
    ("Mixtral of Experts 8x7B (2024)", " https://arxiv.org/pdf/2401.04088.pdf"),
    # NVIDIA
    ("Nemotron-4 340B Technical Report (2024)", " https://arxiv.org/pdf/2406.11704.pdf"),
    ("NVLM 1.0 (2024)", " https://arxiv.org/pdf/2409.11402.pdf"),
    # Alibaba / Qwen series
    ("Qwen2 Technical Report (2024)", " https://arxiv.org/pdf/2407.10671.pdf"),
    ("Qwen2-VL (2024)", " https://arxiv.org/pdf/2409.12191.pdf"),
    ("Qwen2-Audio (2024)", " https://arxiv.org/pdf/2407.10759.pdf"),
    ("Qwen2.5 Technical Report (2024)", " https://arxiv.org/pdf/2412.15115.pdf"),
    ("Qwen2.5-VL Technical Report (2025)", " https://arxiv.org/pdf/2502.13923.pdf"),
    ("Qwen2.5-Omni Technical Report (2025)", " https://arxiv.org/pdf/2503.20215.pdf"),
    ("Qwen3 Technical Report (2025)", " https://arxiv.org/pdf/2505.09388.pdf"),
    # DeepSeek series
    ("DeepSeek-V2 (2024)", " https://arxiv.org/pdf/2405.04434.pdf"),
    ("DeepSeek-V3 Technical Report (2024)", " https://arxiv.org/pdf/2412.19437.pdf"),
    ("DeepSeek-R1 (2025)", " https://arxiv.org/pdf/2501.12948.pdf"),
    ("DeepSeek-Coder (2024)", " https://arxiv.org/pdf/2401.14196.pdf"),
    # ZhipuAI
    ("GLM-130B (2022)", " https://arxiv.org/pdf/2210.02414.pdf"),
    # Shanghai AI Lab
    ("InternLM2 Technical Report (2024)", " https://arxiv.org/pdf/2403.17297.pdf"),
    ("InternVL 2.5 (2024)", " https://arxiv.org/pdf/2412.05271.pdf"),
    # Microsoft
    ("Phi-3 Technical Report (2024)", " https://arxiv.org/pdf/2404.14219.pdf"),
    ("Phi-3 Safety Post-Training (2024)", " https://arxiv.org/pdf/2407.13833.pdf"),
    # AI21
    ("Jamba: Hybrid Transformer–Mamba (2024)", " https://arxiv.org/pdf/2403.19887.pdf"),
    # Huawei
    ("PanGu-Σ (2023)", " https://arxiv.org/pdf/2303.10845.pdf"),
    # 01.AI
    ("Yi: Open Foundation Models (2024)", " https://arxiv.org/pdf/2403.04652.pdf")
]



for file_name, url in pdf_links:
    download_pdf(url, f'{file_name}.pdf')


## b. & c. Extract the text and combine into one txt file

In [None]:
from PyPDF2 import PdfReader

def pdfs_to_text(pdf_files, output_txt):
    with open(output_txt, 'w', encoding='utf-8') as outfile:
        for pdf_file in pdf_files:
            reader = PdfReader(pdf_file)
            # Extract text from each page
            for page in reader.pages:
                text = page.extract_text()
                if text:
                    outfile.write(text)
                    outfile.write('\n\n')  
    print(f'Text merged into: {output_txt}')

# Extract text from downloaded PDFs and combine into one text file
pdf_list = [f'{file_name}.pdf' for file_name, url in pdf_links]
pdfs_to_text(pdf_list, 'merged_output.txt')


# 3. Clean the text file, e.g., fix hyphenation across line breaks, remove headers/footers/page numbers, drop figure/table captions, trim “References” etc..

In [None]:
with open("merged_output.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

raw_text[:10000]  # Display the first 1000 characters of the raw text
len(raw_text)  # Total number of characters in the raw text

In [None]:
import re

# This step is needed because pdfs often break words at line ends with hyphens, creating fragmented words.
def clean_step1_hyphenation(text):
    """Fix hyphenated words across line breaks"""
    
    # Fix hyphenation across line breaks: "word-\npart" → "wordpart"
    text = re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
    
    # Fix hyphenation with multiple spaces: "word- part" → "wordpart"
    text = re.sub(r'(\w+)-\s{2,}(\w+)', r'\1\2', text)
    
    # Handle cases where line break has extra characters
    text = re.sub(r'(\w+)-\s*[\r\n]+\s*(\w+)', r'\1\2', text)
    
    return text

# Headers/footers contain metadata that doesn't contribute to main content and can confuse language models.
def remove_common_headers_footers(text):
    """
    Remove common headers, footers, and page numbers from academic PDF text."""
    # Remove isolated page numbers on lines by themselves (e.g., lines that contain only numbers)
    text = re.sub(r"\n\d+\n", "\n", text)
    
    # Remove conference or publication headers that include 'Proceedings' anywhere in the line
    text = re.sub(r"\n.*Proceedings.*\n", "\n", text)
    
    # Remove headers/footers referencing 'arXiv'
    text = re.sub(r"\n.*arXiv.*\n", "\n", text)
    
    return text

# Figure and table captions contain formatting artifacts and incomplete sentences that negatively impact training
def clean_step3_figures_tables(text):
    """Remove figure and table captions and references"""
    
    # Remove figure captions (multiple patterns)
    text = re.sub(r'Figure\s+\d+:.*?(?=\n\n|\n[A-Z]|\n\d+\.)', '', text, 
                  flags=re.IGNORECASE)
    
    # Remove table captions
    text = re.sub(r'Table\s+\d+:.*?(?=\n\n|\n[A-Z]|\n\d+\.)', '', text, 
                  flags=re.IGNORECASE)
    
    # Remove figure/table references in text
    text = re.sub(r'\(Figure \d+.*?\)', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\(Table \d+.*?\)', '', text, flags=re.IGNORECASE)
    
    return text

# References are structured data, not natural language.
def remove_numbered_references(text):
    # Match a reference entry starting with [number], possibly multi-line,
    # until the next [number] or end-of-string.
    return re.sub(
        r'\[\d+\][\s\S]*?(?=\[\d+\]|$)', 
        '', 
        text
    )

def clean_step5_whitespace(text):
    """Normalize whitespace and formatting"""
    
    # Replace multiple spaces with single space
    text = re.sub(r' {2,}', ' ', text)
    
    # Replace tabs with spaces
    text = re.sub(r'\t', ' ', text)
    
    # Fix multiple newlines (keep paragraph breaks as double newlines)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Remove trailing/leading whitespace from each line
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)
    
    # Remove empty lines at start and end of document
    text = text.strip()
    
    return text

In [52]:
def clean_text_complete(raw_text):
    """Complete cleaning pipeline - all steps in sequence"""
    
    text = clean_step1_hyphenation(raw_text)
    text = remove_common_headers_footers(text)
    text = clean_step3_figures_tables(text)
    text = remove_numbered_references(text)
    text = clean_step5_whitespace(text)
    
    return text


In [42]:
# Apply the complete cleaning pipeline
cleaned_text = clean_text_complete(raw_text)

In [None]:
# I noticed \n connecting pairs of words e.g. "language\nmodels"
cleaned_text = re.sub(r'\n+', ' ', cleaned_text)
cleaned_text[:10000]  # Display the first 1000 characters of the cleaned text
len(cleaned_text)  # Total number of characters in the cleaned textdd

### Saving the cleaned text

In [56]:
output_file = "merged_cleaned_output.txt"
with open(output_file, "w", encoding="utf-8") as f:
    f.write(cleaned_text)

print(f"Cleaned text saved to {output_file}")

Cleaned text saved to merged_cleaned_output.txt


In [47]:
input_file = "merged_output.txt"
output_file = "cleaned_2.txt"

with open(input_file, "r", encoding="utf-8") as f:
    text = f.read()

# 1. Fix hyphenation across line breaks
#PDFs often split words across lines with a hyphen at the end ("learn-\ning").
#We merge them back into a single word ("learning").
text = re.sub(r"-\n", "", text)


# 2. Normalize line breaks
#Sometimes OCR/PDF extraction leaves many unnecessary newlines mid-sentence.
#Replace multiple line breaks with a single space where appropriate.
text = re.sub(r"\n(?=[a-z])", " ", text)  # lowercase continuation
text = re.sub(r"\n+", "\n", text)         # collapse multiple blank line


# 3. Remove common headers/footers and page numbers
#Academic PDFs repeat headers (like "Proceedings of ...") and page numbers.
#This regex removes isolated numbers or phrases at line edges.
text = re.sub(r"\n\d+\n", "\n", text)                     # page numbers
text = re.sub(r"\n.*Proceedings.*\n", "\n", text)         # example conference headers
text = re.sub(r"\n.*arXiv.*\n", "\n", text)               # arXiv headers/footers


# 4. Drop figure/table captions
#Captions usually start with "Figure X:" or "Table X:" and distract from the text.
text = re.sub(r"Figure\s+\d+.*\n", "", text, flags=re.IGNORECASE)
text = re.sub(r"Table\s+\d+.*\n", "", text, flags=re.IGNORECASE)


# 5. Trim the References section
#Most research papers end with "References". We usually don’t want those.
text = re.split(r"\n\s*References\s*\n", text, maxsplit=1, flags=re.IGNORECASE)[0]


# 6. Remove extra spaces
#Final polish: collapse multiple spaces, strip edges.
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n\s+\n", "\n\n", text)  # clean up spaced blank lines
text = text.strip()

with open(output_file, "w", encoding="utf-8") as f:
    f.write(text)

print(f"Cleaned text saved to {output_file}")

Cleaned text saved to cleaned_2.txt


In [48]:
len(text)  # Total number of characters in the cleaned text

186382

# Tokenization

### Implementing regex based tokenizer and building the vocabulary

In [57]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.tokens2ids = {token:id for id,token in enumerate(vocab)}
        self.ids2tokens = {id:token for id,token in enumerate(vocab)}

    # encode function turns text into token IDs
    def encode(self, text):
        text2tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        text2tokens = [
            item.strip() for item in text2tokens if item.strip()
        ]

        text2tokens = [
            item if item in self.tokens2ids
            else "<|unk|>" for item in text2tokens
            ]

        ids = [self.tokens2ids[s] for s in text2tokens]
        return ids

    # decode function turns token IDs back into text
    def decode(self, ids):
        text = " ".join([self.ids2tokens[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
with open("cleaned_2.txt", "r", encoding="utf-8") as f:
    cleaned_text = f.read()

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])
print('Total Number of Tokens in "cleaned_2.txt" are ', len(preprocessed))

# Building the vocabulary
vocab = sorted(set(preprocessed))
vocab_size = len(vocab)

print(vocab_size)
print(vocab[:20])

['Language', 'Models', 'are', 'Few-Shot', 'Learners', 'Tom', 'B', '.', 'Brown\x03Benjamin', 'Mann\x03Nick', 'Ryder\x03Melanie', 'Subbiah\x03', 'Jared', 'KaplanyPrafulla', 'Dhariwal', 'Arvind', 'Neelakantan', 'Pranav', 'Shyam', 'Girish', 'Sastry', 'Amanda', 'Askell', 'Sandhini', 'Agarwal', 'Ariel', 'Herbert-Voss', 'Gretchen', 'Krueger', 'Tom']
Total Number of Tokens in "cleaned_2.txt" are  39398
5496
['\x00', '\x001', '\x002', '\x03Equal', '\x10mpotriva', '\x10n', '\x10ntr-un', '\x12', '\x13', '\x18', '\x180', '\x18200word', '\x1833%', '\x1838years', '\x18500word', '\x1886%', '\x1888%', '\x18ei', '\x18i', '\x1950']


In [64]:
# Encoding and decoding text with our regex based tokenizer and vocabulary
regextokenizer = SimpleTokenizer(vocab)

regex_token_integers = regextokenizer.encode(text)
print(regex_token_integers)

[1119, 1202, 1972, 912, 1127, 1608, 668, 54, 720, 1166, 1443, 1541, 1058, 1081, 847, 656, 1237, 1345, 1499, 965, 1468, 629, 661, 1466, 618, 651, 1001, 978, 1094, 1608, 999, 1426, 760, 613, 1402, 830, 1143, 54, 1727, 1060, 1704, 776, 1696, 769, 1003, 1171, 756, 882, 1501, 1173, 1135, 1475, 976, 695, 757, 1051, 774, 769, 696, 1463, 1175, 622, 1401, 1019, 1550, 831, 633, 1276, 600, 1410, 5283, 3195, 2581, 4837, 3094, 3904, 3627, 1226, 4921, 1927, 2102, 2189, 4149, 3904, 1789, 3478, 2469, 3888, 4956, 3022, 2189, 5484, 3904, 1789, 4721, 4918, 54, 1682, 5078, 4919, 3315, 1970, 35, 4982, 3682, 4780, 4418, 4920, 5484, 2542, 3888, 4988, 3941, 4940, 3888, 4988, 3888, 2885, 54, 728, 2446, 35, 3263, 2204, 3112, 4052, 1789, 3814, 3476, 4918, 3073, 3915, 1789, 2991, 2885, 3941, 3073, 4647, 3392, 5319, 4699, 5251, 2518, 1226, 4894, 4780, 3480, 4815, 5002, 2709, 54, 1002, 5219, 4623, 4962, 4522, 5126, 3476, 3735, 3161, 3313, 4919, 35, 2993, 4053, 35, 4700, 2870, 4308, 2356, 5270, 4189, 4760, 5485, 196

In [61]:
print(regextokenizer.decode(integers))

Language Models are Few-Shot Learners Tom B. BrownBenjamin MannNick RyderMelanie Subbiah Jared KaplanyPrafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei OpenAI Abstract Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by ﬁne-tuning on a speciﬁc task. While typically task-agnostic in architecture, this method still requires task-speciﬁc ﬁne-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to 

### BPE tokenization

In [63]:
import tiktoken

bpe_tokenizer = tiktoken.get_encoding("gpt2")
integers = bpe_tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

strings = bpe_tokenizer.decode(integers)

[32065, 32329, 389, 20463, 12, 28512, 8010, 2741, 198, 13787, 347, 13, 4373, 191, 11696, 13337, 20291, 191, 23609, 40238, 191, 21102, 34166, 3834, 65, 9520, 191, 198, 41, 1144, 11611, 489, 1092, 47, 430, 12853, 64, 20529, 2743, 16783, 943, 50172, 3169, 417, 461, 415, 272, 1736, 272, 615, 36838, 321, 23837, 680, 311, 459, 563, 198, 5840, 5282, 1081, 17164, 3837, 71, 5362, 2449, 283, 16783, 33364, 28648, 12, 53, 793, 35877, 6607, 13685, 518, 1362, 4186, 6752, 394, 272, 198, 30003, 261, 5932, 1215, 414, 64, 371, 1047, 71, 7806, 337, 13, 1168, 15702, 1754, 19627, 18027, 3779, 45535, 10633, 198, 38025, 367, 35270, 2940, 12555, 7651, 21984, 1754, 24787, 385, 89, 25659, 5404, 4746, 12723, 198, 11696, 13337, 25774, 3619, 11264, 12803, 4312, 1008, 198, 16305, 5108, 392, 1836, 38285, 5325, 3841, 49804, 64, 311, 5500, 365, 332, 360, 4982, 1703, 1098, 72, 198, 11505, 20185, 198, 23839, 198, 26446, 670, 468, 9555, 8904, 8810, 319, 867, 399, 19930, 8861, 290, 31747, 416, 662, 12, 34409, 319, 257, 15

### Comparison of regex and bpe tokenizer outputs

In [81]:
from typing import List

total_regex_tokens = len(regex_token_integers)
regex_vocab_size = len(set(regex_token_integers))
total_bpe_tokens = len(integers)
bpe_vocab_size = len(set(integers))


comparison = {
    "regex_total_tokens": total_regex_tokens,
    "regex_vocab_size": regex_vocab_size,
    "bpe_total_tokens": total_bpe_tokens,
    "bpe_vocab_size": bpe_vocab_size
}
comparison

{'regex_total_tokens': 39398,
 'regex_vocab_size': 5496,
 'bpe_total_tokens': 49674,
 'bpe_vocab_size': 5852}

# Key Differences between Regex Tokenizer and BPE Tokenizer

### 1. Handling of Unknown Words

- **Regex Tokenizer:** Maps unknown words (like "bioinformatics", "neologisms") to a single `<unk>` token, causing information loss.

- **BPE Tokenizer:** Breaks unknown words into meaningful subword components, preserving all information.

### 2. Vocabulary Size vs Coverage

- **Regex:** Larger vocabularies needed to cover diverse text, but still fails on rare words.

- **BPE:** Smaller, more efficient vocabularies with better coverage through subword composition.

### 3. Subword Splitting Capability

- **Regex:** No subword splitting - treats "bioinformatics" as atomic unit.

- **BPE:** Intelligent splitting into morphologically meaningful parts like "bio-in-form-a-tics".

### 4. Efficiency Trade-offs

- **Regex:** Fewer tokens per sentence, computationally efficient.

- **BPE:** More tokens but better semantic representation and robustness.

### 5. Generalization

- **Regex:** Poor - cannot handle words not seen during vocabulary building.

- **BPE:** Excellent - can represent any word through subword combinations.


# Dataset and Dataloader

In [74]:
from torch.utils.data import Dataset, DataLoader
import torch

class CustomDataset(Dataset):
    def __init__(self, txt, tokenizer, context_size, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > context_size, "Number of tokenized inputs must at least be equal to context_size+1"

        # Use a sliding window to chunk the book into overlapping sequences of context_size
        for i in range(0, len(token_ids) - context_size, stride):
            input_chunk = token_ids[i:i + context_size]
            target_chunk = token_ids[i + 1: i + context_size + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [75]:
def create_dataloader(txt, batch_size=4, context_size=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = CustomDataset(txt, tokenizer, context_size, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [None]:
dataloader = create_dataloader(
    raw_text, batch_size=2, context_size=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# Statistical Analysis

In [None]:
import pandas as pd

# Per-doc removals (character-level proxy)


summary = {
    "total_docs": len(pdf_links),
    "regex_total_tokens": int(len(regex_token_integers)),
    "bpe_total_tokens": int(len(integers)),
    "avg_tokens_per_doc_regex": float((len(regex_token_integers) / len(pdf_links))),
    "avg_tokens_per_doc_bpe": float((len(integers) / len(pdf_links))),
    "removed_char_percentage_weighted": float(),
}

summary

{'total_docs': 37,
 'regex_total_tokens': 39398,
 'bpe_total_tokens': 49674,
 'avg_tokens_per_doc_regex': 1064.8108108108108,
 'avg_tokens_per_doc_bpe': 1342.5405405405406}