# Homework 1

You are provided with a list of technical reports about large language models (LLMs) trained by different companies. Your assignment is to convert these PDFs into a training dataset that could be used to train an LLM. <br>
Write a code to download, extract, and combine all the PDFs into one text file.
Write code to download all the provided PDF files,
extract text from them, and
Combine everything into a single text file. <br>
Clean the text file, e.g., fix hyphenation across line breaks, remove headers/footers/page numbers, drop figure/table captions, trim “References” etc.. 
For each cleaning step, provide comments explaining why you chose to apply it. <br>
Tokenization
Implement a regex-based tokenizer and build your own vocabulary.
Tokenize the same text using Byte Pair Encoding (BPE) from the tiktoken library.
Compare the two methods and comment on the differences (e.g., handling of unknown words, vocabulary size, subword splitting). <br>
Your own dataset and dataloader
Prepare your own dataset of input–target sequences (using a sliding window approach).
Implement a PyTorch DataLoader to batch the dataset for training. <br>
Statistical Analysis
Total documents, total tokens, average tokens per doc.
Compare before vs after cleaning: how much text was removed. <br>

You should prepare a single python notebook with your code and answers. Make sure your notebook is working without any errors before submitting.


# 1. Download, extract and combine all PDF files into one text file

## a. Download pdfs from urls

In [2]:
import requests
import os

def download_pdf(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, 'wb') as f:
            f.write(response.content)
    else:
        print(f"Failed to download {url}")

# List of (title, url) for the PDFs to download
pdf_links = [
    ("GPT-3", "https://arxiv.org/pdf/2005.14165.pdf"),
    ("GPT-4", "https://arxiv.org/pdf/2303.08774.pdf"),
    ("PaLM", "https://arxiv.org/pdf/2204.02311.pdf"),
    ("PaLM2", "https://arxiv.org/pdf/2305.10403.pdf"),
    ("Gemini 1.0", "https://arxiv.org/pdf/2312.11805.pdf"),
    ("Gemini 1.5 (2024)", " https://arxiv.org/pdf/2403.05530.pdf"),
    ("Gemma (2024)", " https://arxiv.org/pdf/2403.08295.pdf"),
    ("Gemma 2 (2024)", " https://arxiv.org/pdf/2408.00118.pdf"),
    ("Gemma 3", " https://arxiv.org/pdf/2503.19786.pdf"),
    ("CodeGemma (2024)", " https://arxiv.org/pdf/2406.11409.pdf"),
    ("RecurrentGemma (2024)", " https://arxiv.org/pdf/2404.07839.pdf"),
    ("LLaMA (2023)", " https://arxiv.org/pdf/2302.13971.pdf"),
    ("Llama 2 (2023)", " https://arxiv.org/pdf/2307.09288.pdf"),
    ("Llama 3 (2024)", " https://arxiv.org/pdf/2407.21783.pdf"),
    # Mistral
    ("Mistral 7B (2023)", " https://arxiv.org/pdf/2310.06825.pdf"),
    ("Mixtral of Experts 8x7B (2024)", " https://arxiv.org/pdf/2401.04088.pdf"),
    # NVIDIA
    ("Nemotron-4 340B Technical Report (2024)", " https://arxiv.org/pdf/2406.11704.pdf"),
    ("NVLM 1.0 (2024)", " https://arxiv.org/pdf/2409.11402.pdf"),
    # Alibaba / Qwen series
    ("Qwen2 Technical Report (2024)", " https://arxiv.org/pdf/2407.10671.pdf"),
    ("Qwen2-VL (2024)", " https://arxiv.org/pdf/2409.12191.pdf"),
    ("Qwen2-Audio (2024)", " https://arxiv.org/pdf/2407.10759.pdf"),
    ("Qwen2.5 Technical Report (2024)", " https://arxiv.org/pdf/2412.15115.pdf"),
    ("Qwen2.5-VL Technical Report (2025)", " https://arxiv.org/pdf/2502.13923.pdf"),
    ("Qwen2.5-Omni Technical Report (2025)", " https://arxiv.org/pdf/2503.20215.pdf"),
    ("Qwen3 Technical Report (2025)", " https://arxiv.org/pdf/2505.09388.pdf"),
    # DeepSeek series
    ("DeepSeek-V2 (2024)", " https://arxiv.org/pdf/2405.04434.pdf"),
    ("DeepSeek-V3 Technical Report (2024)", " https://arxiv.org/pdf/2412.19437.pdf"),
    ("DeepSeek-R1 (2025)", " https://arxiv.org/pdf/2501.12948.pdf"),
    ("DeepSeek-Coder (2024)", " https://arxiv.org/pdf/2401.14196.pdf"),
    # ZhipuAI
    ("GLM-130B (2022)", " https://arxiv.org/pdf/2210.02414.pdf"),
    # Shanghai AI Lab
    ("InternLM2 Technical Report (2024)", " https://arxiv.org/pdf/2403.17297.pdf"),
    ("InternVL 2.5 (2024)", " https://arxiv.org/pdf/2412.05271.pdf"),
    # Microsoft
    ("Phi-3 Technical Report (2024)", " https://arxiv.org/pdf/2404.14219.pdf"),
    ("Phi-3 Safety Post-Training (2024)", " https://arxiv.org/pdf/2407.13833.pdf"),
    # AI21
    ("Jamba: Hybrid Transformer–Mamba (2024)", " https://arxiv.org/pdf/2403.19887.pdf"),
    # Huawei
    ("PanGu-Σ (2023)", " https://arxiv.org/pdf/2303.10845.pdf"),
    # 01.AI
    ("Yi: Open Foundation Models (2024)", " https://arxiv.org/pdf/2403.04652.pdf")
]



# for file_name, url in pdf_links:
#     download_pdf(url, f'{file_name}.pdf')


## b. & c. Extract the text and combine into one txt file

In [4]:
from PyPDF2 import PdfReader

def pdfs_to_text(pdf_files, output_txt):
    with open(output_txt, 'w', encoding='utf-8') as outfile:
        for pdf_file in pdf_files:
            reader = PdfReader(pdf_file)
            # Extract text from each page, insert \f after each page
            for page in reader.pages:
                text = page.extract_text()
                if text:
                    outfile.write(text)
                    outfile.write('\f')  # Insert form feed to separate pages
    print(f'Text merged into: {output_txt}')

# Example usage stays the same:
pdf_list = [f'{file_name}.pdf' for file_name, url in pdf_links]
pdfs_to_text(pdf_list, 'merged_output_new.txt')



Text merged into: merged_output_new.txt


# 3. Clean the text file, e.g., fix hyphenation across line breaks, remove headers/footers/page numbers, drop figure/table captions, trim “References” etc..

In [14]:
with open("merged_output.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [15]:
raw_text[:10000]  # Display the first 1000 characters of the raw text

'Language Models are Few-Shot Learners\nTom B. Brown\x03Benjamin Mann\x03Nick Ryder\x03Melanie Subbiah\x03\nJared KaplanyPrafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry\nAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan\nRewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter\nChristopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray\nBenjamin Chess Jack Clark Christopher Berner\nSam McCandlish Alec Radford Ilya Sutskever Dario Amodei\nOpenAI\nAbstract\nRecent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training\non a large corpus of text followed by ﬁne-tuning on a speciﬁc task. While typically task-agnostic\nin architecture, this method still requires task-speciﬁc ﬁne-tuning datasets of thousands or tens of\nthousands of examples. By contrast, humans can generally perform a new language task from only\na few examples or from simple instructions – something which current NLP systems

In [27]:
len(raw_text)  # Total number of characters in the raw text

4790145

In [None]:
import regex
from collections import Counter
from typing import List

def split_pages(text: str) -> List[str]:
    return text.split("\f")

# converts all line breaks to \n for consistency
def normalize_whitespace(text: str) -> str:
    # Normalize all newlines to \n, replace NBSP, collapse tabs
    text = text.replace("\r\n", "\n").replace("\r", "\n")
    text = text.replace("\u00A0", " ")
    return regex.sub(r"[\t\x0b\x0c]+", " ", text)

# Headers/footers contain metadata that doesn't contribute to main content and can confuse language models.
def remove_headers_footers(text: str, threshold: float = 0.6) -> str:
    pages = split_pages(text)
    first = [p.strip().splitlines()[0] for p in pages if p.strip()]
    last  = [p.strip().splitlines()[-1] for p in pages if p.strip()]
    hc = {ln for ln,c in Counter(first).items() if c >= threshold*len(pages)}
    fc = {ln for ln,c in Counter(last).items()  if c >= threshold*len(pages)}
    cleaned = []
    for p in pages:
        lines = [ln for ln in p.splitlines() if ln.strip()]
        if lines and lines[0] in hc: lines.pop(0)
        if lines and lines[-1] in fc: lines.pop(-1)
        if lines and regex.fullmatch(r"\d+", lines[-1]): lines.pop(-1)
        cleaned.append("\n".join(lines))
    return "\n\n".join(cleaned)

# This step is needed because pdfs often break words at line ends with hyphens, creating fragmented words.
def fix_hyphenation(text: str) -> str:
    return regex.sub(
        r"(\p{L}+)-\n(\p{L}+)",
        lambda m: m.group(1) + m.group(2) if not (m.group(1).isupper() or m.group(2).isupper()) else m.group(0),
        text
    )

def join_broken_lines(text: str) -> str:
    lines, out = text.split("\n"), []
    for ln in lines:
        ln = ln.rstrip()
        if out and ln and not regex.search(r"[\.!?\):;]$", out[-1]):
            out[-1] += " " + ln.lstrip()
        else:
            out.append(ln)
    return "\n".join(out)

def drop_figure_table_captions(text: str) -> str:
    return "\n".join(
        ln for ln in text.splitlines()
        if not regex.match(r"^\s*(Figure|Fig\.|Table)\s+\d+", ln)
    )

def trim_references(text: str) -> str:
    m = list(regex.finditer(r"(?im)^References\s*$", text))
    return text[:m[-1].start()] if m else text

def clean_document_text(raw_text: str, header_threshold: float = 0.6) -> str:
    text = normalize_whitespace(raw_text)
    text = remove_headers_footers(text, header_threshold)
    text = fix_hyphenation(text)
    text = drop_figure_table_captions(text)
    text = join_broken_lines(text)
    text = trim_references(text)
    return regex.sub(r"\n{3,}", "\n\n", text).strip()


In [None]:
import re


def clean_step1_hyphenation(text):
    """Fix hyphenated words across line breaks"""
    
    # Fix hyphenation across line breaks: "word-\npart" → "wordpart"
    text = re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
    
    # Fix hyphenation with multiple spaces: "word- part" → "wordpart"
    text = re.sub(r'(\w+)-\s{2,}(\w+)', r'\1\2', text)
    
    # Handle cases where line break has extra characters
    text = re.sub(r'(\w+)-\s*[\r\n]+\s*(\w+)', r'\1\2', text)
    
    return text


In [None]:

def remove_common_headers_footers(text):
    """
    Remove common headers, footers, and page numbers from academic PDF text."""
    # Remove isolated page numbers on lines by themselves (e.g., lines that contain only numbers)
    text = re.sub(r"\n\d+\n", "\n", text)
    
    # Remove conference or publication headers that include 'Proceedings' anywhere in the line
    text = re.sub(r"\n.*Proceedings.*\n", "\n", text)
    
    # Remove headers/footers referencing 'arXiv'
    text = re.sub(r"\n.*arXiv.*\n", "\n", text)
    
    return text


In [49]:
# Figure and table captions contain formatting artifacts and incomplete sentences that negatively impact training
def clean_step3_figures_tables(text):
    """Remove figure and table captions and references"""
    
    # Remove figure captions (multiple patterns)
    text = re.sub(r'Figure\s+\d+:.*?(?=\n\n|\n[A-Z]|\n\d+\.)', '', text, 
                  flags=re.IGNORECASE)
    
    # Remove table captions
    text = re.sub(r'Table\s+\d+:.*?(?=\n\n|\n[A-Z]|\n\d+\.)', '', text, 
                  flags=re.IGNORECASE)
    
    # Remove figure/table references in text
    text = re.sub(r'\(Figure \d+.*?\)', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\(Table \d+.*?\)', '', text, flags=re.IGNORECASE)
    
    return text


In [50]:
# References are structured data, not natural language.
def remove_numbered_references(text):
    # Match a reference entry starting with [number], possibly multi-line,
    # until the next [number] or end-of-string.
    return re.sub(
        r'\[\d+\][\s\S]*?(?=\[\d+\]|$)', 
        '', 
        text
    )


In [51]:
# Multiple spaces and inconsistent line breaks create noise that makes tokenization less effective.
def clean_step5_whitespace(text):
    """Normalize whitespace and formatting"""
    
    # Replace multiple spaces with single space
    text = re.sub(r' {2,}', ' ', text)
    
    # Replace tabs with spaces
    text = re.sub(r'\t', ' ', text)
    
    # Fix multiple newlines (keep paragraph breaks as double newlines)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Remove trailing/leading whitespace from each line
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)
    
    # Remove empty lines at start and end of document
    text = text.strip()
    
    return text


In [52]:
def clean_text_complete(raw_text):
    """Complete cleaning pipeline - all steps in sequence"""
    
    text = clean_step1_hyphenation(raw_text)
    text = remove_common_headers_footers(text)
    text = clean_step3_figures_tables(text)
    text = remove_numbered_references(text)
    text = clean_step5_whitespace(text)
    
    return text


In [42]:
# Apply the complete cleaning pipeline
cleaned_text = clean_text_complete(raw_text)

In [43]:
len(cleaned_text)  # Total number of characters in the cleaned text

236089

In [44]:
cleaned_text[:10000] # Display the first 1000 characters of the cleaned text

'Language Models are Few-Shot Learners\nTom B. Brown\x03Benjamin Mann\x03Nick Ryder\x03Melanie Subbiah\x03\nJared KaplanyPrafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry\nAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan\nRewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter\nChristopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray\nBenjamin Chess Jack Clark Christopher Berner\nSam McCandlish Alec Radford Ilya Sutskever Dario Amodei\nOpenAI\nAbstract\nRecent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training\non a large corpus of text followed by ﬁne-tuning on a speciﬁc task. While typically task-agnostic\nin architecture, this method still requires task-speciﬁc ﬁne-tuning datasets of thousands or tens of\nthousands of examples. By contrast, humans can generally perform a new language task from only\na few examples or from simple instructions – something which current NLP systems

In [53]:
# I noticed \n connecting pairs of words e.g. "language\nmodels"
cleaned_text = re.sub(r'\n+', ' ', cleaned_text)

In [54]:
cleaned_text[:10000]  # Display the first 1000 characters of the cleaned text

'Language Models are Few-Shot Learners Tom B. Brown\x03Benjamin Mann\x03Nick Ryder\x03Melanie Subbiah\x03 Jared KaplanyPrafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei OpenAI Abstract Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by ﬁne-tuning on a speciﬁc task. While typically task-agnostic in architecture, this method still requires task-speciﬁc ﬁne-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely

In [55]:
len(cleaned_text)  # Total number of characters in the cleaned text

236013

### Saving the cleaned text

In [56]:
output_file = "merged_cleaned_output.txt"
with open(output_file, "w", encoding="utf-8") as f:
    f.write(cleaned_text)

print(f"Cleaned text saved to {output_file}")

Cleaned text saved to merged_cleaned_output.txt


In [47]:
input_file = "merged_output.txt"
output_file = "cleaned_2.txt"

with open(input_file, "r", encoding="utf-8") as f:
    text = f.read()

# 1. Fix hyphenation across line breaks
#PDFs often split words across lines with a hyphen at the end ("learn-\ning").
#We merge them back into a single word ("learning").
text = re.sub(r"-\n", "", text)


# 2. Normalize line breaks
#Sometimes OCR/PDF extraction leaves many unnecessary newlines mid-sentence.
#Replace multiple line breaks with a single space where appropriate.
text = re.sub(r"\n(?=[a-z])", " ", text)  # lowercase continuation
text = re.sub(r"\n+", "\n", text)         # collapse multiple blank line


# 3. Remove common headers/footers and page numbers
#Academic PDFs repeat headers (like "Proceedings of ...") and page numbers.
#This regex removes isolated numbers or phrases at line edges.
text = re.sub(r"\n\d+\n", "\n", text)                     # page numbers
text = re.sub(r"\n.*Proceedings.*\n", "\n", text)         # example conference headers
text = re.sub(r"\n.*arXiv.*\n", "\n", text)               # arXiv headers/footers


# 4. Drop figure/table captions
#Captions usually start with "Figure X:" or "Table X:" and distract from the text.
text = re.sub(r"Figure\s+\d+.*\n", "", text, flags=re.IGNORECASE)
text = re.sub(r"Table\s+\d+.*\n", "", text, flags=re.IGNORECASE)


# 5. Trim the References section
#Most research papers end with "References". We usually don’t want those.
text = re.split(r"\n\s*References\s*\n", text, maxsplit=1, flags=re.IGNORECASE)[0]


# 6. Remove extra spaces
#Final polish: collapse multiple spaces, strip edges.
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n\s+\n", "\n\n", text)  # clean up spaced blank lines
text = text.strip()

with open(output_file, "w", encoding="utf-8") as f:
    f.write(text)

print(f"Cleaned text saved to {output_file}")

Cleaned text saved to cleaned_2.txt


In [48]:
len(text)  # Total number of characters in the cleaned text

186382

# Tokenization

### Implementing regex based tokenizer and building the vocabulary

In [57]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.tokens2ids = {token:id for id,token in enumerate(vocab)}
        self.ids2tokens = {id:token for id,token in enumerate(vocab)}

    # encode function turns text into token IDs
    def encode(self, text):
        text2tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        text2tokens = [
            item.strip() for item in text2tokens if item.strip()
        ]

        text2tokens = [
            item if item in self.tokens2ids
            else "<|unk|>" for item in text2tokens
            ]

        ids = [self.tokens2ids[s] for s in text2tokens]
        return ids

    # decode function turns token IDs back into text
    def decode(self, ids):
        text = " ".join([self.ids2tokens[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [58]:
with open("cleaned_2.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])
print('Total Number of Tokens in "cleaned_2.txt" are ', len(preprocessed))

# Building the vocabulary
vocab = sorted(set(preprocessed))
vocab_size = len(vocab)

print(vocab_size)
print(vocab[:20])

['Language', 'Models', 'are', 'Few-Shot', 'Learners', 'Tom', 'B', '.', 'Brown\x03Benjamin', 'Mann\x03Nick', 'Ryder\x03Melanie', 'Subbiah\x03', 'Jared', 'KaplanyPrafulla', 'Dhariwal', 'Arvind', 'Neelakantan', 'Pranav', 'Shyam', 'Girish', 'Sastry', 'Amanda', 'Askell', 'Sandhini', 'Agarwal', 'Ariel', 'Herbert-Voss', 'Gretchen', 'Krueger', 'Tom']
Total Number of Tokens in "cleaned_2.txt" are  39398
5496
['\x00', '\x001', '\x002', '\x03Equal', '\x10mpotriva', '\x10n', '\x10ntr-un', '\x12', '\x13', '\x18', '\x180', '\x18200word', '\x1833%', '\x1838years', '\x18500word', '\x1886%', '\x1888%', '\x18ei', '\x18i', '\x1950']


In [60]:
# Encoding and decoding text with our regex based tokenizer and vocabulary
regextokenizer = SimpleTokenizer(vocab)

integers = regextokenizer.encode(text)
print(integers)

[1119, 1202, 1972, 912, 1127, 1608, 668, 54, 720, 1166, 1443, 1541, 1058, 1081, 847, 656, 1237, 1345, 1499, 965, 1468, 629, 661, 1466, 618, 651, 1001, 978, 1094, 1608, 999, 1426, 760, 613, 1402, 830, 1143, 54, 1727, 1060, 1704, 776, 1696, 769, 1003, 1171, 756, 882, 1501, 1173, 1135, 1475, 976, 695, 757, 1051, 774, 769, 696, 1463, 1175, 622, 1401, 1019, 1550, 831, 633, 1276, 600, 1410, 5283, 3195, 2581, 4837, 3094, 3904, 3627, 1226, 4921, 1927, 2102, 2189, 4149, 3904, 1789, 3478, 2469, 3888, 4956, 3022, 2189, 5484, 3904, 1789, 4721, 4918, 54, 1682, 5078, 4919, 3315, 1970, 35, 4982, 3682, 4780, 4418, 4920, 5484, 2542, 3888, 4988, 3941, 4940, 3888, 4988, 3888, 2885, 54, 728, 2446, 35, 3263, 2204, 3112, 4052, 1789, 3814, 3476, 4918, 3073, 3915, 1789, 2991, 2885, 3941, 3073, 4647, 3392, 5319, 4699, 5251, 2518, 1226, 4894, 4780, 3480, 4815, 5002, 2709, 54, 1002, 5219, 4623, 4962, 4522, 5126, 3476, 3735, 3161, 3313, 4919, 35, 2993, 4053, 35, 4700, 2870, 4308, 2356, 5270, 4189, 4760, 5485, 196

In [61]:
print(regextokenizer.decode(integers))

Language Models are Few-Shot Learners Tom B. BrownBenjamin MannNick RyderMelanie Subbiah Jared KaplanyPrafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei OpenAI Abstract Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by ﬁne-tuning on a speciﬁc task. While typically task-agnostic in architecture, this method still requires task-speciﬁc ﬁne-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to 