# Homework 1

You are provided with a list of technical reports about large language models (LLMs) trained by different companies. Your assignment is to convert these PDFs into a training dataset that could be used to train an LLM. <br>
Write a code to download, extract, and combine all the PDFs into one text file.
Write code to download all the provided PDF files,
extract text from them, and
Combine everything into a single text file. <br>
Clean the text file, e.g., fix hyphenation across line breaks, remove headers/footers/page numbers, drop figure/table captions, trim “References” etc.. 
For each cleaning step, provide comments explaining why you chose to apply it. <br>
Tokenization
Implement a regex-based tokenizer and build your own vocabulary.
Tokenize the same text using Byte Pair Encoding (BPE) from the tiktoken library.
Compare the two methods and comment on the differences (e.g., handling of unknown words, vocabulary size, subword splitting). <br>
Your own dataset and dataloader
Prepare your own dataset of input–target sequences (using a sliding window approach).
Implement a PyTorch DataLoader to batch the dataset for training. <br>
Statistical Analysis
Total documents, total tokens, average tokens per doc.
Compare before vs after cleaning: how much text was removed. <br>

You should prepare a single python notebook with your code and answers. Make sure your notebook is working without any errors before submitting.


# 1. Download, extract and combine all PDF files into one text file

## a. Download pdfs from urls

In [8]:
import os
import requests

def download_pdf(url, filename, folder="downloads"):
    # Ensure the target folder exists
    os.makedirs(folder, exist_ok=True)
    
    # Build the full path
    filepath = os.path.join(folder, filename)
    
    # Download and save
    response = requests.get(url)
    if response.status_code == 200:
        with open(filepath, 'wb') as f:
            f.write(response.content)
        print(f"Saved: {filepath}")
    else:
        print(f"Failed to download {url} (Status code: {response.status_code})")


# List of (title, url) for the PDFs to download
pdf_links = [
    ("GPT-3", "https://arxiv.org/pdf/2005.14165.pdf"),
    ("GPT-4", "https://arxiv.org/pdf/2303.08774.pdf"),
    ("PaLM", "https://arxiv.org/pdf/2204.02311.pdf"),
    ("PaLM2", "https://arxiv.org/pdf/2305.10403.pdf"),
    ("Gemini 1.0", "https://arxiv.org/pdf/2312.11805.pdf"),
    ("Gemini 1.5 (2024)", " https://arxiv.org/pdf/2403.05530.pdf"),
    ("Gemma (2024)", " https://arxiv.org/pdf/2403.08295.pdf"),
    ("Gemma 2 (2024)", " https://arxiv.org/pdf/2408.00118.pdf"),
    ("Gemma 3", " https://arxiv.org/pdf/2503.19786.pdf"),
    ("CodeGemma (2024)", " https://arxiv.org/pdf/2406.11409.pdf"),
    ("RecurrentGemma (2024)", " https://arxiv.org/pdf/2404.07839.pdf"),
    ("LLaMA (2023)", " https://arxiv.org/pdf/2302.13971.pdf"),
    ("Llama 2 (2023)", " https://arxiv.org/pdf/2307.09288.pdf"),
    ("Llama 3 (2024)", " https://arxiv.org/pdf/2407.21783.pdf"),
    # Mistral
    ("Mistral 7B (2023)", " https://arxiv.org/pdf/2310.06825.pdf"),
    ("Mixtral of Experts 8x7B (2024)", " https://arxiv.org/pdf/2401.04088.pdf"),
    # NVIDIA
    ("Nemotron-4 340B Technical Report (2024)", " https://arxiv.org/pdf/2406.11704.pdf"),
    ("NVLM 1.0 (2024)", " https://arxiv.org/pdf/2409.11402.pdf"),
    # Alibaba / Qwen series
    ("Qwen2 Technical Report (2024)", " https://arxiv.org/pdf/2407.10671.pdf"),
    ("Qwen2-VL (2024)", " https://arxiv.org/pdf/2409.12191.pdf"),
    ("Qwen2-Audio (2024)", " https://arxiv.org/pdf/2407.10759.pdf"),
    ("Qwen2.5 Technical Report (2024)", " https://arxiv.org/pdf/2412.15115.pdf"),
    ("Qwen2.5-VL Technical Report (2025)", " https://arxiv.org/pdf/2502.13923.pdf"),
    ("Qwen2.5-Omni Technical Report (2025)", " https://arxiv.org/pdf/2503.20215.pdf"),
    ("Qwen3 Technical Report (2025)", " https://arxiv.org/pdf/2505.09388.pdf"),
    # DeepSeek series
    ("DeepSeek-V2 (2024)", " https://arxiv.org/pdf/2405.04434.pdf"),
    ("DeepSeek-V3 Technical Report (2024)", " https://arxiv.org/pdf/2412.19437.pdf"),
    ("DeepSeek-R1 (2025)", " https://arxiv.org/pdf/2501.12948.pdf"),
    ("DeepSeek-Coder (2024)", " https://arxiv.org/pdf/2401.14196.pdf"),
    # ZhipuAI
    ("GLM-130B (2022)", " https://arxiv.org/pdf/2210.02414.pdf"),
    # Shanghai AI Lab
    ("InternLM2 Technical Report (2024)", " https://arxiv.org/pdf/2403.17297.pdf"),
    ("InternVL 2.5 (2024)", " https://arxiv.org/pdf/2412.05271.pdf"),
    # Microsoft
    ("Phi-3 Technical Report (2024)", " https://arxiv.org/pdf/2404.14219.pdf"),
    ("Phi-3 Safety Post-Training (2024)", " https://arxiv.org/pdf/2407.13833.pdf"),
    # AI21
    ("Jamba Hybrid Transformer–Mamba (2024)", " https://arxiv.org/pdf/2403.19887.pdf"),
    # Huawei
    ("PanGu-Σ (2023)", " https://arxiv.org/pdf/2303.10845.pdf"),
    # 01.AI
    ("Yi Open Foundation Models (2024)", " https://arxiv.org/pdf/2403.04652.pdf")
]



# for file_name, url in pdf_links:
#     download_pdf(url, f'{file_name}.pdf', folder="pdfs")


## b. & c. Extract the text and combine into one txt file

In [4]:
!pip install PyPDF2



In [14]:
import re
from PyPDF2 import PdfReader

def pdfs_to_text_with_delimiters(pdf_files, output_txt):
    """
    Extract text from PDFs and add clear delimiters between papers.
    """
    with open(output_txt, 'w', encoding='utf-8') as outfile:
        for i, pdf_file in enumerate(pdf_files):
            # Add paper delimiter (skip for first paper)
            if i > 0:
                outfile.write('\n' + '='*80 + '\n')
                outfile.write(f'PAPER_DELIMITER_{i}\n')
                outfile.write('='*80 + '\n\n')
            
            reader = PdfReader(pdf_file)
            
            # Extract text from each page
            for page_num, page in enumerate(reader.pages):
                text = page.extract_text()
                if text:
                    outfile.write(f"\n[PAGE_{page_num+1}]\n" + text + '\f')
            
    print(f'Text with delimiters merged into: {output_txt}')


# 3. Clean the text file, e.g., fix hyphenation across line breaks, remove headers/footers/page numbers, drop figure/table captions, trim “References” etc..

In [15]:


def clean_text_comprehensive(input_file, output_file):
    """
    Comprehensive text cleaning pipeline for PDF-extracted text.
    """
    
    # Read the input text
    with open(input_file, 'r', encoding='utf-8') as f:
        text = f.read()
    
    original_length = len(text)
    
    # Step 1: Fix hyphenation across line breaks
    # Why: PDFs often break words across lines with hyphens, creating artificial word splits
    text = re.sub(r'([a-zA-Z])-\s*\n\s*([a-zA-Z])', r'\1\2', text)
    
    # Step 2: Remove page markers (our debug markers)
    # Why: These are artifacts from our extraction process, not content
    text = re.sub(r'\[PAGE_\d+\]', '', text)
    
    # Step 3: Remove form feed characters
    # Why: These are page break markers that don't add semantic value
    text = text.replace('\f', ' ')
    
    # Step 4: Remove common headers and footers
    # Why: Headers/footers are repetitive metadata that don't contribute to learning
    header_patterns = [
        r'^\d+\s*$',  # Page numbers on separate lines
        r'^Page \d+ of \d+\s*$',  # "Page X of Y"
        r'^arXiv:\d+\.\d+v\d+.*$',  # arXiv identifiers
        r'^Preprint.*$',
        r'^Draft.*$',
        r'^Submitted to.*$',
        r'^Published in.*$',
        r'^\w+\s+et al\.\s*$',  # Author names as headers
    ]
    
    footer_patterns = [
        r'^\d+\s*$',  # Standalone page numbers
        r'^©.*\d{4}.*$',  # Copyright notices
        r'^Manuscript.*$',
        r'^Confidential.*$',
    ]
    
    # Remove headers and footers line by line
    lines = text.split('\n')
    cleaned_lines = []
    
    for line in lines:
        line_stripped = line.strip()
        is_header_footer = False
        
        # Check against header patterns
        for pattern in header_patterns + footer_patterns:
            if re.match(pattern, line_stripped, re.IGNORECASE):
                is_header_footer = True
                break
        
        # Also remove very short lines that are likely page numbers
        if len(line_stripped) <= 2 and line_stripped.isdigit():
            is_header_footer = True
            
        if not is_header_footer:
            cleaned_lines.append(line)
    
    text = '\n'.join(cleaned_lines)
    
    # Step 5: Remove figure and table captions
    # Why: Captions often reference figures/tables not present in text, adding noise
    caption_patterns = [
        r'Figure \d+[:\.].*?(?=\n\n|\n[A-Z]|\n\d|\Z)',
        r'Fig\. \d+[:\.].*?(?=\n\n|\n[A-Z]|\n\d|\Z)',
        r'Table \d+[:\.].*?(?=\n\n|\n[A-Z]|\n\d|\Z)',
        r'Algorithm \d+[:\.].*?(?=\n\n|\n[A-Z]|\n\d|\Z)',
        r'Listing \d+[:\.].*?(?=\n\n|\n[A-Z]|\n\d|\Z)',
    ]
    
    for pattern in caption_patterns:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Step 6: Trim references sections
    # Why: Reference lists are formatted metadata, not natural language content
    papers = text.split('='*80)
    cleaned_papers = []
    
    for paper in papers:
        if not paper.strip():
            continue
            
        # Look for references section
        ref_patterns = [
            r'\n\s*References\s*\n.*',
            r'\n\s*REFERENCES\s*\n.*',
            r'\n\s*Bibliography\s*\n.*',
            r'\n\s*BIBLIOGRAPHY\s*\n.*',
        ]
        
        paper_cleaned = paper
        for pattern in ref_patterns:
            # Remove everything from "References" to end of paper
            match = re.search(pattern, paper_cleaned, re.DOTALL | re.IGNORECASE)
            if match:
                paper_cleaned = paper_cleaned[:match.start()]
                break
        
        cleaned_papers.append(paper_cleaned)
    
    text = ('='*80).join(cleaned_papers)
    
    # Step 7: Remove acknowledgments
    # Why: Acknowledgments are social metadata, not technical content
    ack_patterns = [
        r'\n\s*Acknowledgments?\s*\n.*?(?=\n\s*[A-Z][a-z]+\s*\n|\Z)',
        r'\n\s*ACKNOWLEDGMENTS?\s*\n.*?(?=\n\s*[A-Z][a-z]+\s*\n|\Z)',
    ]
    
    for pattern in ack_patterns:
        text = re.sub(pattern, '', text, flags=re.DOTALL | re.IGNORECASE)
    
    # Step 8: Clean up excessive whitespace
    # Why: Multiple consecutive spaces/newlines don't add information and waste tokens
    text = re.sub(r' +', ' ', text)  # Multiple spaces to single space
    text = re.sub(r'\n\s*\n\s*\n+', '\n\n', text)  # Multiple newlines to double newline
    
    # Remove leading/trailing whitespace from lines
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)
    text = text.strip()  # Remove empty lines at start and end
    
    # Write cleaned text
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(text)
    
    print(f"Cleaning completed!")
    print(f"Original: {original_length:,} chars → Final: {len(text):,} chars")
    print(f"Removed: {original_length - len(text):,} chars ({((original_length - len(text))/original_length)*100:.1f}%)")
    
    return text

pdf_list = [f'pdfs\{file_name}.pdf' for file_name, url in pdf_links]

# Step 1: Extract PDFs with delimiters
pdfs_to_text_with_delimiters(pdf_list, 'raw_extracted_text.txt')

# Step 2: Clean the text
cleaned_text = clean_text_comprehensive('raw_extracted_text.txt', 'cleaned_text.txt')

print("Process complete! Check 'cleaned_text.txt' for the final output.")

Text with delimiters merged into: raw_extracted_text.txt
Cleaning completed!
Original: 4,811,565 chars → Final: 2,791,367 chars
Removed: 2,020,198 chars (42.0%)
Process complete! Check 'cleaned_text.txt' for the final output.


# 5. Tokenization

### Implementing regex based tokenizer and building the vocabulary

In [7]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.tokens2ids = {token:id for id,token in enumerate(vocab)}
        self.ids2tokens = {id:token for id,token in enumerate(vocab)}

    # encode function turns text into token IDs
    def encode(self, text):
        text2tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        text2tokens = [
            item.strip() for item in text2tokens if item.strip()
        ]

        text2tokens = [
            item if item in self.tokens2ids
            else "<|unk|>" for item in text2tokens
            ]

        ids = [self.tokens2ids[s] for s in text2tokens]
        return ids

    # decode function turns token IDs back into text
    def decode(self, ids):
        text = " ".join([self.ids2tokens[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [8]:
cleaned_text =""

In [9]:
with open("cleaned_text.txt", "r", encoding="utf-8") as f:
    cleaned_text = f.read()

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', cleaned_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])
print('Total Number of Tokens in "cleaned_2.txt" are ', len(preprocessed))

# Building the vocabulary
vocab = sorted(set(preprocessed))
vocab_size = len(vocab)

print("Vocabulary size:",vocab_size)
print(vocab[:20])

['Language', 'Models', 'are', 'Few-Shot', 'Learners', 'Tom', 'B', '.', 'Brown\x03Benjamin', 'Mann\x03Nick', 'Ryder\x03Melanie', 'Subbiah\x03', 'Jared', 'KaplanyPrafulla', 'Dhariwal', 'Arvind', 'Neelakantan', 'Pranav', 'Shyam', 'Girish', 'Sastry', 'Amanda', 'Askell', 'Sandhini', 'Agarwal', 'Ariel', 'Herbert-Voss', 'Gretchen', 'Krueger', 'Tom']
Total Number of Tokens in "cleaned_2.txt" are  555770
Vocabulary size: 33938
['\x00', '\x001', '\x0013%', '\x002', '\x00erce-looking', '\x00roEht', '\x01', '\x01Distributed', '\x01Natural', '\x01xV', '\x0210\x005', '\x023\x02m\x02t', '\x02PUE', '\x02smaller', '\x03\x06d', '\x03Davinci', '\x03Equal', '\x03l', '\x03t\x00l', '\x06']


In [10]:
# Encoding and decoding text with our regex based tokenizer and vocabulary
regextokenizer = SimpleTokenizer(vocab)

regex_token_integers = regextokenizer.encode(cleaned_text)
print(regex_token_integers[:100])

[10689, 11825, 18428, 8340, 10764, 15618, 5968, 441, 6418, 11376, 14068, 15018, 10018, 10270, 7650, 5826, 12179, 13087, 14656, 8912, 14289, 5639, 5852, 14264, 5495, 5796, 9326, 9028, 10470, 15618, 9317, 13913, 6837, 5461, 13654, 7391, 11090, 441, 16962, 10042, 16563, 6934, 16495, 6886, 9343, 11416, 6807, 8064, 14672, 11449, 10913, 14387, 9015, 6187, 6828, 9975, 6913, 6886, 6196, 14249, 11514, 5562, 13619, 9595, 15090, 7400, 5657, 12486, 5398, 13747, 31809, 23154, 20855, 29683, 22771, 26146, 24963, 12038, 30001, 18131, 18894, 19211, 27124, 26146, 17597, 24472, 20381, 26060, 30128, 22532, 19211, 33543, 26146, 17597, 29286, 29989, 441, 16425, 30819, 29991]


In [11]:
print(regextokenizer.decode(regex_token_integers[:100]))

Language Models are Few-Shot Learners Tom B. BrownBenjamin MannNick RyderMelanie Subbiah Jared KaplanyPrafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei OpenAI Abstract Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by ﬁne-tuning on a speciﬁc task. While typically task-agnostic


### BPE tokenization

In [13]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.12.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.12.0-cp310-cp310-manylinux_2_28_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m11.3 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.12.0


In [14]:
import tiktoken

bpe_tokenizer = tiktoken.get_encoding("gpt2")
integers = bpe_tokenizer.encode(cleaned_text, allowed_special={"<|endoftext|>"})

print(integers[:100])

strings = bpe_tokenizer.decode(integers)

print(strings[:100])

[32065, 32329, 389, 20463, 12, 28512, 8010, 2741, 198, 13787, 347, 13, 4373, 191, 11696, 13337, 20291, 191, 23609, 40238, 191, 21102, 34166, 3834, 65, 9520, 191, 198, 41, 1144, 11611, 489, 1092, 47, 430, 12853, 64, 20529, 2743, 16783, 943, 50172, 3169, 417, 461, 415, 272, 1736, 272, 615, 36838, 321, 23837, 680, 311, 459, 563, 198, 5840, 5282, 1081, 17164, 3837, 71, 5362, 2449, 283, 16783, 33364, 28648, 12, 53, 793, 35877, 6607, 13685, 518, 1362, 4186, 6752, 394, 272, 198, 30003, 261, 5932, 1215, 414, 64, 371, 1047, 71, 7806, 337, 13, 1168, 15702, 1754, 19627, 18027]
Language Models are Few-Shot Learners
Tom B. BrownBenjamin MannNick RyderMelanie Subbiah
Jared K


### Comparison of regex and bpe tokenizer outputs

In [15]:
from typing import List

total_regex_tokens = len(regex_token_integers)
regex_vocab_size = len(set(regex_token_integers))
total_bpe_tokens = len(integers)
bpe_vocab_size = len(set(integers))


comparison = {
    "regex_total_tokens": total_regex_tokens,
    "regex_vocab_size": regex_vocab_size,
    "bpe_total_tokens": total_bpe_tokens,
    "bpe_vocab_size": bpe_vocab_size
}
comparison

{'regex_total_tokens': 555770,
 'regex_vocab_size': 33938,
 'bpe_total_tokens': 777055,
 'bpe_vocab_size': 20646}

# Key Differences between Regex Tokenizer and BPE Tokenizer

### 1. Handling of Unknown Words

- **Regex Tokenizer:** Maps unknown words (like "bioinformatics", "neologisms") to a single `<unk>` token, causing information loss.

- **BPE Tokenizer:** Breaks unknown words into meaningful subword components, preserving all information.

### 2. Vocabulary Size vs Coverage

- **Regex:** Larger vocabularies needed to cover diverse text, but still fails on rare words.

- **BPE:** Smaller, more efficient vocabularies with better coverage through subword composition.

### 3. Subword Splitting Capability

- **Regex:** No subword splitting - treats "bioinformatics" as atomic unit.

- **BPE:** Intelligent splitting into morphologically meaningful parts like "bio-in-form-a-tics".

### 4. Efficiency Trade-offs

- **Regex:** Fewer tokens per sentence, computationally efficient.

- **BPE:** More tokens but better semantic representation and robustness.

### 5. Generalization

- **Regex:** Poor - cannot handle words not seen during vocabulary building.

- **BPE:** Excellent - can represent any word through subword combinations.


# 7. Dataset and Dataloader

In [16]:
from torch.utils.data import Dataset, DataLoader
import torch

class CustomDataset(Dataset):
    def __init__(self, txt, tokenizer, context_size, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > context_size, "Number of tokenized inputs must at least be equal to context_size+1"

        # Use a sliding window to chunk the book into overlapping sequences of context_size
        for i in range(0, len(token_ids) - context_size, stride):
            input_chunk = token_ids[i:i + context_size]
            target_chunk = token_ids[i + 1: i + context_size + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [17]:
def create_dataloader(txt, batch_size=4, context_size=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = CustomDataset(txt, tokenizer, context_size, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [18]:
dataloader = create_dataloader(
    cleaned_text, batch_size=2, context_size=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[32065, 32329,   389, 20463],
        [32329,   389, 20463,    12]]), tensor([[32329,   389, 20463,    12],
        [  389, 20463,    12, 28512]])]


# 9. Statistical Analysis

In [19]:
import pandas as pd

# Per-doc removals (character-level proxy)


summary = {
    "total_docs": len(pdf_links),
    "regex_total_tokens": int(len(regex_token_integers)),
    "bpe_total_tokens": int(len(integers)),
    "avg_tokens_per_doc_regex": float((len(regex_token_integers) / len(pdf_links))),
    "avg_tokens_per_doc_bpe": float((len(integers) / len(pdf_links))),
    "removed_char_percentage_weighted": float(),
}

summary

{'total_docs': 37,
 'regex_total_tokens': 555770,
 'bpe_total_tokens': 777055,
 'avg_tokens_per_doc_regex': 15020.81081081081,
 'avg_tokens_per_doc_bpe': 21001.486486486487,
 'removed_char_percentage_weighted': 0.0}