# Milestone 2.1: LaTeX to Hierarchy Processing Pipeline

### Objectives:
1. **LaTeX expansion**: Recursively expand all `\input` and `\include` commands across versions
2. **Text preprocessing**: Remove comments, normalize formatting, clean boilerplate
3. **Reference extraction**: Parse `.bib` files and `\bibitem` blocks, deduplicate globally
4. **Hierarchy construction**: Build tree structure (sections → subsections → sentences → equations)
5. **Element normalization**: Deduplicate elements across versions using fingerprinting
6. **Output generation**: Create `hierarchy.json`, `refs.bib`, and copy metadata files

### Prerequisites:
- Raw LaTeX source files in `23127130/` directory
- Each paper has multiple versions in `tex/` subfolder
- BibTeX files (`.bib`) available for reference extraction

### Pipeline overview:
```
Raw LaTeX → Version Discovery → LaTeX Expansion → Preprocessing 
→ Reference Extraction → Hierarchy Building → Deduplication 
→ Output (hierarchy.json, refs.bib, metadata.json, references.json)
```

In [None]:
import re  
from pathlib import Path  
import hashlib  
import json  
import bibtexparser 
from bibtexparser.bparser import BibTexParser 

import uuid  
from copy import deepcopy  

from pathlib import Path
from tqdm.auto import tqdm
import warnings
import logging

# Suppress all warnings
warnings.filterwarnings('ignore')

# Suppress bibtexparser logging messages
logging.getLogger('bibtexparser').setLevel(logging.ERROR)

# 1. Converting raw laTex to a hirerarchy 

## 1.1 Preparing folder for the pipline

* **Notice**: You have to change the data folder at the variable `DATA_DIR` to get the data and the `OUT_DIR` for the preproccessed data

In [22]:
DATA_DIR = Path("23127130") 
OUTPUT_DIR = Path("output") 
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Get all the papers need to be processed
list_papers = []
if DATA_DIR.exists():
    for folder in DATA_DIR.iterdir():
        if folder.is_dir():
            list_papers.append(folder)
else:
    print("Data directory does not exist.")
    
print(f"Total papers to process: {len(list_papers)}")

Total papers to process: 15000


## 1.2 Building a pipeline for each paper's main link folder

### Step 1: From the folder directory return a list of directory of the paper's versions

In [23]:
# Function to get all papers version and their main .tex files
def get_all_papers(arXiv_folder: Path) -> dict:
    # Go the tex folder to get all the version folder
    arXiv_folder = arXiv_folder / "tex"
    
    if arXiv_folder:
        if arXiv_folder.is_dir():
            papers = []
            for paper_dir in arXiv_folder.iterdir():
                if paper_dir.is_dir():
                    papers.append(paper_dir)
            return papers


## Step 2: From the directory of a version get the full tex that will be converted to PDF

In [24]:
INPUT_RE = re.compile(r'\\(input|include)\{([^}]+)\}')

def find_main_tex(tex_dir: Path) -> Path:
    tex_files = sorted(
        p for p in tex_dir.rglob("*")
        if p.is_file() and p.suffix.lower() == ".tex"
    )

    if not tex_files:
        return None

    # Priority 1: \documentclass
    for f in tex_files:
        try:
            if "\\documentclass" in f.read_text(encoding="utf-8", errors="ignore"):
                return f
        except Exception:
            continue

    # Priority 2: filename contains main
    for f in tex_files:
        if "main" in f.name.lower():
            return f

    return tex_files[0]

def expand_tex(file: Path, visited=None) -> str:
    """
    Recursively inline all \\input / \\include
    """
    if visited is None:
        visited = set()

    if file in visited:
        return ""

    visited.add(file)

    try:
        text = file.read_text(encoding="utf-8", errors="ignore")
    except Exception:
        return ""

    def replacer(match):
        name = match.group(2).strip()
        child = (file.parent / name).with_suffix(".tex")

        if child.exists():
            return expand_tex(child, visited)
        else:
            return ""  

    while INPUT_RE.search(text):
        text = INPUT_RE.sub(replacer, text)

    return text


## Step 3: Shallow cleaning before parsered to hierarchy

In [25]:
def preprocess_latex(tex):
   # Cleans and normalizes LaTeX text for hierarchy parsing.
   
    # Step 1: Remove comments (lines starting with %)
    lines = []
    for line in tex.splitlines():
        if "%" in line:
            # ignore escaped \%
            line = re.sub(r'(?<!\\)%.*', '', line)
        lines.append(line)
    tex = "\n".join(lines)
    
    # Step 2: Remove preamble but keep \begin{document}
    if "\\begin{document}" in tex:
        before, after = tex.split("\\begin{document}", 1)
        tex = "\\begin{document}\n" + after
    
    # Step 3: Remove abstract environment
    tex = re.sub(
        r'\\begin\{abstract\}.*?\\end\{abstract\}',
        '',
        tex,
        flags=re.S
    )
    
    # Step 4: Normalize line endings
    tex = tex.replace('\r\n', '\n')
    
    # Step 5: Remove title/author blocks
    # Delete everything from \title{ up to the first section or abstract
    tex = re.sub(
        r'\\title\s*\{[\s\S]*?(?=\\section\*?\{|\n\\section\*?\{|\n\\begin\{abstract\})',
        '',
        tex,
        flags=re.MULTILINE
    )
    
    # Step 6: Remove boilerplate commands
    boilerplate_patterns = [
        r'\\maketitle',
        r'\\IEEEpeerreviewmaketitle',
        r'\\IEEEtitleabstractindextext',
        r'\\IEEEdisplaynontitleabstractindextext',
        r'\\ACMmaketitle',
    ]
    for p in boilerplate_patterns:
        tex = re.sub(p, '', tex)
    
    # Step 7: Remove include/input commands (already expanded)
    tex = re.sub(r'\\(input|include)\{[^}]+\}', '', tex)
    
    # Step 8: Remove labels
    tex = re.sub(r'\\label\{[^}]+\}', '', tex)
    
    # Step 9: Normalize references to [REF] and [EQ]
    tex = re.sub(r'\\eqref\{[^}]+\}', '[EQ]', tex)
    tex = re.sub(r'\\ref\{[^}]+\}', '[REF]', tex)
    
    # Step 10: Normalize citations to [CITE:key]
    tex = re.sub(r'\\cite\{([^}]+)\}', r'[CITE:\1]', tex)
    
    # Step 11: Remove formatting commands without semantic meaning
    formatting_patterns = [
        r'\\centering',
        r'\\raggedright',
        r'\\raggedleft',
        r'\\vspace\{[^}]+\}',
        r'\\hspace\{[^}]+\}',
        r'\\small',
        r'\\footnotesize',
        r'\\scriptsize',
        r'\\normalsize',
        r'\\midrule',
        r'\\toprule',
        r'\\bottomrule'
    ]
    for p in formatting_patterns:
        tex = re.sub(p, '', tex)
    
    # Step 12: Remove float options like [h], [t], [b], [p]
    tex = re.sub(r'\[(h|t|b|p|!)+\]', '', tex)
    
    # Step 13: Unwrap text formatting commands (keep content, remove command)
    text_commands = ['textbf', 'textit', 'emph', 'underline', 'texttt']
    for cmd in text_commands:
        tex = re.sub(r'\\' + cmd + r'\{([^}]*)\}', r'\1', tex)
        
    # Step 14: Normalize inline math to $ ... $
    tex = re.sub(r'\\\((.*?)\\\)', r'$\1$', tex)
    tex = re.sub(r'\$\s*(.*?)\s*\$', r'$\1$', tex)
    
    # Step 15: Normalize block math to \begin{equation} ... \end{equation}
    # Convert $$ ... $$ to equation
    tex = re.sub(
        r'\$\$(.*?)\$\$',
        r'\\begin{equation}\1\\end{equation}',
        tex,
        flags=re.S
    )
    # Convert \[ ... \] to equation
    tex = re.sub(
        r'\\\[(.*?)\\\]',
        r'\\begin{equation}\1\\end{equation}',
        tex,
        flags=re.S
    )
    # Convert align/gather/multline to equation
    tex = re.sub(
        r'\\begin\{(align|gather|multline)\}(.*?)\\end\{\1\}',
        r'\\begin{equation}\2\\end{equation}',
        tex,
        flags=re.S
    )
    
    # Step 16: Normalize whitespace
    tex = re.sub(r'[ \t]+', ' ', tex)
    tex = re.sub(r'\n\s*\n+', '\n\n', tex)
    
    return tex.strip()

## Step 4: Extracting the references, handling the đeuplication and converting to bibTex file

In [26]:
# regex patterns
BIBITEM_BLOCK_RE = re.compile(
    r'\\begin\{thebibliography\}.*?\\end\{thebibliography\}',
    re.S
)

BIBITEM_RE = re.compile(
    r'\\bibitem\{([^}]+)\}\s*(.*?)(?=\\bibitem|\\end\{thebibliography\})',
    re.S
)

BIBLIOGRAPHY_RE = re.compile(
    r'\\bibliography\{([^}]+)\}'
)

BIBSTYLE_RE = re.compile(
    r'\\bibliographystyle\{[^}]+\}'
)

# Normalization & Fingerprint
def normalize_text(s):
    if not s:
        return ""
    s = s.lower()
    s = re.sub(r'\{|\}', '', s)
    s = re.sub(r'\s+', ' ', s)
    s = re.sub(r'[^\w\s]', '', s)
    return s.strip()


def reference_fingerprint(ref):
    title = normalize_text(ref.get("title", ""))
    year = normalize_text(ref.get("year", ""))
    authors = normalize_text(" ".join(ref.get("authors", [])))
    raw = f"{title}|{authors}|{year}"
    return hashlib.sha1(raw.encode()).hexdigest()


# Parse the bibitem
def parse_bibitem_block(tex, version_id):
    refs = []

    block_match = BIBITEM_BLOCK_RE.search(tex)
    if not block_match:
        return refs

    block = block_match.group(0)

    for key, body in BIBITEM_RE.findall(block):
        refs.append({
            "type": "misc",
            "title": "",
            "authors": [],
            "year": "",
            "journal": "",
            "doi": "",
            "source_keys": [key.strip()],
            "sources": ["bibitem"],
            "versions": [version_id]
        })

    return refs
# Parse the .bib file

def safe_str(val):
    if val is None:
        return ""
    if isinstance(val, str):
        return val
    # BibDataStringExpression or others
    return str(val)

MONTH_MAP = {
    "jan": "January", "january": "January",
    "feb": "February", "february": "February",
    "mar": "March", "march": "March",
    "apr": "April", "april": "April",
    "may": "May",
    "jun": "June", "june": "June",
    "jul": "July", "july": "July",
    "aug": "August", "august": "August",
    "sep": "September", "september": "September",
    "oct": "October", "october": "October",
    "nov": "November", "november": "November",
    "dec": "December", "december": "December",
}

def parse_bib_file(bib_path, version_id):
    if not bib_path.exists():
        return []

    parser = BibTexParser(
        common_strings=False,
        interpolate_strings=False
    )

    text = bib_path.read_text(encoding="utf-8", errors="ignore")
    bib_db = bibtexparser.loads(text, parser=parser)

    refs = []
    for entry in bib_db.entries:
        month_raw = safe_str(entry.get("month")).strip().lower()
        month = MONTH_MAP.get(month_raw, month_raw)

        refs.append({
            "type": safe_str(entry.get("ENTRYTYPE", "misc")),
            "title": safe_str(entry.get("title")),
            "authors": safe_str(entry.get("author"))
                        .replace("\n", " ")
                        .split(" and ")
                        if entry.get("author") else [],
            "year": safe_str(entry.get("year")),
            "journal": safe_str(entry.get("journal") or entry.get("booktitle")),
            "month": month,
            "doi": safe_str(entry.get("doi")),
            "source_keys": [safe_str(entry.get("ID"))],
            "sources": ["bib"],
            "versions": [version_id]
        })

    return refs



# Deduplicate references (global)
def deduplicate_references(refs):
    merged = {}

    for ref in refs:
        fp = reference_fingerprint(ref)

        if fp not in merged:
            ref["id"] = f"ref_{fp[:10]}"
            merged[fp] = ref
        else:
            base = merged[fp]

            for field in ["title", "year", "journal", "doi"]:
                if not base.get(field) and ref.get(field):
                    base[field] = ref[field]

            base["authors"] = sorted(set(base["authors"]) | set(ref["authors"]))
            base["source_keys"] = sorted(set(base["source_keys"]) | set(ref["source_keys"]))
            base["sources"] = sorted(set(base["sources"]) | set(ref["sources"]))
            base["versions"] = sorted(set(base["versions"]) | set(ref["versions"]))

    return list(merged.values())

# Remove references from LaTeX
def remove_references_from_tex(tex):
    tex = BIBITEM_BLOCK_RE.sub("", tex)
    tex = BIBLIOGRAPHY_RE.sub("", tex)
    tex = BIBSTYLE_RE.sub("", tex)

    tex = re.sub(r'[ \t]+$', '', tex, flags=re.M)
    tex = re.sub(r'\n\s*\n+', '\n\n', tex)
    return tex.strip()

# Process one LaTeX version
def process_references(tex, tex_dir, version_id):
    refs = []

    refs.extend(parse_bibitem_block(tex, version_id))

    for bibname in BIBLIOGRAPHY_RE.findall(tex):
        bib_path = (tex_dir / bibname).with_suffix(".bib")
        refs.extend(parse_bib_file(bib_path, version_id))

    cleaned_tex = remove_references_from_tex(tex)
    return cleaned_tex, refs

# Convert references → BIBTEX
def refs_to_bibtex(refs):
    entries = []

    for ref in refs:
        key = ref["id"]
        entry_type = ref.get("type", "misc")

        fields = []
        if ref.get("title"):
            fields.append(f"  title = {{{ref['title']}}}")
        if ref.get("authors"):
            fields.append(f"  author = {{{' and '.join(ref['authors'])}}}")
        if ref.get("year"):
            fields.append(f"  year = {{{ref['year']}}}")
        if ref.get("journal"):
            fields.append(f"  journal = {{{ref['journal']}}}")
        if ref.get("doi"):
            fields.append(f"  doi = {{{ref['doi']}}}")

        entry = f"@{entry_type}{{{key},\n" + ",\n".join(fields) + "\n}"
        entries.append(entry)

    return "\n\n".join(entries)



# Main pipeline: multi-version processing
def process_latex(versions: dict, out_dir: Path):
    """
    versions: {paper_folder: latex_text}
    """
    out_dir.mkdir(parents=True, exist_ok=True)

    all_refs = []
    cleaned_tex_map = {}
    
    get_paper = False

    # ---- extract per version ----
    for paper_folder, tex in versions.items():
        if (get_paper == False):
            paper_id = paper_folder.name.split('v')[0].replace('.', '-')
            get_paper = True
        cleaned_tex, refs = process_references(tex, Path(paper_folder), paper_folder.name)
        cleaned_tex_map[paper_folder] = cleaned_tex
        all_refs.extend(refs)

    canonical_refs = deduplicate_references(all_refs)
    
    # --- output ---
    path = OUTPUT_DIR / paper_id
    path.mkdir(parents=True, exist_ok=True)
    
    # Save the hierarchy to {paper_id}.json
    output_file = path / "refs.bib"
        
        # Save references.bib
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(refs_to_bibtex(canonical_refs))
    return cleaned_tex_map


## Step 5: Building hierarchy for the paper across the version

In [27]:
HIERARCHY_LEVELS = {
    "document": {"level": 0, "atomic": False, "include": True, "signals": [], "unwrap": False},

    "section": {
        "level": 1, "atomic": False, "include": True,
        "signals": [r"\\section\*?\{[^}]*\}"],
        "unwrap": False  # Keep full \section{...}
    },
    "subsection": {
        "level": 2, "atomic": False, "include": True,
        "signals": [r"\\subsection\*?\{[^}]*\}"],
        "unwrap": False  # Keep full \subsection{...}
    },
    "subsubsection": {
        "level": 3, "atomic": False, "include": True,
        "signals": [r"\\subsubsection\*?\{[^}]*\}"],
        "unwrap": False  # Keep full \subsubsection{...}
    },
    "paragraph_explicit": {
        "level": 4, "atomic": False, "include": True,
        "signals": [r"\\paragraph\*?\{[^}]*\}"],
        "unwrap": False  # Keep full \paragraph{...}
    },

    "figure": {
        "level": 5, "atomic": True, "include": True,
        "signals": [r"\\begin\{figure\*?\}(.*?)\\end\{figure\*?\}"],
        "unwrap": True  # Extract content inside figure
    },
    "block_formula": {
        "level": 5, "atomic": True, "include": True,
        "signals": [
            r"\\begin\{equation\}(.*?)\\end\{equation\}",
            r"\\begin\{align\*?\}(.*?)\\end\{align\*?\}",
            r"\\begin\{dcases\}(.*?)\\end\{dcases\}",
            r"\$\$(.*?)\$\$",
            r"\\\[(.*?)\\\]"
        ],
        "unwrap": True  # Extract content inside math
    },

    "sentence": {
        "level": 6,
        "atomic": True,
        "include": True,
        "signals": [r"[^.!?]+(?:[.!?]|$)"],
        "unwrap": False
    }
}


def uid():
    return str(uuid.uuid4())

def clean_whitespace(text):
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n\s*\n+", "\n\n", text)
    return text.strip()

def strip_document_env(text):
    return re.sub(r"\\begin\{document\}|\\end\{document\}", "", text)


def extract_tokens(text):
    tokens = []
    occupied = [False] * len(text)

    levels = sorted(
        HIERARCHY_LEVELS.items(),
        key=lambda x: x[1]["level"]
    )

    for name, cfg in levels:
        if not cfg["signals"]:
            continue

        for sig in cfg["signals"]:
            for m in re.finditer(sig, text, re.S):
                if any(occupied[m.start():m.end()]):
                    continue

                for i in range(m.start(), m.end()):
                    occupied[i] = True

                # Decide whether to unwrap based on configuration
                if cfg.get("unwrap", False) and m.lastindex:
                    # Extract content from capture group
                    content = m.group(1)
                else:
                    # Keep full match
                    content = m.group(0)
                
                content = clean_whitespace(content)

                tokens.append({
                    "id": uid(),
                    "type": name,
                    "level": cfg["level"],
                    "atomic": cfg["atomic"],
                    "start": m.start(),
                    "end": m.end(),
                    "content": content
                })

    tokens.sort(key=lambda x: x["start"])
    return tokens

def emit_sentences(parent, text):
    # Extracts sentence nodes from plain text.
    for s in re.finditer(HIERARCHY_LEVELS["sentence"]["signals"][0], text):
        parent["children"].append({
            "id": uid(),
            "type": "sentence",
            "content": clean_whitespace(s.group(0)),
            "children": []
        })

def build_tree(full_text):
    text = clean_whitespace(strip_document_env(full_text))

    root = {
        "id": uid(),
        "type": "document",
        "content": "",
        "children": []
    }

    stack = [root]
    tokens = extract_tokens(text)
    cursor = 0

    for tok in tokens:
        if tok["start"] > cursor:
            gap = text[cursor:tok["start"]]
            if not re.search(r"\\begin\{", gap):
                emit_sentences(stack[-1], gap)

        cursor = tok["end"]

        node = {
            "id": tok["id"],
            "type": tok["type"],
            "content": tok["content"],
            "children": []
        }

        while (
            len(stack) > 1 and
            HIERARCHY_LEVELS[stack[-1]["type"]]["level"] >= tok["level"]
        ):
            stack.pop()

        stack[-1]["children"].append(node)

        if not tok["atomic"]:
            stack.append(node)

    tail = text[cursor:]
    if tail.strip() and not re.search(r"\\begin\{", tail):
        emit_sentences(stack[-1], tail)

    return root

def build_hierarchy(paper_dict: dict):
    hierarchies = {}
    for paper_folder, text in paper_dict.items():
        hierarchies[paper_folder.name] = build_tree(text)
    return hierarchies



### Step 6: Normalize text and deduplicate the full-context

In [28]:
def normalize_node_content(text: str) -> str:
# Standardizes text content for comparison.
    text = text.lower()
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n\s*\n+", "\n\n", text)
    return text.strip()

def traverse_tree(node, fn):
# Runs a function on every node while traversing the tree.
    fn(node)
    for c in node.get("children", []):
        traverse_tree(c, fn)

def node_fingerprint(node):
# Creates a unique ID for a node using its type and content.
    key = f"{node['type']}|{normalize_node_content(node.get('content', ''))}"
    return hashlib.sha1(key.encode()).hexdigest()

def collect_elements_and_hierarchy(hierarchies: dict):
# Extracts unique elements and parent-child links for each version.

    # Get paper ID from first version
    first_version = list(hierarchies.keys())[0]
    paper_id = first_version.split('v')[0].replace('.', '-')
    
    # Maps for deduplication
    canonical_id_by_fp = {}  # fingerprint -> canonical_id
    elements = {}  # canonical_id -> content
    hierarchy = {}  # version_id -> {child_id: parent_id}
    
    for version_id, tree in hierarchies.items():
        # Initialize hierarchy for this version
        hierarchy[version_id] = {}
        
        def collect(node, parent_id=None, is_root=False):
            # Skip adding root document node to elements
            if is_root and node.get('type') == 'document':
                # Process children of root without adding root itself
                for child in node.get("children", []):
                    collect(child, parent_id=None, is_root=False)
                return
            
            # Generate fingerprint and canonical ID
            fp = node_fingerprint(node)
            
            if fp not in canonical_id_by_fp:
                # First time seeing this content
                canonical_id = f"{paper_id}_{fp[:12]}"
                canonical_id_by_fp[fp] = canonical_id
                elements[canonical_id] = normalize_node_content(node.get('content', ''))
            else:
                canonical_id = canonical_id_by_fp[fp]
            
            # Record parent-child relationship for this version
            if parent_id is not None:
                hierarchy[version_id][canonical_id] = parent_id
            
            # Process children
            for child in node.get("children", []):
                collect(child, canonical_id, is_root=False)
        
        collect(tree, is_root=True)
    
    return elements, hierarchy

def finalize_hierarchy_json(hierarchies: dict):
   # Formats hierarchy trees into a final elements and relationships JSON.

    hierarchies = deepcopy(hierarchies)
    
    elements, hierarchy = collect_elements_and_hierarchy(hierarchies)
    
    return {
        "elements": elements,
        "hierarchy": hierarchy
    }

def save_hierarchy_json(final_structure: dict, path: Path, paper_id: str):
# Saves the hierarchy structure to a JSON file.

    # Ensure the output directory exists
    path.mkdir(parents=True, exist_ok=True)
    
    # Save the hierarchy to hierarchy.json
    output_file = path / f"hierarchy.json"
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(final_structure, f, indent=2, ensure_ascii=False)


## Step 7: Main Pipeline - Process Single Paper

This function orchestrates the complete processing pipeline for a single paper:

### Pipeline Stages:

1. **Setup & Validation**
   - Extract paper ID from folder name
   - Create output directory structure
   - Copy metadata and references files

2. **Version Discovery**
   - Find all LaTeX versions for the paper
   - Locate main .tex file for each version
   - Expand all `\input` and `\include` commands

3. **Text Processing**
   - Preprocess LaTeX (remove comments, normalize formatting)
   - Extract and deduplicate references
   - Clean LaTeX content for hierarchy parsing

4. **Hierarchy Construction**
   - Build tree structure for each version
   - Extract sections, subsections, figures, equations, sentences
   - Normalize content for deduplication

5. **Output Generation**
   - Merge hierarchies across versions
   - Generate canonical element IDs
   - Save three output files:
     - `hierarchy.json` - Document structure with parent-child relationships
     - `refs.bib` - Deduplicated references in BibTeX format
     - `metadata.json` - Paper metadata (copied from source)
     - `references.json` - Candidate references for matching (copied from source)

In [29]:
def process_paper(target_paper: Path):
    paper_id = target_paper.name

    print(f"Processing paper: {paper_id}...")
    all_papers = get_all_papers(target_paper)
    # Ensure paper output directory exists
    paper_out_dir = OUTPUT_DIR / paper_id
    # Create if path is not exxist
    
    if not paper_out_dir.exists():
        paper_out_dir.mkdir(parents=True, exist_ok=True)
        
    # Copy metadata.json
    meta_path = target_paper / "metadata.json"
    if meta_path.exists():
        metadata = json.loads(meta_path.read_text(encoding="utf-8", errors="ignore"))
        output_meta_path = paper_out_dir / "metadata.json"
        with open(output_meta_path, "w", encoding="utf-8") as f:
            json.dump(metadata, f, indent=2, ensure_ascii=False)

    # Copy references.json
    ref_path = target_paper / "references.json"
    if ref_path.exists():
        ref = json.loads(ref_path.read_text(encoding="utf-8", errors="ignore"))
        output_ref_path = paper_out_dir / "references.json"
        with open(output_ref_path, "w", encoding="utf-8") as f:
            json.dump(ref, f, indent=2, ensure_ascii=False)
    
    if not all_papers:
        print(f"No versions found for paper {paper_id}")
        return
    
    paper_dict = {}
    for paper_folder in all_papers:
        main_tex = find_main_tex(paper_folder)
        if main_tex is None:
            continue

        full_text = expand_tex(main_tex)
        if full_text:
            paper_dict[paper_folder] = full_text
    
    if not paper_dict:
        print(f"No valid LaTeX found for paper {paper_id}")
        return
            
    # Preprocess LaTeX
    for paper_folder, text in paper_dict.items():
        paper_dict[paper_folder] = preprocess_latex(text)

    # Process references and clean LaTeX
    paper_dict = process_latex(paper_dict, OUTPUT_DIR / paper_id)
    
    # Build hierarchies
    hierarchies = build_hierarchy(paper_dict)
    
    # Create final merged JSON structure and save
    final_structure = finalize_hierarchy_json(hierarchies)
    save_hierarchy_json(final_structure, OUTPUT_DIR / paper_id, paper_id)



    print(f"Paper {paper_id} processed successfully")


## 1.3 Batch Processing Execution

This section controls the batch processing of all papers in the dataset.

### Configuration Options:

- **`start_index`**: Index of the first paper to process (0-based)
  - Set to `0` to start from the beginning
  - Change this to resume processing from a specific paper

- **`end_index`**: Index where processing stops (exclusive)
  - Set to `len(list_papers)` to process all remaining papers
  - Can be set to a specific number to process a subset

### Processing Loop:

The `tqdm` progress bar provides:
- Real-time progress tracking
- Estimated time remaining
- Papers processed per second

### Example Usage:

```python
# Process all papers
start_index = 0
end_index = len(list_papers)

# Process first 100 papers only
start_index = 0
end_index = 100

# Resume from paper 500
start_index = 500
end_index = len(list_papers)
```

### Output:

For each successfully processed paper, the following files are created in `output/{paper_id}/`:
- `hierarchy.json` - Document structure
- `refs.bib` - Deduplicated references
- `metadata.json` - Paper metadata
- `references.json` - Candidate references

In [30]:
# Get papers from paper ith to end
start_index = 0  # Change this to the desired starting index
end_index = len(list_papers)  # Process until the last paper

list_papers = list_papers[start_index:end_index]

for paper in tqdm(list_papers, desc="Processing papers", unit="paper"):
    process_paper(paper)

print("All papers processed successfully")

Processing papers:   0%|          | 0/15000 [00:00<?, ?paper/s]

Processing paper: 2303-07856...
Paper 2303-07856 processed successfully
Processing paper: 2303-07857...
Paper 2303-07856 processed successfully
Processing paper: 2303-07857...
Paper 2303-07857 processed successfully
Processing paper: 2303-07858...
Paper 2303-07857 processed successfully
Processing paper: 2303-07858...
Paper 2303-07858 processed successfully
Processing paper: 2303-07859...
Paper 2303-07858 processed successfully
Processing paper: 2303-07859...
Paper 2303-07859 processed successfully
Processing paper: 2303-07860...
No valid LaTeX found for paper 2303-07860
Processing paper: 2303-07861...
Paper 2303-07859 processed successfully
Processing paper: 2303-07860...
No valid LaTeX found for paper 2303-07860
Processing paper: 2303-07861...
Paper 2303-07861 processed successfully
Processing paper: 2303-07862...
No versions found for paper 2303-07862
Processing paper: 2303-07863...
Paper 2303-07861 processed successfully
Processing paper: 2303-07862...
No versions found for paper 2