# Test Script for Parser Module

This notebook tests the refactored code in `src/parser`.
We will test:
1. `find_root_tex_file`: Locating the main LaTeX file.
2. `LatexFlattener`: Merging LaTeX files into one.
3. `LatexStructureBuilder`: Parsing the structure of the LaTeX document.

In [1]:
import sys
import os
import json

# Add the project root to sys.path to import src
# Adjust this path depending on where you run the notebook
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '23127011'))
if project_root not in sys.path:
    sys.path.append(project_root)

print(f"Added to path: {project_root}")

from src.parser import find_root_tex_file, LatexFlattener, LatexStructureBuilder, LatexContentProcessor
from src.processing import ContentDeduplicator

Added to path: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\23127011


In [2]:
# Configuration
DATA_RAW_PATH = os.path.abspath(os.path.join(os.getcwd(), '..', 'data_raw'))

print(f"Data raw path: {DATA_RAW_PATH}")

if not os.path.exists(DATA_RAW_PATH):
    print("‚ùå Warning: Data raw path does not exist. Please check the path.")
else:
    print("‚úÖ Data raw path found.")

Data raw path: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\data_raw
‚úÖ Data raw path found.


In [3]:
# Gi·∫£ s·ª≠ ch√∫ng ta c√≥ danh s√°ch c√°c version c·∫ßn g·ªôp
# B·∫°n c·∫ßn ƒë·∫£m b·∫£o folder data_raw ch·ª©a ƒë√∫ng c√°c folder n√†y
paper_base_id = '2403-00531'
versions_to_process = ['2403-00531v1', '2403-00531v2'] 

deduplicator = ContentDeduplicator()

for ver in versions_to_process:
    ver_path = os.path.join(DATA_RAW_PATH, paper_base_id, 'tex', ver)
    
    if not os.path.exists(ver_path):
        print(f"‚ö†Ô∏è Skipping {ver}, path not found: {ver_path}")
        continue
        
    print(f"\nüöÄ Start processing {ver}...")
    
    # 1. T√¨m root file
    root_file = find_root_tex_file(ver_path)
    if not root_file:
        print(f"‚ùå Root file not found for {ver}")
        continue
        
    # 2. Flatten LaTeX
    print(f"   - Flattening...")
    flattener = LatexFlattener(root_file,paper_id=paper_base_id, version=ver)
    flat_content = flattener.flatten()
    
    # 3. Build Structure Tree
    print(f"   - Building Structure...")
    builder = LatexStructureBuilder(flat_content['content'], paper_base_id, ver) # D√πng ID g·ªëc v√† ver
    root_node = builder.build_coarse_tree()
    
    # 4. Process Content (Split Sentences, Clean Figures...)
    # B∆∞·ªõc n√†y QUAN TR·ªåNG ƒë·ªÉ t·∫°o ra c√°c node l√° (sentence) cho deduplication
    print(f"   - Processing Content (Sentences/Figures)...")
    processor = LatexContentProcessor(paper_base_id, ver)
    processor.process_tree(root_node)
    
    # 5. Add to Deduplicator
    print(f"   - Merging into global hierarchy...")
    deduplicator.process_version(ver, root_node)

# --- EXPORT K·∫æT QU·∫¢ ---

final_output = deduplicator.get_final_json()
output_file = 'hierarchy.json'

with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(final_output, f, indent=2, ensure_ascii=False)

print(f"\n‚úÖ DONE! Exported {len(final_output['elements'])} unique elements to '{output_file}'.")
print(f"   - Hierarchy contains versions: {list(final_output['hierarchy'].keys())}")


üöÄ Start processing 2403-00531v1...
   - Flattening...
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00531, Version: 2403-00531v1
   - Building Structure...
   - Processing Content (Sentences/Figures)...
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
   - Merging into global hierarchy...
üîÑ Processing hierarchy for version: 1 (from 2403-00531v1)

üöÄ Start processing 2403-00531v2...
   - Flattening...
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00531, Version: 2403-00531v2
   - Building Structure...
   - Processing Content (Sentences/Figures)...
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
   - Merging into global hierarchy...
üîÑ Processing hierarchy for version: 2 (from 2403-00531v2)

‚úÖ DONE! Exported 532 unique elements to 'hierarchy.json'.
   - Hierarchy contains versions: ['1', '2']
