# Test Script for Parser Module

This notebook tests the refactored code in `src/parser`.
We will test:
1. `find_root_tex_file`: Locating the main LaTeX file.
2. `LatexFlattener`: Merging LaTeX files into one.
3. `LatexStructureBuilder`: Parsing the structure of the LaTeX document.

In [1]:
import sys
import os

# Add the project root to sys.path to import src
# Adjust this path depending on where you run the notebook
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '23127011'))
if project_root not in sys.path:
    sys.path.append(project_root)

print(f"Added to path: {project_root}")

from src.parser import find_root_tex_file, LatexFlattener, LatexStructureBuilder

Added to path: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\23127011


In [2]:
# Configuration
DATA_RAW_PATH = os.path.abspath(os.path.join(os.getcwd(), '..', 'data_raw'))

print(f"Data raw path: {DATA_RAW_PATH}")

if not os.path.exists(DATA_RAW_PATH):
    print("‚ùå Warning: Data raw path does not exist. Please check the path.")
else:
    print("‚úÖ Data raw path found.")

Data raw path: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\data_raw
‚úÖ Data raw path found.


In [3]:
paper_to_test = '2403-00531'
version_to_test = '2403-00531v2'
path_to_version = os.path.join(DATA_RAW_PATH, paper_to_test, 'tex', version_to_test)

print(path_to_version)

d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\data_raw\2403-00531\tex\2403-00531v2


In [4]:
root_file = find_root_tex_file(path_to_version)
print(f"Root TeX file: {root_file}")

Root TeX file: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\data_raw\2403-00531\tex\2403-00531v2\apssamp.tex


In [5]:
flattener = LatexFlattener(root_file, paper_to_test, version_to_test)
print("Flattening completed.")

üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00531, Version: 2403-00531v2
Flattening completed.


In [6]:
res = flattener.flatten()

In [7]:
print(res)

{'paper_id': '2403-00531', 'version': '2403-00531v2', 'root_file_path': 'd:\\Coding\\School\\Y3-K1\\Intro2DS\\DS - LAB 2\\Milestone2_Project\\data_raw\\2403-00531\\tex\\2403-00531v2\\apssamp.tex', 'metadata': {'total_length': 72185, 'merged_count': 1, 'merged_files': ['apssamp.tex'], 'missing_files': []}, 'content': "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\documentclass[\n reprint,\n\n\n\n\n\n\n\n\n\n\n amsmath,amssymb,\n aps,\npra,\n\n\n\n\nfloatfix,\n]{revtex4-2}\n\n\\usepackage{graphicx}\n\n\\usepackage{csvsimple}\n\\usepackage{dcolumn}\n\\usepackage{array} \n\n\n\\usepackage{bm}\n\\usepackage[dvipsnames]{xcolor}\n\n\\usepackage{subcaption}\n\\usepackage{multirow}\n\\usepackage{times}\n\\usepackage{placeins}\n\\usepackage{hyperref}\n\\usepackage{booktabs}\n\\bibliographystyle{apsrev4-1}\n\n\\begin{document}\n\\hbadness=99999\n\n\\preprint{APS/123-QED}\n\n\\title{Phenomenology of renormalization group improved gravity from the kinematics of SPARC galaxies.}\n\n\n\\author{Esha Bhati

In [8]:
builder = LatexStructureBuilder(res['content'], paper_to_test, version_to_test)
root = builder.build_coarse_tree()

# 1 c√¢y l√† 1 dictionary, v·ªõi 'children' l√† danh s√°ch c√°c c√¢y con, c≈©ng l√† dictionary, th√¨ class n√†y n√≥ s·∫Ω l∆∞u l·∫°i th√¥ng tin v·ªÅ id, v√† version, ch·ª© kh√¥ng l∆∞u n·ªôi dung chi ti·∫øt b√™n trong, sau khi ƒë√£ g·ªçi build_coarse_tree() th√¨ ta c·∫ßn ghi nh·ªõ bi·∫øn root ƒë·ªÉ x·ª≠ l√Ω ti·∫øp

In [9]:
# print(root)
builder.print_tree(root)

- [DOCUMENT] Root Document (ID: 2403-005)
  Content Preview: 



















\documentclass[
 reprint,




...
  - [SECTION] Introduction (ID: 2403-005)
    Content Preview:  
\label{sec:intro}

The proposal to modify the ac...
  - [SECTION] The models (ID: 2403-005)
    Content Preview: 
\label{sec:rggr}
...
    - [SUBSECTION] RGGR gravity model (ID: 2403-005)
      Content Preview: 

Renormalization Group correction to General Rela...
    - [SUBSECTION] NFW dark matter (ID: 2403-005)
      Content Preview: 
We also compare the RGGR model with an alternativ...
  - [SECTION] Galaxy catlogue (ID: 2403-005)
    Content Preview: 
\label{sec:gal}

Spitzer Photometry for Accurate ...
  - [SECTION] Methodology (ID: 2403-005)
    Content Preview: 
\label{sec:method}
Using the RC data for the qual...
  - [SECTION] Results (ID: 2403-005)
    Content Preview: 
\label{sec:res}
In the following, we analyze the ...
    - [SUBSECTION] Fit to the observed Rotation Curve (ID: 2403-005)
      Conte

In [10]:
from src.parser import LatexContentProcessor

In [11]:
def count_nodes(node):
    count = 1
    for child in node.get('children', []):
        count += count_nodes(child)
    return count

before = count_nodes(root)

In [12]:
content_processor = LatexContentProcessor(paper_to_test, version_to_test)
content_processor.process_tree(root)

üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...


In [13]:
after = count_nodes(root)

In [14]:
print(before, after)

18 441


In [15]:
builder.print_tree_to_file(root, 'output_tree2.txt')
builder.print_tree_to_file(root, 'output_tree2.json')

‚úÖ ƒê√£ l∆∞u c·∫•u tr√∫c c√¢y v√†o: output_tree2.txt
   - T·ªïng s·ªë nodes: 441
   - T·ªïng s·ªë edges: 440
‚úÖ ƒê√£ l∆∞u c·∫•u tr√∫c c√¢y v√†o: output_tree2.json
   - T·ªïng s·ªë nodes: 441
   - T·ªïng s·ªë edges: 440


In [16]:
# v√¨ latex content processor ch·ªâ d√πng ƒë·ªÉ x·ª≠ l√Ω n·ªôi dung c·ªßa nh·ªØng block text ƒë√£ ƒë∆∞·ª£c t√°ch

In [17]:
cleaned_conteent = builder.export_cleaned_paper(root)

In [18]:
with open('cleaned_content.txt', 'w', encoding='utf-8') as f:
    f.write(cleaned_conteent)

In [19]:
test_markdown = builder.export_to_markdown(root)
with open('exported_paper.md', 'w', encoding='utf-8') as f:
    f.write(test_markdown)