# Batch Processing for IntroDS Milestone 2
# 
This notebook automates the pipeline for all papers found in `data_raw`.
It performs:
1.  **Multi-file Gathering:** Finds root tex and flattens structure [cite: 11-14].
2.  **Hierarchy Construction:** Parses LaTeX into a tree structure [cite: 15-20].
3.  **Standardization:** Cleans content and splits sentences [cite: 27-30].
4.  **Content Deduplication:** Merges content across versions [cite: 36-37].
5.  **Output Generation:** Saves `hierarchy.json` in the required folder structure [cite: 89-96].


In [2]:

# %%
import sys
import os
import json
import shutil
from tqdm import tqdm # Th∆∞ vi·ªán t·∫°o thanh progress bar (n·∫øu ch∆∞a c√≥: pip install tqdm)

# --- SETUP PATHS ---
# Adjust project_root if necessary
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '23127011'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Import c√°c module ƒë√£ vi·∫øt
from src.parser import (
    find_root_tex_file, 
    LatexFlattener, 
    LatexStructureBuilder, 
    LatexContentProcessor
)
from src.processing import ContentDeduplicator
# , ReferenceExtractor 
# L∆∞u √Ω: Import ReferenceExtractor n·∫øu b·∫°n mu·ªën xu·∫•t c·∫£ refs.bib


In [3]:

# Define Data Paths
DATA_RAW_PATH = os.path.abspath(os.path.join(os.getcwd(), '..', 'data_raw'))
DATA_OUTPUT_PATH = os.path.abspath(os.path.join(os.getcwd(), '..', 'data_output'))

print(f"üìÇ Input: {DATA_RAW_PATH}")
print(f"üìÇ Output: {DATA_OUTPUT_PATH}")


üìÇ Input: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\data_raw
üìÇ Output: d:\Coding\School\Y3-K1\Intro2DS\DS - LAB 2\Milestone2_Project\data_output


## Helper Functions

H√†m h·ªó tr·ª£ ƒë·ªÉ l·∫•y danh s√°ch th∆∞ m·ª•c con v√† s·∫Øp x·∫øp version.

In [4]:
def get_subdirs(path):
    """Returns a list of immediate subdirectories in a path."""
    if not os.path.exists(path):
        return []
    return [d for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]

def sort_versions(version_list):
    """Sorts versions logically: v1, v2, v10... instead of v1, v10, v2"""
    try:
        return sorted(version_list, key=lambda x: int(x.split('v')[-1]) if 'v' in x else 0)
    except:
        return sorted(version_list)


## Main Processing Pipeline

V√≤ng l·∫∑p ch√≠nh duy·ªát qua t·ª´ng b√†i b√°o v√† t·ª´ng version.


In [5]:
# T·∫°o folder output g·ªëc n·∫øu ch∆∞a c√≥
if not os.path.exists(DATA_OUTPUT_PATH):
    os.makedirs(DATA_OUTPUT_PATH)

# L·∫•y danh s√°ch c√°c b√†i b√°o (Paper IDs)
papers = get_subdirs(DATA_RAW_PATH)
print(f"Found {len(papers)} papers to process: {papers}")


Found 11 papers to process: ['2403-00530', '2403-00531', '2403-00532', '2403-00533', '2403-00534', '2403-00535', '2403-00536', '2403-00537', '2403-00538', '2403-00539', '2403-00540']


In [6]:

# Duy·ªát qua t·ª´ng b√†i b√°o
for paper_id in tqdm(papers, desc="Processing Papers"):
    print(f"\nüìò Processing Paper: {paper_id}")
    
    # 1. Setup Output Folder cho b√†i b√°o n√†y
    # C·∫•u tr√∫c: data_output/<paper_id>/
    paper_output_dir = os.path.join(DATA_OUTPUT_PATH, paper_id)
    if not os.path.exists(paper_output_dir):
        os.makedirs(paper_output_dir)
        
    # 2. Setup Deduplicator (M·ªói b√†i b√°o d√πng 1 dedup ri√™ng ƒë·ªÉ g·ªôp c√°c version c·ªßa n√≥)
    deduplicator = ContentDeduplicator()
    
    # 3. T√¨m c√°c version trong folder tex/
    tex_source_dir = os.path.join(DATA_RAW_PATH, paper_id, 'tex')
    if not os.path.exists(tex_source_dir):
        print(f"   ‚ö†Ô∏è 'tex' folder not found for {paper_id}. Skipping.")
        continue
        
    versions = sort_versions(get_subdirs(tex_source_dir))
    
    if not versions:
        print(f"   ‚ö†Ô∏è No versions found in {tex_source_dir}. Skipping.")
        continue

    # --- PROCESS T·ª™NG VERSION ---
    for ver in versions:
        ver_path = os.path.join(tex_source_dir, ver)
        print(f"   üëâ Version: {ver}")
        
        try:
            # A. T√¨m Root File [cite: 13]
            root_file = find_root_tex_file(ver_path)
            if not root_file:
                print(f"      ‚ùå Root file not found in {ver}. Skipping.")
                continue
            
            # B. Flatten LaTeX [cite: 11-12]
            # G·ªôp t·∫•t c·∫£ file \input, \include th√†nh 1 chu·ªói
            flattener = LatexFlattener(root_file, paper_id=paper_id, version=ver)
            flat_result = flattener.flatten()
            flat_content_str = flat_result['content'] # L·∫•y string n·ªôi dung
            
            # C. Build Structure Tree [cite: 15-20]
            # Parse th√†nh c√¢y section/chapter/subsection
            builder = LatexStructureBuilder(flat_content_str, paper_id, ver)
            root_node = builder.build_coarse_tree() # Ho·∫∑c .parse() t√πy t√™n h√†m b·∫°n ƒëang d√πng
            
            # D. Process Content (Clean & Split Sentences) [cite: 23-24, 28-30]
            # T√°ch c√¢u, x·ª≠ l√Ω c√¥ng th·ª©c to√°n, h√¨nh ·∫£nh
            processor = LatexContentProcessor(paper_id, ver)
            processor.process_tree(root_node)
            
            # E. Add to Deduplicator [cite: 36-37]
            # G·ªôp v√†o kho chung, x·ª≠ l√Ω tr√πng l·∫∑p n·ªôi dung
            deduplicator.process_version(ver, root_node)
            
            # (Optional) F. Extract References [cite: 31]
            # N·∫øu b·∫°n ƒë√£ t√≠ch h·ª£p ReferenceExtractor, h√£y g·ªçi ·ªü ƒë√¢y ƒë·ªÉ l·∫•y refs cho file refs.bib
            # ref_extractor = ReferenceExtractor()
            # refs = ref_extractor.extract_from_content(flat_content_str)
            # ... x·ª≠ l√Ω dedup reference ...

        except Exception as e:
            print(f"      ‚ùå Error processing {ver}: {str(e)}")
            # import traceback
            # traceback.print_exc()

    # --- EXPORT K·∫æT QU·∫¢ CHO B√ÄI B√ÅO N√ÄY ---
    
    # 1. Export hierarchy.json [cite: 106-133]
    final_json = deduplicator.get_final_json()
    hierarchy_path = os.path.join(paper_output_dir, 'hierarchy.json')
    
    with open(hierarchy_path, 'w', encoding='utf-8') as f:
        json.dump(final_json, f, indent=2, ensure_ascii=False)
    
    print(f"   ‚úÖ Saved hierarchy.json ({len(final_json['elements'])} unique elements)")
    
    # 2. Copy metadata.json v√† references.json (n·∫øu c√≥ trong raw) [cite: 98-104]
    # V√¨ code parser kh√¥ng t·∫°o metadata, ta copy t·ª´ ngu·ªìn (th∆∞·ªùng scraper ƒë√£ l·∫•y v·ªÅ)
    for meta_file in ['metadata.json', 'references.json']:
        src_meta = os.path.join(DATA_RAW_PATH, paper_id, meta_file)
        dst_meta = os.path.join(paper_output_dir, meta_file)
        if os.path.exists(src_meta):
            shutil.copy2(src_meta, dst_meta)
            print(f"   ‚úÖ Copied {meta_file}")
        else:
            print(f"   ‚ö†Ô∏è Missing {meta_file} in source.")

print("\nüéâ ALL DONE! Check 'data_output' folder.")

Processing Papers:   0%|          | 0/11 [00:00<?, ?it/s]


üìò Processing Paper: 2403-00530
   üëâ Version: 2403-00530v1
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00530, Version: 2403-00530v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00530v1)
   üëâ Version: 2403-00530v2
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00530, Version: 2403-00530v2
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...


Processing Papers:   9%|‚ñâ         | 1/11 [00:00<00:01,  5.55it/s]

üîÑ Processing hierarchy for version: 2 (from 2403-00530v2)
   ‚úÖ Saved hierarchy.json (360 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403-00531
   üëâ Version: 2403-00531v1
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00531, Version: 2403-00531v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00531v1)


Processing Papers:  27%|‚ñà‚ñà‚ñã       | 3/11 [00:00<00:00,  8.97it/s]

   üëâ Version: 2403-00531v2
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00531, Version: 2403-00531v2
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 2 (from 2403-00531v2)
   ‚úÖ Saved hierarchy.json (532 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403-00532
   üëâ Version: 2403-00532v1
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00532, Version: 2403-00532v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00532v1)
   üëâ Version: 2403-00532v2
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00532, Version: 2403-00532v2
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 2 (from 2403-00532v2)
   ‚úÖ Saved hierarchy.json (1191 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403

Processing Papers:  36%|‚ñà‚ñà‚ñà‚ñã      | 4/11 [00:00<00:00,  7.39it/s]

üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00533, Version: 2403-00533v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00533v1)
   üëâ Version: 2403-00533v2
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00533, Version: 2403-00533v2
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 2 (from 2403-00533v2)
   ‚úÖ Saved hierarchy.json (361 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403-00534
   üëâ Version: 2403-00534v1
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00534, Version: 2403-00534v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00534v1)
   ‚úÖ Saved hierarchy.json (281 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403-00535
   üëâ Version: 2403-00

Processing Papers:  55%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 6/11 [00:00<00:00,  8.75it/s]

üîÑ Processing hierarchy for version: 1 (from 2403-00535v1)
   ‚úÖ Saved hierarchy.json (892 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403-00536
   üëâ Version: 2403-00536v1


Processing Papers:  82%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè | 9/11 [00:01<00:00,  7.53it/s]

üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00536, Version: 2403-00536v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00536v1)
   ‚úÖ Saved hierarchy.json (1307 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403-00537
   üëâ Version: 2403-00537v1
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00537, Version: 2403-00537v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00537v1)
   üëâ Version: 2403-00537v2
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00537, Version: 2403-00537v2
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 2 (from 2403-00537v2)
   ‚úÖ Saved hierarchy.json (198 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üìò Processing Paper: 2403-00538
   üëâ Version: 2403-0

Processing Papers: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 11/11 [00:01<00:00,  7.91it/s]

üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00540, Version: 2403-00540v1
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 1 (from 2403-00540v1)
   üëâ Version: 2403-00540v2
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00540, Version: 2403-00540v2
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 2 (from 2403-00540v2)
   üëâ Version: 2403-00540v3
üìù Kh·ªüi t·∫°o LatexFlattener cho Paper: 2403-00540, Version: 2403-00540v3
üîç X·ª≠ l√Ω Preamble ƒë·ªÉ tr√≠ch xu·∫•t Title, Authors, Abstract...
üîÑ Processing hierarchy for version: 3 (from 2403-00540v3)
   ‚úÖ Saved hierarchy.json (523 unique elements)
   ‚úÖ Copied metadata.json
   ‚úÖ Copied references.json

üéâ ALL DONE! Check 'data_output' folder.



