# 🎯 Strategy 1: Direct LlamaParse Approach

**Philosophy**: Leverage LlamaParse's sophisticated built-in AI to directly convert PDFs into structured academic content.

## Optimization Areas:
- System prompt engineering for academic papers
- Result type selection (text vs markdown)
- API parameters tuning
- Post-processing logic
- Error handling improvements

## Available Papers:
- `30YearsResearchGate.pdf`
- `SchenkBekkerSchmitt2025PrecRes.pdf`

In [1]:
# Install required packages
!pip3 install llama_parse

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
from llama_parse import LlamaParse
import os



In [None]:
# API Keys
api_key = "llx-"

In [4]:
# ⚡ OPTIMIZATION AREA 1: Custom System Prompt for Academic Papers
# Try different prompts that emphasize specific academic elements
parsing_instruction_research = """
The provided document is a research paper. I want you to parse it systematically while preserving the academic structure and important content.

IMPORTANT PARSING GUIDELINES:
1. Preserve academic sections: abstract, introduction, methodology, results, discussion, conclusion, references
2. Maintain figure captions and table data as separate sections
3. Extract section numbers (e.g., "2.1", "3.2") when available
4. Preserve mathematical formulas, equations, and citations
5. Filter out headers/footers but keep page numbers
6. Maintain academic formatting and language
7. Break content at logical academic boundaries (sections, subsections, paragraphs)

The output should maintain the scholarly structure of the original document.
"""

# 🔧 TRY: Alternative prompt variations:

# Option A: More specific academic focus
# parsing_instruction_v2 = """
# You are an expert academic document parser specializing in research papers.
# Focus on:
# 1. Precise section identification (abstract, intro, methods, results, discussion, conclusion)
# 2. Mathematical equation preservation
# 3. Citation and reference handling
# 4. Figure/table metadata extraction
# Break into granular chunks maintaining academic integrity.
# """

# Option B: Structure-first approach
# parsing_instruction_v3 = """
# Parse this research paper with emphasis on structural hierarchy:
# 1. Identify main sections and subsections
# 2. Preserve numbering systems (1.1, 1.2, etc.)
# 3. Separate visual elements (figures, tables, equations)
# 4. Maintain citation context and reference links
# Create logical, hierarchical chunks that preserve academic flow.
# """

In [5]:
# Research paper file paths
research_paper_1 = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/30YearsResearchGate.pdf"
research_paper_2 = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/SchenkBekkerSchmitt2025PrecRes.pdf"

# Output directory
output_dir = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/"
os.makedirs(output_dir, exist_ok=True)

In [6]:
def process_paper_with_llamaparse(input_file, output_file, custom_prompt=None):
    """Process a research paper using LlamaParse"""
    
    prompt = custom_prompt or parsing_instruction_research
    
    # ⚡ OPTIMIZATION AREA 2: LlamaParse Parameters
    # Experiment with different result_type, system_prompt combinations
    parser = LlamaParse(
        api_key=api_key,
        result_type="markdown",  # 🔧 TRY: "text", "markdown", or other formats
        system_prompt=prompt,  # 🔧 TRY: Different prompt variations
        verbose=True,
        # 🔧 TRY: Add other parameters like language="en", num_workers=4, etc.
        # language="en",
        # num_workers=4,
        # split_by_page=False,
    )
    
    print(f"Parsing research paper: {input_file}")
    parsed_document = parser.load_data(input_file)
    
    # ⚡ OPTIMIZATION AREA 3: Post-processing Enhancement
    # Add custom logic to improve parsing results
    with open(output_file, 'w', encoding='utf-8') as f:
        for doc in parsed_document:
            # 🔧 TRY: Add filtering, formatting, or enhancement logic here
            content = doc.text
            
            # Example post-processing options:
            # content = clean_academic_text(content)
            # content = enhance_section_headers(content)
            # content = fix_citations(content)
            
            f.write(content + '\n')
    
    print(f"Parsed content saved to: {output_file}")
    return parsed_document

In [7]:
# Process first research paper
output_file_1 = os.path.join(output_dir, "strategy1_paper1_llamaparse.md")
parsed_doc_1 = process_paper_with_llamaparse(research_paper_1, output_file_1)

Parsing research paper: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/30YearsResearchGate.pdf
Started parsing the file under job_id e6be10a1-ec8a-4ed4-94a0-4dda0cae4bd8
Started parsing the file under job_id e6be10a1-ec8a-4ed4-94a0-4dda0cae4bd8
Parsed content saved to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy1_paper1_llamaparse.md
Parsed content saved to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy1_paper1_llamaparse.md


In [8]:
# Process second research paper
output_file_2 = os.path.join(output_dir, "strategy1_paper2_llamaparse.md")
parsed_doc_2 = process_paper_with_llamaparse(research_paper_2, output_file_2)

Parsing research paper: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/SchenkBekkerSchmitt2025PrecRes.pdf
Started parsing the file under job_id a7bcb89e-e3ac-43b8-8640-46b7f39ce032
Started parsing the file under job_id a7bcb89e-e3ac-43b8-8640-46b7f39ce032
..Parsed content saved to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy1_paper2_llamaparse.md
Parsed content saved to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy1_paper2_llamaparse.md


In [9]:
# ⚡ OPTIMIZATION AREA 4: Quality Assessment and Comparison
def analyze_parsing_results(output_file, paper_name):
    """Analyze the quality of parsing results"""
    
    with open(output_file, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Basic metrics
    total_length = len(content)
    line_count = len(content.split('\n'))
    
    # Academic structure detection
    sections = {
        'abstract': content.lower().count('abstract'),
        'introduction': content.lower().count('introduction'),
        'methodology': content.lower().count('methodology') + content.lower().count('methods'),
        'results': content.lower().count('results'),
        'discussion': content.lower().count('discussion'),
        'conclusion': content.lower().count('conclusion'),
        'references': content.lower().count('references') + content.lower().count('bibliography')
    }
    
    # Figure and table detection
    figures = content.lower().count('figure')
    tables = content.lower().count('table')
    
    # Citations (rough estimate)
    citations = content.count('(') + content.count('[')
    
    print(f"\n📊 ANALYSIS RESULTS FOR {paper_name}:")
    print(f"Total content length: {total_length:,} characters")
    print(f"Total lines: {line_count:,}")
    print(f"Academic sections found: {sections}")
    print(f"Figures mentioned: {figures}")
    print(f"Tables mentioned: {tables}")
    print(f"Potential citations: {citations}")
    
    return {
        'length': total_length,
        'lines': line_count,
        'sections': sections,
        'figures': figures,
        'tables': tables,
        'citations': citations
    }

# Analyze both papers
results_1 = analyze_parsing_results(output_file_1, "Paper 1 (30YearsResearchGate)")
results_2 = analyze_parsing_results(output_file_2, "Paper 2 (SchenkBekkerSchmitt2025)")


📊 ANALYSIS RESULTS FOR Paper 1 (30YearsResearchGate):
Total content length: 51,889 characters
Total lines: 714
Academic sections found: {'abstract': 24, 'introduction': 25, 'methodology': 30, 'results': 27, 'discussion': 29, 'conclusion': 30, 'references': 26}
Figures mentioned: 46
Tables mentioned: 26
Potential citations: 193

📊 ANALYSIS RESULTS FOR Paper 2 (SchenkBekkerSchmitt2025):
Total content length: 69,657 characters
Total lines: 785
Academic sections found: {'abstract': 22, 'introduction': 22, 'methodology': 27, 'results': 27, 'discussion': 19, 'conclusion': 18, 'references': 21}
Figures mentioned: 43
Tables mentioned: 42
Potential citations: 341


In [10]:
# 🎯 YOUR OPTIMIZATION WORKSPACE
# Implement your Strategy 1 optimizations here

print("🚀 Strategy 1 Optimization Workspace")
print("\nOptimization Areas for LlamaParse:")
print("☐ Custom academic prompts")
print("☐ Result type optimization (text/markdown/json)")
print("☐ Parameter tuning (language, workers, etc.)")
print("☐ Post-processing enhancements")
print("☐ Quality validation")

# TODO: Add your optimized LlamaParse implementation here

# Example optimization template:
'''
# Your optimized implementation:
optimized_prompt = """
YOUR_CUSTOM_ACADEMIC_PROMPT_HERE
"""

optimized_parser = LlamaParse(
    api_key=api_key,
    result_type="YOUR_OPTIMAL_TYPE",
    system_prompt=optimized_prompt,
    language="en",
    num_workers=4,
    # Add other optimized parameters
)

# Process both papers with optimizations
# Add quality assessment and comparison
'''

🚀 Strategy 1 Optimization Workspace

Optimization Areas for LlamaParse:
☐ Custom academic prompts
☐ Result type optimization (text/markdown/json)
☐ Parameter tuning (language, workers, etc.)
☐ Post-processing enhancements
☐ Quality validation


'\n# Your optimized implementation:\noptimized_prompt = """\nYOUR_CUSTOM_ACADEMIC_PROMPT_HERE\n"""\n\noptimized_parser = LlamaParse(\n    api_key=api_key,\n    result_type="YOUR_OPTIMAL_TYPE",\n    system_prompt=optimized_prompt,\n    language="en",\n    num_workers=4,\n    # Add other optimized parameters\n)\n\n# Process both papers with optimizations\n# Add quality assessment and comparison\n'

In [11]:
# 📊 STRATEGY 1 RESULTS SUMMARY
print("🏆 STRATEGY 1: LLAMAPARSE DIRECT RESULTS")
print("=" * 50)

print(f"\nPaper 1 Results:")
print(f"  Content Length: {results_1['length']:,} chars")
print(f"  Academic Sections: {sum(results_1['sections'].values())}")
print(f"  Figures/Tables: {results_1['figures'] + results_1['tables']}")

print(f"\nPaper 2 Results:")
print(f"  Content Length: {results_2['length']:,} chars")
print(f"  Academic Sections: {sum(results_2['sections'].values())}")
print(f"  Figures/Tables: {results_2['figures'] + results_2['tables']}")

print("\n🎯 Next Steps:")
print("1. Review the generated markdown files")
print("2. Implement your optimizations above")
print("3. Compare results with other strategies")
print("4. Document your improvements")

🏆 STRATEGY 1: LLAMAPARSE DIRECT RESULTS

Paper 1 Results:
  Content Length: 51,889 chars
  Academic Sections: 191
  Figures/Tables: 72

Paper 2 Results:
  Content Length: 69,657 chars
  Academic Sections: 156
  Figures/Tables: 85

🎯 Next Steps:
1. Review the generated markdown files
2. Implement your optimizations above
3. Compare results with other strategies
4. Document your improvements
