# 🎯 Strategy 1: Direct LlamaParse Approach

**Philosophy**: Leverage LlamaParse's sophisticated built-in AI to directly convert PDFs into structured academic content.

## Optimization Areas:
- System prompt engineering for academic papers
- Result type selection (text vs markdown)
- API parameters tuning
- Post-processing logic
- Error handling improvements

## Available Papers:
- `Mobile-Based_Deep_Learning_Models_for_Banana_Disease.pdf`
- `Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf`
- `Practical Machine Learning_25_05_04_14_32_34.pdf`

In [19]:
# Install required packages
!pip3 install llama_parse python-dotenv

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [20]:
from llama_parse import LlamaParse
import os
from dotenv import load_dotenv
load_dotenv()

True

In [21]:
# API Key from environment (LlamaParse)
api_key = os.getenv("LLAMAPARSE_API_KEY")
if not api_key:
    raise ValueError("LLAMAPARSE_API_KEY not found in environment. Please set it in your .env file.")

In [22]:
# ⚡ OPTIMIZATION AREA 1: Custom System Prompt for Academic Papers
# Try different prompts that emphasize specific academic elements
parsing_instruction_research = """
The provided document is a research paper. I want you to parse it systematically while preserving the academic structure and important content.

IMPORTANT PARSING GUIDELINES:
1. Preserve academic sections: abstract, introduction, methodology, results, discussion, conclusion, references
2. Maintain figure captions and table data as separate sections
3. Extract section numbers (e.g., "2.1", "3.2") when available
4. Preserve mathematical formulas, equations, and citations
5. Filter out headers/footers but keep page numbers
6. Maintain academic formatting and language
7. Break content at logical academic boundaries (sections, subsections, paragraphs)

The output should maintain the scholarly structure of the original document.
"""

# 🔧 TRY: Alternative prompt variations:

# Option A: More specific academic focus
# parsing_instruction_v2 = """
# You are an expert academic document parser specializing in research papers.
# Focus on:
# 1. Precise section identification (abstract, intro, methods, results, discussion, conclusion)
# 2. Mathematical equation preservation
# 3. Citation and reference handling
# 4. Figure/table metadata extraction
# Break into granular chunks maintaining academic integrity.
# """

# Option B: Structure-first approach
# parsing_instruction_v3 = """
# Parse this research paper with emphasis on structural hierarchy:
# 1. Identify main sections and subsections
# 2. Preserve numbering systems (1.1, 1.2, etc.)
# 3. Separate visual elements (figures, tables, equations)
# 4. Maintain citation context and reference links
# Create logical, hierarchical chunks that preserve academic flow.
# """

In [23]:
import glob

# List all PDF files in the papers directory
papers_dir = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/"
paper_files = sorted(glob.glob(os.path.join(papers_dir, "*.pdf")))

# Output directory
output_dir = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/"
os.makedirs(output_dir, exist_ok=True)

# Map input PDF to output markdown filename
paper_outputs = [
    (
        paper,
        os.path.join(
            output_dir,
            f"strategy1_{os.path.splitext(os.path.basename(paper))[0].replace(' ', '_').replace('-', '_').lower()}_llamaparse.md"
        )
    )
    for paper in paper_files
]

In [24]:
from llama_parse import ResultType

def process_paper_with_llamaparse(input_file, output_file, custom_prompt=None):
    """Process a research paper using LlamaParse"""
    
    prompt = custom_prompt or parsing_instruction_research
    
    # ⚡ OPTIMIZATION AREA 2: LlamaParse Parameters
    # Experiment with different result_type, system_prompt combinations
    parser = LlamaParse(
        api_key=str(api_key),
        result_type="markdown",  # type: ignore (let's just use markdown for simplicity)
        system_prompt=prompt,  # 🔧 TRY: Different prompt variations
        verbose=True,
        # 🔧 TRY: Add other parameters like language="en", num_workers=4, etc.
        # language="en",
        # num_workers=4,
        # split_by_page=False,
    )
    
    print(f"Parsing research paper: {input_file}")
    parsed_document = parser.load_data(input_file)
    
    # ⚡ OPTIMIZATION AREA 3: Post-processing Enhancement
    # Add custom logic to improve parsing results
    with open(output_file, 'w', encoding='utf-8') as f:
        for doc in parsed_document:
            # 🔧 TRY: Add filtering, formatting, or enhancement logic here
            content = doc.text
            
            # Example post-processing options:
            # content = clean_academic_text(content)
            # content = enhance_section_headers(content)
            # content = fix_citations(content)
            
            f.write(content + '\n')
    
    print(f"Parsed content saved to: {output_file}")
    return parsed_document

In [25]:
# Process all research papers
for paper_path, output_file in paper_outputs:
    print(f"\n---\nProcessing: {os.path.basename(paper_path)}")
    try:
        parsed_doc = process_paper_with_llamaparse(paper_path, output_file)
    except Exception as e:
        print(f"Error processing {paper_path}: {e}")


---
Processing: Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf
Parsing research paper: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf
Started parsing the file under job_id ad75dfd0-1ed2-42b1-af15-5510a403515c
Started parsing the file under job_id ad75dfd0-1ed2-42b1-af15-5510a403515c
Parsed content saved to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy1_examining_the_awareness_of_mobile_money_users_on_s_llamaparse.md

---
Processing: Mobile-Based_Deep_Learning_Models_for_Banana_Disease.pdf
Parsing research paper: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/Mobile-Based_Deep_Learning_Models_for_Banana_Disease.pdf
Parsed content saved to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy1_examining_the_awareness_of_mobile_money_users_on_s_llamaparse.md

---
Processing: Mobile-Based_Deep_Learning_Models_for_B

In [26]:
# ⚡ OPTIMIZATION AREA 4: Quality Assessment and Comparison
def analyze_parsing_results(output_file, paper_name):
    """Analyze the quality of parsing results"""
    
    with open(output_file, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Basic metrics
    total_length = len(content)
    line_count = len(content.split('\n'))
    
    # Academic structure detection
    sections = {
        'abstract': content.lower().count('abstract'),
        'introduction': content.lower().count('introduction'),
        'methodology': content.lower().count('methodology') + content.lower().count('methods'),
        'results': content.lower().count('results'),
        'discussion': content.lower().count('discussion'),
        'conclusion': content.lower().count('conclusion'),
        'references': content.lower().count('references') + content.lower().count('bibliography')
    }
    
    # Figure and table detection
    figures = content.lower().count('figure')
    tables = content.lower().count('table')
    
    # Citations (rough estimate)
    citations = content.count('(') + content.count('[')
    
    print(f"\n📊 ANALYSIS RESULTS FOR {paper_name}:")
    print(f"Total content length: {total_length:,} characters")
    print(f"Total lines: {line_count:,}")
    print(f"Academic sections found: {sections}")
    print(f"Figures mentioned: {figures}")
    print(f"Tables mentioned: {tables}")
    print(f"Potential citations: {citations}")
    
    return {
        'length': total_length,
        'lines': line_count,
        'sections': sections,
        'figures': figures,
        'tables': tables,
        'citations': citations
    }

# Analyze the paper
results = analyze_parsing_results(output_file, "Mobile-Based Deep Learning Models for Banana Disease")


📊 ANALYSIS RESULTS FOR Mobile-Based Deep Learning Models for Banana Disease:
Total content length: 543,588 characters
Total lines: 8,225
Academic sections found: {'abstract': 88, 'introduction': 95, 'methodology': 126, 'results': 133, 'discussion': 83, 'conclusion': 95, 'references': 104}
Figures mentioned: 294
Tables mentioned: 256
Potential citations: 2455


In [27]:
# 📊 STRATEGY 1 RESULTS SUMMARY
print("🏆 STRATEGY 1: LLAMAPARSE DIRECT RESULTS")
print("=" * 50)

print(f"\nPaper Results:")
print(f"  Content Length: {results['length']:,} chars")
print(f"  Academic Sections: {sum(results['sections'].values())}")
print(f"  Figures/Tables: {results['figures'] + results['tables']}")

print("\n🎯 Next Steps:")
print("1. Review the generated markdown file")
print("2. Implement your optimizations above")
print("3. Compare results with other strategies")
print("4. Document your improvements")

🏆 STRATEGY 1: LLAMAPARSE DIRECT RESULTS

Paper Results:
  Content Length: 543,588 chars
  Academic Sections: 724
  Figures/Tables: 550

🎯 Next Steps:
1. Review the generated markdown file
2. Implement your optimizations above
3. Compare results with other strategies
4. Document your improvements
