<a href="https://colab.research.google.com/github/Fatima1510/twiga-indabax-challenge/blob/main/strategy1_llamaparse_direct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎯 Strategy 1: Direct LlamaParse Approach

**Philosophy**: Leverage LlamaParse's sophisticated built-in AI to directly convert PDFs into structured academic content.

## Optimization Areas:
- System prompt engineering for academic papers
- Result type selection (text vs markdown)
- API parameters tuning
- Post-processing logic
- Error handling improvements

## Available Papers:
- `Mobile-Based_Deep_Learning_Models_for_Banana_Disease.pdf`
- `Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf`
- `Practical Machine Learning_25_05_04_14_32_34.pdf`

In [1]:
# Install required packages
!pip3 install llama_parse python-dotenv

Collecting llama_parse
  Downloading llama_parse-0.6.54-py3-none-any.whl.metadata (6.6 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting llama-cloud-services>=0.6.54 (from llama_parse)
  Downloading llama_cloud_services-0.6.54-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-cloud==0.1.35 (from llama-cloud-services>=0.6.54->llama_parse)
  Downloading llama_cloud-0.1.35-py3-none-any.whl.metadata (1.2 kB)
Collecting llama-index-core>=0.12.0 (from llama-cloud-services>=0.6.54->llama_parse)
  Downloading llama_index_core-0.13.0-py3-none-any.whl.metadata (2.5 kB)
Collecting aiosqlite (from llama-index-core>=0.12.0->llama-cloud-services>=0.6.54->llama_parse)
  Downloading aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting banks<3,>=2.2.0 (from llama-index-core>=0.12.0->llama-cloud-services>=0.6.54->llama_parse)
  Downloading banks-2.2.0-py3-none-any.whl.metadata (12 kB)
Collecting dataclasses-json (from llama-index-core

In [4]:
from llama_parse import LlamaParse
import os
from dotenv import load_dotenv

# Create a .env file with your LlamaParse API key
with open(".env", "w") as f:
    f.write("LLAMAPARSE_API_KEY
")

load_dotenv()

True

In [5]:
# API Key from environment (LlamaParse)
api_key = os.getenv("LLAMAPARSE_API_KEY")
if not api_key:
    raise ValueError("LLAMAPARSE_API_KEY not found in environment. Please set it in your .env file.")

In [None]:
# ⚡ OPTIMIZATION AREA 1: Custom System Prompt for Academic Papers
# Try different prompts that emphasize specific academic elements
parsing_instruction_research = """
The provided document is a research paper. I want you to parse it systematically while preserving the academic structure and important content.

IMPORTANT PARSING GUIDELINES:
1. Preserve academic sections: abstract, introduction, methodology, results, discussion, conclusion, references
2. Maintain figure captions and table data as separate sections
3. Extract section numbers (e.g., "2.1", "3.2") when available
4. Preserve mathematical formulas, equations, and citations
5. Filter out headers/footers but keep page numbers
6. Maintain academic formatting and language
7. Break content at logical academic boundaries (sections, subsections, paragraphs)

The output should maintain the scholarly structure of the original document.
"""

# 🔧 TRY: Alternative prompt variations:

# Option A: More specific academic focus
# parsing_instruction_v2 = """
# You are an expert academic document parser specializing in research papers.
# Focus on:
# 1. Precise section identification (abstract, intro, methods, results, discussion, conclusion)
# 2. Mathematical equation preservation
# 3. Citation and reference handling
# 4. Figure/table metadata extraction
# Break into granular chunks maintaining academic integrity.
# """

# Option B: Structure-first approach
# parsing_instruction_v3 = """
# Parse this research paper with emphasis on structural hierarchy:
# 1. Identify main sections and subsections
# 2. Preserve numbering systems (1.1, 1.2, etc.)
# 3. Separate visual elements (figures, tables, equations)
# 4. Maintain citation context and reference links
# Create logical, hierarchical chunks that preserve academic flow.
# """

In [8]:
import glob
from google.colab import files

# Uploading PDFs
print("Please upload your PDFs...")
uploaded = files.upload()

# Saving PDFs into paper directory
papers_dir = "/content/papers"
os.makedirs(papers_dir, exist_ok=True)

for filename in uploaded.keys():
    os.rename(filename, os.path.join(papers_dir, filename))

# Find all uploaded PDFs
paper_files = sorted(glob.glob(os.path.join(papers_dir, "*.pdf")))

# Set up output directory for parsed Markdown
output_dir = "/content/input_papers"
os.makedirs(output_dir, exist_ok=True)

# Create mapping of PDF → output Markdown
paper_outputs = [
    (
        paper,
        os.path.join(
            output_dir,
            f"strategy1_{os.path.splitext(os.path.basename(paper))[0].replace(' ', '_').replace('-', '_').lower()}_llamaparse.md"
        )
    )
    for paper in paper_files
]

print("PDF → Output Markdown Mapping:")
for pdf, md in paper_outputs:
    print(f"{pdf}  -->  {md}")

Please upload your PDFs...


Saving Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf to Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf
Saving AIandMLPaper.pdf to AIandMLPaper.pdf
PDF → Output Markdown Mapping:
/content/papers/AIandMLPaper.pdf  -->  /content/input_papers/strategy1_aiandmlpaper_llamaparse.md
/content/papers/Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf  -->  /content/input_papers/strategy1_examining_the_awareness_of_mobile_money_users_on_s_llamaparse.md


In [9]:
from llama_parse import ResultType

def process_paper_with_llamaparse(input_file, output_file, custom_prompt=None):
    """Process a research paper using LlamaParse and save to Markdown."""

    prompt = custom_prompt or parsing_instruction_research

    # ⚡ OPTIMIZATION AREA 2: LlamaParse Parameters
    # Experiment with different result_type, system_prompt combinations
    parser = LlamaParse(
        api_key=str(api_key),
        result_type="markdown",  # type: ignore (let's just use markdown for simplicity)
        system_prompt=prompt,  # 🔧 TRY: Different prompt variations
        verbose=True,
        # 🔧 TRY: Add other parameters like language="en", num_workers=4, etc.
        # language="en",
        # num_workers=4,
        # split_by_page=False,
    )

    print(f"Parsing research paper: {input_file}")
    parsed_document = parser.load_data(input_file)

    # ⚡ OPTIMIZATION AREA 3: Post-processing Enhancement
    # Add custom logic to improve parsing results
    with open(output_file, 'w', encoding='utf-8') as f:
        for doc in parsed_document:
            # 🔧 TRY: Add filtering, formatting, or enhancement logic here
            content = doc.text

            # Example post-processing options:
            # content = clean_academic_text(content)
            # content = enhance_section_headers(content)
            # content = fix_citations(content)

            f.write(content + '\n')

    print(f"Parsed content saved to: {output_file}")
    return parsed_document

In [10]:
# Process all research papers
for pdf_path, md_path in paper_outputs:
    process_paper_with_llamaparse(pdf_path, md_path, api_key)

Parsing research paper: /content/papers/AIandMLPaper.pdf
Started parsing the file under job_id 0b901100-9dcd-4381-ab79-361d420cbd00
Parsed content saved to: /content/input_papers/strategy1_aiandmlpaper_llamaparse.md
Parsing research paper: /content/papers/Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf
Started parsing the file under job_id 9c110601-fb11-49f8-8b29-74e6cbbc5e4c
Parsed content saved to: /content/input_papers/strategy1_examining_the_awareness_of_mobile_money_users_on_s_llamaparse.md


In [12]:
# ⚡ OPTIMIZATION AREA 4: Quality Assessment and Comparison
def analyze_parsing_results(output_file, paper_name):
    """Analyze the quality of parsing results"""

    with open(output_file, 'r', encoding='utf-8') as f:
        content = f.read()

    # Basic metrics
    total_length = len(content)
    line_count = len(content.split('\n'))

    # Academic structure detection
    sections = {
        'abstract': content.lower().count('abstract'),
        'introduction': content.lower().count('introduction'),
        'methodology': content.lower().count('methodology') + content.lower().count('methods'),
        'results': content.lower().count('results'),
        'discussion': content.lower().count('discussion'),
        'conclusion': content.lower().count('conclusion'),
        'references': content.lower().count('references') + content.lower().count('bibliography')
    }

    # Figure and table detection
    figures = content.lower().count('figure')
    tables = content.lower().count('table')

    # Citations (rough estimate)
    citations = content.count('(') + content.count('[')

    print(f"\n📊 ANALYSIS RESULTS FOR {paper_name}:")
    print(f"Total content length: {total_length:,} characters")
    print(f"Total lines: {line_count:,}")
    print(f"Academic sections found: {sections}")
    print(f"Figures mentioned: {figures}")
    print(f"Tables mentioned: {tables}")
    print(f"Potential citations: {citations}")

    return {
        'length': total_length,
        'lines': line_count,
        'sections': sections,
        'figures': figures,
        'tables': tables,
        'citations': citations
    }

# Analyze each paper
all_results = {}
for pdf_path, md_path in paper_outputs:
    paper_name = os.path.splitext(os.path.basename(pdf_path))[0]
    all_results[paper_name] = analyze_parsing_results(md_path, paper_name)


📊 ANALYSIS RESULTS FOR AIandMLPaper:
Total content length: 40,422 characters
Total lines: 322
Academic sections found: {'abstract': 1, 'introduction': 2, 'methodology': 3, 'results': 2, 'discussion': 1, 'conclusion': 1, 'references': 5}
Figures mentioned: 0
Tables mentioned: 18
Potential citations: 59

📊 ANALYSIS RESULTS FOR Examining_the_Awareness_of_Mobile_Money_Users_on_S:
Total content length: 20,772 characters
Total lines: 201
Academic sections found: {'abstract': 1, 'introduction': 1, 'methodology': 2, 'results': 3, 'discussion': 1, 'conclusion': 2, 'references': 5}
Figures mentioned: 0
Tables mentioned: 3
Potential citations: 27


In [17]:
# 📊 STRATEGY 1 RESULTS SUMMARY
print("🏆 STRATEGY 1: LLAMAPARSE DIRECT RESULTS")
print("=" * 50)

for pdf_path, md_path in paper_outputs:
    # Load parsed content for summary
    with open(md_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Simple metrics (you can add more advanced parsing if needed)
    content_length = len(content)
    academic_sections = content.count("#")  # counts markdown headers
    figures = content.lower().count("figure")
    tables = content.lower().count("table")

    print(f"\nResults for {os.path.basename(pdf_path)}:")
    print(f"  Content Length: {content_length:,}  chars")
    print(f"  Academic Sections (Markdown headers): {academic_sections}")
    print(f"  Figures/Tables Mentions: {figures + tables}")
    print(f"  Output Markdown: {md_path}")

print("\n🎯 Next Steps:")
print("1. Review the generated markdown file in /content/input_papers/")
print("2. Implement your optimizations above")
print("3. Compare results with other strategies")
print("4. Document your improvements")

🏆 STRATEGY 1: LLAMAPARSE DIRECT RESULTS

Results for AIandMLPaper.pdf:
  Content Length: 40,422  chars
  Academic Sections (Markdown headers): 43
  Figures/Tables Mentions: 18
  Output Markdown: /content/input_papers/strategy1_aiandmlpaper_llamaparse.md

Results for Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf:
  Content Length: 20,772  chars
  Academic Sections (Markdown headers): 41
  Figures/Tables Mentions: 3
  Output Markdown: /content/input_papers/strategy1_examining_the_awareness_of_mobile_money_users_on_s_llamaparse.md

🎯 Next Steps:
1. Review the generated markdown file in /content/input_papers/
2. Implement your optimizations above
3. Compare results with other strategies
4. Document your improvements
