# 🎯 Strategy 2: Raw PDF + AI Structuring

**Philosophy**: Extract raw text with pypdf, then use powerful AI models to structure and parse the content with full control.

## Optimization Areas:
- Text extraction quality improvement
- AI model selection
- Prompt engineering for academic structure
- Chunking strategy optimization
- Page range selection logic

## Available Papers:
- `30YearsResearchGate.pdf`
- `SchenkBekkerSchmitt2025PrecRes.pdf`

In [1]:
# Install required packages
!pip3 install together pypdf pydantic

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
from pypdf import PdfReader
from together import Together
from pydantic import BaseModel, Field
from typing import Optional
import json
import re
import os



In [None]:
# API Keys
tog_api = "xxx"
together = Together(api_key=tog_api)

In [4]:
# ⚡ OPTIMIZATION AREA: Data Model Enhancement
class ResearchChunk(BaseModel):
    section_type: str = Field(description="Type of section: abstract, introduction, methodology, results, discussion, conclusion, references, etc.")
    section_number: Optional[str] = Field(description="Section number if available (e.g., '2.1', '3.2')")
    page_number: int = Field(description="Page number of the chunk")
    content: str = Field(description="Parsed content of the chunk")
    is_figure_caption: bool = Field(default=False, description="Whether this chunk is a figure caption")
    is_table: bool = Field(default=False, description="Whether this chunk contains table data")
    
    # 🔧 TRY: Add these optional fields for enhanced academic parsing:
    confidence_score: Optional[float] = Field(default=None, description="AI confidence in parsing accuracy")
    has_citations: bool = Field(default=False, description="Whether chunk contains citations")
    has_equations: bool = Field(default=False, description="Whether chunk contains mathematical equations")
    keywords: Optional[list[str]] = Field(default=None, description="Key academic terms found in chunk")
    subsection_title: Optional[str] = Field(default=None, description="Subsection title if identifiable")

class ResearchPaper(BaseModel):
    title: Optional[str] = Field(description="Title of the research paper if identifiable")
    authors: Optional[str] = Field(description="Authors of the paper if identifiable")
    chunks: list[ResearchChunk] = Field(description="List of chunks that build the research paper")
    
    # 🔧 TRY: Add these optional fields for enhanced paper metadata:
    abstract: Optional[str] = Field(default=None, description="Paper abstract if identifiable")
    publication_year: Optional[int] = Field(default=None, description="Publication year if found")
    journal: Optional[str] = Field(default=None, description="Journal or venue if identifiable")
    doi: Optional[str] = Field(default=None, description="DOI if found in paper")
    total_pages: Optional[int] = Field(default=None, description="Total number of pages processed")

In [5]:
# Research paper file paths
research_paper_1 = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/30YearsResearchGate.pdf"
research_paper_2 = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/SchenkBekkerSchmitt2025PrecRes.pdf"

# Output directory
output_dir = "/Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/"
os.makedirs(output_dir, exist_ok=True)

print(f"Paper 1: {research_paper_1}")
print(f"Paper 2: {research_paper_2}")

Paper 1: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/30YearsResearchGate.pdf
Paper 2: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/SchenkBekkerSchmitt2025PrecRes.pdf


In [6]:
# ⚡ OPTIMIZATION AREA 1: Text Extraction Enhancement
def get_research_paper_text(pdf_path: str, start_page: int = 1, end_page: Optional[int] = None):
    """Extract text from research paper with page range"""
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)
    
    if end_page is None:
        end_page = total_pages
    
    print(f"Total pages in PDF: {total_pages}")
    print(f"Extracting pages {start_page} to {min(end_page, total_pages)}")
    
    text = ""
    for page_num in range(start_page, min(end_page + 1, total_pages + 1)):
        page_text = reader.pages[page_num-1].extract_text()
        
        # 🔧 TRY: Add text cleaning, normalization, or enhancement here
        # Examples:
        # page_text = page_text.replace('\n\n', '\n')  # Normalize line breaks
        # page_text = re.sub(r'\s+', ' ', page_text)   # Clean whitespace
        # page_text = clean_academic_text(page_text)   # Custom cleaning function
        
        text += f"\n--- PAGE {page_num} ---\n" + page_text
        
    return text

def analyze_paper_characteristics(pdf_path: str):
    """Analyze paper to optimize processing strategy"""
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)
    
    # Sample first page to analyze content
    first_page = reader.pages[0].extract_text()
    
    characteristics = {
        'total_pages': total_pages,
        'has_abstract': 'abstract' in first_page.lower(),
        'has_figures': 'figure' in first_page.lower(),
        'has_tables': 'table' in first_page.lower(),
        'complexity': 'high' if total_pages > 15 else 'medium' if total_pages > 8 else 'low'
    }
    
    return characteristics

# Analyze both papers
paper1_analysis = analyze_paper_characteristics(research_paper_1)
paper2_analysis = analyze_paper_characteristics(research_paper_2)
print(f"Paper 1 characteristics: {paper1_analysis}")
print(f"Paper 2 characteristics: {paper2_analysis}")

Paper 1 characteristics: {'total_pages': 21, 'has_abstract': False, 'has_figures': False, 'has_tables': False, 'complexity': 'high'}
Paper 2 characteristics: {'total_pages': 15, 'has_abstract': False, 'has_figures': False, 'has_tables': False, 'complexity': 'medium'}


In [7]:
# ⚡ OPTIMIZATION AREA 2: Advanced Prompt Engineering
research_parsing_instruction = """
The provided document is a research paper. I want you to parse it systematically while preserving the academic structure and important content.

IMPORTANT PARSING GUIDELINES:
1. Break into MULTIPLE chunks at logical boundaries (paragraphs, sections, subsections)
2. Identify section types: abstract, introduction, methodology, results, discussion, conclusion, references
3. Preserve figure captions and table data as separate chunks
4. Extract section numbers (e.g., "2.1", "3.2") when available
5. Maintain academic formatting and citations
6. Filter out headers/footers but keep page numbers
7. Preserve mathematical formulas and equations

For each chunk:
- Identify the section type (abstract, introduction, etc.)
- Extract section numbers if available
- Identify the page number
- Mark if it's a figure caption or table
- Extract the content preserving academic language

You MUST create multiple chunks - do not combine different sections or paragraphs into one chunk.
"""

# 🔧 TRY: Alternative prompt variations:

# Option A: More specific academic focus
research_parsing_instruction_v2 = """
You are an expert academic document parser specializing in research papers.
Focus on:
1. Precise section identification (abstract, intro, methods, results, discussion, conclusion)
2. Mathematical equation preservation
3. Citation and reference handling
4. Figure/table metadata extraction
Break into granular chunks maintaining academic integrity.
"""

# Option B: Quality-focused approach
research_parsing_instruction_v3 = """
Extract maximum value from this research paper by:
1. Identifying key insights and findings
2. Preserving methodological details
3. Maintaining result data and analysis
4. Capturing conclusions and implications
Focus on content quality and academic completeness.
"""

In [8]:
def robust_json_parse(ai_response, schema_class):
    """
    Robust JSON parsing with multiple fallback strategies
    """
    # Check if response is valid
    if not ai_response or not ai_response.choices or len(ai_response.choices) == 0:
        raise Exception("Invalid AI response: no choices available")
    
    raw_content = ai_response.choices[0].message.content
    
    # Check if content is valid
    if not raw_content:
        raise Exception("Invalid AI response: no content available")
    
    # Strategy 1: Direct parsing
    try:
        return schema_class.model_validate_json(raw_content)
    except Exception as e1:
        print(f"Strategy 1 failed: {e1}")
    
    # Strategy 2: Clean common JSON issues
    try:
        # Fix escaped quotes in content
        cleaned = re.sub(r'(?<!\\)"(?![,:}\]])', r'\\"', raw_content)
        
        # Fix newlines in strings
        cleaned = re.sub(r'(?<!\\)\n(?![,:}\]])', r'\\n', cleaned)
        
        # Validate JSON first
        json.loads(cleaned)
        return schema_class.model_validate_json(cleaned)
    except Exception as e2:
        print(f"Strategy 2 failed: {e2}")
    
    # Strategy 3: Re-request with stricter format
    try:
        print("🔧 Requesting AI to reformat response...")
        reformat_response = together.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": """Fix this JSON by:
1. Properly escaping all quotes in content strings
2. Removing any trailing commas
3. Ensuring valid JSON format
4. Keeping all original data intact
Return only the corrected JSON.""",
                },
                {
                    "role": "user",
                    "content": f"Fix this JSON response:\n{raw_content[:8000]}",  # Limit length
                },
            ],
            model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
            temperature=0.0,
            stream=False,
        )
        
        # Check if reformat response is valid
        if not reformat_response or not reformat_response.choices or len(reformat_response.choices) == 0:
            raise Exception("Invalid reformat response: no choices available")
        
        reformat_content = reformat_response.choices[0].message.content
        if not reformat_content:
            raise Exception("Invalid reformat response: no content available")
        
        return schema_class.model_validate_json(reformat_content)
    except Exception as e3:
        print(f"Strategy 3 failed: {e3}")
    
    # Strategy 4: Manual fallback
    print("🔧 Using manual fallback parsing...")
    if schema_class == ResearchPaper:
        return ResearchPaper(
            title="Research Paper (Manual Fallback)",
            authors="Unknown (parsing error)",
            chunks=[
                ResearchChunk(
                    section_type="other",
                    page_number=1,
                    content="Failed to parse content due to JSON formatting issues. Please try again with different parameters.",
                    is_figure_caption=False,
                    is_table=False
                )
            ]
        )
    
    raise Exception("All parsing strategies failed")

In [9]:
def process_research_paper(pdf_path: str, output_filename: str, start_page: int = 1, end_page: Optional[int] = None, custom_prompt: str = None):
    """Process a research paper and save chunks"""
    print(f"Processing: {pdf_path}")
    
    # 🔧 TRY: Add preprocessing steps here
    # Examples:
    # - Validate PDF integrity
    # - Analyze paper characteristics
    # - Choose optimal extraction method
    
    # Extract text
    text = get_research_paper_text(pdf_path, start_page, end_page)
    
    # 🔧 TRY: Add text cleaning/enhancement before AI processing
    # text = preprocess_academic_text(text)
    
    # Use custom prompt or default
    prompt = custom_prompt or research_parsing_instruction
    
    # ⚡ OPTIMIZATION AREA 3: AI Model and Parameters
    # Parse with AI
    extract = together.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": f"""{prompt}
                
CRITICAL JSON FORMATTING RULES:
1. Escape all quotes in content with \\"
2. Replace newlines in content with \\n
3. Ensure valid JSON structure
4. No trailing commas
5. Keep content strings under 1000 characters each

REQUIREMENTS:
1. Create AT LEAST 15-30 chunks from the provided text
2. Each paragraph should be its own chunk
3. Each section/subsection should be separate chunks
4. Identify section types: abstract, introduction, methodology, results, discussion, conclusion
5. Extract section numbers (e.g., "2.1", "3.2") when available
6. Mark figure captions and tables separately
7. Preserve academic language and citations

For each chunk:
- section_type: abstract, introduction, methodology, results, discussion, conclusion, references, figure_caption, table, other
- section_number: extract from text (e.g., "2.1", "3.2") or null
- page_number: extract from text or estimate
- content: the actual text content (PROPERLY ESCAPED and UNDER 1000 chars)
- is_figure_caption: true if this is a figure caption
- is_table: true if this contains table data

CRITICAL: Do NOT merge paragraphs or sections. Each distinct paragraph = one chunk.""",
            },
            {
                "role": "user",
                "content": f"Break this research paper content into many separate chunks:\n\n{text[:12000]}",  # Limit input length
            },
        ],
        model="meta-llama/Llama-3.3-70B-Instruct-Turbo",  # 🔧 TRY: Different models
        response_format={"type": "json_object", "schema": ResearchPaper.model_json_schema()},
        temperature=0.1,  # 🔧 TRY: Experiment with temperature for consistency vs creativity
        max_tokens=4000,  # 🔧 TRY: Adjust based on paper complexity
        stream=False,
    )
    
    # Process results with robust parsing
    try:
        data = robust_json_parse(extract, ResearchPaper)
        print("✅ Successfully parsed research paper data!")
    except Exception as e:
        print(f"❌ Error processing results: {e}")
        # Create fallback data
        data = ResearchPaper(
            title="Research Paper (Parsing Error)",
            authors="Unknown (parsing error)",
            chunks=[]
        )
    
    # 🔧 TRY: Add post-processing validation and enhancement here
    # Examples:
    # - Validate chunk quality
    # - Merge similar chunks
    # - Enhance metadata
    # data = enhance_research_data(data)
    
    # Save to file
    output_path = os.path.join(output_dir, output_filename)
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(f"# {data.title or 'Research Paper'}\n\n")
        f.write(f"**Authors:** {data.authors or 'Not identified'}\n\n")
        f.write(f"**Total Chunks:** {len(data.chunks)}\n\n")
        f.write("---\n\n")
        
        for i, chunk in enumerate(data.chunks):
            f.write(f"## Chunk {i+1}\n\n")
            f.write(f"- **Section Type:** {chunk.section_type}\n")
            f.write(f"- **Section Number:** {chunk.section_number or 'N/A'}\n")
            f.write(f"- **Page:** {chunk.page_number}\n")
            f.write(f"- **Figure Caption:** {chunk.is_figure_caption}\n")
            f.write(f"- **Table:** {chunk.is_table}\n\n")
            f.write(f"**Content:**\n{chunk.content}\n\n")
            f.write("---\n\n")
    
    print(f"Saved {len(data.chunks)} chunks to: {output_path}")
    return data

In [10]:
# Process first research paper
# ⚡ OPTIMIZATION AREA: Smart Page Range Selection
# Adapt page range based on paper analysis
start_page_1 = 1
end_page_1 = 12 if paper1_analysis['complexity'] == 'high' else paper1_analysis['total_pages']

print(f"Processing Paper 1 with pages {start_page_1}-{end_page_1}")
research_data_1 = process_research_paper(
    research_paper_1, 
    "strategy2_paper1_raw_ai.md", 
    start_page_1, 
    end_page_1
)

Processing Paper 1 with pages 1-12
Processing: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/30YearsResearchGate.pdf
Total pages in PDF: 21
Extracting pages 1 to 12
✅ Successfully parsed research paper data!
Saved 15 chunks to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy2_paper1_raw_ai.md
✅ Successfully parsed research paper data!
Saved 15 chunks to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy2_paper1_raw_ai.md


In [11]:
# Process second research paper
start_page_2 = 1
end_page_2 = 12 if paper2_analysis['complexity'] == 'high' else paper2_analysis['total_pages']

print(f"Processing Paper 2 with pages {start_page_2}-{end_page_2}")
research_data_2 = process_research_paper(
    research_paper_2, 
    "strategy2_paper2_raw_ai.md", 
    start_page_2, 
    end_page_2
)

Processing Paper 2 with pages 1-15
Processing: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/papers/SchenkBekkerSchmitt2025PrecRes.pdf
Total pages in PDF: 15
Extracting pages 1 to 15
✅ Successfully parsed research paper data!
Saved 33 chunks to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy2_paper2_raw_ai.md
✅ Successfully parsed research paper data!
Saved 33 chunks to: /Users/fredygerman/Personal/builds/exp/twiga-challenge-1/data/input_papers/strategy2_paper2_raw_ai.md


In [12]:
# ⚡ OPTIMIZATION AREA 4: Quality Assessment and Analysis
def analyze_strategy2_results(research_data, paper_name):
    """Analyze the quality of Strategy 2 results"""
    
    if not research_data.chunks:
        print(f"❌ No chunks found for {paper_name}")
        return {}
    
    # Quality metrics
    section_types = {}
    figure_count = 0
    table_count = 0
    total_content_length = 0
    citation_count = 0
    
    for chunk in research_data.chunks:
        # Count section types
        section_types[chunk.section_type] = section_types.get(chunk.section_type, 0) + 1
        
        # Count special elements
        if chunk.is_figure_caption:
            figure_count += 1
        if chunk.is_table:
            table_count += 1
        
        # Track content metrics
        total_content_length += len(chunk.content)
        
        # Count potential citations
        citation_count += chunk.content.count('(') + chunk.content.count('[')
    
    avg_chunk_length = total_content_length // len(research_data.chunks) if research_data.chunks else 0
    
    print(f"\n📊 STRATEGY 2 ANALYSIS FOR {paper_name}:")
    print(f"Paper Title: {research_data.title}")
    print(f"Authors: {research_data.authors}")
    print(f"Total chunks: {len(research_data.chunks)}")
    print(f"Section types: {section_types}")
    print(f"Figures detected: {figure_count}")
    print(f"Tables detected: {table_count}")
    print(f"Average chunk length: {avg_chunk_length} characters")
    print(f"Potential citations: {citation_count}")
    
    return {
        'chunks': len(research_data.chunks),
        'sections': section_types,
        'figures': figure_count,
        'tables': table_count,
        'avg_length': avg_chunk_length,
        'citations': citation_count
    }

# Analyze both papers
results_1 = analyze_strategy2_results(research_data_1, "Paper 1 (30YearsResearchGate)")
results_2 = analyze_strategy2_results(research_data_2, "Paper 2 (SchenkBekkerSchmitt2025)")


📊 STRATEGY 2 ANALYSIS FOR Paper 1 (30YearsResearchGate):
Paper Title: 30 Years of Unlocking Sustainable Educational Opportunities through Open and Distance Learning
Authors: Martha Jacob Kabate, Upendo Nombo
Total chunks: 15
Section types: {'abstract': 1, 'introduction': 2, 'concept and definition': 1, 'background of the chapter': 3, 'literature review': 6, 'figure caption': 1, 'other': 1}
Figures detected: 1
Tables detected: 0
Average chunk length: 150 characters
Potential citations: 9

📊 STRATEGY 2 ANALYSIS FOR Paper 2 (SchenkBekkerSchmitt2025):
Paper Title: Chunk 1
Authors: 
Total chunks: 33
Section types: {'abstract': 1, 'introduction': 28, 'geological setting': 4}
Figures detected: 0
Tables detected: 0
Average chunk length: 213 characters
Potential citations: 40


In [13]:
# 🎯 YOUR OPTIMIZATION WORKSPACE
# Implement your Strategy 2 optimizations here

print("🚀 Strategy 2 Optimization Workspace")
print("\nOptimization Areas for Raw PDF + AI:")
print("☐ Text extraction enhancement (different PDF libraries)")
print("☐ AI model selection and parameters")
print("☐ Prompt engineering for better structure")
print("☐ Adaptive page range selection")
print("☐ Post-processing and validation")

# TODO: Add your optimized Raw PDF + AI implementation here

# Example optimization template:
'''
# Your optimized implementation:

def your_optimized_text_extraction(pdf_path):
    # Try different PDF libraries:
    # - pdfplumber for better table/layout handling
    # - PyMuPDF (fitz) for complex layouts
    # - Custom preprocessing for academic papers
    return extracted_text

def your_optimized_ai_processing(text):
    # Optimize AI parameters:
    extract = together.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "YOUR_OPTIMIZED_PROMPT",
            },
            {"role": "user", "content": text},
        ],
        model="YOUR_CHOSEN_MODEL",  # Try different models
        temperature=YOUR_TEMP,  # Optimize for consistency vs creativity
        max_tokens=YOUR_TOKENS,  # Based on paper complexity
        top_p=YOUR_TOP_P,  # Response diversity
        stream=False,
    )
    return extract

# Process both papers with your optimizations
'''

🚀 Strategy 2 Optimization Workspace

Optimization Areas for Raw PDF + AI:
☐ Text extraction enhancement (different PDF libraries)
☐ AI model selection and parameters
☐ Prompt engineering for better structure
☐ Adaptive page range selection
☐ Post-processing and validation


'\n# Your optimized implementation:\n\ndef your_optimized_text_extraction(pdf_path):\n    # Try different PDF libraries:\n    # - pdfplumber for better table/layout handling\n    # - PyMuPDF (fitz) for complex layouts\n    # - Custom preprocessing for academic papers\n    return extracted_text\n\ndef your_optimized_ai_processing(text):\n    # Optimize AI parameters:\n    extract = together.chat.completions.create(\n        messages=[\n            {\n                "role": "system",\n                "content": "YOUR_OPTIMIZED_PROMPT",\n            },\n            {"role": "user", "content": text},\n        ],\n        model="YOUR_CHOSEN_MODEL",  # Try different models\n        temperature=YOUR_TEMP,  # Optimize for consistency vs creativity\n        max_tokens=YOUR_TOKENS,  # Based on paper complexity\n        top_p=YOUR_TOP_P,  # Response diversity\n        stream=False,\n    )\n    return extract\n\n# Process both papers with your optimizations\n'

In [14]:
# 📊 STRATEGY 2 RESULTS SUMMARY
print("🏆 STRATEGY 2: RAW PDF + AI RESULTS")
print("=" * 50)

if results_1:
    print(f"\nPaper 1 Results:")
    print(f"  Total Chunks: {results_1['chunks']}")
    print(f"  Section Types: {len(results_1['sections'])}")
    print(f"  Figures/Tables: {results_1['figures'] + results_1['tables']}")
    print(f"  Avg Chunk Length: {results_1['avg_length']} chars")

if results_2:
    print(f"\nPaper 2 Results:")
    print(f"  Total Chunks: {results_2['chunks']}")
    print(f"  Section Types: {len(results_2['sections'])}")
    print(f"  Figures/Tables: {results_2['figures'] + results_2['tables']}")
    print(f"  Avg Chunk Length: {results_2['avg_length']} chars")

print("\n🎯 Next Steps:")
print("1. Review the generated chunk files")
print("2. Implement your optimizations above")
print("3. Test different PDF extraction methods")
print("4. Compare with other strategies")
print("5. Document your improvements")

🏆 STRATEGY 2: RAW PDF + AI RESULTS

Paper 1 Results:
  Total Chunks: 15
  Section Types: 7
  Figures/Tables: 1
  Avg Chunk Length: 150 chars

Paper 2 Results:
  Total Chunks: 33
  Section Types: 3
  Figures/Tables: 0
  Avg Chunk Length: 213 chars

🎯 Next Steps:
1. Review the generated chunk files
2. Implement your optimizations above
3. Test different PDF extraction methods
4. Compare with other strategies
5. Document your improvements
