# Tutorial 2: Research Findings Extraction

This notebook demonstrates how to extract structured data from academic papers and research documents.

**What you'll learn:**
- Using `purpose="findings"` for research papers
- Detection modes: strict, moderate, extended
- Understanding extracted fields (estimates, p-values, methodologies)
- Analyzing extraction results

## 1. Setup

In [None]:
import os
import pandas as pd

# Set your API key
os.environ["GEMINI_API_KEY"] = "your-api-key-here"

from structify import Pipeline

## 2. Research Findings Extraction

The `purpose="findings"` mode is optimized for academic papers and research documents. It focuses on extracting:

- **Estimates & Coefficients**: Treatment effects, regression coefficients, effect sizes
- **Statistical Measures**: Standard errors, confidence intervals, p-values
- **Methodology**: DID, IV, RDD, RCT, OLS, Fixed Effects
- **Context**: Sample size, time period, geographic scope

In [None]:
# Create a pipeline optimized for research findings
pipeline = Pipeline(
    purpose="findings",  # Optimized for academic papers
    seed=42,             # Reproducible results
)

# Process your research papers
results = pipeline.fit_transform("path/to/research_papers/")

print(f"Extracted {len(results)} findings from research papers")

## 3. Understanding Detection Modes

Detection modes control how many fields are discovered:

| Mode | Max Fields | Best For |
|------|-----------|----------|
| `strict` | 5-7 | Simple papers, quick extraction |
| `moderate` | 7-12 | Most use cases (default) |
| `extended` | 12-20 | Complex papers with many variables |

In [None]:
# Strict mode - fewer fields, faster extraction
pipeline_strict = Pipeline(
    purpose="findings",
    detection_mode="strict",  # Only essential fields
)

# Moderate mode (default) - balanced
pipeline_moderate = Pipeline(
    purpose="findings",
    detection_mode="moderate",
)

# Extended mode - more fields for complex papers
pipeline_extended = Pipeline(
    purpose="findings",
    detection_mode="extended",  # More detailed extraction
)

## 4. Example: Extracting Economic Research Findings

In [None]:
# Create pipeline for economics research
pipeline = Pipeline(
    purpose="findings",
    detection_mode="moderate",
    deduplicate=True,
    enable_checkpoints=True,
    seed=42,
)

# Extract findings
results = pipeline.fit_transform("economics_papers/")

# View extracted data
results.head()

## 5. Typical Fields Extracted

For research findings, you'll typically see these fields:

In [None]:
# View the detected schema
print("Detected Schema for Research Findings:")
print("=" * 50)

for field in pipeline.schema.fields:
    req = "(required)" if field.required else ""
    print(f"\n{field.name} [{field.type.value}] {req}")
    print(f"  {field.description}")
    
    if field.options:
        print(f"  Options: {', '.join(field.options[:5])}{'...' if len(field.options) > 5 else ''}")

## 6. Analyzing Extraction Results

In [None]:
# Basic statistics
print(f"Total findings extracted: {len(results)}")
print(f"Columns: {list(results.columns)}")

In [None]:
# Analyze methodology distribution (if methodology field exists)
if 'methodology' in results.columns:
    print("\nMethodology Distribution:")
    print(results['methodology'].value_counts())

In [None]:
# Analyze significance levels (if present)
if 'significance_level' in results.columns:
    print("\nStatistical Significance:")
    print(results['significance_level'].value_counts())

In [None]:
# View estimate values (if present)
if 'estimate_value' in results.columns:
    print("\nEstimate Value Statistics:")
    print(results['estimate_value'].describe())

## 7. Filtering Results

In [None]:
# Filter for statistically significant results only
if 'significance_level' in results.columns:
    significant = results[results['significance_level'].isin(['p<0.01', 'p<0.05'])]
    print(f"Significant findings (p<0.05): {len(significant)}")
    significant.head()

In [None]:
# Filter by methodology
if 'methodology' in results.columns:
    did_results = results[results['methodology'] == 'DID']
    print(f"Difference-in-Differences findings: {len(did_results)}")
    did_results.head()

## 8. Mandatory Fields

For research findings, these fields are always included:

- **unit**: Unit of measurement (%, log, coefficient, percentage points, elasticity)
- **notes**: One sentence explaining the finding and context

In [None]:
# View the notes field - contains important context
if 'notes' in results.columns:
    print("Sample notes from extracted findings:")
    for i, note in enumerate(results['notes'].head(5)):
        print(f"\n{i+1}. {note}")

In [None]:
# View units used
if 'unit' in results.columns:
    print("\nUnits of measurement:")
    print(results['unit'].value_counts())

## 9. Saving Schema for Reuse

Save the detected schema to skip detection next time:

In [None]:
# Save the schema
pipeline.save_schema("research_schema.json")
print("Schema saved! Use it later with: Pipeline(schema='research_schema.json')")

## 10. Export Results

In [None]:
# Save to CSV
results.to_csv("research_findings.csv", index=False)

# Save to Excel with formatting
results.to_excel("research_findings.xlsx", index=False)

# Save to JSON for further processing
results.to_json("research_findings.json", orient="records", indent=2)

print("Results exported successfully!")

## Summary

In this tutorial, you learned:
- ✅ Using `purpose="findings"` for research papers
- ✅ Detection modes (strict, moderate, extended)
- ✅ Understanding extracted fields
- ✅ Analyzing and filtering results
- ✅ Mandatory fields (unit, notes)
- ✅ Saving schemas for reuse

**Next:** Tutorial 3 - Policy Document Extraction