# OpenExtract - PDF Data Extraction for Accountants

Extract structured data from tax forms, 401(k) documents, invoices, and more!

**No coding required** - just run these cells in order:
1. Setup (install dependencies)
2. Upload your PDF
3. Pick a template
4. Extract data
5. Download as CSV

---

## Step 1: Setup

Run this cell first to install dependencies and load OpenExtract.

In [None]:
# Install dependencies
!pip install -q pdfplumber pandas

# Clone OpenExtract repository
!rm -rf openextract  # Remove if exists from previous run
!git clone -q https://github.com/ModelUser123/CJCPAs-OpenExtract.git openextract

# Add to Python path
import sys
sys.path.insert(0, 'openextract/src')

# Import OpenExtract
from openextract import Extractor

# Create extractor instance
extractor = Extractor(templates_dir='openextract/templates')

print("\n" + "="*50)
print("  OpenExtract Ready!")
print("="*50)

## Step 2: Upload Your PDF

Run this cell and select your PDF file to upload.

In [None]:
from google.colab import files

print("Select a PDF file to upload...")
uploaded = files.upload()

if uploaded:
    pdf_file = list(uploaded.keys())[0]
    print(f"\n Uploaded: {pdf_file}")
    print(f" Size: {len(uploaded[pdf_file]):,} bytes")
else:
    print(" No file uploaded")

## Step 3: View Available Templates

Run this cell to see all templates you can use.

In [None]:
extractor.list_templates()

## Step 4: Extract Data

Change the `template` variable below to match your document type, then run the cell.

**Common templates:**
- `form-5500` - DOL/IRS Form 5500 (large plans)
- `form-5500-sf` - Form 5500-SF (small plans)
- `1099-misc` - IRS 1099-MISC
- `1099-nec` - IRS 1099-NEC
- `1099-int` - IRS 1099-INT
- `k1-1065` - Schedule K-1 (Form 1065)
- `generic-invoice` - Invoices
- `generic-bank-statement` - Bank statements

In [None]:
# === CHANGE THIS TO MATCH YOUR DOCUMENT ===
template = "form-5500"  # Change me!
# ==========================================

print(f"Extracting data using template: {template}")
print("Please wait...\n")

try:
    results = extractor.extract(pdf_file, template=template)
    
    print(" Extraction complete!\n")
    print("=" * 50)
    print("EXTRACTED DATA")
    print("=" * 50)
    
    # Display results nicely
    for col in results.columns:
        value = results[col].iloc[0]
        if value is not None and str(value) != 'nan':
            print(f"  {col}: {value}")
    
    print("\n" + "=" * 50)
    
except Exception as e:
    print(f" Error: {e}")
    print("\nTip: Make sure you uploaded a PDF and selected the right template.")

## Step 5: View Results as Table

Run this cell to see your data in a table format.

In [None]:
# Display as a nice table
from IPython.display import display

if 'results' in dir() and results is not None:
    display(results.T.rename(columns={0: 'Value'}))
else:
    print(" No results yet. Run Step 4 first.")

## Step 6: Download as CSV

Run this cell to download your extracted data as a CSV file.

In [None]:
if 'results' in dir() and results is not None:
    # Create filename based on original PDF
    output_filename = pdf_file.rsplit('.', 1)[0] + '_extracted.csv'
    
    # Save to CSV
    results.to_csv(output_filename, index=False)
    
    # Download
    files.download(output_filename)
    
    print(f" Downloaded: {output_filename}")
else:
    print(" No results to download. Run Step 4 first.")

---

## Advanced: Batch Processing

Need to process multiple PDFs? Upload them all, then run this cell.

In [None]:
# Upload multiple PDFs
print("Select multiple PDF files to upload...")
batch_uploaded = files.upload()

if batch_uploaded:
    pdf_files = [f for f in batch_uploaded.keys() if f.lower().endswith('.pdf')]
    print(f"\n Uploaded {len(pdf_files)} PDF files")
    
    # Process all PDFs
    template = "form-5500"  # Change to your template
    
    all_results = extractor.extract_batch(pdf_files, template=template)
    
    print(f"\n Extracted data from {len(all_results)} files")
    display(all_results)
    
    # Download combined results
    all_results.to_csv('batch_extracted.csv', index=False)
    files.download('batch_extracted.csv')

---

## Need Help?

- **GitHub Issues**: Report bugs or request features
- **Contribute**: Add new templates via Pull Request
- **Templates**: Check the `/templates` folder for all available templates

Made with love for accountants everywhere!