## MetaBeeAI LLM pipeline tutorial

Walk through the steps of the MetaBeeAI LLM pipeline.

1. Split the PDF into overlapping 2-page PDFs
2. Process the PDFs using Vision Agentic Document Analysis
3. Extract the text from the PDFs (semi-structured JSON format)
4. Extract the structured data from the text (structured JSON and csv format)

Starting with a directory of PDFs, each within a numbered 3-digit subfolder, e.g.:

```
data/papers/
├── 001/
│   ├── main.pdf
└── 002/
    ├── main.pdf
```







## Setup

Step 1: create virtual environment

```
python -m venv metabeeai_llm
source metabeeai_llm/bin/activate
```

Step 2: install dependencies

```
pip install -r requirements.txt
```

Step 3: create .env file with your API keys

```
OPENAI_API_KEY=your_openai_api_key
LANDING_AI_API_KEY=your_landing_ai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
```

Step 4: install the package

```
pip install -e . # in development mode
```

or:

```
pip install .
```




## Pipeline

## 1. Split the PDF into overlapping 2-page PDFs using `split_pdf.py`

The script will:
- Look for subfolders in the specified directory
- For each subfolder, look for a PDF named {subfolder}_main.pdf
- Create a "pages" subdirectory if it doesn't exist
- Split the PDF into overlapping 2-page PDFs named main_p01-02.pdf, main_p02-03.pdf, etc.

Note that this script requires exactly one argument - the directory containing your paper subfolders. 

The directory should have this structure:

papers_directory/
    001/
        001_main.pdf
    002/
        002_main.pdf
    ...

Basic usage (requires directory argument)
```
python -m metabeeai_llm.split_pdf data/papers
```

Or with full path
```
python -m metabeeai_llm.split_pdf /path/to/papers/directory
```



In [None]:
from metabeeai_llm.split_pdf import split_pdfs

# Basic usage
split_pdfs("data/papers")

# Or with full path
split_pdfs("/path/to/papers/directory")

## 2. Run Vision Agentic Document Analysis

Vision Agentic Document Analysis uses computer vision to extract text from the PDFs. It outputs a JSON file for each PDF, with chunks of text (or image descriptions) labeled with unique IDs and chunk types (e.g., "text", "figure", "header", etc.).

Parameters:
--dir / papers_dir: Directory containing paper folders (default: "data/papers")
--start / start_folder: Optional folder number to start processing from (e.g., "059")

The script will:
- Look for PDF files in the "pages" subdirectory of each paper folder
- Process each PDF using the Landing AI Vision Agentic Document Analysis API
- Save the results as JSON files alongside the PDFs
- Create a timestamped log file in the papers directory

Note: Make sure you have:
- Set up your .env file with LANDING_AI_API_KEY

The required directory structure:
papers_directory/
    001/
        pages/
            main_p01-02.pdf
            main_p02-03.pdf
            ...
    002/
        pages/
            ...

Basic usage (uses default data/papers directory)
```
python -m metabeeai_llm.va_process_papers
```
With specific directory
```
python -m metabeeai_llm.va_process_papers --dir data/papers
```
Start from a specific folder number
```
python -m metabeeai_llm.va_process_papers --dir data/papers --start 059
```


In [None]:
from metabeeai_llm.va_process_papers import process_papers

# Basic usage (uses default "../data/papers" directory)
process_papers()

# With specific directory
process_papers(papers_dir="data/papers")

# Start from a specific folder number
process_papers(papers_dir="data/papers", start_folder="059")

## 3. Run first LLM pipeline with `llm_pipeline.py` 

(using `json_multistage_qa.py`)

This pipeline extracts inital structured data from the text chunks. It aims to get information about the bees and pesticides, as well as any secondary stressors. It compiles a list of endpoints found in the text chunks and attempts to extract sample size, means (central tendency), and standard deviations (if available). The pipeline also extracts text summaries of the discussion, including the impact of the paper's findings, recommendations for future research, and any limitations of the study. 

First, the output JSON files from Vision Agentic Document Analysis are merged into a single JSON file.

Then, the LLM pipeline is run to extract the literature answers.

It uses questions.yaml to determine the questions and formats the output JSON file.

Parameters:

--dir / base_dir: Directory containing paper folders (default: "../data/papers")

--start / start_paper: First paper number to process (default: 1)

--end / end_paper: Last paper number to process (default: 999)

--overwrite / overwrite_merged: Flag to overwrite existing merged.json files (default: False)

The script will process the papers in the specified range and create or update merged.json files in each paper's directory.

Basic usage (uses defaults)
```
python -m metabeeai_llm.llm_pipeline
```

With specific options

```
python -m metabeeai_llm.llm_pipeline \
    --dir data/papers \
    --start 1 \
    --end 10 \
    --overwrite
```


In [None]:
from metabeeai_llm.llm_pipeline import process_papers

# Basic usage (uses defaults)
process_papers(
    base_dir="../data/papers",
    start_paper=1,
    end_paper=999,
    overwrite_merged=False
)

# With specific options
process_papers(
    base_dir="data/papers",
    start_paper=1,
    end_paper=10,
    overwrite_merged=True
)

## 4. Quick check of chunk IDs to make sure they are unique

The script will:
- Check all JSON files (except merged.json) in each pages/ subfolder
- Look for duplicate chunk_ids
- Print any duplicates found
- Save a detailed log file with results to either:
    - data/logs/chunk_id_check_[timestamp].json (for relative paths)
    - [papers_dir]/logs/chunk_id_check_[timestamp].json (for absolute paths)
The log file will contain a summary of how many subfolders were checked and how many had duplicates, along with detailed information about any duplicates found.

Basic usage (uses default data/papers directory)
```
python -m metabeeai_llm.unique_chunk_id
```

Specify a custom directory
``` 
python -m metabeeai_llm.unique_chunk_id --dir path/to/papers
```


In [None]:
from metabeeai_llm.unique_chunk_id import check_chunk_ids_in_pages_dir

# Basic usage (uses default data/papers directory)
check_chunk_ids_in_pages_dir()

# Specify a custom directory
check_chunk_ids_in_pages_dir(papers_dir="path/to/papers")

## 5. Run second LLM pipeline with `process_llm_output.py`

This pipeline extracts structured data from the JSON files. It uses the `schema_config.yaml` file to determine which chunks to extract and how to structure the data.

It has two options: 

1. Extract pesticides and bee data
2. Extract endpoints, means, standard deviations, and sample sizes


Basic usage - process both pesticides and endpoints
```
python -m metabeeai_llm.process_llm_output --start 1 --end 10
```
Process only pesticides
```
python -m metabeeai_llm.process_llm_output --start 1 --end 10 --pesticides
```
Process only endpoints
```
python -m metabeeai_llm.process_llm_output --start 1 --end 10 --endpoints
```
Process both explicitly
```
python -m metabeeai_llm.process_llm_output --start 1 --end 10 --all
```
With custom config file
```
python -m metabeeai_llm.process_llm_output --start 1 --end 10 --config path/to/schema_config.yaml
```

In [None]:
from metabeeai_llm.process_llm_output import process_papers, save_data, flatten_pesticide_data

# Process papers
pesticide_results, endpoint_results = process_papers(
    start_folder=1,
    end_folder=10,
    config_file="../schema_config.yaml",  # optional
    process_pesticides=True,              # optional
    process_endpoints=True                # optional
)

# Save the results
if pesticide_results:
    save_data(pesticide_results, "pesticides", flatten_func=flatten_pesticide_data)
if endpoint_results:
    save_data(endpoint_results, "endpoints")