# Drinks Menu Processing Notebook
This notebook processes drinks menus from PDF files using the following steps:
1. Extract text from the PDF files (`extract_text_hybrid`).
2. Generate structured JSON data for each menu using an LLM API (`query_model_to_json`).
3. Save the generated JSON outputs to a specified target directory.

## Dependecies
Please ensure you have done the following before proceeding
- Install dependecies in `requirements.txt`.
- Setup your own Gemini API key as an environment variable (named `GEMINI_API_KEY`) you can get setup with a free API key here (https://aistudio.google.com/apikey). 
- Follow installation instructions here (https://tesseract-ocr.github.io/tessdoc/Installation.html) to install tesseract OCR

In [4]:
import os
import json
from pathlib import Path

from pdf_to_text import extract_text_hybrid
from query_llm import query_model_to_json

# Directory paths
pdf_directory = "./example_pdfs"  # Directory containing the input PDFs
output_directory = "./output_jsons"  # Directory to save the JSON outputs

# Ensure the output directory exists
os.makedirs(output_directory, exist_ok=True)

In [5]:
def process_pdf_to_json(pdf_path):
    """
    Processes a single PDF file:
    1. Extracts text using `extract_text_hybrid`.
    2. Converts the text into structured JSON using `query_model_to_json`.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        dict: The structured JSON data for the menu.
    """
    # Extract text from the PDF
    print(f"Extracting text from: {pdf_path}")
    extracted_text = extract_text_hybrid(pdf_path)

    # Query the LLM to generate JSON
    print(f"Querying the LLM for JSON generation...")
    menu_json = query_model_to_json(extracted_text)
    
    return menu_json

### Processing All PDFs
This section processes all PDFs in the specified directory and saves their JSON outputs.

In [6]:
pdf_files = [Path(p) for p in os.listdir(pdf_directory)]
for pdf_file in pdf_files:
    # set paths for loading and saving
    pdf_path = os.path.join(pdf_directory, pdf_file)
    json_path = os.path.join(output_directory, pdf_file.with_suffix(".json"))

    try:
        # Process pdf
        menu_json = process_pdf_to_json(pdf_path)

        # Save json
        with open(json_path, "w") as outfile:
            json.dump(menu_json, outfile, indent=2)
            outfile.close()
        
        print(f"Processed and saved: {json_path}")

    except Exception as e:
        print(f"Error processing {pdf_file}: {e}")

Extracting text from: ./example_pdfs/beer_menu.pdf
Querying the LLM for JSON generation...
Processed and saved: ./output_jsons/beer_menu.json
Extracting text from: ./example_pdfs/wine_list.pdf
Querying the LLM for JSON generation...
Processed and saved: ./output_jsons/wine_list.json
Extracting text from: ./example_pdfs/bacardi_menu.pdf
Querying the LLM for JSON generation...
Processed and saved: ./output_jsons/bacardi_menu.json


### Next Steps
- Review the generated JSON files to ensure the output meets the requirements.
- Test the notebook with additional PDFs to verify robustness.