# Table of contents
1. [Introduction](#introduction)
2. [Gemini API](#gemini-api)
3. [PDF to Text](#pdf-to-text)
4. [Save as JSON](#save-as-json)

## Introduction
The notebook is used to extract text from PDF files and save it as JSON files. The JSON files will be used to annotate the data.

## Gemini API
The Gemini API is used to extract text from PDF files. The API is free to use and can be accessed [here](https://ai.google.dev/gemini-api/docs/models/gemini).

In [1]:
import google.generativeai as genai
#import env
import os
from dotenv import load_dotenv
load_dotenv()

GCP_KEY = os.getenv("GCP_KEY")


genai.configure(api_key=GCP_KEY)
model = genai.GenerativeModel("gemini-1.5-flash")

## PDF to Text
The PDF files are converted to text using the Gemini API. The API returns the text.

In [2]:
import base64

# Function to read and base64 encode the PDF
def encode_pdf(file_path):
    with open(file_path, "rb") as pdf_file:
        pdf_data = pdf_file.read()
    return base64.standard_b64encode(pdf_data).decode("utf-8")


# PDF file paths
pdf_files = [
    "D:/github/OCR/Invoice-635286.pdf",
    "D:/github/OCR/Invoice-640419.pdf",
    "D:/github/OCR/wordpress-pdf-invoice-plugin-sample.pdf"
]

responses = {pdf_file: None for pdf_file in pdf_files}

# Summarize each PDF
for pdf_file in pdf_files:
    encoded_pdf = encode_pdf(pdf_file)
    prompt = "Summarize this document"

    response = model.generate_content(
        [{'mime_type': 'application/pdf', 'data': encoded_pdf},
        prompt]
    )
    responses[pdf_file] = response
    
    print(f"Summary for {pdf_file}:")
    print(response.text)
    print("=" * 80)

Summary for D:/github/OCR/Invoice-635286.pdf:
This is an invoice from IT Works to ACME Systems Inc.  The invoice number is 635286 and the date is 2017-12-06.  ACME Systems Inc. is located at Somewhere Road 59, Bucharest, Romania.  The invoice includes one item: Concierge Services, with a quantity of 1, a price per unit of 226351 EUR, and a total cost of 226351 EUR.  The subtotal is 188626 EUR, the tax is 37725.2 EUR, and the total amount due is 226351 EUR.  Payment is due within 20 days of the invoice date.
Summary for D:/github/OCR/Invoice-640419.pdf:
This is an invoice from Clipboard Papers to ACME Systems Inc. for concierge services.  The invoice number is 640419 and the date is June 18, 2017.  One unit of concierge services was provided at a cost of 187,124 RON.  Including tax (37,424.8 RON), the total amount due is 224,549 RON.  Payment is due within seven days of receipt.
Summary for D:/github/OCR/wordpress-pdf-invoice-plugin-sample.pdf:
This is an invoice from Sliced Invoices (D

## Save as JSON
The text is saved as JSON files. The JSON files are used to annotate the data.

In [3]:
# Save as json
import json

for pdf_file, response in responses.items():
    response_json = response.to_dict()
    with open(f"{pdf_file[:-4]}.json", "w") as json_file:
        json.dump(response_json, json_file, indent=4)