# Setup

## Environment

1. Create a virtual environment or similar (this was built with Python 3.10, but 3.11 should work too), and install `requirements.txt`:
    ```bash
    pip install -r requirements.txt
    ```
2. Setup Google Cloud Application Default Credentials (see [this doc](https://cloud.google.com/docs/authentication/provide-credentials-adc)).
3. Copy the `.env.template` file and set keys and other information as indicated.

## LLM and Data Objects

Load the `.env` file into the Python environment:

In [None]:
from dotenv import load_dotenv
load_dotenv(override=True)  

Initialize VertexAI:

In [None]:
import os
import vertexai
vertexai.init(
    project=os.environ.get("GOOGLE_PROJECT_NAME"),
    location=os.environ.get("GOOGLE_LOCATION",'us-east1'),
)

Initialize LangChain VertexAI components:

In [None]:
from langchain_google_vertexai import VertexAI
from langchain_google_vertexai import ChatVertexAI
from langchain_google_vertexai import VertexAIEmbeddings
from vertexai.generative_models import GenerativeModel

llmModel = VertexAI(model_name=os.environ.get('GOOGLE_LLM','gemini-1.5-flash'))
chatModel = ChatVertexAI(model=os.environ.get('GOOGLE_LLM','gemini-1.5-flash'))
embedModel = VertexAIEmbeddings(model_name=os.environ.get('GOOGLE_EMBED_MODEL','multimodalembedding')) 
genModel = GenerativeModel(model_name=os.environ.get('GOOGLE_LLM','gemini-1.5-flash'))

Initialize Cassio (Astra DB)

In [None]:
import cassio
cassio.init(auto=True)

And establish the graph store:

In [None]:
from ragstack_langchain.graph_store import CassandraGraphStore     

SITE_PREFIX="travel_docs"
graph_store = CassandraGraphStore(
    embedModel,
    node_table=f"{SITE_PREFIX}_nodes",
    edge_table=f"{SITE_PREFIX}_edges")

# Create LangChain `Document`s

The example `Tourbook.pdf` is fairly complex in structure, both digitally and visually. 

## Text

A variety of parsing tools such as Unstructured and Adobe ExtractAPI were attempted on `Tourbook.pdf` file, attempting with both file structure and OCR techniques, to no avail. The Vertex LLM was able to parse (with a fairly generic prompt), but unfortunately exited early as it determined it was repeating existing content and cited the URL of this!

In this notebook we are trying to demonstrate multi-modal embedding and retrieval, so the information was manually parsed, and put into the file `Tourbook.json`. This contains the first 24 pages minus the cover page, the table of contents, and a map on page 3.

In [None]:
from langchain_core.documents import Document
from ragstack_knowledge_store.link_tag import BidirLinkTag
import json

with open('Tourbook.json', 'r') as file:
    text_data = json.load(file)

text_documents = []
h1_dict = {}

for i, entry in enumerate(text_data):
    h1 = entry['metadata']['h1']
    h1_dict[entry['metadata']['page_number']] = h1 # note the H1 level for each page, as we will reference again on the images
    link_h1 = BidirLinkTag(kind="h1", tag=h1)
    entry['metadata']['link_tags'] = [link_h1]
    doc = Document(page_content=entry['page_content'], metadata=entry['metadata'])
    text_documents.append(doc)

Note the `metadata.link_tags` list; here we are linking to and from the H1 header level, which corresponds to the section. In this way, any information in a section will be linked to other information in the section.

## Images

For images, we will use `PyMuPDF` to extract images from the document, `base64` encode the image, and create a `Document` referencing the appropriate H1 heading for the page. 

In [None]:
import pymupdf
import base64

doc = pymupdf.open('Tourbook.pdf')
image_documents = []

# page_index starts from 0, so these are actual pages 3, 5-23, but are numbered 1 and 3-21. 
pages_to_process = [2] + list(range(4, 23))  

for page_index in pages_to_process:
    page = doc[page_index]
    image_list = page.get_images()
    adjusted_page_number = page_index - 1

    # Iterate over the images on the page
    for image_index, img in enumerate(image_list, start=1):
        xref = img[0]
        pix = pymupdf.Pixmap(doc, xref) 
        if pix.n - pix.alpha > 3:
            pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

        base64_image = base64.b64encode(pix.tobytes(output="png")).decode('utf-8')

        if adjusted_page_number % 2 == 0:  # If it's even
            page_spread = f"{adjusted_page_number}-{adjusted_page_number+1}"
        else:  # If it's odd
            page_spread = f"{adjusted_page_number-1}-{adjusted_page_number}"

        h1 = h1_dict[adjusted_page_number]
        link_h1 = BidirLinkTag(kind="h1", tag=h1)
        doc_metadata = {
            "mime_type": "image/png",
            "mime_encoding": "base64",
            "page_number": adjusted_page_number, 
            "page_spread": page_spread, 
            "image_index": image_index,
            "h1": h1, 
            "link_tags" : [ link_h1 ]

        }
        # Now in theory, langchain_google_vertex.embeddings.embed_image() calls ImageBytesLoader.load_bytes
        # which can take a base64 string, but that wasn't working...but this URI trick does work!
        image_doc = Document(page_content=f"data:image/png;base64,{base64_image}", metadata=doc_metadata)
        image_documents.append(image_doc)

# Load Knowledge Store

In [None]:
docs = []
for doc in text_documents + image_documents:
    docs.append(doc)

    if len(docs) >= 50:
        print("saving batch")
        graph_store.add_documents(docs)
        docs.clear()

if docs:
    print("saving batch")
    graph_store.add_documents(docs)