### Test Notebook for running Cortex AI with local code

To run this notebook, 

1. Make sure to have Python 3.12.4 installed. That is the latest version with a wheel as seen in this post (https://stackoverflow.com/questions/79135647/pip-install-snowflake-connector-python-fails-building-wheels). Using later versions will demand an installation of Visual Studio C++ 14.0, which i didn't test.

2. create an environment and install the requirements.txt file

3. Create Snowflake credentials in windows credential manager (Follow this post: https://medium.com/@aarhar/password-management-in-python-keyring-and-credential-manager-29fa4ccc919e)

4. Run a query to test you connection (Currently it's not working, but the "basics" are there.)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import keyring
import os 
import snowflake.connector as sf_connector
import fitz 
import pdfplumber
from langchain.text_splitter import RecursiveCharacterTextSplitter
from collections import defaultdict
from PIL import Image
from io import BytesIO

In [3]:
account_identifier = keyring.get_password('NC_Snowflake_Trial_Account_Name', 'account_identifier')
user_name = "EMILHALDAN5468402"
password = keyring.get_password('NC_Snowflake_Trial_User_Password', user_name)

print("Account Identifier: ", account_identifier)
print("User Name: ", user_name)
# print("Password: ", password)


Account Identifier:  EDBHJWL-MFB05236
User Name:  EMILHALDAN5468402


## Setup in snowflake 

1. Create the database structure in Python first, and then create database tables in SnowFlake once the processing is done. This is due to the chaotic nature of the PDF files, which require a lot of cleaning and processing.

2. 


## The intended database structure is as follows:

- **manuals** (Stores metadata about each manual)  
  - `manual_id` (Unique ID for each manual)  
  - `doc_name` ()
  - `title` (Title of the manual)  
  - `version` (Version or revision number)  
  - `language` (Language code, e.g., 'en', 'de')  
  - `source_path` (Original PDF file path or S3 URL)   

- **sections** (Defines logical sections and subsections within each manual)  
  - `section_id` (Unique ID for the section)  
  - `manual_id` (Foreign key referencing `manuals`)  
  - `title` (Title or heading of the section)  
  - `order_num` (Numerical order of the section in the manual)  
  - `parent_section_id` (Optional FK for nested subsections)  

- **chunks** (Text chunks derived from section content for LLMs or search)  
  - `chunk_id` (Unique ID for the chunk)  
  - `section_id` (Foreign key referencing `sections`)  
  - `chunk_text` (The text content of the chunk)  
  - `chunk_order` (Order of the chunk within the section)  
  - `embedding` (Optional vector for semantic search or embeddings)  

- **images** (Stores references to images extracted from the manual)  
  - `image_id` (Unique ID for the image)  
  - `manual_id` (Foreign key referencing `manuals`)  
  - `section_id` (Foreign key referencing `sections`)  
  - `image_path` (S3 or web-accessible path to the image)  
  - `caption` (Optional caption or alt text)  
  - `order_num` (Display order within the section)  


In [None]:
manuals_df = pd.DataFrame(columns=['Manual_ID', 'Doc_Name', 'Title', 'Version', 'Language', 'Source_Path'])
sections_df = pd.DataFrame(columns=['Section_ID', 'Manual_ID', 'Doc_Name', 'Order_Num', 'Parent_Section_ID'])
chunks_df = pd.DataFrame(columns=['Chunk_ID', 'Section_ID', 'Manual_ID', 'Order_Num', 'Chunk_Embedding_Vector'])
images_df = pd.DataFrame(columns=['Image_ID', 'Section_ID', 'Manual_ID', 'Order_Num', 'Image_Description' ,'Image_Path'])

In [9]:
pdf_files_path = "./Washer_Manuals"

for filename in os.listdir(pdf_files_path):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_files_path, filename)
        print(file_path)
        # do something with the file
    else:
        continue

./Washer_Manuals\k714wm14 service manual.pdf
./Washer_Manuals\mmo_87050793_1630397705_64_10689.pdf
./Washer_Manuals\technical-manual-w11663204-revb.pdf
./Washer_Manuals\WAK20160IN.pdf
./Washer_Manuals\WAN28258GB.pdf
./Washer_Manuals\WAN28282GC.pdf
./Washer_Manuals\Washing machine Top-loader C series.pdf
./Washer_Manuals\WAT24168IN.pdf
./Washer_Manuals\WAV28KH3GB.pdf
./Washer_Manuals\WFL2050.pdf
./Washer_Manuals\WGA1340SIN.pdf
./Washer_Manuals\WGA1420SIN.pdf
./Washer_Manuals\WGE03408GB.pdf
./Washer_Manuals\WGG254Z0GB.pdf


In [None]:
## Extracting metadata from the PDF files

def extract_manual_metadata(file_path):
    doc = fitz.open(file_path)
    metadata = doc.metadata
    return {
        "title": metadata.get("title") or file_path.split("/")[-1],
        "version": metadata.get("modDate") or "v1",
        "language": "en",  # Default, or use NLP detection
        "source_path": file_path
    }

extract_manual_metadata(file_path)


{'title': 'User manual and installation instructions WGG254Z0GB  WGG254ZSGB | Bosch',
 'version': "D:20240116140125+01'00",
 'language': 'en',
 'source_path': './Washer_Manuals\\WGG254Z0GB.pdf'}

In [37]:
## Exploring the PDF files and their structure

def explore_pdf_fonts_llm_ready(file_path : str, sort_by="size", examples_included:int = 8, verbose:int=0 ) -> str:
    """
    Args:
        file_path (str): A string containing the local path to the PDF file.
        sort_by (str, optional): Defaults to "size".
        verbose (int, optional): Option are (0,1,2) and increases amount of information printed, typically used for confirming documents read. Defaults to 0.

    Returns:
        str: _description_
    """

    doc = fitz.open(file_path)
    font_data = defaultdict(list)

    for p_idx,page in enumerate(doc):
        if verbose > 1:
            # used to show the pages processed to compare with the actual document
            print(f"Processing page {p_idx + 1}/{len(doc)}...")
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:
            for line in b.get("lines", []):
                for span in line.get("spans", []):
                    text = span["text"].strip()
                    if text:
                        font_key = (round(span["size"], 2), span["font"])
                        font_data[font_key].append({
                            "text": text,
                            "size": round(span["size"], 2),
                            "font": span["font"],
                            "page": p_idx + 1,
                            "bbox": span.get("bbox", None),
                            "position": (round(span["origin"][0], 1), round(span["origin"][1], 1))
                        })

    # Sort by font size or font name
    sorted_fonts = sorted(font_data.items(), key=lambda x: -x[0][0] if sort_by == "size" else x[0][1])

    # Build LLM-friendly string
    lines = []
    for (size, font), entries in sorted_fonts:
        line_of_text = f"Font Group: size={size}, font='{font}' (occurrences={len(entries)})"
        lines.append(line_of_text)
        if verbose > 0:
            print("\n"+line_of_text)
        for entry in entries[:examples_included]:  # Show top 5 examples per group
            line_of_text = f"  [Page {entry['page']}] (x={entry['position'][0]}, y={entry['position'][1]}) → {entry['text']}"
            lines.append(line_of_text)
            if verbose > 0:
                print(f"    [Page {entry['page']}] (x={entry['position'][0]}, y={entry['position'][1]}) → {entry['text']}")
        lines.append("")  # blank line between font groups

    return "\n".join(lines)

explore_pdf_fonts_llm_ready(file_path, sort_by="size", examples_included=5, verbose = 1)



Font Group: size=28.0, font='BoschSans-Regular' (occurrences=1)
    [Page 1] (x=31.5, y=279.2) → Washing machine

Font Group: size=20.0, font='BoschSans-Bold' (occurrences=4)
    [Page 52] (x=112.0, y=103.0) → Thank you for buying a
    [Page 52] (x=112.0, y=123.7) → Bosch Home Appliance!
    [Page 52] (x=112.0, y=327.6) → Looking for help?
    [Page 52] (x=112.0, y=348.3) → You'll find it here.

Font Group: size=15.37, font='ArialUnicodeMS' (occurrences=1)
    [Page 19] (x=293.7, y=426.2) → 4

Font Group: size=14.51, font='ArialUnicodeMS' (occurrences=13)
    [Page 18] (x=38.6, y=341.7) → 1
    [Page 18] (x=38.6, y=277.9) → 2
    [Page 18] (x=136.7, y=161.8) → 4
    [Page 18] (x=260.5, y=162.2) → 6
    [Page 18] (x=78.1, y=161.8) → 3

Font Group: size=14.45, font='ArialUnicodeMS' (occurrences=9)
    [Page 19] (x=98.3, y=183.6) → 2
    [Page 19] (x=185.1, y=107.5) → 3
    [Page 19] (x=75.8, y=183.6) → 1
    [Page 19] (x=185.1, y=89.6) → 4
    [Page 32] (x=354.3, y=164.5) → 1

Font Gro



In [41]:
## Extracting section headers from the PDF files

def extract_text_chunks(file_path, chunk_size=512, chunk_overlap=200):
    with pdfplumber.open(file_path) as pdf:
        full_text = "\n".join([page.extract_text() or "" for page in pdf.pages])

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_text(full_text)
    return [{"chunk_order": i, "chunk_text": c} for i, c in enumerate(chunks)]

extract_text_chunks(file_path, 512, 128)[:5]  # Show first 5 chunks

[{'chunk_order': 0,
  'chunk_text': 'Register\nyour\nb M o ge s y n c t B e h f o r w - e s h e c d o h e b m v n e e i n o c . e w e c f o o i a t m n s n : d /\nwelcome\nWashing machine\nWGG254Z0GB\nWGG254ZSGB\n[en] User manual and installation\ninstructions\nen\nFurther information and explanations are\navailable online:\nTable of contents\n1Safety........................................... 4 8Buttons...................................... 22\n1.1 General information................... 4\n9Programmes.............................. 24'},
 {'chunk_order': 1,
  'chunk_text': '1.1 General information................... 4\n9Programmes.............................. 24\n1.2 Intended use.............................. 4\n1.3 Restriction on user group.......... 4 10 Accessories............................. 28\n1.4 Safe installation......................... 5\n1.5 Safe use.................................... 7 11 Laundry.................................... 28\n11.1 Preparing the laundry.........

In [52]:
# Create a directory called Washer_Images

def extract_images(file_path, output_dir):
    doc = fitz.open(file_path)
    images = []

    file_name = os.path.splitext(os.path.basename(file_path))[0]
    specific_output_dir = os.path.join(output_dir, file_name)
    os.makedirs(specific_output_dir, exist_ok=True)

    for i, page in enumerate(doc):
        img_list = page.get_images(full=True)
        for j, img in enumerate(img_list):
            xref = img[0]
            base_image = fitz.Pixmap(doc, xref)

            print(f"Page {i+1}, Img {j+1}, Size: {base_image.width}x{base_image.height}, n={base_image.n}")

            # Convert CMYK or alpha (n > 4) to RGB
            if base_image.n >= 5:  # e.g., CMYK or with alpha
                base_image = fitz.Pixmap(fitz.csRGB, base_image)

            if base_image.n in [1, 3]:  # 1=grayscale, 3=RGB
                img_path = os.path.join(specific_output_dir, f"img_p{i+1}_{j+1}.png")
                base_image.save(img_path)
                images.append({
                    "page": i + 1,
                    "image_path": img_path,
                    "order_num": j
                })

            base_image = None  # Free memory

    return images

extract_images(file_path, "Washer_Images")

Page 1, Img 1, Size: 1x1, n=3
Page 5, Img 1, Size: 115x100, n=3
Page 6, Img 1, Size: 115x100, n=3
Page 7, Img 1, Size: 115x100, n=3
Page 8, Img 1, Size: 115x100, n=3
Page 9, Img 1, Size: 115x100, n=3
Page 12, Img 1, Size: 115x100, n=3
Page 33, Img 1, Size: 115x100, n=3
Page 35, Img 1, Size: 115x100, n=3
Page 37, Img 1, Size: 115x100, n=3
Page 47, Img 1, Size: 115x100, n=3
Page 52, Img 1, Size: 1x1, n=3


[{'page': 1,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p1_1.png',
  'order_num': 0},
 {'page': 5,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p5_1.png',
  'order_num': 0},
 {'page': 6,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p6_1.png',
  'order_num': 0},
 {'page': 7,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p7_1.png',
  'order_num': 0},
 {'page': 8,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p8_1.png',
  'order_num': 0},
 {'page': 9,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p9_1.png',
  'order_num': 0},
 {'page': 12,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p12_1.png',
  'order_num': 0},
 {'page': 33,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p33_1.png',
  'order_num': 0},
 {'page': 35,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p35_1.png',
  'order_num': 0},
 {'page': 37,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p37_1.png',
  'order_num': 0},
 {'page': 47,
  'image_path': 'Washer_Images\\WGG254Z0GB\\img_p47_1.png',
  'order

In [None]:

connection_parameters = {
    "account_identifier": account_identifier,
    "user": user_name,
    "password": password,
    "role": "ACCOUNTADMIN",
    "warehouse": "COMPUTE_WH",
    "database": "CYBERSYN",
    "schema": "PUBLIC"
}

# Connect to Snowflake
conn = sf_connector.connect(
    user=connection_parameters['user'],
    password=connection_parameters['password'],
    account=connection_parameters['account_identifier'],
    warehouse=connection_parameters['warehouse'],
    database=connection_parameters['database'],
    schema=connection_parameters['schema'],
    role=connection_parameters['role']
)

# # Test the connection
cursor = conn.cursor()
cursor.execute("""
    SELECT * 
    FROM company_metadata;
""")

df = cursor.fetch_pandas_all()
df