### Test Notebook for running Cortex AI with local code

To run this notebook, 

1. Make sure to have Python 3.12.4 installed. That is the latest version with a wheel as seen in this post (https://stackoverflow.com/questions/79135647/pip-install-snowflake-connector-python-fails-building-wheels). Using later versions will demand an installation of Visual Studio C++ 14.0, which i didn't test.

2. create an environment and install the requirements.txt file

3. Create Snowflake credentials in windows credential manager (Follow this post: https://medium.com/@aarhar/password-management-in-python-keyring-and-credential-manager-29fa4ccc919e)

4. Run a query to test you connection (Currently it's not working, but the "basics" are there.)

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import keyring
import os 
import snowflake.connector as sf_connector
import fitz 
import pdfplumber
from langchain.text_splitter import RecursiveCharacterTextSplitter
from collections import defaultdict
from PIL import Image
from io import BytesIO
import layoutparser as lp
import cv2
import numpy as np
import pymupdf4llm
import pathlib

## TLDR: This notebook

The idea of this notebook is to create NLP and image oriented functions focused on the application of extracting and strcturing information from a pdf.

It currently contains the following type of functions:
- A setup/config block of code intended to connect with snowflake databases.
- A suggestion of the database structure to store the information extracted from the pdf.
- `extract_manual_metadata()` : Extracts metadata from a PDF file using PyMuPDF.
- `explore_pdf_fonts_llm_ready()` : Creates a string block of text summarizing the fonts and font sizes used in a PDF file, along with examples. The intended use of this function is to potentially feed this string to a RAG model, to help it understand the structure of the PDF, and perhaps enable the model to label sections of the text, such as "title", "subtitle", "paragraph", "warnings", etc.
- `extract_text_chunks()` : Extracts all the text data from a PDF file using PyMuPDF, and splits it in a dataframe in string chunks. Could potentially be replaced by CORTEX from Snowflake for more accurate results.
- `extract_images_from_pdf()` : Extracts all the images from a PDF using a sequence of functions. 
    1. Each PDF are rendered as a list of images using `render_pdf_to_images()`. 
    2. Then each page is processed using `detect_image_regions()`, which runs using classic computer vision techniques:
        2.1 `cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)` converts the image to grayscale.
        2.2 `cv2.threshold(gray, 240, 255, cv2.THRESH_BINARY_INV)` applies a binary threshold to the grayscale image, meaning that colors above 240 are set to 0 (black) and all values below 240 are set to 255 (white). This creates "black boxes" around the images, and leaves the text areas white.
        2.3 `cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)` finds the contours of the black boxes, and returns a list of coordinates for each box.
    3. Using the list of coordinates for the "black regions", which are presumed to be images, the function `crop_regions_from_image()` crops the original page leaving only the image in the region of interest. These images are then saved in a folder corresponding to the file name, and the function returns a list of metadata to the images.


### Notes and considerations: 
- The image extraction process intentionally avoids using modern object segmentation techniques like YOLO or Mask R-CNN, as i found out that these methods have a lot of dependencies which i couldn't make compatible with python 3.12.4. It might be possible, but i believe that the current method of calculating contours is possibly more effective, and orders of magnitude faster than using a deep learning model. This has only been tested on 1 PDF, hence more testing is needed to rely on the methods.


### Import issues 

- PyMuPDF and fitz has proved to be slightly frustrating. The best solution i found so far was after installing requirements.txt is to:

```pythonpip uninstall fitz \n pip install``` 

```--upgrade --force-reinstall pymupdf```

In [None]:
account_identifier = keyring.get_password('NC_Snowflake_Trial_Account_Name', 'account_identifier')
user_name = "JESPEREDSTROM"
password = keyring.get_password('NC_Snowflake_Trial_User_Password', user_name)

print("Account Identifier: ", account_identifier)
print("User Name: ", user_name)
# print("Password: ", password)


Account Identifier:  EDBHJWL-MFB05236
User Name:  EMILHALDAN5468402


In [7]:

connection_parameters = {
    "account_identifier": account_identifier,
    "user": user_name,
    "password": password,
    "role": "ACCOUNTADMIN",
    "warehouse": "COMPUTE_WH",
    "database": "CYBERSYN",
    "schema": "PUBLIC"
}

# Connect to Snowflake
conn = sf_connector.connect(
    user=connection_parameters['user'],
    password=connection_parameters['password'],
    account=connection_parameters['account_identifier'],
    warehouse=connection_parameters['warehouse'],
    database=connection_parameters['database'],
    schema=connection_parameters['schema'],
    role=connection_parameters['role']
)

# # Test the connection
cursor = conn.cursor()
cursor.execute("""
    SELECT * 
    FROM company_metadata;
""")

df = cursor.fetch_pandas_all()
df.head(3)

Unnamed: 0,CYBERSYN_COMPANY_ID,COMPANY_NAME,PERMID_SECURITY_ID,PRIMARY_TICKER,SECURITY_NAME,ASSET_CLASS,PRIMARY_EXCHANGE_CODE,PRIMARY_EXCHANGE_NAME,SECURITY_STATUS,GLOBAL_TICKERS,EXCHANGE_CODE,PERMID_QUOTE_ID
0,07e85915f5b12be5f0f3fe67276ffc08,"CELSIUS HOLDINGS, INC.",8589948642,CELH,CELSIUS HOLDINGS ORD SHS,Ordinary Shares,NAS,NASDAQ CAPITAL MARKET,Active,"[\n ""CELH""\n]","[\n ""NAS""\n]","[\n ""25727435733""\n]"
1,0ac135dfdf2b133215109ccc15754a62,CHURCH & DWIGHT CO INC /DE/,8590943794,CHD,CHURCH AND DWIGHT ORD SHS,Ordinary Shares,NYS,NEW YORK STOCK EXCHANGE,Active,"[\n ""0R13"",\n ""0R13l"",\n ""CHD"",\n ""CHD*"",\...","[\n ""BCO"",\n ""LSE"",\n ""MCX"",\n ""MXQ"",\n ""...","[\n ""21550585218"",\n ""21634298419"",\n ""2167..."
2,1443e2413b8a7cb8126ec02fb9096da6,GENERAL MILLS INC,8590933269,GIS,GENERAL MILLS ORD SHS,Ordinary Shares,NYS,NEW YORK STOCK EXCHANGE,Active,"[\n ""0R1X"",\n ""0R1Xl"",\n ""GIS"",\n ""GIS*"",\...","[\n ""BCO"",\n ""BIV"",\n ""LSE"",\n ""MCX"",\n ""...","[\n ""21550585154"",\n ""21634298003"",\n ""2165..."


## The intended database structure is as follows:

- **manuals** (Stores metadata about each manual)  
  - `manual_id` (Unique ID for each manual)  
  - `doc_name` ()
  - `title` (Title of the manual)  
  - `version` (Version or revision number)  
  - `language` (Language code, e.g., 'en', 'de')  
  - `source_path` (Original PDF file path or S3 URL)   

- **sections** (Defines logical sections and subsections within each manual)  
  - `section_id` (Unique ID for the section)  
  - `manual_id` (Foreign key referencing `manuals`)  
  - `title` (Title or heading of the section)  
  - `order_num` (Numerical order of the section in the manual)  
  - `parent_section_id` (Optional FK for nested subsections)  

- **chunks** (Text chunks derived from section content for LLMs or search) (Perhaps create multiple chunk tables for different chunks sizes)  
  - `chunk_id` (Unique ID for the chunk)  
  - `section_id` (Foreign key referencing `sections`)  
  - `chunk_text` (The text content of the chunk)  
  - `chunk_order` (Order of the chunk within the section)  
  - `embedding` (Vector for semantic search or embeddings)  

- **images** (Stores references to images extracted from the manual)  
  - `image_id` (Unique ID for the image)  
  - `manual_id` (Foreign key referencing `manuals`)  
  - `section_id` (Foreign key referencing `sections`)  
  - `image_path` (S3 or web-accessible path to the image)  
  - `caption` (Optional caption or alt text)  
  - `order_num` (Display order within the section)  


In [8]:
manuals_df = pd.DataFrame(columns=['Manual_ID', 'Doc_Name', 'Title', 'Version', 'Language', 'Source_Path'])
sections_df = pd.DataFrame(columns=['Section_ID', 'Manual_ID', 'Doc_Name', 'Order_Num', 'Parent_Section_ID'])
chunks_df = pd.DataFrame(columns=['Chunk_ID', 'Section_ID', 'Manual_ID', 'Order_Num', 'Chunk_Embedding_Vector'])
images_df = pd.DataFrame(columns=['Image_ID', 'Section_ID', 'Manual_ID', 'Order_Num', 'Image_Description' ,'Image_Path'])

In [9]:
pdf_files_path = "./Washer_Manuals"

for filename in os.listdir(pdf_files_path):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_files_path, filename)
        print(file_path)
        # do something with the file
    else:
        continue

./Washer_Manuals\k714wm14 service manual.pdf
./Washer_Manuals\mmo_87050793_1630397705_64_10689.pdf
./Washer_Manuals\technical-manual-w11663204-revb.pdf
./Washer_Manuals\WAK20160IN.pdf
./Washer_Manuals\WAN28258GB.pdf
./Washer_Manuals\WAN28282GC.pdf
./Washer_Manuals\Washing machine Top-loader C series.pdf
./Washer_Manuals\WAT24168IN.pdf
./Washer_Manuals\WAV28KH3GB.pdf
./Washer_Manuals\WFL2050.pdf
./Washer_Manuals\WGA1340SIN.pdf
./Washer_Manuals\WGA1420SIN.pdf
./Washer_Manuals\WGE03408GB.pdf
./Washer_Manuals\WGG254Z0GB.pdf


In [10]:
## Extracting metadata from the PDF files

def extract_manual_metadata(file_path):
    doc = fitz.open(file_path)
    metadata = doc.metadata
    return {
        "title": metadata.get("title") or file_path.split("/")[-1],
        "version": metadata.get("modDate") or "v1",
        "language": "en",  # Default, or use NLP detection
        "source_path": file_path
    }

extract_manual_metadata(file_path)


{'title': 'User manual and installation instructions WGG254Z0GB  WGG254ZSGB | Bosch',
 'version': "D:20240116140125+01'00",
 'language': 'en',
 'source_path': './Washer_Manuals\\WGG254Z0GB.pdf'}

In [11]:
## Exploring the PDF files and their structure

def explore_pdf_fonts_llm_ready(file_path : str, sort_by="size", examples_included:int = 8, verbose:int=0 ) -> str:
    """
    Args:
        file_path (str): A string containing the local path to the PDF file.
        sort_by (str, optional): Defaults to "size".
        verbose (int, optional): Option are (0,1,2) and increases amount of information printed, typically used for confirming documents read. Defaults to 0.

    Returns:
        str: _description_
    """

    doc = fitz.open(file_path)
    font_data = defaultdict(list)

    for p_idx,page in enumerate(doc):
        if verbose > 1:
            # used to show the pages processed to compare with the actual document
            print(f"Processing page {p_idx + 1}/{len(doc)}...")
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:
            for line in b.get("lines", []):
                for span in line.get("spans", []):
                    text = span["text"].strip()
                    if text:
                        font_key = (round(span["size"], 2), span["font"])
                        font_data[font_key].append({
                            "text": text,
                            "size": round(span["size"], 2),
                            "font": span["font"],
                            "page": p_idx + 1,
                            "bbox": span.get("bbox", None),
                            "position": (round(span["origin"][0], 1), round(span["origin"][1], 1))
                        })

    # Sort by font size or font name
    sorted_fonts = sorted(font_data.items(), key=lambda x: -x[0][0] if sort_by == "size" else x[0][1])

    # Build LLM-friendly string
    lines = []
    for (size, font), entries in sorted_fonts:
        line_of_text = f"Font Group: size={size}, font='{font}' (occurrences={len(entries)})"
        lines.append(line_of_text)
        if verbose > 0:
            print("\n"+line_of_text)
        for entry in entries[:examples_included]:  # Show top 5 examples per group
            line_of_text = f"  [Page {entry['page']}] (x={entry['position'][0]}, y={entry['position'][1]}) → {entry['text']}"
            lines.append(line_of_text)
            if verbose > 0:
                print(f"    [Page {entry['page']}] (x={entry['position'][0]}, y={entry['position'][1]}) → {entry['text']}")
        lines.append("")  # blank line between font groups

    return "\n".join(lines)

explore_pdf_fonts_llm_ready(file_path, sort_by="size", examples_included=5, verbose = 1)



Font Group: size=28.0, font='BoschSans-Regular' (occurrences=1)
    [Page 1] (x=31.5, y=279.2) → Washing machine

Font Group: size=20.0, font='BoschSans-Bold' (occurrences=4)
    [Page 52] (x=112.0, y=103.0) → Thank you for buying a
    [Page 52] (x=112.0, y=123.7) → Bosch Home Appliance!
    [Page 52] (x=112.0, y=327.6) → Looking for help?
    [Page 52] (x=112.0, y=348.3) → You'll find it here.

Font Group: size=15.37, font='ArialUnicodeMS' (occurrences=1)
    [Page 19] (x=293.7, y=426.2) → 4

Font Group: size=14.51, font='ArialUnicodeMS' (occurrences=13)
    [Page 18] (x=38.6, y=341.7) → 1
    [Page 18] (x=38.6, y=277.9) → 2
    [Page 18] (x=136.7, y=161.8) → 4
    [Page 18] (x=260.5, y=162.2) → 6
    [Page 18] (x=78.1, y=161.8) → 3

Font Group: size=14.45, font='ArialUnicodeMS' (occurrences=9)
    [Page 19] (x=98.3, y=183.6) → 2
    [Page 19] (x=185.1, y=107.5) → 3
    [Page 19] (x=75.8, y=183.6) → 1
    [Page 19] (x=185.1, y=89.6) → 4
    [Page 32] (x=354.3, y=164.5) → 1

Font Gro



In [12]:
## Extracting section headers from the PDF files

def extract_text_chunks(file_path, chunk_size=512, chunk_overlap=200):
    with pdfplumber.open(file_path) as pdf:
        full_text = "\n".join([page.extract_text() or "" for page in pdf.pages])

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_text(full_text)
    return [{"chunk_order": i, "chunk_text": c} for i, c in enumerate(chunks)]

extract_text_chunks(file_path, 512, 128)[:5]  # Show first 5 chunks

[{'chunk_order': 0,
  'chunk_text': 'Register\nyour\nb M o ge s y n c t B e h f o r w - e s h e c d o h e b m v n e e i n o c . e w e c f o o i a t m n s n : d /\nwelcome\nWashing machine\nWGG254Z0GB\nWGG254ZSGB\n[en] User manual and installation\ninstructions\nen\nFurther information and explanations are\navailable online:\nTable of contents\n1Safety........................................... 4 8Buttons...................................... 22\n1.1 General information................... 4\n9Programmes.............................. 24'},
 {'chunk_order': 1,
  'chunk_text': '1.1 General information................... 4\n9Programmes.............................. 24\n1.2 Intended use.............................. 4\n1.3 Restriction on user group.......... 4 10 Accessories............................. 28\n1.4 Safe installation......................... 5\n1.5 Safe use.................................... 7 11 Laundry.................................... 28\n11.1 Preparing the laundry.........

# Pypdfium4ll

In [None]:

class text_chunker():
    # Take in file name
    def __init__(self, pdf):
        self.pdf_name = pdf
        self.newpath = "Washer_Images/Washer_Manuals/pymupdf4llm/" + self.pdf_name
        # make dir
        if not os.path.exists(self.newpath):
            os.makedirs(self.newpath)
        

    def extract_text_and_images(self):
        """Extracts images and saves them in 'self.newpath'
        """
        pymupdf4llm.to_markdown(doc=f"Washer_Manuals/{self.pdf_name}", write_images=True, image_path= self.newpath)
        # self.text = pathlib.Path("output.md").write_bytes(self.md_text.encode())
        print(self.text)

    def extract_page_cunks(self):
        # testing something with text and markdown. Does not work!
        self.document = pymupdf4llm.to_markdown(doc=f"Washer_Manuals/{self.pdf_name}", page_chunks=True)
        
        for index, page in enumerate(self.document):
            
            print(page["text"])
            print("Encoded text: ", pathlib.Path("output.md").write_bytes(page["text"].encode()))
            print(self.document[index])
            self.document[index]["text"] =  pathlib.Path("output.md").write_bytes(page["text"].encode())
        # self.text = pathlib.Path("output.md").write_bytes(self.md_text.encode())
        print(self.document[11]["text"])
        

    def process(self, pdf_text: str):
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 1512, #Adjust this as you see fit
            chunk_overlap  = 256, #This let's text have some form of overlap. Useful for keeping chunks contextual
            length_function = len
        )
    
        chunks = text_splitter.split_text(pdf_text)
        self.df = pd.DataFrame(chunks, columns=['chunks'])
        
        
        yield from self.df.itertuples(index=False, name=None)

In [80]:
cunker_object = text_chunker(pdf="k714wm14 service manual.pdf")
cunker_object.extract_page_cunks()

# �������������������������� ���������������
## � ������������ ���������������


-----


Encoded text:  226
{'metadata': {'format': 'PDF 1.5', 'title': '维修手册_SERVICE MANUAL-SICILY GLORY(6.0,7.0KG)', 'author': '张晓良', 'subject': '', 'keywords': '', 'creator': 'CorelDRAW X5', 'producer': 'Acrobat Distiller 9.0.0 (Windows)', 'creationDate': "D:20140618154138+08'00'", 'modDate': "D:20140618154138+08'00'", 'trapped': '', 'encryption': None, 'file_path': 'Washer_Manuals/k714wm14 service manual.pdf', 'page_count': 65, 'page': 1}, 'toc_items': [], 'tables': [], 'images': [{'number': 0, 'bbox': Rect(174.83399963378906, 331.0339660644531, 426.0059814453125, 657.5659790039062), 'transform': (251.1719970703125, 0.0, -0.0, 326.5320129394531, 174.83399963378906, 331.0339660644531), 'width': 697, 'height': 906, 'colorspace': 1, 'cs-name': 'Indexed(131,ICCBased(RGB,sRGB IEC61966-2.1))', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 23922, 'has-mask': False}], 'graphics': [], 'text': '# ��������������������

In [81]:
# cunker_object.extract_page_cunks()

cunker_object.document[2]["text"]


10699

In [13]:
def render_pdf_to_images(pdf_path, zoom=2.0):
    doc = fitz.open(pdf_path)
    images = []
    for i, page in enumerate(doc):
        mat = fitz.Matrix(zoom, zoom)
        pix = page.get_pixmap(matrix=mat)
        img_data = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        images.append({
            "page_number": i + 1,
            "image": img_data
        })
    return images


def detect_image_regions(page_image):
    image = np.array(page_image)
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    _, thresh = cv2.threshold(gray, 240, 255, cv2.THRESH_BINARY_INV)

    contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    regions = []
    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        if w > 50 and h > 50:  # Skip tiny blocks (Maybe reconsider)
            regions.append([x, y, x + w, y + h])
    return regions


def crop_regions_from_image(page_image, regions, output_dir, page_num):
    os.makedirs(output_dir, exist_ok=True)
    saved_images = []

    for i, coords in enumerate(regions):
        x1, y1, x2, y2 = map(int, coords)
        cropped = page_image.crop((x1, y1, x2, y2))
        save_path = os.path.join(output_dir, f"page_{page_num}_img_{i+1}.png")
        cropped.save(save_path)
        saved_images.append({
            "page": page_num,
            "image_path": save_path,
            "coords": (x1, y1, x2, y2)
        })
    return saved_images


# This is the main function to extract images from the PDF
def extract_images_from_pdf(pdf_path:str, output_dir: str, verbose:int =0):
    rendered_pages = render_pdf_to_images(pdf_path)
    all_extracted = []

    for page_idx,page in enumerate(rendered_pages):
        page_num = page["page_number"]
        image = page["image"]
        if verbose > 0:
            print(f"Processing page {page_num}...")

        regions = detect_image_regions(image)
        if verbose > 0:
            print(f"Found {len(regions)} image regions on page {page_num}")

        if not regions:
            if verbose > 0:
                print(f"No image regions found on page {page_num}")
            continue
        
        # Creates an image directory for each PDF file
        image_output_dir = os.path.join(output_dir, pdf_path.split("/")[-1].replace(".pdf", ""))
        os.makedirs(image_output_dir, exist_ok=True)

        extracted = crop_regions_from_image(
            image, regions, output_dir=image_output_dir, page_num=page_num
        )
        all_extracted.extend(extracted)
    return all_extracted

extract_images_from_pdf(file_path, output_dir="Washer_Images", verbose = 1)

Processing page 1...
Found 1 image regions on page 1
Processing page 2...
Found 1 image regions on page 2
Processing page 3...
Found 0 image regions on page 3
No image regions found on page 3
Processing page 4...
Found 1 image regions on page 4
Processing page 5...
Found 0 image regions on page 5
No image regions found on page 5
Processing page 6...
Found 0 image regions on page 6
No image regions found on page 6
Processing page 7...
Found 0 image regions on page 7
No image regions found on page 7
Processing page 8...
Found 0 image regions on page 8
No image regions found on page 8
Processing page 9...
Found 0 image regions on page 9
No image regions found on page 9
Processing page 10...
Found 0 image regions on page 10
No image regions found on page 10
Processing page 11...
Found 0 image regions on page 11
No image regions found on page 11
Processing page 12...
Found 0 image regions on page 12
No image regions found on page 12
Processing page 13...
Found 7 image regions on page 13
Pro

[{'page': 1,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_1_img_1.png',
  'coords': (0, 0, 839, 1191)},
 {'page': 2,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_2_img_1.png',
  'coords': (656, 87, 778, 209)},
 {'page': 4,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_4_img_1.png',
  'coords': (56, 232, 142, 289)},
 {'page': 13,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_13_img_1.png',
  'coords': (144, 853, 224, 969)},
 {'page': 13,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_13_img_2.png',
  'coords': (62, 853, 142, 969)},
 {'page': 13,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_13_img_3.png',
  'coords': (464, 686, 788, 920)},
 {'page': 13,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_13_img_4.png',
  'coords': (62, 657, 224, 773)},
 {'page': 13,
  'image_path': 'Washer_Images\\Washer_Manuals\\WGG254Z0GB\\page_13_img_5.png',
  'coords': (62,