## This Notebook tests the possibility of using LLMs in order to extract information of the PDFs

This method has been tested as using classical string manipulation and data engineering appeared to be more challenging than anticipated. This could be due to several reasons such as:
- The specific PDF document used for testing 
- Lack of competence on the task (Emil Haldan)

The findings of the tests made on a single document show that the larger models yield much better results. Currently the best ouput came from using llama3.1-70b, which appears to be the "best" model offered by snowflake which is available in our current region "Azure West Europe - (Netherlands)".

The function `extract_TOC(text: str, model : str)` currently takes 1 text string of 8192 characters, and returns a string which includes a JSON structure of the table of contents (cell 13).

Processing time pr. document: 61.44 seconds
Warehouse specs: Small, 2 clusters.

In [None]:
CREATE DATABASE IF NOT EXISTS WASHING_MACHINE_MANUALS;
CREATE SCHEMA IF NOT EXISTS WASHING_MACHINE_MANUALS.PUBLIC;
USE DATABASE WASHING_MACHINE_MANUALS;
USE SCHEMA PUBLIC;

In [None]:
-- Creating stage to dump PDF documents into
create or replace stage docs ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE') DIRECTORY = ( ENABLE = true );

In [None]:
-- Uploading the documents to the @docs stage directly (DO THIS MANUALLY)
-- Check that the files were uploaded
LS @docs;

In [None]:
CREATE OR REPLACE TABLE DOCUMENTS (
    DOCUMENT_ID INT AUTOINCREMENT PRIMARY KEY,
    RELATIVE_PATH STRING NOT NULL,
    FILE_URL STRING,
    SIZE NUMBER,
    STAGE_NAME STRING DEFAULT '@docs',
    CREATED_AT TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
);

INSERT INTO DOCUMENTS (RELATIVE_PATH, FILE_URL, SIZE)
SELECT 
    RELATIVE_PATH,
    FILE_URL,
    SIZE
FROM DIRECTORY(@docs);

In [None]:
SELECT * 
FROM DOCUMENTS;

In [None]:
-- Scale up!
-- ALTER WAREHOUSE COMPUTE_WH SET WAREHOUSE_SIZE = '4X-Large'; -- Didn't seem to have any effect on the run time. Probably have to ask about this.

In [None]:

-- Creates the table for storing the chunks and vector embeddings
CREATE OR REPLACE TABLE CHUNKS (
    CHUNK_ID INT AUTOINCREMENT PRIMARY KEY,
    DOCUMENT_ID INT NOT NULL,
    CHUNK_INDEX INT,
    CHUNK STRING NOT NULL,
    CREATED_AT TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP(),
    CONSTRAINT fk_document
        FOREIGN KEY (DOCUMENT_ID)
        REFERENCES DOCUMENTS(DOCUMENT_ID)
);


-- Creates a temp table with parsed text (1 row for each document, with a super long string of raw text of the document)
CREATE OR REPLACE TEMP TABLE parsed_text_table AS
SELECT 
  relative_path,
  size,
  file_url,
  BUILD_SCOPED_FILE_URL(@docs, relative_path) AS scoped_file_url,
  TO_VARCHAR(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@docs, relative_path, {'mode': 'LAYOUT'})) AS full_text
FROM DIRECTORY(@docs);


-- Using the temporary table to fill the CHUNKS tables with 
INSERT INTO CHUNKS (DOCUMENT_ID, CHUNK_INDEX, CHUNK)
SELECT 
    d.DOCUMENT_ID,
    chunk_data.index AS CHUNK_INDEX,
    chunk_data.value::STRING AS CHUNK,
FROM parsed_text_table p
JOIN DOCUMENTS d ON p.RELATIVE_PATH = d.RELATIVE_PATH
JOIN LATERAL FLATTEN(
    INPUT => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
        p.full_text,
        'none',     -- or 'markdown'
        8192,       -- chunk size
        256             -- overlap
    )
) AS chunk_data
WHERE p.full_text IS NOT NULL;

SELECT * 
FROM CHUNKS 
LIMIT 10;

In [None]:
SELECT * FROM CHUNKS;

### This section will focus on classifying the sections of the document, using a sequence of LLM functions and logic

In [None]:
import pandas as pd
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col
import time

session = Session.builder.getOrCreate()

df_chunks = session.table("CHUNKS").to_pandas()
df_chunks.head()

In [None]:
def extract_TOC(text: str, model : str) -> str:
    prompt = (
    """
    I will provide a long string of text that most likely contains a table of contents, 
    although it may also include additional body text from a document. Your task is to carefully 
    extract only the table of contents and structure it as a JSON object in the following 
    format:
    {
      "Section": "<section name>",
      "Section Number": "<section name>",
      "Page": <page number>
    }

    Guidelines:
        - Ignore any text that is not part of the table of contents.
        - Ensure that sub-sections are nested appropriately under their parent section.
        - If a section has no sub-sections, return "Sub sections": [].
        - Page numbers should be extracted as integers, if possible.
        - Be tolerant of inconsistencies in formatting, spacing, or punctuation (e.g. dashes, colons, ellipses).
        - Do not include duplicate or repeated sections.
        - You should only consider items which are part of the table of contents, nothing before, nothing after.
        - "Section" must consist of words
        - "Section Number" must be represented as an integer or float - E.G: 1, 2, 5.3, 1,4, etc.
        - "Page" must be an integer.
            
    """
    f"Text:\n{text}"
    )
    start_time = time.time()
    result = session.sql(f"""
        SELECT SNOWFLAKE.CORTEX.COMPLETE('{model}', $$ {prompt} $$)
    """).collect()
    print(f"Runtime in seconds: {time.time() - start_time:.4f}")
    
    return result

llm_output = extract_TOC(df_chunks.loc[0,"CHUNK"], model = 'snowflake-arctic')
llm_output[0][0]

In [None]:
llm_output = extract_TOC(df_chunks.loc[0,"CHUNK"], model = 'llama3.1-8b')
llm_output[0][0]

In [None]:
llm_output = extract_TOC(df_chunks.loc[0,"CHUNK"], model = 'llama3.1-70b')
llm_output[0][0]

In [None]:
llm_output = extract_TOC(df_chunks.loc[0,"CHUNK"], model = 'mistral-large2')
llm_output[0][0]

In [None]:
llm_output = extract_TOC(df_chunks.loc[0,"CHUNK"], model = 'mixtral-8x7b')
llm_output[0][0]


### It appears that the best results are constructed using llama3.1-70b.

Things to improve could potentially be:
- The chunk size which is currently 8192 (whic was estmated based on a 2 page TOC)
- text prompt: could potentially include more instructions or be cleaned up.

In [None]:
# Testing it on another document

llm_output = extract_TOC(df_chunks.loc[10,"CHUNK"], model = 'llama3.1-70b')
llm_output[0][0]