# An example of using a layered approach to interacting with LLMs 

This is my example of how you can get multiple completions, use a separate call to the LLM to evaluate them, and use another call to evaluate against relevant legislation to protect against breeches. 

This workbook sets the redis database up and loads some PDFs and the Insurance Contract Act of Australia.

The building blocks are taken from OpenAI's Cookbook - Chatbot Kickstarter, with the logic and handling of multiple conversations introduced in this project.

It is laid out in these sections:
- **Setup:** 
    - Initiate variables and source the data
- **Lay the foundations:**
    - Set up the vector database to accept vectors and data
    - Load the dataset, chunk the data up for embedding and store in the vector database


In [1]:
##These will ensure that any modules are reloaded when code blocks are run. 
##Helpful in development in case modules are edited in parallel
%load_ext autoreload
%autoreload 2

## Setup Environment


First we'll setup our libraries and environment variables

In [2]:
import redis
import openai
import os
import requests
import numpy as np
import pandas as pd
from typing import Iterator
import tiktoken
import fitz  #PyMuPDF - original cookbook used textract but it refused to work for me so I substitued this 
from numpy import array, average
from database import get_redis_connection

# Set our default models and chunking size
from config import COMPLETIONS_MODEL, EMBEDDINGS_MODEL, CHAT_MODEL, TEXT_EMBEDDING_CHUNK_SIZE, VECTOR_FIELD_NAME

# Openai API Key
from apikey import API_KEY
openai.api_key = API_KEY

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ImportWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [3]:
pd.set_option('display.max_colwidth', 0)

## Pull datafiles in from the file system 

In this example I wanted to have two separate indexes. One for product PDS(s) (product disclosure statement) and one for a contract of insurance.

I am using RACQ's PDS that are publicly available at www.racq.com.au/insurance/insurance-disclosure-documents

And for the contract a desensitised contract with dummy data

In [4]:

PDS_dir = os.path.join(os.curdir,'PDS')
PDS_files = sorted([x for x in os.listdir(PDS_dir) if 'DS_Store' not in x])
print("PDS Files:", PDS_files)

contract_dir = os.path.join(os.curdir,'Reg')
contract_files = sorted([x for x in os.listdir(contract_dir) if 'DS_Store' not in x])
print("Contract Files:", contract_files)


PDS Files: ['Household Insurance PDS RHHB20822.pdf', 'Household_KFSBLD_20220401.pdf', 'Household_KFSCTS_20220401.pdf', 'Household_SPDS_RHHB90523_20230529_v54.pdf']
Contract Files: ['Contract.pdf']


## Laying the foundations

### Storage

I'm using Redis in line with the original cookbook walk-through. Refer to that for more detail on how to set up Redis.

Initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.

In [5]:
# Setup Redis

from redis import Redis
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField,
    NumericField
)
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)

redis_client = get_redis_connection()

In [6]:
# Constants
VECTOR_DIM = 1536 #len(data['title_vector'][0]) # length of the vectors
#VECTOR_NUMBER = len(data)                 # initial number of vectors
PDS_PREFIX = "PDS"                            # prefix for the PDS document keys
CONTRACT_PREFIX = "CONTRACT"
DISTANCE_METRIC = "COSINE"                # distance metric for the vectors (ex. COSINE, IP, L2)

In [7]:
# Create search index

# Index
VECTOR_FIELD_NAME = 'content_vector'

# Define RediSearch fields for each of the columns in the dataset
# This is where you should add any additional metadata you want to capture
filename = TextField("filename")
text_chunk = TextField("text_chunk")
file_chunk_index = NumericField("file_chunk_index")

# define RediSearch vector fields to use HNSW index

text_embedding = VectorField(VECTOR_FIELD_NAME,
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC
    }
)
# Add all our field objects to a list to be created as an index
fields = [filename,text_chunk,file_chunk_index,text_embedding]

In [8]:
redis_client.ping()

True

In [9]:
# Define the two indexes being used
PDS_INDEX_NAME = "pds-index"
CONTRACT_INDEX_NAME = "contract-index" 


# Drop the PDS index if it already exists
try:
    redis_client.ft(PDS_INDEX_NAME).info()
    print(f"Index {PDS_INDEX_NAME} already exists. Dropping...")
    redis_client.ft(PDS_INDEX_NAME).dropindex()
except Exception as e:
    print(e)

# Drop the CONTRACT index if it already exists
try:
    redis_client.ft(CONTRACT_INDEX_NAME).info()
    print(f"Index {CONTRACT_INDEX_NAME} already exists. Dropping...")
    redis_client.ft(CONTRACT_INDEX_NAME).dropindex()
except Exception as e:
    print(e)

# Create the PDS RediSearch Index
print(f'Creating Index {PDS_INDEX_NAME}')
redis_client.ft(PDS_INDEX_NAME).create_index(
    fields=fields,
    definition=IndexDefinition(prefix=[PDS_PREFIX], index_type=IndexType.HASH)
)

# Create the CONTRACT RediSearch Index
print(f'Creating Index {CONTRACT_INDEX_NAME}')
redis_client.ft(CONTRACT_INDEX_NAME).create_index(
    fields=fields,
    definition=IndexDefinition(prefix=[CONTRACT_PREFIX], index_type=IndexType.HASH)
)



Index pds-index already exists. Dropping...
Index contract-index already exists. Dropping...
Creating Index pds-index
Creating Index contract-index


b'OK'

### Ingestion

We'll load up our PDFs and do the following
- Initiate our tokenizer
- Run a processing pipeline to:
    - Mine the text from each PDF
    - Split them into chunks and embed them
    - Store them in Redis

In [10]:
# The transformers.py file contains all of the transforming functions, including ones to chunk, embed and load data
# For more details, check the file and work through each function individually
from transformers import handle_file_string

In [14]:
%%time
# This step takes less than 1 minute

# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# Process each PDS file and prepare for embedding
for pdf_file in PDS_files:
    
    pdf_path = os.path.join(PDS_dir,pdf_file)
    print(pdf_path)
    
    # Extract the raw text from each PDF using textract
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    # Initialize an empty string to store the extracted text
    text = ''
    
    # Loop through each page and extract text
    for page_number in range(pdf_document.page_count):
        page = pdf_document.load_page(page_number)
        text += page.get_text()

    #print((pdf_file,text[:50]),tokenizer,redis_client,VECTOR_FIELD_NAME,PDS_INDEX_NAME)

    # Chunk each document, embed the contents and load to Redis
    handle_file_string((pdf_file,text),tokenizer,redis_client,VECTOR_FIELD_NAME,PDS_INDEX_NAME,PDS_PREFIX)

# Process each contract file and prepare for embedding
for pdf_file in contract_files:
    
    pdf_path = os.path.join(contract_dir,pdf_file)
    print(pdf_path)
    
    # Extract the raw text from each PDF using textract
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    # Initialize an empty string to store the extracted text
    text = ''
    
    # Loop through each page and extract text
    for page_number in range(pdf_document.page_count):
        page = pdf_document.load_page(page_number)
        text += page.get_text()

    ##print(text) 
    
    # Chunk each document, embed the contents and load to Redis
    handle_file_string((pdf_file,text),tokenizer,redis_client,VECTOR_FIELD_NAME,CONTRACT_INDEX_NAME,CONTRACT_PREFIX)

.\PDS\Household Insurance PDS RHHB20822.pdf
load_vector started with vector: content_vector and index: pds-index and input list size: 94
Pipeline executed
.\PDS\Household_KFSBLD_20220401.pdf
load_vector started with vector: content_vector and index: pds-index and input list size: 5
Pipeline executed
.\PDS\Household_KFSCTS_20220401.pdf
load_vector started with vector: content_vector and index: pds-index and input list size: 5
Pipeline executed
.\PDS\Household_SPDS_RHHB90523_20230529_v54.pdf
load_vector started with vector: content_vector and index: pds-index and input list size: 5
Pipeline executed
.\Reg\Contract.pdf
0036330027 
MR JOHN DOE 
5 ROADY ROAD 
BRISBANE QLD 4000 
It’s time to renew your insurance. 
What next? 
1. 
When we renew your policy, we will continue to deduct the direct debit amount from your nominated 
account or card until you contact us to cancel or change that arrangement. Please refer to the enclosed 
Direct Debit Confirmation Certificate for full details. If you

In [13]:
# Check that our docs have been inserted
# print(PDS_INDEX_NAME,redis_client.ft(PDS_INDEX_NAME).info()['num_docs'])
# print(CONTRACT_INDEX_NAME,redis_client.ft(CONTRACT_INDEX_NAME).info()['num_docs'])
try:
    pds_index_stats = redis_client.ft(PDS_INDEX_NAME).info()
    pds_records_count = pds_index_stats["num_docs"]
    print(f"Number of records in index {PDS_INDEX_NAME}: {pds_records_count}")
except redis.exceptions.ResponseError as e:
    # Handle any errors that may occur when retrieving index stats
    print(f"Error getting index stats for {CONTRACT_INDEX_NAME}: {e}")

try:
    contract_index_stats = redis_client.ft(CONTRACT_INDEX_NAME).info()
    contract_records_count = contract_index_stats["num_docs"]
    print(f"Number of records in index {CONTRACT_INDEX_NAME}: {contract_records_count}")
except redis.exceptions.ResponseError as e:
    # Handle any errors that may occur when retrieving index stats
    print(f"Error getting index stats for {CONTRACT_INDEX_NAME}: {e}")

Number of records in index pds-index: 109
Number of records in index contract-index: 16
