## Llama-index with custom LLm 
Previously, we used llama-index (which rely on openai to create word embeddings by default) with a custom llm and created a vector index by manually creating and loading Documents from the dataset. In this notebook, I test various data loaders methods from llama-index and compare their result. Since our data contains texts and tabular data, we need to either load the whole data (from pdf or html page) or partition text and tabular data and build indices for each one. See here for more info https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/composability.html, and see demo here https://github.com/jerryjliu/llama_index/blob/main/examples/composable_indices/ComposableIndices.ipynb.

We can also choose other supported transformers to create embeddings. See here for more info. https://gpt-index.readthedocs.io/en/latest/how_to/customization/embeddings.html#custom-embeddings. I have another notebook where I tested the custom-embeddings.

We can also use llama-index with langchain agent which can do more complex data structures. See here for more info https://gpt-index.readthedocs.io/en/latest/how_to/integrations/using_with_langchain.html, and usage here https://github.com/jerryjliu/llama_index/blob/main/examples/chatbot/Chatbot_SEC.ipynb. 

What we are developing here is called generative question-answering model using indexing library llama-index and text generation Bloom model.
- https://wandb.ai/mostafaibrahim17/ml-articles/reports/The-Answer-Key-Unlocking-the-Potential-of-Question-Answering-With-NLP--VmlldzozNTcxMDE3

In [4]:
!pip install -U pip
!pip install -U llama_index
!pip install -U transformers
!pip install panda
!pip install numpy
!pip install torch torchvision torchaudio
!pip install -U langchain==0.0.142



In [22]:
# Import necessary packages
# from llama_index import GPTSimpleVectorIndex, Document, SimpleDirectoryReader
import pandas as pd, numpy as np
import os, openai
from transformers import GPT2Tokenizer, GPT2LMHeadModel


os.environ['OPENAI_API_KEY'] = 'your openai key here'
openai.api_key = os.getenv("OPENAI_API_KEY")

#### Preparing the Dataset

In [6]:
df = pd.read_json("./regItems.json")
df = df.replace(to_replace="", value=np.nan).dropna(axis=0) # remove null values
# df['paragraphText'] = df['paragraphText'].str.replace("OLD SECTION.*", "", regex=True) # remove any dirty words
# df['paragraphText'] = df['paragraphText'].str.replace("[a-zA-z]\d\w+", ". ", regex=True)
# df['paragraphText'] = df['paragraphText'].str.lower()
df

Unnamed: 0,_id,chapter,article,title,paragraphText
2,{'$oid': '6417c99c2d49ca2fefed951e'},Chapter 17.5. Lead and Copper,Article 8. Lead Service Line Requirements for ...,§ 64688. Lead Service Line Replacement.,(a) A system shall replace lead service lines ...
3,{'$oid': '6417c99c2d49ca2fefed951f'},Chapter 17.5. Lead and Copper,Article 7. Public Education Program for Lead A...,§ 64687. Lead Public Education Program Content...,(a) Each system with a lead action level excee...
4,{'$oid': '6417c99c2d49ca2fefed9520'},Chapter 17.5. Lead and Copper,Article 8. Lead Service Line Requirements for ...,§ 64689. Lead Service Line Sampling.,(a) Each lead service line sample shall be one...
5,{'$oid': '6417c99d2d49ca2fefed9521'},Chapter 17.5. Lead and Copper,Article 6. Source Water Requirements for Actio...,§ 64686. Requirements Subsequent to the Depart...,(a) If the Department determines that source w...
6,{'$oid': '6417c99e2d49ca2fefed9522'},"Chapter 15.5. Disinfectant Residuals, Disinfec...",Article 6. Reporting and Recordkeeping Require...,§ 64537.6. Disinfection Byproduct Precursors a...,(a) Systems required to meet the enhanced coag...
...,...,...,...,...,...
759,{'$oid': '6417cac62d49ca2fefed9813'},Chapter 17. Surface Water Treatment,"Article 2. Treatment Technique Requirements, W...",§ 64653. Filtration.,(a) All approved surface water utilized by a s...
760,{'$oid': '6417cac62d49ca2fefed9814'},Chapter 17. Surface Water Treatment,"Article 2. Treatment Technique Requirements, W...",§ 64652. Treatment Technique Requirements and ...,(a) A supplier using an approved surface water...
761,{'$oid': '6417cac62d49ca2fefed9815'},Chapter 17. Surface Water Treatment,Article 3. Monitoring Requirements,§ 64655. Filtration Monitoring.,(a) To determine compliance with the performan...
762,{'$oid': '6417cac72d49ca2fefed9816'},Chapter 17. Surface Water Treatment,Article 3. Monitoring Requirements,"§ 64654.8. Source, Raw, Settled, and Recycled ...",(a) A supplier shall comply with the source mo...


In [8]:
data = df['paragraphText'].tolist()
# Prepare your data and convert it into a format that Llama Index can understand
# This example assumes that your data is a list of strings
formatted_data = [{'text': doc} for doc in data]

- https://github.com/jerryjliu/llama_index/blob/main/examples/test_wiki/TestNYC_Embeddings.ipynb
- https://gpt-index.readthedocs.io/en/latest/how_to/custom_llms.html#example-using-a-custom-llm-model

#### LLM customization with LlamaIndex

Note that we need to use the prompt helper to customize the prompt sizes, since every model has a slightly different context length.

Note that you may have to adjust the internal prompts to get good performance. Even then, you should be using a sufficiently large LLM to ensure it’s capable of handling the complex queries that LlamaIndex uses internally, so your mileage may vary.

A list of all default internal prompts is available here: 
- https://github.com/jerryjliu/llama_index/blob/main/gpt_index/prompts/default_prompts.py

Chat-specific prompts are listed here: 
- https://github.com/jerryjliu/llama_index/blob/main/gpt_index/prompts/chat_prompts.py


In [12]:
from langchain.llms.base import LLM
from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper, ServiceContext
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from IPython.display import Markdown
from typing import Optional, List, Mapping, Any

# define prompt helper
# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 512
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

In [13]:
# 2. Load the BigScience Bloomz model and tokenizer
model_name = "bigscience/bloom-560m" # "bigscience/bloomz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, config='T5Config')

In [14]:
class CustomLLM(LLM):
    # 3. Create the pipeline for question answering
    pipeline = pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        # device=0, # GPU device number
        max_length=512,
        do_sample=True,
        top_p=0.95,
        top_k=50,
        temperature=0.7
    )

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        prompt_length = len(prompt)
        response = self.pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"]

        # only return newly generated tokens
        return response[prompt_length:]

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self) -> str:
        return "custom"

In [15]:
#define our llm
llm_predictor = LLMPredictor(llm=CustomLLM())
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

#### Load the dataset using data loaders from llama-index

In [16]:
from llama_index import download_loader, Document
from pathlib import Path
# Load and index the dataset using Llama-index

# store all versions of documents and vector indices
allDocs = []
indices = {}

#### vector index by creating Documents manually

In [17]:
%%script false --no-raise-error
# original data loader after cleaning in pd
pdDocs = [Document(d) for d in data]
allDocs.append(pdDocs)

index = GPTSimpleVectorIndex.from_documents(pdDocs, service_context=service_context)
index.save_to_disk('./indices/pdIndex.json')

#### vector index by using simple directory data loader 
The data loader can read in pdf files.

In [18]:
%%script false --no-raise-error
# using simple directory data loader
SimpleDirectoryReader = download_loader("SimpleDirectoryReader")
simpleDirLoader = SimpleDirectoryReader(input_dir="./data/pdf", recursive=True, exclude_hidden=True)
simpleDocs = simpleDirLoader.load_data()
allDocs.append(simpleDocs)

index = GPTSimpleVectorIndex.from_documents(simpleDocs, service_context=service_context)
index.save_to_disk('./indices/simpleIndex.json')

In [None]:
# !git clone https://github.com/facebookresearch/detectron2.git
# !pip install -e detectron2
# !pip install unstructured[local-inference]

In [None]:
# using unstructured data loader
# UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)
# unstructuredLoader = UnstructuredReader()
# unstructuredDocs = unstructuredLoader.load_data(file=Path(f'./data/pdf/calregs.pdf'), split_documents=False)
# allDocs.append(unstructuredDocs)

In [19]:
queries = ["from Table 64423-A, if monthly population served is 1000, what is the range of service connections and minimum number of samples per month?", 
          "what is the maximum contaminant level of aluminum that public water system shall comply?",
          "if monthly population served is 500, what is the range of service connections and minimum number of samples per month?",
          "if monthly population served is 4000, what is the minimum number of samples per month?",
          "what will the PWS need to do if there is a violation in lead concentration?",
          "How do we determine how many samples a PWS will need to take for lead?",
          "What is the difference between a level 1 and 2 assessments?",
          "What is a level 2 assessment?",
          "What is a level 1 assessment?",
          "What is DLR?",
          "What does DLR stand for?",
          "what is the exact meaning of the water treatment facility?",
          "What are some requirements an applicant should have before taking the T2 operator exam?",
          "what does the term State Board stands for?",
          "what does awwa abbreviation mean?",
          "what is AWWA?", "what is cross-connection?"]

In [20]:
# load indices
pdIndex = GPTSimpleVectorIndex.load_from_disk('./indices/pdIndex.json')
simpleIndex = GPTSimpleVectorIndex.load_from_disk('./indices/simpleIndex.json')

indices.update({"pdIndex": pdIndex})
indices.update({"simpleIndex": simpleIndex})

In [23]:
pdResponse = pdIndex.query("What does DLR stand for?")
simpleResponse = simpleIndex.query("What does DLR stand for?")
display(Markdown(f"<b>{pdResponse}</b>"))
display(Markdown(f"<b>{simpleResponse}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 98 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3598 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens


<b>
DLR stands for "Detection Limit for Reporting".</b>

<b>
DLR stands for Detection Limit for Purposes of Reporting.</b>

In [24]:
for query in queries[0: 8]:
    print(f"query: {query}")
    for name, index in indices.items():
        print(f"{name}'s token usage:")
        response = index.query(query)
        print(f"{name} response:")
        display(Markdown(f"<b>{response}</b><br />"))
    print("----------------------------------------")
    print("----------------------------------------")

query: from Table 64423-A, if monthly population served is 1000, what is the range of service connections and minimum number of samples per month?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1925 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 30 tokens


pdIndex response:


<b>
The range of service connections for a monthly population served of 1000 is 401 to 890, and the minimum number of samples per month is 2.</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3670 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 30 tokens


simpleIndex response:


<b>
The range of service connections is fewer than 10,000 and the minimum number of samples per month is 3.</b><br />

----------------------------------------
----------------------------------------
query: what is the maximum contaminant level of aluminum that public water system shall comply?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 245 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 16 tokens


pdIndex response:


<b>
The maximum contaminant level of aluminum that public water systems shall comply with is 1 mg/L.</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3737 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 16 tokens


simpleIndex response:


<b>
There is no maximum contaminant level of aluminum specified in the context information.</b><br />

----------------------------------------
----------------------------------------
query: if monthly population served is 500, what is the range of service connections and minimum number of samples per month?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1917 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 22 tokens


pdIndex response:


<b>
The range of service connections for a monthly population served of 500 is 401 to 890, and the minimum number of samples per month is 2.</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3664 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 22 tokens


simpleIndex response:


<b>
The range of service connections is 34,301 to 46,400 and the minimum number of samples per month is 100.</b><br />

----------------------------------------
----------------------------------------
query: if monthly population served is 4000, what is the minimum number of samples per month?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1884 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 17 tokens


pdIndex response:


<b>
30</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3704 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 17 tokens


simpleIndex response:


<b>
The minimum number of samples per month for a public water system serving 4,000 persons is four. This is based on the information provided in §64423.1(a), which states that a public water system must designate each sample as routine, repeat, replacement, or "other" and have each sample analyzed for total coliforms.</b><br />

----------------------------------------
----------------------------------------
query: what will the PWS need to do if there is a violation in lead concentration?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 385 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 17 tokens


pdIndex response:


<b>
If there is a violation in lead concentration, the PWS will need to collect one lead and copper source water sample from each entry point to the distribution system that is representative of the source or combined sources and is collected after any treatment, if treatment is applied before distribution. They will also need to submit a written recommendation to the Department for the installation and operation of a source water treatment (ion exchange, reverse osmosis, lime softening, or coagulation/filtration) or demonstrate that source water treatment is not needed to minimize lead and copper levels at users' taps. Finally, they will need to submit any additional information requested by the Department to aid in its determination of whether source water treatment is necessary to minimize lead and copper levels in water delivered to users' taps.</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3678 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 17 tokens


simpleIndex response:


<b>
If a water test indicates that the drinking water drawn from a tap in the PWS's home contains lead above 15 ppb, then the PWS will need to take the following precautions: let the water run from the tap before using it for drinking or cooking any time the water in a tap has not been used for several hours; use cold water for drinking, cooking, and preparing baby formula; and consider using a filter certified to remove lead. The PWS may also need to replace the portion of each lead service line that they own if the line contributes lead concentrations of 15 ppb or more after they have completed the comprehensive treatment program.</b><br />

----------------------------------------
----------------------------------------
query: How do we determine how many samples a PWS will need to take for lead?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 609 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 17 tokens


pdIndex response:


<b>
The number of samples a PWS will need to take for lead will depend on the system's 90th percentile levels for lead and copper, the difference between the 90th percentile tap sampling lead level and the highest source water monitoring result, and the source water lead levels. If the system has 90th percentile levels that do not exceed 0.005 mg/L for lead and 0.65 mg/L for copper for two consecutive periods, it may reduce the sampling to once every three years at the reduced number of sites. If the system does not meet the criteria in paragraph (1), after two consecutive periods with no action level exceedance, the frequency may be reduced to annually at the reduced number of sites, if the system receives written approval from the Department. If the system demonstrates for two consecutive periods that the difference between the 90th percentile tap sampling lead level and the highest source water monitoring result for each period is less than the reporting level for purposes of reporting (DLR), or that the source water lead levels are below the method detection level of 0.001 mg/L and the 90th percentile lead level is equal to or less than the DLR for each period, the system shall conduct tap sampling once every three years.</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3731 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 17 tokens


simpleIndex response:


<b>
The number of samples a public water system (PWS) will need to take for lead will be determined by the requirements of §64689. Lead Service Line Sampling. This section states that each lead service line sample shall be one liter in volume and have stood motionless in the lead service line for at least six hours, but not more than twelve. The number of samples will depend on the size of the lead service line and the number of taps that need to be sampled.</b><br />

----------------------------------------
----------------------------------------
query: What is the difference between a level 1 and 2 assessments?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 221 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 12 tokens


pdIndex response:


<b>
The difference between a Level 1 and Level 2 assessment is that a Level 2 assessment provides a more detailed examination of the system than a Level 1 assessment. This includes a more comprehensive investigation and review of available information, additional internal and external resources, and other relevant practices to identify the possible presence of sanitary defects, defects in distribution system coliform monitoring practices, and (when possible) the likely reason that the system triggered the assessment.</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3807 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 12 tokens


simpleIndex response:


<b>
A Level 1 assessment is conducted to identify the possible presence of sanitary defects and defects in distribution system coliform monitoring practices. It includes a review and identification of the minimum elements in subparagraphs (A) through (E) and shall describe sanitary defects detected (and if applicable, may note no sanitary defects were detected), corrective actions completed, and a proposed timetable for any corrective actions not already completed. 

A Level 2 assessment is conducted to identify the possible presence of sanitary defects and defects in distribution system coliform monitoring practices. It includes a review and identification of the minimum elements in subsections (a)(2)(A) through (E) to identify the possible presence of sanitary defects and defects in distribution system coliform monitoring practices. It must also describe sanitary defects detected (and if applicable, may note no sanitary defects were detected), corrective actions completed, and a proposed timetable for any corrective actions not already completed.</b><br />

----------------------------------------
----------------------------------------
query: What is a level 2 assessment?
pdIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 207 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens


pdIndex response:


<b>
A Level 2 assessment is an evaluation that provides a more detailed examination of a system than a Level 1 assessment. It involves a comprehensive investigation and review of available information, additional internal and external resources, and other relevant practices to identify the possible presence of sanitary defects, defects in distribution system coliform monitoring practices, and (when possible) the likely reason that the system triggered the assessment.</b><br />

simpleIndex's token usage:


INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3627 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens


simpleIndex response:


<b>
A Level 2 assessment is an assessment conducted to identify potential problems in water treatment or distribution. It is used to identify problems and take corrective actions to address any issues that are found.</b><br />

----------------------------------------
----------------------------------------
