What we are developing here is called generative question-answering model using indexing library llama-index and text generation Bloom model.
- https://wandb.ai/mostafaibrahim17/ml-articles/reports/The-Answer-Key-Unlocking-the-Potential-of-Question-Answering-With-NLP--VmlldzozNTcxMDE3

In [1]:
!pip install -U llama_index
!pip install -U transformers
!pip install panda
!pip install numpy
!pip install torch torchvision torchaudio
!pip install -U langchain

Collecting llama_index
  Downloading llama_index-0.5.2.tar.gz (161 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: llama_index
  Building wheel for llama_index (setup.py) ... [?25ldone
[?25h  Created wheel for llama_index: filename=llama_index-0.5.2-py3-none-any.whl size=243234 sha256=7d77c7bdbd5e63ded89da2213d1c2a99a6befc044a85e2bec982cce71a20e113
  Stored in directory: /home/studio-lab-user/.cache/pip/wheels/59/c2/96/557e4ddce303dc8df195782e3f56d0fd74c176f9716e4c9d11
Successfully built llama_index
Installing collected packages: llama_index
  Attempting uninstall: llama_index
    Found existing installation: llama-index 0.4.36
    Uninstalling llama-index-0.4.36:
      Successfully uninstalled llama-index-0.4.36
Successfully installed llama_index-0.5.2
Collecting transformers
  Downloading transformers-4.27.4-py3-no

In [3]:
# Import necessary packages
# from llama_index import GPTSimpleVectorIndex, Document, SimpleDirectoryReader
import pandas as pd, numpy as np
import os, openai
from transformers import GPT2Tokenizer, GPT2LMHeadModel


os.environ['OPENAI_API_KEY'] = ''
openai.api_key = os.getenv("OPENAI_API_KEY")

In [4]:
df = pd.read_json("../data/regItems.json")
df = df.replace(to_replace="", value=np.nan).dropna(axis=0) # remove null values
df['paragraphText'] = df['paragraphText'].str.replace("OLD SECTION.*", "", regex=True) # remove any dirty words
df['paragraphText'] = df['paragraphText'].str.replace("[a-zA-z]\d\w+", ". ", regex=True)
df['paragraphText'] = df['paragraphText'].str.lower()
df

Unnamed: 0,_id,chapter,article,title,paragraphText
2,{'$oid': '6417c99c2d49ca2fefed951e'},Chapter 17.5. Lead and Copper,Article 8. Lead Service Line Requirements for ...,§ 64688. Lead Service Line Replacement.,(a) a system shall replace lead service lines ...
3,{'$oid': '6417c99c2d49ca2fefed951f'},Chapter 17.5. Lead and Copper,Article 7. Public Education Program for Lead A...,§ 64687. Lead Public Education Program Content...,(a) each system with a lead action level excee...
4,{'$oid': '6417c99c2d49ca2fefed9520'},Chapter 17.5. Lead and Copper,Article 8. Lead Service Line Requirements for ...,§ 64689. Lead Service Line Sampling.,(a) each lead service line sample shall be one...
5,{'$oid': '6417c99d2d49ca2fefed9521'},Chapter 17.5. Lead and Copper,Article 6. Source Water Requirements for Actio...,§ 64686. Requirements Subsequent to the Depart...,(a) if the department determines that source w...
6,{'$oid': '6417c99e2d49ca2fefed9522'},"Chapter 15.5. Disinfectant Residuals, Disinfec...",Article 6. Reporting and Recordkeeping Require...,§ 64537.6. Disinfection Byproduct Precursors a...,(a) systems required to meet the enhanced coag...
...,...,...,...,...,...
759,{'$oid': '6417cac62d49ca2fefed9813'},Chapter 17. Surface Water Treatment,"Article 2. Treatment Technique Requirements, W...",§ 64653. Filtration.,(a) all approved surface water utilized by a s...
760,{'$oid': '6417cac62d49ca2fefed9814'},Chapter 17. Surface Water Treatment,"Article 2. Treatment Technique Requirements, W...",§ 64652. Treatment Technique Requirements and ...,(a) a supplier using an approved surface water...
761,{'$oid': '6417cac62d49ca2fefed9815'},Chapter 17. Surface Water Treatment,Article 3. Monitoring Requirements,§ 64655. Filtration Monitoring.,(a) to determine compliance with the performan...
762,{'$oid': '6417cac72d49ca2fefed9816'},Chapter 17. Surface Water Treatment,Article 3. Monitoring Requirements,"§ 64654.8. Source, Raw, Settled, and Recycled ...",(a) a supplier shall comply with the source mo...


In [5]:
data = df['paragraphText'].tolist()
# Prepare your data and convert it into a format that Llama Index can understand
# This example assumes that your data is a list of strings
formatted_data = [{'text': doc} for doc in data]

- https://github.com/jerryjliu/llama_index/blob/main/examples/test_wiki/TestNYC_Embeddings.ipynb
- https://gpt-index.readthedocs.io/en/latest/how_to/custom_llms.html#example-using-a-custom-llm-model

#### LLM customization with LlamaIndex

Note that we need to use the prompt helper to customize the prompt sizes, since every model has a slightly different context length.

Note that you may have to adjust the internal prompts to get good performance. Even then, you should be using a sufficiently large LLM to ensure it’s capable of handling the complex queries that LlamaIndex uses internally, so your mileage may vary.

A list of all default internal prompts is available here: 
- https://github.com/jerryjliu/llama_index/blob/main/gpt_index/prompts/default_prompts.py

Chat-specific prompts are listed here: 
- https://github.com/jerryjliu/llama_index/blob/main/gpt_index/prompts/chat_prompts.py


In [9]:
from langchain.llms.base import LLM
from llama_index import LLMPredictor, GPTSimpleVectorIndex, Document, PromptHelper, ServiceContext, Document
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from IPython.display import Markdown
from typing import Optional, List, Mapping, Any

# define prompt helper
# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 525
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

In [10]:
# 2. Load the BigScience Bloomz model and tokenizer
model_name = "bigscience/bloom-560m" # "bigscience/bloomz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, config='T5Config')


In [11]:
class CustomLLM(LLM):
    # 3. Create the pipeline for question answering
    pipeline = pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        # device=0, # GPU device number
        max_length=512,
        do_sample=True,
        top_p=0.95,
        top_k=50,
        temperature=0.7
    )

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        prompt_length = len(prompt)
        response = self.pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"]

        # only return newly generated tokens
        return response[prompt_length:]

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self) -> str:
        return "custom"

In [12]:
#define our llm
llm_predictor = LLMPredictor(llm=CustomLLM())
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

In [13]:
# 1. Load and index the dataset using Llama-index
index_path = "./index"


documents = [Document(d) for d in data]

# index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

index.save_to_disk('index.json')

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 198292 tokens


In [14]:
new_index = GPTSimpleVectorIndex.load_from_disk('index.json')

In [15]:
# set Logging to DEBUG for more detailed outputs
response = new_index.query("what is cross-connection?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 331 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 6 tokens


<b>
Cross-connection is a connection between a potable water system and a non-potable water system, such as a recycled water system, that could allow contaminated water to enter the potable water system.</b>

The bot response depends on the way we asked a question. See below for an example.

In [16]:
response = new_index.query("what is AWWA?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 347 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 5 tokens


<b>
AWWA stands for the American Water Works Association. It is a professional organization that works to improve the quality and availability of drinking water.</b>

In [17]:
response = new_index.query("what does AWWA abbreviation mean?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 332 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 8 tokens


<b>
AWWA stands for American Water Works Association.</b>

In [18]:
response = new_index.query("what does awwa abbreviation mean?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 333 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens


<b>
AWWA stands for American Water Works Association.</b>

In [None]:
response = new_index.query("what does the term State Board stands for?")
display(Markdown(f"<b>{response}</b>"))

<b>
The term "State Board" stands for the State Water Resources Control Board.</b>

In [19]:
response = new_index.query("Give me a list of conditions on which the water system be removed from service.")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 350 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 16 tokens


<b>
1. The water system fails to meet the safety and quality standards set by the state board.
2. The water system fails to comply with the regulations and requirements set by the state board.
3. The water system fails to maintain the necessary infrastructure and equipment to provide safe and reliable service.
4. The water system fails to provide adequate customer service.
5. The water system fails to provide timely and accurate billing information.
6. The water system fails to provide timely and accurate maintenance and repair services.
7. The water system fails to provide timely and accurate water testing and monitoring services.
8. The water system fails to provide timely and accurate reporting of water quality data.
9. The water system fails to provide timely and accurate reporting of water usage data.
10. The water system fails to provide timely and accurate reporting of water system operations.
11. The water system fails to provide timely and accurate reporting of water system maintenance and repairs.
12. The water system fails to provide timely and accurate reporting of water system upgrades and improvements.
13. The water system fails to provide timely and accurate reporting of water system compliance with applicable laws and regulations.
14. The water system fails to provide timely and accurate reporting of water</b>

In [28]:
response = new_index.query("I am not sure what the fecal coliform level is and can you explain what that is and how it can affect water system?")
display(Markdown(f"<b>{response}</b>"))

<b>
Fecal coliforms are a type of bacteria that are found in the intestines of warm-blooded animals, including humans. They are used as an indicator of water contamination by sewage or animal waste. High levels of fecal coliforms in a water system can indicate that the water is contaminated with disease-causing organisms, such as E. coli, which can cause serious illnesses.</b>

In [27]:
response = new_index.query("What are some requirements an applicant should have before taking the T2 operator exam?")
display(Markdown(f"<b>{response}</b>"))

<b>
An applicant should have passed a Grade T1 operator examination within the three years prior to submitting the application for certification. They should also have completed at least one year of operator experience working as a certified T2 operator for a T2 facility or higher, or a facility that, prior to January 1, 2001, would have met the criteria for classification as a T2 facility or higher pursuant to Section 64413.1.</b>

In [30]:
response = new_index.query("who is water user?")
display(Markdown(f"<b>{response}</b>"))

<b>
A water user is any individual or entity that uses water from a water system for human consumption. This includes residential customers, businesses, and other organizations.</b>

In [20]:
response = new_index.query("can you explain what the water treatment facility and what functions it serves?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 276 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 14 tokens


<b>
A water treatment facility is a system of structures, equipment, and processes that are used to treat or condition a water supply. This treatment affects the physical, chemical, and bacteriological quality of the water, making it safe for public use. The facility may include processes such as filtration, sedimentation, disinfection, and other methods to remove contaminants and make the water safe for consumption. The facility may also include processes to adjust the pH, hardness, and other characteristics of the water. Facilities that only disinfect the water and do not require Giardia or virus reduction are not considered water treatment facilities.</b>

In [22]:
response = new_index.query("what is the exact meaning of the water treatment facility?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 221 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 11 tokens


<b>
A water treatment facility is a group of structures, equipment, and processes that are used to improve the physical, chemical, or bacteriological quality of water that is distributed or offered to the public for domestic use by a public water system. Facilities that only disinfect water and are under the control of a certified distribution operator are not included as water treatment facilities.</b>

In [23]:
response = new_index.query("What does DLR stand for?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 98 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens


<b>
DLR stands for "Detection Limit for Reporting".</b>

In [24]:
response = new_index.query("What is DLR?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 123 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 5 tokens


<b>
DLR stands for "Detection Limit for Reporting" and is the designated minimum level at or above which any analytical finding of a contaminant in drinking water must be reported to the department.</b>

In [25]:
response = new_index.query("What is a level 1 assessment?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 121 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens


<b>
A level 1 assessment is an evaluation to determine if there are any sanitary defects, defects in the distribution system coliform monitoring practices, and (when possible) the cause of the system triggering the assessment.</b>

In [26]:
response = new_index.query("What is a level 2 assessment?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 207 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 7 tokens


<b>
A level 2 assessment is an evaluation that provides a more detailed examination of a system than a level 1 assessment. It involves a comprehensive investigation and review of available information, additional internal and external resources, and other relevant practices to identify the possible presence of sanitary defects, defects in distribution system coliform monitoring practices, and (when possible) the likely reason that the system triggered the assessment.</b>

In [27]:
response = new_index.query("What is the difference between a level 1 and 2 assessments?")
display(Markdown(f"<b>{response}</b>"))

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 220 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 12 tokens


<b>
The difference between a level 1 and 2 assessment is that a level 2 assessment provides a more detailed examination of the system than a level 1 assessment. This includes a more comprehensive investigation and review of available information, additional internal and external resources, and other relevant practices to identify the possible presence of sanitary defects, defects in distribution system coliform monitoring practices, and (when possible) the likely reason that the system triggered the assessment.</b>