# Document Segmentation and RAG


This notebook contains the steps and code to answer questions using the flan-ul2 Foundation Model from watsonx.ai and Langchain. The document is segment into meaningful chunks and multiple files are generated based on the subtitle of the data for this use case. The approach eliminates the need to mention the chunk size and chunk overlap. Intelligent Document Processing technique of Watson Discovery had been used to split the document and a query is used to retrieve the all documents. These documents are ingested to ChromaDB and rest of the query process remains same. This approach is expected increase the retrieval metrics as the chunks are more contextually appropriate.

## Contents
This notebooks contains the following:
1. Setup of required libraries and modules
2. Data Loading, pay attention to multiple files being downloaded
3. Accessing LLM from WML
4. Answering the question using RAG approach

## Install the dependencies

Before starting this step, ensure that the Watson Machine Learning service is created and associated with this project. It might take few minutes to install all the dependencies.

In [1]:
!pip install "ibm-watson-machine-learning>=1.0.320" 
!pip install "pydantic>=1.10.0" 
!pip install langchain 
!pip install huggingface
!pip install huggingface-hub
!pip install sentence-transformers
!pip install chromadb
!pip install wget

Collecting pydantic>=1.10.0
  Downloading pydantic-2.4.2-py3-none-any.whl (395 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.8/395.8 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting annotated-types>=0.4.0
  Downloading annotated_types-0.6.0-py3-none-any.whl (12 kB)
Collecting typing-extensions>=4.6.1
  Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Collecting pydantic-core==2.10.1
  Downloading pydantic_core-2.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: typing-extensions, annotated-types, pydantic-core, pydantic
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.5.0
    Uninstalling typing_extensions-4.5.0:
      Successfully uninstalled typing_extensions-4.5.0
Successfully installed annotated-types-

Installing collected packages: typing-inspect, sniffio, marshmallow, jsonpointer, jsonpatch, dataclasses-json, anyio, langsmith, langchain
Successfully installed anyio-3.7.1 dataclasses-json-0.6.1 jsonpatch-1.33 jsonpointer-2.4 langchain-0.0.319 langsmith-0.0.49 marshmallow-3.20.1 sniffio-1.3.0 typing-inspect-0.9.0
Collecting huggingface
  Downloading huggingface-0.0.1-py3-none-any.whl (2.5 kB)
Installing collected packages: huggingface
Successfully installed huggingface-0.0.1
Collecting huggingface-hub
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting filelock
  Downloading filelock-3.12.4-py3-none-any.whl (11 kB)
Collecting fsspec>=2023.5.0
  Downloading fsspec-2023.9.2-py3-none-any.whl (173 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.4/173.4 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
Installing collected 

  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125940 sha256=35b8f265b02087794bcd1a0cfcf2a4dab16b6285fb8559ebf8b6c0ba8f480982
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-transformers
Installing collected packages: safetensors, regex, nltk, huggingface-hub, tokenizers, transformers, sentence-transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.18.0
    Uninstalling huggingface-hub-0.18.0:
      Successfully uninstalled huggingface-hub-0.18.0
Successfully installed huggingface-hub-0.17.3 nltk-3.8.1 regex-2023.10.3 safetensors-0.4.0 sentence-transformers-2.2.2 tokenizers-0.14.1 transformers-4.34.1
Collecting chromadb
  Downloading chromadb-0.4.14-py3-none-any.whl (448 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting uvloop!=0.15.0,!=0.15.1,>=0.14.0
  Downloading uvloop-0.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m83.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hCollecting websockets>=10.4
  Downloading websockets-11.0.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.9/129.9 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mpmath>=0.19
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
[2K   

## WatsonX.ai API Connection

Provide the Cloud IAM key to access the foundation models from the WML endpoint

In [1]:
import os, getpass
credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": getpass.getpass("Please enter your WML api key (hit enter): ")
}

Please enter your WML api key (hit enter): ········


## Project Id definition

The foundation models need project id for the execution and also for the CUH

In [2]:
try:
    project_id = os.environ["PROJECT_ID"]
   
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

The below cell downloads multiple files from git repo. These files have been split using Watson Discovery and a query was used to retrieve all the files and saved indiviudally.Only 4 files have been used to illustrate the approach.

In [7]:

import wget

url1 = 'https://raw.github.com/ravisrirangam/chunking_techniques/main/data/code_of_conduct.txt'
wget.download(url1, out='code_of_conduct.txt')
url2 = 'https://raw.github.com/ravisrirangam/chunking_techniques/main/data/internet_email_policy.txt'
wget.download(url2, out='internet_email_policy.txt')
url3 = 'https://raw.github.com/ravisrirangam/chunking_techniques/main/data/mobile_phone_policy.txt'
wget.download(url3, out='mobile_phone_policy.txt')
url4 = 'https://raw.github.com/ravisrirangam/chunking_techniques/main/data/recruitment_policy.txt'
wget.download(url4, out='recruitment_policy.txt')
print('files downloaded')

files downloaded


## Data Loading 

The downloaded files are encoded using the default embedding model from HF and ingested to ChromaDB

In [8]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [12]:
from langchain.document_loaders import TextLoader
from langchain.vectorstores import Chroma

filenames = ['code_of_conduct.txt', 'internet_email_policy.txt', 'mobile_phone_policy.txt', 'recruitment_policy.txt']
for i in range(len(filenames)):
    filename = filenames[i]    
    loader = TextLoader(filename)
    document = loader.load()
    Chroma.from_documents(document, embeddings)
    print(filename, ' ingested')

code_of_conduct.txt  ingested
internet_email_policy.txt  ingested
mobile_phone_policy.txt  ingested
recruitment_policy.txt  ingested


One more policy file is added, this step is done to get reference of the chromadb to be used for retrieval, this could be done in earlier step as well

In [17]:
url = 'https://raw.github.com/ravisrirangam/chunking_techniques/main/data/smoking_policy.txt'
wget.download(url, out='smoking_policy.txt')
print('file downloaded')
loader = TextLoader(filename)
document = loader.load()
docsearch = Chroma.from_documents(document, embeddings)
print('Smoking policy ingested')

file downloaded
Smoking policy ingested


## flan-ul2 creation

The below code does the following:
1. Get the model_id
2. Create the parameters for the model
3. Initialize the model
4. Langchain wrapper for the model

In [18]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

model_id = ModelTypes.FLAN_UL2

The decoding method is set to "greedy" to get a deterministic output, you can change it to "sample" and add temperature, top_k and top_p parameters

In [19]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 130,
    GenParams.MAX_NEW_TOKENS: 200
}

In [20]:
from ibm_watson_machine_learning.foundation_models import Model

model = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id
)

In [21]:
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM

flan_ul2_llm = WatsonxLLM(model=model)

## Get Answer to a question on one policy using RAG

The below code shows the retrieval part from ChromaDB. The query need not have the "policy" keyword. The quality of the retrieval can be enhanced by using all-mini* embedding model. The code is for illustrative purpose only. Using PromptTemplate and other advanced classes from Langchain the output can be improved.

In [26]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=flan_ul2_llm, chain_type="stuff", retriever=docsearch.as_retriever())
query = "code of conduct"
qa.run(query)

"Yes, it outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built on integrity, respect, and accountability. Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whether with colleagues, clients, or the broader community. We respect and protect sensitive information, and we avoid conflicts of interest. Respect: We embrace diversity and value each individual's contributions. Discrimination, harassment, or any form of disrespectful behavior is unacceptable. We create an inclusive environment where differences are celebrated and everyone is treated with dignity and courtesy. Accountability: We take responsibility for our actions and decisions. We follow all relevant laws and regulations, and we strive to continuously improve our practices. We report any potential violations of this code and support the inves

## Query on ChromaDB

If a native query was run on ChromaDB with "mobile policy", it can be noticed that 4 chunks have been retrieved. The first two chunks are matching the query and are the correct chunks. The next two chunks are not relevant to the query but had some semantic match and were returned in the query result. The LLM picked up the correct chunks and generated the summary of the content.

In [27]:
query = "mobile policy"
docs = docsearch.similarity_search(query)
print(len(docs))
for i in range(len(docs)):
    print(docs[i].page_content)
    print('\n\n')


4
Mobile Phone Policy
The Mobile Phone Policy sets forth the standards and expectations governing the appropriate and responsible usage of mobile devices in the organization. The purpose of this policy is to ensure that employees utilize mobile phones in a manner consistent with company values and legal compliance.
Acceptable Use: Mobile devices are primarily intended for work-related tasks. Limited personal usage is allowed, provided it does not disrupt work obligations.
Security: Safeguard your mobile device and access credentials. Exercise caution when downloading apps or clicking links from unfamiliar sources. Promptly report security concerns or suspicious activities related to your mobile device.
Confidentiality: Avoid transmitting sensitive company information via unsecured messaging apps or emails. Be discreet when discussing company matters in public spaces.
Cost Management: Keep personal phone usage separate from company accounts and reimburse the company for any personal cha