# Document Search with LangChain

This example shows how to use the Python [LangChain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request on open-source LLMs and embedding models using the OpenAI SDK, then augment that request using the text stored in a collection of local PDF documents.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:

   Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```
3. (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

#### Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import os
import requests
import sys

from pathlib import Path

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter

#### Load config files

In [3]:
# Add root folder of the rag_bootcamp repo to PYTHONPATH
current_dir = Path().resolve()
parent_dir = current_dir.parent
sys.path.insert(0, str(parent_dir))

from utils.load_secrets import load_env_file
load_env_file()

In [4]:
GENERATOR_BASE_URL = os.environ.get("OPENAI_BASE_URL")

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

#### Set up some helper functions

In [5]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

#### Make sure other necessary items are in place

In [6]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

#### Choose LLM and embedding model

In [7]:
GENERATOR_MODEL_NAME = "Meta-Llama-3.1-8B-Instruct"
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

## Start with a basic generation request without RAG augmentation

Let's start by asking Llama-3.1 a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's world knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is domain-specific and it won't know the answer to. A good example would be an obscure detail buried deep within a company's annual report. For example:

*How many Vector scholarships in AI were awarded in 2022?*

In [8]:
query = "How many Vector scholarships in AI were awarded in 2022?"

## Now send the query to the open source model using KScope

In [9]:
llm = ChatOpenAI(
    model=GENERATOR_MODEL_NAME,
    temperature=0,
    max_tokens=None,
    base_url=GENERATOR_BASE_URL,
    api_key=OPENAI_API_KEY
)
message = [
    ("human", query),
]
try:
    result = llm.invoke(message)
    print(f"Result: \n\n{result.content}")
except Exception as err:
    if "Error code: 503" in err.message:
        print(f"The model {GENERATOR_MODEL_NAME} is not ready yet.")
    else:
        raise

The model Meta-Llama-3.1-8B-Instruct is not ready yet.


Without additional information, Llama-3.1 is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from `source_documents`

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [10]:

from langchain.document_loaders import TextLoader

In [11]:
ls /projects/RAG2/scotia-2/Datasets-Scotia-2/IBIS

'11114CA Wheat Farming in Canada Industry Report.pdf'
'11115CA Corn Farming in Canada Industry Report.pdf'
'33639CA Auto Parts Manufacturing in Canada Industry Report.pdf'
'44111CA New Car Dealers in Canada Industry Report.pdf'
'48412CA Long-Distance Freight Trucking in Canada Industry Report.pdf'
'48422CA Local Specialized Freight Trucking in Canada Industry Report.pdf'
'48423CA Long-Distance Specialized Freight Trucking in Canada Industry Report.pdf'


In [12]:
model_kwargs = {'device': 'cuda', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name=   EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the embeddings model...


In [161]:
%%time
# Load the IBIS pdfs
#directory_path = "./source_documents"
directory_path = "/projects/RAG2/scotia-2/Datasets-Scotia-2/IBIS"
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source documents: {len(docs)}")



Number of source documents: 280
CPU times: user 10.5 s, sys: 63.4 ms, total: 10.6 s
Wall time: 10.6 s


## Process PDFs (Optional)

In [162]:
import nltk
from nltk.corpus import words
nltk.download('words')

[nltk_data] Downloading package words to /h/ws_ikharchuk/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [163]:
import re

In [164]:
def merge_adjacent_words3(word_list):
    english_words = set(words.words())
    i = 0
    while i < len(word_list) - 2:
        combined_word = word_list[i] + word_list[i + 1]+ word_list[i + 1]
        if (combined_word.lower() in english_words) |(combined_word.lower().strip('s').strip('es').strip('ed') in english_words):
            word_list[i] = combined_word
            del word_list[i + 1]
            del word_list[i + 1]
        else:
            i += 1
    return word_list

In [165]:
def merge_adjacent_words2(word_list):
    english_words = set(words.words())
    new_list = []
    i = 0
    while i < len(word_list) - 1:
        combined_word = word_list[i] + word_list[i + 1]
        if (combined_word.lower() in english_words) |(combined_word.lower().strip('s').strip('es').strip('ed') in english_words):
            word_list[i] = combined_word
            del word_list[i + 1]
        else:
            i += 1
    return word_list

In [166]:
len (docs)

280

In [167]:
def process_string(test):
    test =test.replace('.', ' ').replace('?', ' ').replace('\n', ' ')
    words_list = test.split()

    words_list = [x.replace('•', ' ')  for x in words_list]

    words_list =[re.sub(r'[^A-Za-z0-9\s]', '', x) for x in words_list ]
    return ' '.join(merge_adjacent_words2(merge_adjacent_words3(words_list)))
process_string(test)

'What are innovations in industry pr oduc ts and servic esLow Genetically modified seeds could become available  Gene tical ly modified crops like corn so ybeans and cotton have become c ommer cially available in recent decades Genetic modifications have helped tooee tivity while improving resistance to certain pests and diseases  Ther eareare c ommer cially produced genetically modified strains off t Farmerss ve been reluctant touse genetically modified seeds for wheat since it is directly consumed by people unlike so ybeans and most corn pr oduc ts There is significant upstream innovation  Changes too equipment fertilizers and other chemicalss the l argest source of innovation for farmerss stof these innovations promise too ease yiel ds improve quality and save farmerss  If a wheat farmer does not have adequate equipmen tandand terial sit will be difficult for them too compete Capital costs have steadily incr eased for farmerss recent years alongside growing input costs Key Success F

In [168]:
%%time
for doc in docs[:280]:
    doc.page_content = process_string(doc.page_content)


CPU times: user 1min 17s, sys: 846 ms, total: 1min 18s
Wall time: 1min 18s


In [169]:
docs[10].page_content

'Profit Margin Total profit margin annual change from 20 11  2029 Profit Margin 6pp4pp2pp0pp2pp4pp 2012 2014 2016 2018 2020 2022 202 4 Source IBIS WorldTotal Profit 2 6bn 192411 0 Profit Margin 15 9 19244 0 pp Profit per Business 279 8k Current Performanc e201924 Revenue C AGR 4 7 Whats driving current industry perf ormanc eAA owing price of wheat has driven industry growth  Indus try perf ormanc eis closely tied too world price of wheat Farmerss charge a premium for their harvest when wheat pric esarar high Thew orld price of wheat has grown a taa AGR off 3 through the end off 4 spiking in double digit sin 2020  2021 and 2022  Suppl y largelydede the price of wheat Pric essur ge whensese weather or insect infestations cut crop yiel dsAA vere drought in 2020 and 2021 made wheat moree ce and boosted pric es  Revenue reached apeak in 2021 and declined in 2022 but has remained around 2021 levels in 2023 and 202 4 This hasss from year toyear volatility in price as it fell over 2023 in reac

### Split the documents into smaller chunks

In [170]:
%%time

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1250, chunk_overlap=32)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

Number of text chunks: 566
CPU times: user 179 ms, sys: 11.8 ms, total: 191 ms
Wall time: 187 ms


In [171]:
# Adding Reuters data

In [172]:

file_paths = ['/projects/RAG2/scotia-2/Datasets-Scotia-2/Agriculture_txt/agri_ca_co.csv', 
              '/projects/RAG2/scotia-2/Datasets-Scotia-2/Transport_txt/transport_CA.csv', 
              '/projects/RAG2/scotia-2/Datasets-Scotia-2/Auto_txt/auto_ca.csv', 
             ]
def load_txt_file(file_path):
    # Create a TextLoader instance

    loader = TextLoader(file_path)

    # Load the document

    document = loader.load()
    chunks2 =text_splitter.split_documents(document)
    print(f"Number of text chunks: {len(chunks2)}")
    return chunks2

In [173]:
%%time
for file_path in file_paths:
    chunks2 = load_txt_file(file_path)
    chunks= chunks +chunks2

Number of text chunks: 1212
Number of text chunks: 927
Number of text chunks: 2065
CPU times: user 140 ms, sys: 15.9 ms, total: 156 ms
Wall time: 160 ms


#### Define the embeddings model

## Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [174]:
%%time
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

CPU times: user 1min 38s, sys: 280 ms, total: 1min 38s
Wall time: 1min 33s


In [184]:
industries  = ["auto dealer", "wheat farming", 'trucking']
#industry  = "wheat farming"
industry ="auto dealer"
query = f"Where are most {industry} companies are located in Canada?"
query1  = f' Who are the main payers in canadian {industry} companies' # bad

query2 = f'What is the profit margin of Canadian {industry}'

#query = f'Who are the main palyers in Canadian  {industry}?' # bad
query3  = f' Which countries are competitors for canadian industry internationally?' # bad

query4 = f'''Please provide a summary of news  grouping the most important event for the {industry} into trends. Is there is anything could be highlighted regional trends happening in Alberta, BC, and Ontatio?How profit margin of the {industry}  companies has changed in 2024. What was the main reasons?'''
 
query5 =f''' What is the level of {industry}  consolidation in the Canadian sector, and what are the primary drivers behind this trend? '''
 
query6 =f''' What are the primary factors contributing to the supply demand imbalance  in the {industry}  , and what strategies are {industry}  companies employing to address this issue? '''

query7 = f'''How have fluctuating costs impacted the profitability of Canadian {industry}  companies over the past five years, and what strategies have they employed to mitigate this volatility? '''
 
query8 = f'''To what extent has the adoption of new technologies impacted operational efficiency and cost structures within the Canadian {industry}   companies? What are new  technology opportunities in the sector'''

query9 = f'''How significant is the competition from alternative providers for Canadian {industry}  and how are these companies adapting to this competitive landscape. What is the substitution risk.'''
 
query10 = f'''What are the key regulatory and policy challenges facing the Canadian {industry}  companies(e.g., hours of service regulations, environmental regulations, safety standards), and how are these regulations impacting industry operations and profitability?''' 
 

query11 = f'''What is the level of government support (subsidies, grants, incentives) available to the Canadian {industry}  , and are any changes expected? '''


In [185]:
query6 =f''' What are the innovations in  {industry} '''


In [186]:
queries  = [query1, query2, query3, query4, query5, query6, query7, query8, query9, query10, query11 ]

In [187]:
%%time
retrieved_docs = retriever.invoke(query6)

CPU times: user 38.4 ms, sys: 204 µs, total: 38.6 ms
Wall time: 35 ms


Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [188]:
pretty_print_docs(retrieved_docs)

Document 1:

muchneeded stability into the aut omobil e manuf acturing supply chain This stability will spur downstream consumer demand giving auto partss ers moree tooee equipment manuf acturing contracts from aut omak ers Innovation will create new opportunit esfor auto partss acturers  Automak ers have added many advanced features to vehicl es incl uding filtrations ystems too pathogens and allergens alongside new airbags that cushion the head better on impact Numerous r ecalls for airbags over the past decade have spurreded latters innovations These innovations have generally created new growth opportunities for auto partss acturers which will persist through the outlook period  The increasing pr ominenc eof electric and autonomous v ehicl es will also create demand fora host of niche y advanced auto partss trend will eventually lead too specialization and fr agmen tation among partss acturers  Even sov ehicl eslackee t systems Ass grow moree ar demand fortheee t segment will falte

## Now send the query to the RAG pipeline

In [182]:
%%time
rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
query =query6 +' Please answer in 2 sentences maximum. If answer is not available, answer NA ' 
result = rag_pipeline.invoke(input=query +' Please answer in 2 sentences maximum. If answer is not available, answer NA ' )
print(f"Result: \n\n{result['result'].replace ('According to the provided context, ', '')}")

Result: 

Low. Innovations in wheat farming include the potential availability of genetically modified seeds, which could improve yield and resistance to pests and diseases, as well as upstream innovations in equipment, fertilizers, and other chemicals that can ease yields, improve quality, and save farmers costs.
CPU times: user 62.1 ms, sys: 7.97 ms, total: 70 ms
Wall time: 2.61 s


# Loop

In [183]:
%%time
for i, query in enumerate (queries):
    query = query + ' Please answer in 2 sentences maximum. If answer is not available, answer NA ' 
    retrieved_docs = retriever.invoke(query)
    rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
    result = rag_pipeline.invoke(input=query)
    print(f"Result_{i+1}: \n\n{result['result'].replace ('According to the provided context, ', '').replace ('Based on the provided context, ', '')}")
    print ('-'*120)
    
    

Result_1: 

the main players in the Canadian wheat farming industry are not explicitly mentioned, but it is mentioned that no single company accounts for more than 5% of the industry market share. Therefore, the answer is NA.
------------------------------------------------------------------------------------------------------------------------
Result_2: 

The profit margin of Canadian wheat farming is 15.9% as of 2024. This is based on data from IBIS World, which reports that the profit margin has remained relatively stable over the past few years.
------------------------------------------------------------------------------------------------------------------------
Result_3: 

According to the context, countries that are competitors for the Canadian auto parts industry internationally include the United States, Japan, Germany, and South Korea, which have extensive automotive and manufacturing sectors and are major buyers of auto parts. Additionally, countries like Mexico also pose a