# Document Search with LangChain

This example shows how to use the Python [LangChain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request on open-source LLMs and embedding models using the OpenAI SDK, then augment that request using the text stored in a collection of local PDF documents.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:

   Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```
3. (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

#### Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import os
import requests
import sys

from pathlib import Path

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter

#### Load config files

In [3]:
# Add root folder of the rag_bootcamp repo to PYTHONPATH
current_dir = Path().resolve()
parent_dir = current_dir.parent
sys.path.insert(0, str(parent_dir))

from utils.load_secrets import load_env_file
load_env_file()

In [4]:
GENERATOR_BASE_URL = os.environ.get("OPENAI_BASE_URL")

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

#### Set up some helper functions

In [5]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

#### Make sure other necessary items are in place

In [6]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

#### Choose LLM and embedding model

In [219]:
#GENERATOR_MODEL_NAME = "Meta-Llama-3.1-8B-Instruct"
GENERATOR_MODEL_NAME = 'DeepSeek-R1-Distill-Qwen-1.5B'
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"


## Start with a basic generation request without RAG augmentation

Let's start by asking Llama-3.1 a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's world knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is domain-specific and it won't know the answer to. A good example would be an obscure detail buried deep within a company's annual report. For example:

*How many Vector scholarships in AI were awarded in 2022?*

In [220]:
query = "How many Vector scholarships in AI were awarded in 2022?"

## Now send the query to the open source model using KScope

In [221]:
llm = ChatOpenAI(
    model=GENERATOR_MODEL_NAME,
    temperature=0,
    max_tokens=None,
    base_url=GENERATOR_BASE_URL,
    api_key=OPENAI_API_KEY
)
message = [
    ("human", query),
]
try:
    result = llm.invoke(message)
    print(f"Result: \n\n{result.content}")
except Exception as err:
    if "Error code: 503" in err.message:
        print(f"The model {GENERATOR_MODEL_NAME} is not ready yet.")
    else:
        raise

Result: 

<think>
Okay, so I need to figure out how many Vector scholarships in AI were awarded in 2022. I'm not exactly sure where to start, but I know Vector is a company that offers various kinds of scholarships, including AI-related ones. Maybe I can look up their website or some official reports from them. 

First, I should check their official website. I remember they have a section for scholarships, so I'll go there. Let me search for Vector scholarships in AI. Hmm, they have a section for AI-related programs, so that's probably where I can find the information. 

Once I find the scholarship details, I'll need to filter them by the year 2022. I can probably use a search function or go through the list and pick out the ones from that year. But wait, I'm not sure if they have a specific search feature or if I need to manually go through each entry. Maybe they have a table or a list that I can sort by year. 

I also wonder if Vector has multiple categories of scholarships, like AI,

Without additional information, Llama-3.1 is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from `source_documents`

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [10]:

from langchain.document_loaders import TextLoader

In [11]:
ls /projects/RAG2/scotia-2/Datasets-Scotia-2/IBIS

'11114CA Wheat Farming in Canada Industry Report.pdf'
'11115CA Corn Farming in Canada Industry Report.pdf'
'33639CA Auto Parts Manufacturing in Canada Industry Report.pdf'
'44111CA New Car Dealers in Canada Industry Report.pdf'
'48412CA Long-Distance Freight Trucking in Canada Industry Report.pdf'
'48422CA Local Specialized Freight Trucking in Canada Industry Report.pdf'
'48423CA Long-Distance Specialized Freight Trucking in Canada Industry Report.pdf'


In [12]:
model_kwargs = {'device': 'cuda', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name=   EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the embeddings model...


In [161]:
%%time
# Load the IBIS pdfs
#directory_path = "./source_documents"
directory_path = "/projects/RAG2/scotia-2/Datasets-Scotia-2/IBIS"
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source documents: {len(docs)}")



Number of source documents: 280
CPU times: user 10.5 s, sys: 63.4 ms, total: 10.6 s
Wall time: 10.6 s


## Process PDFs (Optional)

In [162]:
import nltk
from nltk.corpus import words
nltk.download('words')

[nltk_data] Downloading package words to /h/ws_ikharchuk/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [163]:
import re

In [164]:
def merge_adjacent_words3(word_list):
    english_words = set(words.words())
    i = 0
    while i < len(word_list) - 2:
        combined_word = word_list[i] + word_list[i + 1]+ word_list[i + 1]
        if (combined_word.lower() in english_words) |(combined_word.lower().strip('s').strip('es').strip('ed') in english_words):
            word_list[i] = combined_word
            del word_list[i + 1]
            del word_list[i + 1]
        else:
            i += 1
    return word_list

In [258]:
def merge_adjacent_words2(word_list):
    english_words = set(words.words())
    new_list = []
    i = 0
    while i < len(word_list) - 1:
        combined_word = word_list[i] + word_list[i + 1]
        if (combined_word.lower() in english_words) |(combined_word.lower().strip('s').strip('es').strip('ed') in english_words):
            word_list[i] = combined_word
            del word_list[i + 1]
        else:
            i += 1
    return word_list

In [264]:
def merge_adjacent_words2(word_list):
    english_words = set(words.words())
    new_list = []
    i = 0
    while i < len(word_list) - 1:
        combined_word = word_list[i] + word_list[i + 1]
        if (combined_word.lower() in english_words) |(combined_word.lower().strip('s').strip('es').strip('ed').strip('ing') in english_words):
            new_list.append (combined_word)
            i += 2           
        else:
            new_list.append (word_list[i])
            i += 1
    return new_list

In [265]:
def process_string(test):
    test =test.replace('.', ' ').replace('?', ' ').replace('\n', ' ')
    words_list = test.split()

    words_list = [x.replace('•', ' ')  for x in words_list]

    words_list =[re.sub(r'[^A-Za-z0-9\s]', '', x) for x in words_list ]
    return ' '.join(merge_adjacent_words2(merge_adjacent_words3(words_list)))
#process_string(test)

In [267]:
%%time
for doc in docs[:]:
    doc.page_content = process_string(doc.page_content)


CPU times: user 1min 17s, sys: 800 ms, total: 1min 18s
Wall time: 1min 18s


In [268]:
#docs[10].page_content

### Split the documents into smaller chunks

In [269]:
%%time

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1250, chunk_overlap=32)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

Number of text chunks: 560
CPU times: user 194 ms, sys: 8.01 ms, total: 202 ms
Wall time: 200 ms


In [270]:
# Adding Reuters data

In [271]:

file_paths = ['/projects/RAG2/scotia-2/Datasets-Scotia-2/Agriculture_txt/agri_ca_co.csv', 
              '/projects/RAG2/scotia-2/Datasets-Scotia-2/Transport_txt/transport_CA.csv', 
              '/projects/RAG2/scotia-2/Datasets-Scotia-2/Auto_txt/auto_ca.csv', 
             ]
def load_txt_file(file_path):
    # Create a TextLoader instance

    loader = TextLoader(file_path)

    # Load the document

    document = loader.load()
    chunks2 =text_splitter.split_documents(document)
    print(f"Number of text chunks: {len(chunks2)}")
    return chunks2

In [272]:
%%time
for file_path in file_paths:
    chunks2 = load_txt_file(file_path)
    chunks= chunks +chunks2

Number of text chunks: 1212
Number of text chunks: 927
Number of text chunks: 2065
CPU times: user 143 ms, sys: 148 µs, total: 144 ms
Wall time: 152 ms


#### Define the embeddings model

## Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [273]:
%%time
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

CPU times: user 1min 39s, sys: 276 ms, total: 1min 40s
Wall time: 1min 34s


In [274]:
industries  = ["auto dealer", "wheat farming", 'trucking']
#industry  = "wheat farming"
industry ="auto dealer"
query = f"Where are most {industry} companies are located in Canada?"
query1  = f' Who are the main payers in canadian {industry} companies' # bad

query2 = f'What is the profit margin of Canadian {industry}'

#query = f'Who are the main palyers in Canadian  {industry}?' # bad
query3  = f' Which countries are competitors for canadian industry internationally?' # bad

query4 = f'''Please provide a summary of news  grouping the most important event for the {industry} into trends. Is there is anything could be highlighted regional trends happening in Alberta, BC, and Ontatio?How profit margin of the {industry}  companies has changed in 2024. What was the main reasons?'''
 
query5 =f''' What is the level of {industry}  consolidation in the Canadian sector, and what are the primary drivers behind this trend? '''
 
query6 =f''' What are the primary factors contributing to the supply demand imbalance  in the {industry}  , and what strategies are {industry}  companies employing to address this issue? '''

query7 = f'''How have fluctuating costs impacted the profitability of Canadian {industry}  companies over the past five years, and what strategies have they employed to mitigate this volatility? '''
 
query8 = f'''To what extent has the adoption of new technologies impacted operational efficiency and cost structures within the Canadian {industry}   companies? What are new  technology opportunities in the sector'''

query9 = f'''How significant is the competition from alternative providers for Canadian {industry}  and how are these companies adapting to this competitive landscape. What is the substitution risk.'''
 
query10 = f'''What are the key regulatory and policy challenges facing the Canadian {industry}  companies(e.g., hours of service regulations, environmental regulations, safety standards), and how are these regulations impacting industry operations and profitability?''' 
 

query11 = f'''What is the level of government support (subsidies, grants, incentives) available to the Canadian {industry}  , and are any changes expected? '''


In [278]:
query6 =f''' What are the innovations in  {industry} '''


In [279]:
queries  = [query1, query2, query3, query4, query5, query6, query7, query8, query9, query10, query11 ]

In [280]:
%%time
retrieved_docs = retriever.invoke(query4)

pretty_print_docs(retrieved_docs)

Document 1:

Macro-economic factors observed in recent quarters, including a softening
Canadian economy, inflated vehicle prices and interest rate hikes of recent
years, are expected to continue to be headwinds in the near term. Alongside
these challenges, elevated national inventory of new light vehicles from key
brands in our dealership operations, and a constrained supply of quality,
affordable used vehicles are anticipated to create operating conditions in the
third quarter of 2024 similar to those in the second quarter.

In response to these challenging market conditions, AutoCanada is intensifying
its focus on enhancing its core dealership operations and accelerating
strategic initiatives aimed at improving profitability, reducing leverage, and
adapting to the evolving market landscape.
----------------------------------------------------------------------------------------------------
Document 2:

While the foregoing economic, political and other factors are part of the
general 

## Now send the query to the RAG pipeline

In [286]:
result

'The auto dealer industry is innovating through the adoption of electric and hybrid vehicles, offering a diverse range of car types, leveraging digital marketing and online sales, and adapting to evolving consumer preferences.'

In [287]:
%%time
rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
query =query6 +' Please answer in 2 sentences maximum. If answer is not available, answer NA ' 
result = rag_pipeline.invoke(input=query +' Please answer in 2 sentences maximum. If answer is not available, answer NA ' )
result = result['result'].split ('\n')[-1]
print(f"Result: \n\n{result.replace ('According to the provided context, ', '')}")

Result: 

The auto dealer industry is innovating through the adoption of electric and hybrid vehicles, offering a diverse range of car types, leveraging digital marketing strategies, and adapting to evolving consumer preferences.
CPU times: user 43 ms, sys: 11.9 ms, total: 54.9 ms
Wall time: 4.18 s


# Loop

In [288]:
%%time
for i, query in enumerate (queries):
    query = query + ' Please answer in 2 sentences maximum. If answer is not available, answer NA ' 
    retrieved_docs = retriever.invoke(query)
    rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
    result = rag_pipeline.invoke(input=query)
    result = result['result'].split ('\n')[-1]
    print(f"Result_{i+1}: \n\n{result.replace ('According to the provided context, ', '').replace ('Based on the provided context, ', '')}")
    print ('-'*120)
    
    

Result_1: 

The main payers in Canadian auto dealer companies are AutoCanada, the largest dealer group, and its franchise dealerships across provinces and states.
------------------------------------------------------------------------------------------------------------------------
Result_2: 

The profit margin of Canadian auto dealers is not directly provided in the context, but based on AutoCanada's data, it might be around 1.47%.
------------------------------------------------------------------------------------------------------------------------
Result_3: 

The Canadian industry is a major competitor in the global auto partss market, with countries like the United States, Canada, China, Japan, and Indonesia also being significant players. These countries are known for their strong auto partss, wheat export destinations, and economic conditions that may influence their competitiveness.
-----------------------------------------------------------------------------------------------

# Setting up having two vector stores


In [238]:
#PDF 

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1250, chunk_overlap=32)
chunks5 = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

Number of text chunks: 566


In [239]:
%%time
vectorstore_pdf = FAISS.from_documents(chunks5, embeddings)
retriever_pdf = vectorstore_pdf.as_retriever(search_kwargs={"k": 3})

CPU times: user 11.6 s, sys: 31 ms, total: 11.6 s
Wall time: 11 s


In [289]:
# News
chunks6=[]

for file_path in file_paths:
    chunks7 = load_txt_file(file_path)
    chunks6= chunks6 +chunks7

Number of text chunks: 1212
Number of text chunks: 927
Number of text chunks: 2065


In [291]:
%%time
# news
vectorstore_news = FAISS.from_documents(chunks6, embeddings)
retriever_news = vectorstore_news.as_retriever(search_kwargs={"k": 3})

CPU times: user 1min 27s, sys: 254 ms, total: 1min 27s
Wall time: 1min 22s


 ## Querying

In [292]:
%%time
retrieved_docs1 = retriever_news.invoke(query4)
print ('from News')
pretty_print_docs(retrieved_docs1)

from News
Document 1:

Macro-economic factors observed in recent quarters, including a softening
Canadian economy, inflated vehicle prices and interest rate hikes of recent
years, are expected to continue to be headwinds in the near term. Alongside
these challenges, elevated national inventory of new light vehicles from key
brands in our dealership operations, and a constrained supply of quality,
affordable used vehicles are anticipated to create operating conditions in the
third quarter of 2024 similar to those in the second quarter.

In response to these challenging market conditions, AutoCanada is intensifying
its focus on enhancing its core dealership operations and accelerating
strategic initiatives aimed at improving profitability, reducing leverage, and
adapting to the evolving market landscape.
----------------------------------------------------------------------------------------------------
Document 2:

While the foregoing economic, political and other factors are part of th

In [293]:
%%time
retrieved_docs2 = retriever_pdf.invoke(query4)
print ('from PDF')
pretty_print_docs(retrieved_docs2)

from PDF
Document 1:

Whats impac ting Ne wRoads A utomo tive Groups perf ormanc e NewRoads A utomo tive Group pur chased Ne wmark et Honda now called Ne wRoads Honda  NewRoad A utomo tive announc edthe acquisition too chase Ne wRoads too itss too oss York Region Also this isee ted to strengthen the c ompan ys roster of servic esand v ehicl esYou can view and do wnload moree ydede on my ibisworld com Retail Trade In Canada   44 111CA New Car Deal ers in Canada 25 www ibis world com November 202 4
----------------------------------------------------------------------------------------------------
Document 2:

Profit Margin Total profit margin annual change from 20 11  2029 Profit Margin 2pp1pp0pp1pp2pp3pp 2012 2014 2016 2018 2020 2022 202 4 Source IBIS WorldTotal Profit 3 8bn 19240 8 Profit Margin 2 0 19240 2 pp Profit per Business 810 8k Current Performanc e201924 Revenue C AGR 2 8 Whats driving current industry perf ormanc eNew car deal ers have endur edv olatile conditions  The pande

In [294]:
retrieved_docs=retrieved_docs1 +retrieved_docs2

In [295]:
%%time
rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
query =query6 +' Please answer in 2 sentences maximum. If answer is not available, answer NA ' 
result = rag_pipeline.invoke(input=query +' Please answer in 2 sentences maximum. If answer is not available, answer NA ' )
print(f"Result: \n\n{result['result'].replace ('According to the provided context, ', '')}")

Result: 

<think>
Okay, so I need to figure out the answer to the user's question about the innovations in the auto dealer industry. Let me start by reading through the provided context carefully to understand the key points.

First, the context mentions that the automotive supply chain is crucial, and there's a shortage of semiconductors, which is a limiting factor. This shortage has made it difficult for manufacturers in Canada to produce new cars. The user is probably interested in how the industry is adapting to these challenges.

Looking further, there's information about different types of car dealerships, such as sedans, SUVs, hybrids, and pickup trucks. These dealerships are successful because they offer a variety of options, which gives buyers multiple choices. This variety can provide significant pricing power, which is a key point in the context about consumer power.

Another section talks about the impact of increased new car sales and consumer confidence. It mentions that 

In [246]:
result['result'].split('\n')[-1]

'Auto dealers are innovating through product enhancement, leveraging consumer preferences with used cars, and offering diverse deal erships to attract a broader audience.'

In [230]:
# previous 
print ('Innovations in the auto dealer industry include electric vehicles (EVs), hybrid vehicles, and advanced driver-assistance systems (ADAS) that enhance safety and connectivity. Additionally, auto dealerships have adopted innovative marketing techniques, such as social media and online transactions, to reach customers and increase revenue.')

Innovations in the auto dealer industry include electric vehicles (EVs), hybrid vehicles, and advanced driver-assistance systems (ADAS) that enhance safety and connectivity. Additionally, auto dealerships have adopted innovative marketing techniques, such as social media and online transactions, to reach customers and increase revenue.


In [None]:
`