### LangChain: Q&A over Documents and Evaluation Metrics

In [3]:
## Import necessary libraries
import os
from dotenv import load_dotenv,find_dotenv
_ = load_dotenv(find_dotenv(".env"))

In [4]:
## Import Langchain Classes
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain_community.vectorstores import FAISS
from langchain.indexes import  VectorstoreIndexCreator
from langchain_openai.embeddings import  OpenAIEmbeddings
from IPython.display import display,Markdown

### RetrievalQA
* This chain first does a retrieval step to fetch relevant documents, then passes those documents into an LLM to generate a response.
### DocArrayInMemorySearch
* DocArrayInMemorySearch is a document index provided by Docarray that stores documents in memory. It is a great starting point for small datasets, where you may not want to launch a database server.


In [5]:
csv_file_path = "../Data/Client.csv"
csv_loader = CSVLoader(file_path=csv_file_path)

In [6]:
embedding_model = OpenAIEmbeddings(
  api_key=os.getenv("OPENAI_API_KEY"),
  model="text-embedding-3-small"
)
indexed_vectoreStore = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embedding_model
).from_loaders([csv_loader])

In [7]:
query = "I want to know adresses of all the client named john .\
      Structure the response into a table in markdown and summarize it the content of the generated table "

In [8]:
base_llm_model = ChatOpenAI(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY")
)
response = indexed_vectoreStore.query(
    llm=base_llm_model,
    question=query
    )
display(Markdown(response))

Here is the information about clients named "john" structured in a table:

| id_client | Nom  | Prenom | Adresse           |
|-----------|------|--------|-------------------|
| 61        | john | johnson| 234 Main St, New York |
| 51        | john | Smith  | 234 Elm St, New York |

Summary:
The table lists the addresses of all clients named "john." There are two clients with the first name "john," one residing at 234 Main St, New York and the other at 234 Elm St, New York.

### Step By Step Solution

In [9]:
from langchain.document_loaders import CSVLoader
csv_loader = CSVLoader(file_path="../Data/Client.csv")

loaded_csv_document = csv_loader.load()
## Display the first 5 document
loaded_csv_document[:5]

[Document(page_content='id_client: 41\nNom: John\nPrenom: Doe\nAdresse: 123 Main St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 0}),
 Document(page_content='id_client: 42\nNom: Jane\nPrenom: Smith\nAdresse: 456 Elm St, Los Angeles\nTelephone: 5559876543', metadata={'source': '../Data/Client.csv', 'row': 1}),
 Document(page_content='id_client: 43\nNom: Michael\nPrenom: Johnson\nAdresse: \nTelephone: 5551112222', metadata={'source': '../Data/Client.csv', 'row': 2}),
 Document(page_content='id_client: 44\nNom: Emily\nPrenom: Williams\nAdresse: 321 Maple Ln, Houston\nTelephone: ', metadata={'source': '../Data/Client.csv', 'row': 3}),
 Document(page_content='id_client: 45\nNom: Christopher\nPrenom: \nAdresse: 654 Cedar Rd, Phoenix\nTelephone: 5555556666', metadata={'source': '../Data/Client.csv', 'row': 4})]

In [10]:
## Example how the embedding layer is applied
embedding_model.embed_query("Hello langchain Community")[:5]

[0.017455831170082092,
 -0.02263569086790085,
 -0.003403961891308427,
 0.0127999447286129,
 0.005284655373543501]

In [11]:
vectoreStoreDatabase = FAISS.from_documents(
    loaded_csv_document,
    embedding_model
)
## Result based on similary search methodology
fetched_info = vectoreStoreDatabase.similarity_search("give me the adresse of all client named john")
fetched_info

[Document(page_content='id_client: 61\nNom: john\nPrenom: johnson\nAdresse: 234 Main St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 20}),
 Document(page_content='id_client: 41\nNom: John\nPrenom: Doe\nAdresse: 123 Main St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 0}),
 Document(page_content='id_client: 51\nNom: john\nPrenom: Smith\nAdresse: 234 Elm St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 10}),
 Document(page_content='id_client: 42\nNom: Jane\nPrenom: Smith\nAdresse: 456 Elm St, Los Angeles\nTelephone: 5559876543', metadata={'source': '../Data/Client.csv', 'row': 1})]

In [12]:
## Construct a retriever interface from the vectore store database
retriever = vectoreStoreDatabase.as_retriever()

In [13]:
## Using Stuff Document Methodology
question_answer_stuff = RetrievalQA.from_chain_type(
    llm=base_llm_model,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

### Stuffing Method
* Stuff method is the simplest way to put the retrieved documents as context.All the data is simply injected into the prompt as context to pass to the language model.(LLM).
* Pros : it makes a single call to the LLM . The LLM has access to the data at once .
* Cons : LLMS have a context length , and for large documents or many documents this will not work as it will result in a prompt larger than the context length.
* Alternative method : Map_reduce , Refine ,Map_rerank

In [14]:
response = question_answer_stuff.run(query)
display(Markdown(response))

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


Sure! Here is a table containing the addresses of all clients named "john":

| id_client | Nom  | Prenom    | Adresse         |
|-----------|------|-----------|-----------------|
| 61        | john | johnson   | 234 Main St, New York |
| 51        | john | Smith     | 234 Elm St, New York  |

Summary:
The table lists the addresses of all clients named "john". There are two clients named "john": one with the last name "johnson" living at "234 Main St, New York" and another with the last name "Smith" living at "234 Elm St, New York".

### Evaluation
Although evaluating the outputs of Large Language Models (LLMs) is essential for anyone looking to ship robust LLM applications, LLM evaluation remains a challenging task for many. Whether you are :
* Refining a model’s accuracy through fine-tuning 
* Enhancing a Retrieval-Augmented Generation (RAG) system’s contextual relevancy.

Understanding how to develop and decide on the appropriate set of LLM evaluation metrics for your use case is imperative to building a bulletproof LLM evaluation pipeline.

### Hard-coded examples for evaluation

In [36]:
examples = [
    {
        "query":"Does any of the client named john live in the 234 Main St, New York adresse",
        "answer":"Yes"
    },
    {
        "query":"Give me the adresse of a client named john smith",
        "answer":"234 Elm St, New York"
    }
]

### LLM-Generated examples (Automating Hrad-coding Examples)

In [None]:
### LLM Chain for generating examples for question answering.
from langchain.evaluation.qa import QAGenerateChain

qa_examples_from_llm = QAGenerateChain.from_llm(
    llm=base_llm_model
    )
generated_qa_examples_from_llm = qa_examples_from_llm.apply_and_parse(
    [{"doc":t} for t in loaded_csv_document[:5]]
)

In [37]:
formatted_generated_qa_examples_from_llm = [];
for qa_gen in generated_qa_examples_from_llm:
    formatted_generated_qa_examples_from_llm.append(qa_gen["qa_pairs"])

formatted_generated_qa_examples_from_llm += examples
formatted_generated_qa_examples_from_llm

[{'query': 'What is the full address of the client with id_client 41?',
  'answer': '123 Main St, New York'},
 {'query': 'What is the full name and address of the client with ID 42?',
  'answer': 'The full name is Jane Smith, and the address is 456 Elm St, Los Angeles.'},
 {'query': 'What is the ID number and telephone number of the client named Michael Johnson?',
  'answer': 'The ID number of the client named Michael Johnson is 43, and his telephone number is 5551112222.'},
 {'query': 'What is the full address of the client with the ID 44?',
  'answer': '321 Maple Ln, Houston'},
 {'query': 'What is the address and telephone number for the client with the id_client 45?',
  'answer': 'The address is 654 Cedar Rd, Phoenix, and the telephone number is 5555556666.'},
 {'query': 'Does any of the client named john live in the 234 Main St, New York adresse',
  'answer': 'Yes'},
 {'query': 'Give me the adresse of a client named john smith',
  'answer': '234 Elm St, New York'}]

In [38]:
question_answer_stuff.run(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Yes, the client named "john" with the last name "johnson" lives at 234 Main St, New York.'

### Track/debugg steps that LLM perform to answering the query

In [45]:
import langchain
langchain.debug = True
question_answer_stuff.run(formatted_generated_qa_examples_from_llm[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is the full address of the client with id_client 41?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What is the full address of the client with id_client 41?",
  "context": "id_client: 41\nNom: John\nPrenom: Doe\nAdresse: 123 Main St, New York\nTelephone: 5551234567\n\nid_client: 45\nNom: Christopher\nPrenom: \nAdresse: 654 Cedar Rd, Phoenix\nTelephone: 5555556666\n\nid_client: 43\nNom: Michael\nPrenom: Johnson\nAdresse: \nTelephone: 5551112222\n\nid_client: 51\nNom: john\nPrenom: Smith\nAdresse: 234 Elm St, New York\nTelephone: 5551234567"
}
[32;1m[1;3m[llm/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain > llm:Chat

'The full address of the client with id_client 41 is 123 Main St, New York.'

#### Evaluate all the question/answer in one shot using LLM

In [47]:
langchain.debug = False
from langchain.evaluation.qa import  QAEvalChain
predictions = question_answer_stuff.apply(formatted_generated_qa_examples_from_llm)

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [51]:
eval_chains = QAEvalChain.from_llm(base_llm_model)
graded_answers = eval_chains.evaluate(formatted_generated_qa_examples_from_llm,predictions)

for index ,exp in enumerate(formatted_generated_qa_examples_from_llm):
    print(f"Query_Answer Example : {index}")
    print(f"Question : "+predictions[index]["query"])
    print(f"Real answer of the current question : "+ predictions[index]["answer"])
    print(f"Predicted Answer from the LLM chain : "+predictions[index]["result"])
    print("Predicted Grade of the result : "+graded_answers[index]["results"])
    print("-------------------------------------")

Query_Answer Example : 0
Question : What is the full address of the client with id_client 41?
Real answer of the current question : 123 Main St, New York
Predicted Answer from the LLM chain : The full address of the client with id_client 41 is 123 Main St, New York.
Predicted Grade of the result : CORRECT
-------------------------------------
Query_Answer Example : 1
Question : What is the full name and address of the client with ID 42?
Real answer of the current question : The full name is Jane Smith, and the address is 456 Elm St, Los Angeles.
Predicted Answer from the LLM chain : The full name of the client with ID 42 is Jane Smith, and the address is 456 Elm St, Los Angeles.
Predicted Grade of the result : CORRECT
-------------------------------------
Query_Answer Example : 2
Question : What is the ID number and telephone number of the client named Michael Johnson?
Real answer of the current question : The ID number of the client named Michael Johnson is 43, and his telephone numbe

#### LangChain evaluation platform
For more flexibility monitoring and automation langchain have a evaluation platform , called The LangChainSmith (Previously called Langchain Evaluation Platform), can be accessed here https://smith.langchain.com/