### LangChain: Q&A over Documents

In [1]:
## Import necessary libraries
import os
from dotenv import load_dotenv,find_dotenv
_ = load_dotenv(find_dotenv(".env"))

In [11]:
## Import Langchain Classes
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain_community.vectorstores import FAISS
from langchain.indexes import  VectorstoreIndexCreator
from langchain_openai.embeddings import  OpenAIEmbeddings
from IPython.display import display,Markdown

### RetrievalQA
* This chain first does a retrieval step to fetch relevant documents, then passes those documents into an LLM to generate a response.
### DocArrayInMemorySearch
* DocArrayInMemorySearch is a document index provided by Docarray that stores documents in memory. It is a great starting point for small datasets, where you may not want to launch a database server.


In [23]:
csv_file_path = "../Data/Client.csv"
csv_loader = CSVLoader(file_path=csv_file_path)

In [13]:
embedding_model = OpenAIEmbeddings(
  api_key=os.getenv("OPENAI_API_KEY"),
  model="text-embedding-3-small"
)
indexed_vectoreStore = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embedding_model
).from_loaders([csv_loader])

In [40]:
query = "I want to know adresses of all the client named john .\
      Structure the response into a table in markdown and summarize it the content of the generated table "

In [17]:
base_llm_model = ChatOpenAI(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY")
)
response = indexed_vectoreStore.query(
    llm=base_llm_model,
    question=query
    )
display(Markdown(response))

Certainly! Here is the information structured into a table in markdown format:

| id_client | Nom  | Prenom | Adresse              |
|-----------|------|--------|----------------------|
| 60        | Lisa | Smith  | 901 Elm St, Phoenix  |
| 50        | Lisa | Taylor |                      |
| 70        | Lisa | Johnson| 901 Elm St, Phoenix  |

**Summary:**
- There are three clients named Lisa.
- Out of these three, two clients have the same address: 901 Elm St, Phoenix.
- One client named Lisa Taylor does not have an address listed.

### Step By Step Solution

In [31]:
from langchain.document_loaders import CSVLoader
csv_loader = CSVLoader(file_path="../Data/Client.csv")

loaded_csv_document = csv_loader.load()
## Display the first 5 document
loaded_csv_document[:5]

[Document(page_content='id_client: 41\nNom: John\nPrenom: Doe\nAdresse: 123 Main St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 0}),
 Document(page_content='id_client: 42\nNom: Jane\nPrenom: Smith\nAdresse: 456 Elm St, Los Angeles\nTelephone: 5559876543', metadata={'source': '../Data/Client.csv', 'row': 1}),
 Document(page_content='id_client: 43\nNom: Michael\nPrenom: Johnson\nAdresse: \nTelephone: 5551112222', metadata={'source': '../Data/Client.csv', 'row': 2}),
 Document(page_content='id_client: 44\nNom: Emily\nPrenom: Williams\nAdresse: 321 Maple Ln, Houston\nTelephone: ', metadata={'source': '../Data/Client.csv', 'row': 3}),
 Document(page_content='id_client: 45\nNom: Christopher\nPrenom: \nAdresse: 654 Cedar Rd, Phoenix\nTelephone: 5555556666', metadata={'source': '../Data/Client.csv', 'row': 4})]

In [32]:
## Example how the embedding layer is applied
embedding_model.embed_query("Hello langchain Community")[:5]

[0.017455831170082092,
 -0.02263569086790085,
 -0.003403961891308427,
 0.0127999447286129,
 0.005284655373543501]

In [35]:
vectoreStoreDatabase = FAISS.from_documents(
    loaded_csv_document,
    embedding_model
)
## Result based on similary search methodology
fetched_info = vectoreStoreDatabase.similarity_search("give me the adresse of all client named john")
fetched_info

[Document(page_content='id_client: 61\nNom: john\nPrenom: johnson\nAdresse: 234 Main St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 20}),
 Document(page_content='id_client: 41\nNom: John\nPrenom: Doe\nAdresse: 123 Main St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 0}),
 Document(page_content='id_client: 51\nNom: john\nPrenom: Smith\nAdresse: 234 Elm St, New York\nTelephone: 5551234567', metadata={'source': '../Data/Client.csv', 'row': 10}),
 Document(page_content='id_client: 42\nNom: Jane\nPrenom: Smith\nAdresse: 456 Elm St, Los Angeles\nTelephone: 5559876543', metadata={'source': '../Data/Client.csv', 'row': 1})]

In [38]:
## Construct a retriever interface from the vectore store database
retriever = vectoreStoreDatabase.as_retriever()

In [39]:
## Using Stuff Document Methodology
question_answer_stuff = RetrievalQA.from_chain_type(
    llm=base_llm_model,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

In [41]:
response = question_answer_stuff.run(query)
display(Markdown(response))

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


Here are the addresses of all clients named "john":

| id_client | Nom  | Prenom | Adresse         |
|-----------|------|--------|-----------------|
| 61        | john | johnson| 234 Main St, New York |
| 51        | john | Smith  | 234 Elm St, New York  |

Summary:
The table lists the addresses of all clients named "john." There are two clients with this name, one residing at 234 Main St, New York, and the other at 234 Elm St, New York.