

# Exploring LangChain and LLM APIs for Building a RAG with Chroma Vector Store

This notebook aims to explore the use of LangChain and LLM APIs in the context of building a Retrieval-Augmented Generation (RAG) system.  
Specifically, we will utilize the Chroma vector store to process and retrieve relevant information from a dataset of 100 LLM papers, provided in PDF format.  
The goal is to showcase how LangChain's tools, such as retrievers, document loaders, and LLM APIs, can be used to efficiently process large amounts of research papers and build a powerful information retrieval system.


# Table of Contents

1. [Introduction]
2. [Environment Setup]
    - [Loading Environment Variables]
3. [Libraries and Modules]
4. [Document Processing]
    - [Extracting Text from PDFs]
    - [Preparing Documents from PDFs]
5. [Vector Store Setup]
    - [Creating or Loading Vector Store]
6. [Retrieval and Query Processing]
    - [Retrieving Relevant Documents]
    - [Generating Answers using LLM]


This code imports various modules and components from LangChain and other libraries to build a Retrieval-Augmented Generation (RAG) system:

- **LangChain Hub**: For managing LangChain components and tools.
- **RecursiveCharacterTextSplitter**: For splitting text into smaller chunks to facilitate processing.
- **Chroma**: For storing and querying embeddings using a vector database.
- **StrOutputParser**: For parsing string-based outputs from language models.
- **RunnablePassthrough**: For creating pass-through components in a pipeline.
- **ChatOpenAI**: For interacting with OpenAI's GPT models for chat-based generation.
- **HuggingFaceEmbeddings**: For generating embeddings using Hugging Face models.
- **OS**: For interacting with the operating system, including environment variables and paths.
- **Document**: For representing and manipulating document objects in LangChain.
- **pdfplumber**: For extracting text from PDF files.
- **RetrievalQA**: For building a Retrieval-based Question Answering chain using retrievers and models.
- **dotenv**: For loading environment variables from a `.env` file.


In [53]:
# For accessing LangChain's centralized hub for managing components and tools.
from langchain import hub

# For splitting text into smaller, manageable chunks for processing.
from langchain.text_splitter import RecursiveCharacterTextSplitter

# For storing and querying embeddings using Chroma vector database.
from langchain_community.vectorstores import Chroma

# For parsing string-based outputs from language model responses.
from langchain_core.output_parsers import StrOutputParser

# For creating pass-through components in a pipeline (no modifications applied).
from langchain_core.runnables import RunnablePassthrough

# For interacting with OpenAI's GPT models for chat-based generation.
from langchain_openai import ChatOpenAI

# For generating embeddings using Hugging Face models.
from langchain_huggingface import HuggingFaceEmbeddings

# For interacting with the operating system (e.g., environment variables, paths).
import os

# For representing and manipulating document objects in LangChain.
from langchain.docstore.document import Document

# For extracting text content from PDF files.
import pdfplumber

# For creating a Retrieval-based Question Answering chain using retrievers and models.
from langchain.chains import RetrievalQA

# For loading environment variables from a `.env` file.
from dotenv import load_dotenv

# Execute environment variable loading immediately.
load_dotenv()

True

This code sets the environment variables `LANGCHAIN_TRACING_V2`, `LANGCHAIN_ENDPOINT`, `LANGCHAIN_API_KEY`, and `OPENAI_API_KEY` by retrieving their values from the `.env` file or the system environment.

- **LANGCHAIN_TRACING_V2**: For tracing and debugging LangChain operations.
- **LANGCHAIN_ENDPOINT**: Specifies the endpoint for LangChain services.
- **LANGCHAIN_API_KEY**: Provides the API key for LangChain usage.
- **OPENAI_API_KEY**: Provides the API key for OpenAI services.

In [54]:
# Sets the environment variable 'LANGCHAIN_TRACING_V2','LANGCHAIN_ENDPOINT','LANGCHAIN_API_KEY','OPENAI_API_KEY' from the value stored in the .env file or system environment.
os.environ['LANGCHAIN_TRACING_V2'] = os.getenv('LANGCHAIN_TRACING_V2')
os.environ['LANGCHAIN_ENDPOINT'] = os.getenv('LANGCHAIN_ENDPOINT')
os.environ['LANGCHAIN_API_KEY'] = os.getenv('LANGCHAIN_API_KEY')
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')

This code defines three functions:

- **`extract_text_from_pdf(pdf_path)`**: Extracts text from a PDF file located at `pdf_path` using `pdfplumber`. It processes each page of the PDF and concatenates the extracted text into a single string.
  
- **`prepare_documents_from_pdfs(pdf_paths)`**: Processes a list of PDF file paths (`pdf_paths`) to extract their text content using `extract_text_from_pdf`. It creates a list of `Document` objects, each containing the extracted text and metadata about the source (PDF file path).

- **`process_llm_response(llm_response)`**: Processes the response from a language model (LLM). It prints the main result from the LLM response and lists the sources of the information, which are the metadata of the documents used in the response.

In [55]:
# Function to extract text from PDFs
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

# Function to prepare documents from a list of PDF paths
def prepare_documents_from_pdfs(pdf_paths):
    documents = []
    for pdf_path in pdf_paths:
        text = extract_text_from_pdf(pdf_path)
        if text:  # Only process if text is extracted
            documents.append(Document(page_content=text, metadata={"source": pdf_path}))
    return documents

# Function which prints the main result from an LLM response and lists the sources of the information from the response's source documents.
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])


This code performs the following steps:

- **Specify PDF directory**: Defines the directory containing PDF files (`./LLM_papers`) and generates a list of PDF file paths (`pdf_paths`).

- **Set up persistence and embeddings**: Defines the path for the vector store's persistence directory (`./chroma_langchain_pdf_db`) and initializes Hugging Face embeddings using a pre-trained model (`sentence-transformers/all-mpnet-base-v2`).

- **Load or create vector store**: Checks if a vector store already exists in the specified directory. If it exists, it loads the existing vector store. If not, it creates a new one by:
  - Extracting text from the specified PDFs.
  - Splitting the extracted text into manageable chunks using the `RecursiveCharacterTextSplitter`.
  - Initializing and creating a new `Chroma` vector store using the processed documents and embeddings.

- **Use the retriever**: After creating or loading the vector store, it sets up the `retriever` to query the vector store for relevant information.

In [56]:
# Specify the directory containing your PDF files
pdf_directory = "./LLM_papers"
pdf_paths = [os.path.join(pdf_directory, f) for f in os.listdir(pdf_directory) if f.endswith(".pdf")]

# Path to persist directory
persist_directory = "./chroma_langchain_pdf_db"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Check if vectorstore exists, otherwise create it
if os.path.exists(persist_directory) and os.listdir(persist_directory):
    print("Loading existing vector store...")
    vectorstore = Chroma(
        embedding_function=embeddings,
        persist_directory=persist_directory,
    )
else:
    print("No existing vector store found. Creating a new one...")

    # Extract and prepare documents
    print("Extracting text from PDFs...")
    documents = prepare_documents_from_pdfs(pdf_paths)

    # Split documents into manageable chunks
    print("Splitting documents into chunks...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(documents)

    # Initialize embeddings
    print("Initializing embeddings...")
    vectorstore = Chroma.from_documents(
        documents=splits,
        embedding=embeddings,
        persist_directory=persist_directory,
    )
    print("Vector store created and saved locally.")

print("Vector store returned Successfully.")

# Use the retriever
retriever = vectorstore.as_retriever()

Loading existing vector store...
Vector store returned Successfully.


#### RETRIEVAL and GENERATION ####

This code demonstrates how to combine a retriever and a language model (LLM) to generate answers based on retrieved documents. Here's the breakdown:

1. **Prompt**: 
   - The prompt template is pulled from the `rlm/rag-prompt` using LangChain's hub.

2. **LLM**: 
   - A `ChatOpenAI` model (GPT-3.5-turbo) is used to generate answers with a temperature of 0 for deterministic results.

3. **Post-processing**: 
   - The `format_docs` function is defined to format the documents retrieved by the retriever into a single string, joining their contents with newlines.

4. **Chain**: 
   - A LangChain pipeline (`rag_chain`) is created using a combination of components:
     - The retriever fetches relevant documents.
     - The `format_docs` function formats these documents.
     - The formatted documents and the question are fed into the prompt.
     - The LLM generates an answer.
     - The `StrOutputParser` parses the final output.

5. **Question**: 
   - The `rag_chain` is invoked with a question: "When training pre-trained Language Models, how is Matrix Decomposition utilized?" The pipeline retrieves relevant documents, formats them, and generates the answer.

This setup integrates retrieval-augmented generation (RAG) using a chain of operations to improve the accuracy and context of the language model's response.

In [59]:

# Prompt
prompt = hub.pull("rlm/rag-prompt")

# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
rag_chain.invoke("When training pre trained Language Models, how is Matrix Decomposition utilized?")

'Matrix Decomposition is utilized in training pre-trained Language Models for model compression. Singular Value Decomposition (SVD) is a popular strategy for compressing large matrices into smaller ones, approximating learned matrices with fewer parameters. Fisher information is introduced to weigh the importance of parameters in matrix factorization to align the objective with task accuracy.'

This code performs the following tasks:

- **Retrieves relevant documents**: It uses a `RetrievalQA` chain that takes a query, retrieves relevant documents using the `retriever`, and generates an answer with the provided language model (`llm`).
  
- **Generates an answer**: The `qa_chain` is responsible for processing the query and generating an answer by combining the retrieved documents with the LLM's capabilities.

- **Prints answer and sources**: The function `process_llm_response` is called to print the generated answer (`llm_response['result']`) and the sources used to generate the answer by listing the document metadata of the so

In [58]:
''' 
Takes a query , retrieves relevant documents using a retriever, 
generates an answer using an LLM, 
and prints both the answer and the sources used to generate it. 
'''

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

query = "When training pre trained Language Models, how is Matrix Decomposition utilized?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Matrix decomposition is utilized in training pre-trained language models for model compression. Specifically, techniques like Singular Value Decomposition (SVD) are used to factorize large matrices into smaller matrices, reducing the number of parameters in the model. This compression strategy helps approximate the learned matrix with fewer parameters, ultimately reducing the size of the model while retaining important information.


Sources:
./LLM_papers\2020.aacl-main.88.pdf
./LLM_papers\2207.00112.pdf
./LLM_papers\2211.09718.pdf
./LLM_papers\2109.06243.pdf
