## Simple RAG (Retrieval-Augmented Generation) System

### Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

### Key Components

#### PDF processing and text extraction
Text chunking for manageable processing
Vector store creation using <a href="https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/">FAISS</a> and OpenAI embeddings
Retriever setup for querying the processed documents
Evaluation of the RAG system
Method Details

#### Document Preprocessing

The PDF is loaded using PyPDFLoader.
The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.
Text Cleaning

A custom function replace_t_with_space is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

#### Vector Store Creation

OpenAI embeddings are used to create vector representations of the text chunks.
A FAISS vector store is created from these embeddings for efficient similarity search.
Retriever Setup

A retriever is configured to fetch the top 2 most relevant chunks for a given query.
Encoding Function

The encode_pdf function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

#### Key Features

Modular Design: The encoding process is encapsulated in a single function for easy reuse.
Configurable Chunking: Allows adjustment of chunk size and overlap.
Efficient Retrieval: Uses FAISS for fast similarity search.
Evaluation: Includes a function to evaluate the RAG system's performance.
Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

#### Evaluation

The system includes an evaluate_rag function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

#### Benefits of this Approach

Scalability: Can handle large documents by processing them in chunks.
Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.
Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

In [1]:
from helper_functions import *
from evaluate_rag import *


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from helper_functions import *


### Read Doc

In [2]:
# initiate the pdf document path
path = "data/ManideepResume.pdf"

### Encode

In [3]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

In [4]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = texts
    # cleaned_texts = replace_t_with_space(texts)

    # Create embeddings and vector store
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [5]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

### Create Retriever

In [33]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 20})

### Test retriever

In [34]:
test_query = "What are the list of companies ?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)


### Evaluation

In [8]:
evaluate_rag(chunks_query_retriever)

Answering the question from the retrieved context...
Answering the question from the retrieved context...
Answering the question from the retrieved context...


Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:15,  5.13s/test case]



Metrics Summary

  - ✅ Correctness (GEval) (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output matches the expected output exactly., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4, reason: None, error: None)
  - ❌ Contextual Relevancy (score: 0.0, threshold: 1.0, strict: False, evaluation model: gpt-4, reason: The score is 0.00 because the context focuses on professional experience and job roles, but does not provide any information about a 'profile name'., error: None)

For test case:

  - input: What is the name of the profile
  - actual output: Manideep Bangaru
  - expected output: Manideep Bangaru
  - context: None
  - retrieval context: ['ManideepBangaru\nbmd994@gmail.c om+917416228028\nOBJECTIVE:SeekingachallengingcareerpositioninanorganizationwhereIcanusemytechnicalskills&creativitytomakeasignificantcontributiontowardsgrowth&developmen toforganizationalongwithmypersonalgrowth\nProf




### Test Run

In [35]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model = 'gpt-4o')

In [36]:
qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = 'stuff',
    retriever = chunks_query_retriever
)

In [39]:
response = qa_chain.run(query = "Today date is 29th September, 2024. You are a inerview resume analyzer in an IT company, What is the total experience of the candidate ?")

In [40]:
from IPython.display import Markdown
Markdown(response)

The candidate's total experience can be calculated by summing up the duration of their professional engagements as listed in their resume.

1. **Gameopedia Data Solutions Pvt Ltd, Hyderabad**: Sep '22 to Present (Sep '22 to Sep '24) = 2 years
2. **Accenture, Hyderabad**: Jun '21 to Sep '22 = 1 year and 3 months
3. **Cognizant Technology Solutions, Hyderabad**: Jan '19 to Jun '21 = 2 years and 6 months
4. **Affine Analytics Pvt Ltd, Bangalore**: Jun '18 to Dec '18 = 6 months
5. **Nielsen India Pvt Ltd, Bangalore**: Oct '16 to Jun '18 = 1 year and 8 months
6. **Deloitte, Hyderabad**: May '16 to Aug '16 = 4 months

Summing these up:

- 2 years (Gameopedia)
- 1 year and 3 months (Accenture)
- 2 years and 6 months (Cognizant)
- 6 months (Affine)
- 1 year and 8 months (Nielsen)
- 4 months (Deloitte)

Converting all into months for easier summation:
- 2 years = 24 months
- 1 year and 3 months = 15 months
- 2 years and 6 months = 30 months
- 6 months = 6 months
- 1 year and 8 months = 20 months
- 4 months = 4 months

Total = 24 + 15 + 30 + 6 + 20 + 4 = 99 months

Converting back to years:
- 99 months ≈ 8 years and 3 months

Therefore, the candidate has approximately 8 years and 3 months of total professional experience.