## Simple RAG (Retrieval-Augmented Generation) System

### Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

### Key Components

#### PDF processing and text extraction
Text chunking for manageable processing
Vector store creation using <a href="https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/">FAISS</a> and OpenAI embeddings
Retriever setup for querying the processed documents
Evaluation of the RAG system
Method Details

#### Document Preprocessing

The PDF is loaded using PyPDFLoader.
The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.
Text Cleaning

A custom function replace_t_with_space is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

#### Vector Store Creation

OpenAI embeddings are used to create vector representations of the text chunks.
A FAISS vector store is created from these embeddings for efficient similarity search.
Retriever Setup

A retriever is configured to fetch the top 2 most relevant chunks for a given query.
Encoding Function

The encode_pdf function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

#### Key Features

Modular Design: The encoding process is encapsulated in a single function for easy reuse.
Configurable Chunking: Allows adjustment of chunk size and overlap.
Efficient Retrieval: Uses FAISS for fast similarity search.
Evaluation: Includes a function to evaluate the RAG system's performance.
Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

#### Evaluation

The system includes an evaluate_rag function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

#### Benefits of this Approach

Scalability: Can handle large documents by processing them in chunks.
Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.
Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

### Read Doc

In [4]:
# initiate the pdf document path
path = "data/ManideepResume.pdf"

### Encode

In [5]:
from helper_functions import encode_pdf

chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

### Create Retriever

In [9]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 15})

### Test retriever

In [10]:
from helper_functions import retrieve_context_per_question

test_query = "What are the list of companies ?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)

### Evaluation

In [11]:
from evaluate_rag import evaluate_rag

evaluate_rag(chunks_query_retriever)

Answering the question from the retrieved context...
Answering the question from the retrieved context...
Answering the question from the retrieved context...


Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:24,  8.22s/test case]



Metrics Summary

  - ❌ Correctness (GEval) (score: 0.23833602776858015, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output contains detailed information not present in the expected output, which only mentions 'Engineering'., error: None)
  - ✅ Faithfulness (score: 0.75, threshold: 0.7, strict: False, evaluation model: gpt-4, reason: None, error: None)
  - ❌ Contextual Relevancy (score: 0.16666666666666666, threshold: 1.0, strict: False, evaluation model: gpt-4, reason: The score is 0.17 because the context provided mainly discusses professional roles, responsibilities, and job experiences but does not address or provide any information about the 'highest qualification'., error: None)

For test case:

  - input: What is the highest qualification ?
  - actual output: The highest qualification is a Bachelor of Technology in Electronics and Communications Engineering (ECE) from Jawaharlal Nehru Technological University, Hyderabad, India, secured in July 20




### Test Run

In [12]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model = 'gpt-4o')

In [13]:
qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = 'stuff',
    retriever = chunks_query_retriever
)

In [17]:
response = qa_chain.invoke(input = "Today date is 29th September, 2024. You are a inerview resume analyzer in an IT company, What is the total experience of the candidate ?")

In [19]:
response

{'query': 'Today date is 29th September, 2024. You are a inerview resume analyzer in an IT company, What is the total experience of the candidate ?',
 'result': "Based on the provided resume, the candidate's professional experience is as follows:\n\n1. **Deloitte, Hyderabad**: May 2016 to August 2016 (Contract) - 4 months\n2. **Nielsen India Pvt. Ltd, Bangalore**: October 2016 to June 2018 - 1 year and 9 months\n3. **Affine Analytics Pvt. Ltd, Bangalore**: June 2018 to December 2018 - 7 months\n4. **Cognizant Technology Solutions, Hyderabad**: January 2019 to June 2021 - 2 years and 6 months\n5. **Accenture, Hyderabad**: June 2021 to September 2022 - 1 year and 4 months\n6. **Gameopedia Data Solutions Pvt Ltd, Hyderabad**: September 2022 to Present (September 2024) - 2 years\n\nNow, summing up the duration of all the experiences:\n\n- Deloitte: 4 months\n- Nielsen: 1 year and 9 months\n- Affine Analytics: 7 months\n- Cognizant: 2 years and 6 months\n- Accenture: 1 year and 4 months\n- 

In [24]:
from IPython.display import Markdown
Markdown(response['result'])

Based on the provided resume, the candidate's professional experience is as follows:

1. **Deloitte, Hyderabad**: May 2016 to August 2016 (Contract) - 4 months
2. **Nielsen India Pvt. Ltd, Bangalore**: October 2016 to June 2018 - 1 year and 9 months
3. **Affine Analytics Pvt. Ltd, Bangalore**: June 2018 to December 2018 - 7 months
4. **Cognizant Technology Solutions, Hyderabad**: January 2019 to June 2021 - 2 years and 6 months
5. **Accenture, Hyderabad**: June 2021 to September 2022 - 1 year and 4 months
6. **Gameopedia Data Solutions Pvt Ltd, Hyderabad**: September 2022 to Present (September 2024) - 2 years

Now, summing up the duration of all the experiences:

- Deloitte: 4 months
- Nielsen: 1 year and 9 months
- Affine Analytics: 7 months
- Cognizant: 2 years and 6 months
- Accenture: 1 year and 4 months
- Gameopedia: 2 years

Total experience:
= 4 months + 1 year 9 months + 7 months + 2 years 6 months + 1 year 4 months + 2 years
= (4 + 9 + 7 + 6 + 4) months + (1 + 2 + 1 + 2) years
= 30 months + 6 years
= 2 years 6 months + 6 years
= 8 years and 6 months

Therefore, the total experience of the candidate as of 29th September 2024 is **8 years and 6 months**.