<a href="https://colab.research.google.com/github/Schofi/Rag_tech/blob/main/all_rag_techniques/simple_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb)

# Simple RAG (Retrieval-Augmented Generation) System

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

## Key Components

1. PDF processing and text extraction
2. Text chunking for manageable processing
3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
4. Retriever setup for querying the processed documents
5. Evaluation of the RAG system

## Method Details

### Document Preprocessing

1. The PDF is loaded using PyPDFLoader.
2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.

### Text Cleaning

A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.

### Encoding Function

The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

## Key Features

1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.
2. Configurable Chunking: Allows adjustment of chunk size and overlap.
3. Efficient Retrieval: Uses FAISS for fast similarity search.
4. Evaluation: Includes a function to evaluate the RAG system's performance.

## Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

## Evaluation

The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [1]:
print("🔧 正在为Google Colab安装RAG系统依赖...")

# 批量安装所有依赖包
packages = [
    "python-dotenv",
    "langchain",
    "langchain-community",
    "langchain-openai",
    "langchain-core",
    "faiss-cpu",
    "pypdf",
    "pymupdf",
    "rank-bm25",
    "openai",
    "pydantic",
    "tiktoken",
    "numpy",
    "requests",
    "tqdm"
]

for package in packages:
    print(f"📦 安装 {package}...")
    !pip install {package} -q

print("✅ 所有依赖包安装完成！")

# 验证关键包是否安装成功
print("\n🔍 验证安装...")

try:
    from langchain_community.document_loaders import PyPDFLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.vectorstores import FAISS
    from langchain_openai import OpenAIEmbeddings
    from rank_bm25 import BM25Okapi
    import fitz
    import openai
    print("✅ 所有关键模块导入成功！")
except ImportError as e:
    print(f"❌ 导入失败: {e}")
    print("请检查安装是否完整")

print("\n🎉 RAG系统环境准备完成！")

🔧 正在为Google Colab安装RAG系统依赖...
📦 安装 python-dotenv...
📦 安装 langchain...
📦 安装 langchain-community...
📦 安装 langchain-openai...
📦 安装 langchain-core...
📦 安装 faiss-cpu...
📦 安装 pypdf...
📦 安装 pymupdf...
📦 安装 rank-bm25...
📦 安装 openai...
📦 安装 pydantic...
📦 安装 tiktoken...
📦 安装 numpy...
📦 安装 requests...
📦 安装 tqdm...
✅ 所有依赖包安装完成！

🔍 验证安装...
✅ 所有关键模块导入成功！

🎉 RAG系统环境准备完成！


In [24]:
# 安装deepeval和相关评估库
!pip install deepeval
!pip install ragas
!pip install trulens-eval

Collecting deepeval
  Downloading deepeval-3.0.6-py3-none-any.whl.metadata (16 kB)
Collecting anthropic<0.50.0,>=0.49.0 (from deepeval)
  Downloading anthropic-0.49.0-py3-none-any.whl.metadata (24 kB)
Collecting click<8.2.0,>=8.0.0 (from deepeval)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting ollama (from deepeval)
  Downloading ollama-0.5.1-py3-none-any.whl.metadata (4.3 kB)
Collecting opentelemetry-api<2.0.0,>=1.24.0 (from deepeval)
  Downloading opentelemetry_api-1.34.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc<2.0.0,>=1.24.0 (from deepeval)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.34.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk<2.0.0,>=1.24.0 (from deepeval)
  Downloading opentelemetry_sdk-1.34.0-py3-none-any.whl.metadata (1.6 kB)
Collecting portalocker (from deepeval)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Collecting posthog<4.0.0,>=3.23.0 (from deepe

Collecting ragas
  Downloading ragas-0.2.15-py3-none-any.whl.metadata (9.0 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting diskcache>=5.6.3 (from ragas)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading ragas-0.2.15-py3-none-any.whl (190 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
[31mERROR: Operation cancelled by user[0m[31m
[0m^C
^C


In [6]:
# Clone the repository to access helper functions and evaluation modules
!git clone https://github.com/NirDiamant/RAG_TECHNIQUES.git
import sys
sys.path.append('RAG_TECHNIQUES')
# If you need to run with the latest data
# !cp -r RAG_TECHNIQUES/data .

fatal: destination path 'RAG_TECHNIQUES' already exists and is not an empty directory.


In [8]:
import os
import sys
from dotenv import load_dotenv


# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable (comment out if not using OpenAI)
if not os.getenv('OPENAI_API_KEY'):
    os.environ["OPENAI_API_KEY"] = input("Please enter your OpenAI API key: ")
else:
    os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

# Original path append replaced for Colab compatibility
from helper_functions import *
from evaluation.evalute_rag import *


ValueError: Invalid model. Available GPT models: gpt-3.5-turbo, gpt-3.5-turbo-0125, gpt-3.5-turbo-1106, gpt-4-0125-preview, gpt-4-1106-preview, gpt-4-turbo, gpt-4-turbo-2024-04-09, gpt-4-turbo-preview, gpt-4o, gpt-4o-2024-05-13, gpt-4o-2024-08-06, gpt-4o-2024-11-20, gpt-4o-mini, gpt-4o-mini-2024-07-18, gpt-4-32k, gpt-4-32k-0613, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4.5-preview, o1, o1-preview, o1-2024-12-17, o1-preview-2024-09-12, o1-mini, o1-mini-2024-09-12, o3-mini, o3-mini-2025-01-31, o4-mini, gpt-4.5-preview-2025-02-27

### Read Docs

In [None]:
# Download required data files
import os
os.makedirs('data', exist_ok=True)

# Download the PDF document used in this notebook
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf


In [None]:
path = "data/Understanding_Climate_Change.pdf"

### Encode document

In [None]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings (Tested with OpenAI and Amazon Bedrock)
    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)
    #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)

    # Create vector store
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [None]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

### Create retriever

In [None]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

### Test retriever

In [None]:
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

### Evaluate results

In [None]:
#Note - this currently works with OPENAI only
evaluate_rag(chunks_query_retriever)