# RAG Implementation

## Repository Setup and Dependency Installation

### Subtask:
Set up the repository and ensure all Python dependencies are installed from `requirements.txt`.


In [None]:
import os

repo_path = '/content/mcp-local-llm'

if not os.path.exists(repo_path):
    print(f"Cloning repository into {repo_path}...")
    !git clone https://github.com/GoogleCloudPlatform/mcp-local-llm.git {repo_path}
    print("Repository cloned successfully.")
almost_original_cwd = os.getcwd()

print(f"Repository path set to: {repo_path}")

In [None]:
print("Installing tesseract-ocr and language pack...")
!apt-get update
!apt-get install -y tesseract-ocr tesseract-ocr-eng
print("tesseract-ocr installed successfully.")

In [None]:
import os

os.chdir(repo_path)
print(f"Changed current working directory to: {os.getcwd()}")

In [None]:
requirements_path = 'requirements.txt'

libraries_to_add = ['PyPDF2', 'Pillow', 'pytesseract', 'reportlab']

existing_requirements = []
if os.path.exists(requirements_path):
    with open(requirements_path, 'r') as f:
        existing_requirements = [line.strip() for line in f if line.strip()]

new_libraries_to_write = []
for lib in libraries_to_add:
    if lib not in existing_requirements:
        new_libraries_to_write.append(lib)

if new_libraries_to_write:
    with open(requirements_path, 'a') as f:
        for lib in new_libraries_to_write:
            f.write(f"{lib}\n")
    print(f"Appended new libraries to {requirements_path}: {', '.join(new_libraries_to_write)}")
else:
    print(f"All specified libraries are already in {requirements_path}.")

In [None]:
print("Installing Python dependencies from requirements.txt...")
!pip install -r requirements.txt
print("Python dependencies installed successfully.")

## Process Documents in data/raw and Build Artifacts

Perform core data processing by running `doc_parser.py` to extract text from various document types, `ingest.py` to chunk the processed documents and update metadata, and `embed.py` to generate embeddings and build a FAISS index.


In [None]:
print(f"Executing doc_parser.py to process documents in {data_raw_path}...")
!python "{doc_parser_script_path}" "{data_raw_path}"
print("doc_parser.py execution complete.")

In [None]:
print(f"Executing ingest.py to chunk documents and update metadata in {output_metadata_path}...")
!python "{chunk_data_script}" --chunk_size {chunk_size} --chunk_overlap {chunk_overlap} --output_metadata_path "{output_metadata_path}" --data_raw_path "{data_raw_path}"
print("ingest.py execution complete.")

In [None]:
print(f"Executing embed.py to generate embeddings and build FAISS index in {faiss_index_path}...")
!python "{embed_data_script}" --model_name "{model_name_for_embedding}" --output_metadata_path "{output_metadata_path}" --faiss_index_path "{faiss_index_path}" --data_raw_path "{data_raw_path}"
print("embed.py execution complete.")

## Initialize RAG System (using rag.py)

Import and set up the `answer` function from the `rag.py` script. This configures the RAG system to use the specified LLM within `rag.py` for generating responses.


In [None]:
import sys
import os

# Add the scripts directory to the Python path
sys.path.insert(0, scripts_dir)

print(f"Added {scripts_dir} to sys.path: {scripts_dir in sys.path}")

# Import the answer function
from rag import answer

print("Imported 'answer' function from rag.py")

# Call the answer function once with a dummy query to initialize the RAG system and load the LLM
print("Initializing RAG system with a dummy query...")
dummy_response = answer(query="Who Made You")
print(dummy_response)
print("RAG system initialized.")

## Interactive RAG Query Session

Provide a code cell for an interactive RAG query loop. You can type questions and receive answers from the RAG system, demonstrating its ability to retrieve information from the diverse document types you've processed.


In [None]:
print("Starting interactive RAG query session. Type 'quit' to exit.")

while True:
    query = input("\nEnter your query (or type 'quit' to exit): ")
    if query.lower() == 'quit':
        print("Exiting R RAG query session.")
        break

    # Call the answer function with the user's query
    try:
        response = answer(query=query)
        print("\n--- RAG Response ---")
        print(response)
        print("--------------------")
    except Exception as e:
        print(f"An error occurred during RAG query: {e}")