#RAG Assignment

## Problem Statement
This notebook implements a Retrieval-Augmented Generation (RAG) pipeline to answer questions based on a provided company policy document (`knowledge_base.txt`).

## 1. Setup and Installation
Since this is running in Colab, we first need to install the required libraries.

In [None]:
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu sentence-transformers

## 2. Prepare Dataset
We create the `knowledge_base.txt` file directly in this environment.

In [None]:
knowledge_base_content = """Remote Work Policy - Acme Corp
Effective Date: January 1, 2024

1. Purpose
The purpose of this Remote Work Policy is to outline the guidelines and expectations for employees working remotely. Acme Corp recognizes the benefits of remote work in promoting work-life balance and productivity.

2. Eligibility
Full-time employees who have completed their probationary period are eligible to apply for remote work. Roles that require physical presence (e.g., hardware maintenance, front-desk reception) are not eligible.

3. Work Hours & Availability
Remote employees must be available during core business hours (10:00 AM - 4:00 PM EST). Employees are expected to maintain the same level of productivity and responsiveness as they would in the office.

4. Equipment & Security
Acme Corp will provide a company laptop and necessary software. Employees must ensure their home Wi-Fi network is secure and password-protected. Use of public Wi-Fi for handling sensitive company data is strictly prohibited unless a VPN is used.

5. Communication
Employees should use Slack for asynchronous communication and Zoom for meetings. Weekly check-ins with managers are mandatory.

6. Expense Reimbursement
Acme Corp will reimburse up to $50/month for internet expenses. Home office furniture or electricity costs are not reimbursable.

7. Termination of Remote Work
Acme Corp reserves the right to terminate remote work agreements at any time if performance standards are not met or business needs change."""

with open('knowledge_base.txt', 'w') as f:
    f.write(knowledge_base_content)

print("Created 'knowledge_base.txt' successfully.")

Created 'knowledge_base.txt' successfully.


## 3. RAG Pipeline Implementation

In [None]:
import os
from typing import List
from langchain_community.document_loaders import TextLoader
try:
    from langchain.text_splitter import RecursiveCharacterTextSplitter
except ImportError:
    from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Load Data
loader = TextLoader('knowledge_base.txt')
documents = loader.load()
print(f"Loaded {len(documents)} document(s).")

# 2. Chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")

# 3. Embeddings (Using langchain-huggingface)
print("Initializing Embedding Model...")
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 4. Vector Store
print("Creating Vector Store...")
vector_store = FAISS.from_documents(chunks, embedding_model)
print("Vector store created successfully.")

Loaded 1 document(s).
Split into 4 chunks.
Initializing Embedding Model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating Vector Store...
Vector store created successfully.


## 4. Retrieval & Generation (Mock LLM)
Simulating the LLM response generation.

In [None]:
class MockLLM:
    def __init__(self, vector_store):
        self.vector_store = vector_store

    def answer_question(self, query):
        # Retrieve
        docs = self.vector_store.similarity_search(query, k=2)

        # Mock Generation
        response = f"""
        [Generated Answer]
        Based on the policy:
        - Found in: {docs[0].page_content[:100]}...
        (This is a mock, replace with OpenAI for real generation)
        """
        return response, docs

rag_system = MockLLM(vector_store)

test_queries = [
    "What is the eligibility for remote work?",
    "Does the company pay for internet?",
    "What are the core work hours?"
]

print("--- Test Results ---")
for query in test_queries:
    print(f"\nQuery: {query}")
    answer, source_docs = rag_system.answer_question(query)
    print(f"Answer: {answer}")

--- Test Results ---

Query: What is the eligibility for remote work?
Answer: 
        [Generated Answer]
        Based on the policy:
        - Found in: 2. Eligibility
Full-time employees who have completed their probationary period are eligible to appl...
        (This is a mock, replace with OpenAI for real generation)
        

Query: Does the company pay for internet?
Answer: 
        [Generated Answer]
        Based on the policy:
        - Found in: 6. Expense Reimbursement
Acme Corp will reimburse up to $50/month for internet expenses. Home office...
        (This is a mock, replace with OpenAI for real generation)
        

Query: What are the core work hours?
Answer: 
        [Generated Answer]
        Based on the policy:
        - Found in: 2. Eligibility
Full-time employees who have completed their probationary period are eligible to appl...
        (This is a mock, replace with OpenAI for real generation)
        


# RAG Assignment

## Project Overview
This version of the RAG Assignment is optimized for Google Colab. It creates the dataset and installs dependencies directly within the notebook environment, ensuring a zero-setup experience.

## Files
- **`rag_assignment_colab.ipynb`**: The standalone Jupyter Notebook containing the complete implementation (Data creation, RAG pipeline, Mock LLM).

## Instructions to Run
1. **Open Google Colab**: Go to [https://colab.research.google.com/](https://colab.research.google.com/).
2. **Upload Notebook**:
   - Click **File > Upload notebook**.
   - Select the `rag_assignment_colab.ipynb` file from this folder.
3. **Run All Cells**:
   - Once opened, click **Runtime > Run all** in the top menu.
   - The notebook will automatically:
     - Install necessary libraries (`langchain`, `faiss-cpu`, etc.).
     - Create the sample `knowledge_base.txt` file.
     - Execute the RAG pipeline and show test results.

## Key Features
- **Zero Configuration**: No need to manually install Python or manage virtual environments.
- **Self-Contained**: The mock dataset is generated via code, so you don't need to upload external text files manually.
- **GPU Ready**: While this assignment runs fine on CPU, it is compatible with Colab's GPU runtime for faster embedding generation if needed.

## Dependencies
The notebook automatically installs the following via `pip`:
- `langchain`
- `langchain-community`
- `langchain-huggingface`
- `faiss-cpu`
- `sentence-transformers`