In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def rag_system(kb, query, top_k=3):

    sentences = []
    categories = []

    # Flatten KB into sentence-level chunks
    for item in kb:
        category = item["category"]
        for sentence in item["content"]:
            sentences.append(sentence)
            categories.append(category)

    # Embedding using TF-IDF
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(sentences)

    # Transform the query
    query_vec = vectorizer.transform([query])

    # Compute similarity
    sims = cosine_similarity(query_vec, vectors).flatten()

    # Retrieve top-k sentences
    top_indices = sims.argsort()[-top_k:][::-1]
    retrieved_sentences = [sentences[i] for i in top_indices]
    retrieved_categories = [categories[i] for i in top_indices]

    # GENERATION: combine retrieved sentences into a final response
    context_text = " ".join(retrieved_sentences)
    generated_response = "Based on the retrieved documentation: " + context_text

    return {
        "query": query,
        "retrieved_context": retrieved_sentences,
        "categories": retrieved_categories,
        "response": generated_response
    }

# RAG System Function Explanation

The `rag_system` function implements a basic **Retrieval-Augmented Generation (RAG)** pipeline using TF-IDF. It retrieves the most relevant information from a knowledge base (KB) and generates a response for a given query.

---

## Function Purpose

- **Retrieve**: Find the top-k most relevant sentences from the KB for a user query.
- **Augment**: Combine the retrieved sentences into a context.
- **Generate**: Form a final response string using the retrieved context.

---

## Inputs

1. `kb` (`list`): Knowledge base, where each item is a dictionary with:
   - `"category"`: The category of the document or content.
   - `"content"`: A list of sentences describing instructions, safety protocols, or troubleshooting guides.
2. `query` (`str`): The user query you want to get an answer for.
3. `top_k` (`int`, optional): Number of top relevant sentences to retrieve. Default is `3`.

---

## Process

1. **Flatten KB**:  
   Convert all sentences in the KB into a single list while keeping track of their categories.

2. **TF-IDF Vectorization**:  
   Transform all sentences into numerical vectors using `TfidfVectorizer`.  
   This acts as the **embedding step**.

3. **Query Transformation**:  
   Convert the user query into a TF-IDF vector using the same vectorizer.

4. **Similarity Computation**:  
   Compute cosine similarity between the query vector and all sentence vectors.

5. **Retrieve Top-K Sentences**:  
   Select the `top_k` sentences with the highest similarity scores.

6. **Generate Response**:  
   Combine the retrieved sentences into a single string as the final response.

---

## Outputs

The function returns a dictionary with:

- `"query"`: Original query string.
- `"retrieved_context"`: List of the top-k retrieved sentences.
- `"categories"`: List of the corresponding categories for the retrieved sentences.
- `"response"`: Final generated response string combining the retrieved sentences.

---

## Example Usage

```python
# Assuming kb is already loaded
result = rag_system(kb, "How should the robot handle fragile items?")
print(result["response"])