# RAG Pipeline & Service for DeepSeek chatbot to Access
This project implements RAG pipeline/service with powerful open sources such as **DeepSeek** and **FAISS Index**. 

## What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances large language models (LLMs) by letting them access external information during generation.

Instead of relying solely on what the LLM "knows," RAG retrieves relevant documents from a knowledge base (like a vector database) and uses them as context for a more accurate, up-to-date, and grounded response.

It combines:

- **Retrieval**: Finding relevant data/documents for a given query.

- **Generation**: Using an LLM to generate an answer using both the query and the retrieved context.



## FAISS
FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta (Facebook) that performs fast similarity search on high-dimensional vectors.
It’s typically used for:
- Searching similar documents or images
- Building Retrieval-Augmented Generation (RAG) systems
- Finding nearest neighbors in vector space (e.g., embeddings)

## Setup Environment 

In [None]:
# Install required libraries
!pip install torch transformers faiss-cpu llama-index llama-index-embeddings-huggingface sentence-transformers
!pip install transformers accelerate bitsandbytes

## Model & Tokenizer Setup

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load DeepSeek model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", device_map="auto", trust_remote_code=True, torch_dtype=torch.float16)

tokenizer_config.json:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

## Prompt Wrapper

In [7]:
def generate_response(prompt):
    system = "You are a helpful assistant."
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt}
    ]

    # KEEP this line: it returns tensor directly (your model supports this)
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    # Create an attention mask manually since tokenizer didn't return it
    attention_mask = (input_ids != tokenizer.pad_token_id).long()

    # Generate output
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=512
    )

    return tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)


## Embedding & Index Setup

In [8]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Dummy corpus (for now)
documents = [
    "Hiro Oshima is Data Engineer",
    "Hiro loves cats",
    "Hiro loves data science"
]

# Embed and build index
doc_embeddings = embedding_model.encode(documents)
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings).astype("float32"))


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## RAG Retrieval Function

In [9]:
def get_embedding(text):
    return embedding_model.encode([text])[0]

def retrieve_top_k(query, k=3):
    query_vector = np.array([get_embedding(query)], dtype=np.float32)
    distances, indices = index.search(query_vector, k)
    return [documents[i] for i in indices[0]]

def rag_chatbot(query):
    # Retrieve top-k documents
    retrieved_docs = retrieve_top_k(query, k=3)
    context = "\n".join(retrieved_docs)
    
    prompt = f"Context:\n{context}\n\nUser Question:\n{query}"
    return generate_response(prompt)

In [10]:
response = rag_chatbot("Tell me what you know about Hiro")
print(response)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

As an AI, I don't have personal experiences or emotions, but I can provide information based on the context you've provided. 

Hiro Oshima is a Data Engineer. He is passionate about data science and loves to work with data to extract meaningful insights. He is also a cat lover, which aligns with your statement. He is known for his strong work ethic, attention to detail, and ability to work in a team environment.

However, I'm an AI and don't have personal experiences or emotions. I don't have a personal knowledge or opinions. I can provide information based on the context you've provided.



# Add More Index

In [None]:
new_docs = [
    "INSERT NEW DOCUMENT HERE",
    "Hiro wants likes ML engineering "
]
new_embeddings = embedding_model.encode(new_docs)
index.add(np.array(new_embeddings).astype("float32"))
documents.extend(new_docs)

In [None]:
]