# RAG Chatbot using reviews.csv
This notebook builds a simple Retrieval-Augmented Generation (RAG) style chatbot over the `reviews.csv` dataset.

It uses:
- pandas for loading data
- scikit-learn TF-IDF for embeddings
- cosine similarity for retrieval
- A simple response generator (can be replaced with an LLM later)


In [None]:

# Install dependencies (uncomment if running in Colab/Jupyter fresh environment)
# !pip install pandas scikit-learn


In [None]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


## Load Dataset

In [None]:

# Path to your uploaded file
csv_path = "reviews.csv"

df = pd.read_csv(csv_path)

print("Columns:", df.columns.tolist())
print("Number of rows:", len(df))
df.head()


## Prepare Text Data
We will combine relevant text columns into a single corpus for retrieval.

In [None]:

# Try to find a text column automatically
text_columns = [col for col in df.columns if df[col].dtype == 'object']

print("Text columns detected:", text_columns)

# Choose the first text column by default
text_col = text_columns[0]
documents = df[text_col].astype(str).tolist()

print("Using column:", text_col)
print("Sample document:", documents[0][:200])


## Create Embeddings using TF-IDF

In [None]:

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
doc_vectors = vectorizer.fit_transform(documents)

print("Document vectors shape:", doc_vectors.shape)


## Retrieval Function

In [None]:

def retrieve(query, top_k=3):
    query_vec = vectorizer.transform([query])
    similarities = cosine_similarity(query_vec, doc_vectors)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            "index": int(idx),
            "score": float(similarities[idx]),
            "text": documents[idx]
        })
    return results


## Simple Answer Generator
This is a placeholder for an LLM. It uses retrieved documents to form a response.
You can later replace this with OpenAI, Groq, or any LLM.

In [None]:

def generate_answer(query, retrieved_docs):
    answer = f"Question: {query}\n\nBased on the dataset, here are the most relevant entries:\n\n"
    for i, doc in enumerate(retrieved_docs, 1):
        answer += f"{i}. (score={doc['score']:.3f}) {doc['text'][:300]}...\n\n"
    answer += "You can plug this context into an LLM for a more natural answer."
    return answer


## Chat Function

In [None]:

def chat(query, top_k=3):
    retrieved = retrieve(query, top_k=top_k)
    response = generate_answer(query, retrieved)
    return response


## Try It Out

In [None]:

query = "What do people think about this product?"
print(chat(query, top_k=3))


## Interactive Loop (Optional)

In [None]:

while True:
    q = input("Ask a question (or type 'exit'): ")
    if q.lower() == "exit":
        break
    print(chat(q, top_k=3))
    print("-" * 80)


## Next Improvements
- Replace TF-IDF with sentence-transformer embeddings
- Use FAISS or Chroma for vector search
- Plug in an LLM (OpenAI / Groq / HuggingFace) for natural answers
- Add a UI with Gradio or Streamlit