# RAG-powered Chatbot Assignment: Document Indexing and Retrieval


## Objective
This assignment aims to provide hands-on experience in building a Retrieval Augmented Generation (RAG) powered chatbot. You will focus on the core components of RAG: document indexing and efficient retrieval to answer user queries based on a given corpus of documents.

## Part 1: Setup and Environment (10 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment for this project.
    * Install all necessary libraries. (e.g., `transformers`, `torch`/`tensorflow`, `faiss-cpu`/`annoy`/`chromadb`, `nltk`/`spacy`, `langchain`/`llamaindex` - choose one, `scikit-learn`, `unstructured` or `pypdf` for document parsing).
    * Provide a `requirements.txt` file listing all dependencies.

2.  **Dataset Acquisition:**
    * Choose a suitable document corpus. This could be:
        * A collection of Wikipedia articles on a specific topic (e.g., "Artificial Intelligence", "Climate Change").
        * A set of research papers (e.g., from arXiv).
        * A collection of documentation files for a software project.
        * **Minimum Requirement:** The corpus should contain at least **50 distinct documents** and a total of at least **10,000 words**.
    * Describe your chosen dataset, its source, and its characteristics.

In [None]:
# Your code for environment setup description and dataset acquisition details here.

## Part 2: Document Preprocessing and Chunking (20 Marks)

1.  **Document Loading and Parsing:**
    * Implement a robust method to load documents from your chosen corpus. Handle various file formats if applicable (e.g., PDF, TXT, Markdown).
    * Extract plain text content from each document.

2.  **Text Preprocessing:**
    * Apply necessary text preprocessing steps (e.g., lowercasing, punctuation removal, stop word removal, stemming/lemmatization - justify your choices).

3.  **Document Chunking:**
    * Divide the processed documents into smaller, meaningful chunks. Experiment with different chunking strategies:
        * **Fixed-size chunks:** e.g., 200 words with an overlap of 50 words.
        * **Sentence-based chunks:** Divide based on sentence boundaries.
        * **Semantic chunks:** (Optional, for extra credit) Use more advanced methods to ensure chunks retain semantic coherence.
    * Justify your chosen chunking strategy and the parameters used. Explain why this strategy is suitable for RAG.

In [None]:
# Your code for document loading, parsing, preprocessing, and chunking here.
# Include explanations and justifications for your choices.

## Part 3: Document Indexing and Vector Store (30 Marks)

1.  **Text Embedding:**
    * Choose and implement a pre-trained language model for generating embeddings for your text chunks. (e.g., Sentence-BERT models like `all-MiniLM-L6-v2`, `mpnet-base-v2`, `RoBERTa`, or `BERT` based models).
    * Describe the chosen model and justify why it's suitable for your task.

2.  **Vector Store Implementation:**
    * Build a vector store (or knowledge base) to efficiently store and retrieve the embeddings.
    * You can use libraries like:
        * `FAISS` (Facebook AI Similarity Search) for efficient similarity search.
        * `Annoy` (Approximate Nearest Neighbors Oh Yeah).
        * `ChromaDB` or `Pinecone` (if you are comfortable with cloud-based solutions, but local is preferred for this assignment).
    * Index all your document chunks and their corresponding embeddings into the chosen vector store.
    * Explain your choice of vector store and how it works.

3.  **Persistence (Optional, for extra credit):**
    * Implement a way to save and load your pre-built index to avoid re-indexing every time.

In [None]:
# Your code for text embedding and vector store implementation here.
# Include explanations and justifications for your choices.

## Part 4: Retrieval Module (25 Marks)

1.  **Query Embedding:**
    * Implement a function to convert a user query into an embedding using the *same* embedding model used for document chunks.

2.  **Similarity Search:**
    * Implement a function to perform a similarity search (e.g., cosine similarity) in your vector store using the query embedding.
    * Retrieve the top `k` most relevant document chunks based on their similarity scores. Experiment with different values of `k` (e.g., 3, 5, 10) and justify your chosen `k`.
    * Display the retrieved chunks and their similarity scores for a few sample queries.

3.  **Retrieval Evaluation (Qualitative):**
    * Formulate at least 5 diverse sample queries related to your document corpus.
    * For each query, analyze the retrieved chunks. Do they seem relevant? Are there any irrelevant chunks? What are the potential issues?
    * Discuss how you might improve the retrieval quality.

In [None]:
# Your code for query embedding, similarity search, and qualitative retrieval evaluation here.
# Include your sample queries and analysis.

## Part 5: Chatbot Integration (Basic RAG) (15 Marks)

1.  **LLM Integration:**
    * Choose a suitable pre-trained Language Model (LLM) for text generation. (e.g., `distilbert-base-uncased`, `gpt2`, or a smaller open-source model you can run locally). For demonstration, you can use a Hugging Face pipeline for text generation.
    * **Note:** You are *not* required to fine-tune an LLM. The focus is on RAG.

2.  **RAG Pipeline:**
    * Build a simple RAG pipeline:
        * User provides a query.
        * Retrieve relevant document chunks using your retrieval module.
        * Construct a prompt for the LLM that includes the user query and the retrieved context.
        * Generate a response using the LLM based on the prompt.

3.  **Demonstration:**
    * Provide at least 3 examples of user queries and the chatbot's generated responses. Demonstrate how the retrieved information is used to answer the questions.
    * Discuss the strengths and limitations of your basic RAG chatbot.

In [None]:
# Your code for basic RAG pipeline implementation and demonstration here.
# Include your sample queries and chatbot responses.

## Part 6: Reflection and Future Work (Bonus - 5 Marks)

1.  **Challenges Faced:**
    * Describe any challenges you encountered during this assignment and how you overcame them.

2.  **Potential Improvements:**
    * Suggest ways to improve the RAG pipeline. Consider aspects like:
        * Advanced chunking strategies.
        * Re-ranking retrieved documents.
        * Handling multi-turn conversations.
        * Integrating more sophisticated LLMs.
        * Quantitative evaluation metrics for retrieval and generation.

3.  **Learnings:**
    * Summarize your key learnings from building this RAG-powered chatbot.

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
* Ensure your code is well-commented and easy to understand.
* Provide a `requirements.txt` file.
* Include a brief `README.md` file (optional but recommended) explaining how to run your code and any specific instructions.
* Make sure your notebook runs without errors in the specified environment.