# Assignment: Creating an Embedding Store with FAISS and Sentence Transformers

## Objective
This assignment focuses on building an efficient embedding store using FAISS (Facebook AI Similarity Search) and Sentence Transformers. You will learn to generate meaningful sentence embeddings and index them for fast similarity search, a fundamental component in many NLP applications like semantic search, recommendation systems, and RAG (Retrieval Augmented Generation) pipelines.

## Part 1: Environment Setup and Data Preparation (20 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment for this assignment.
    * Install the necessary libraries: `torch`, `transformers`, `sentence-transformers`, `faiss-cpu`, `numpy`, `scikit-learn`.
    * Provide a `requirements.txt` file listing all dependencies.

2.  **Dataset Acquisition:**
    * Choose a text dataset for this assignment. Recommended options:
        * A collection of movie plot summaries (e.g., from Kaggle or public domain sources).
        * A subset of a news article dataset (e.g., AG News).
        * A collection of scientific paper abstracts.
    * **Minimum Requirement:** The dataset should contain at least **500 distinct text documents/sentences/paragraphs**.
    * Load your chosen dataset into a suitable data structure (e.g., a list of strings or a Pandas DataFrame).
    * Describe your chosen dataset, its source, and its characteristics (e.g., average sentence length, topic diversity).

In [None]:
# Your code for environment setup description and dataset acquisition here.
# Include comments and explanations.

## Part 2: Sentence Embedding Generation (30 Marks)

1.  **Load Sentence Transformer Model:**
    * Load a pre-trained Sentence Transformer model. Recommended models:
        * `'all-MiniLM-L6-v2'` (good balance of performance and speed)
        * `'all-mpnet-base-v2'` (higher performance)
        * `'distilbert-base-nli-stsb-mean-tokens'`
    * Justify your choice of model.

2.  **Generate Embeddings:**
    * Iterate through your dataset and generate embeddings for each text document/sentence/paragraph using the loaded Sentence Transformer model.
    * Store the generated embeddings in a NumPy array. Ensure the data type is `float32` for FAISS compatibility.

3.  **Inspect Embeddings:**
    * Print the shape of the resulting embeddings array (e.g., `(num_samples, embedding_dimension)`).
    * Display a few sample embeddings (e.g., the first 5 embeddings) to observe their structure.
    * Discuss the dimensionality of the embeddings and its implications.

In [None]:
# Your code for loading the Sentence Transformer model and generating embeddings here.
# Include model justification and embedding inspection.

## Part 3: FAISS Index Creation and Population (30 Marks)

1.  **Choose FAISS Index Type:**
    * Select an appropriate FAISS index type for your needs. For this assignment, start with a simple index:
        * `faiss.IndexFlatL2` (Exact L2 distance search - good for small datasets and understanding basics)
        * Alternatively, `faiss.IndexFlatIP` (Exact Inner Product search if using normalized embeddings)
    * Briefly explain your chosen index type and its characteristics.

2.  **Initialize and Add Embeddings to Index:**
    * Initialize the FAISS index with the correct embedding dimension.
    * Add your generated embeddings to the FAISS index.
    * Report the total number of vectors added to the index.

3.  **Index Persistence (Optional, for extra credit):**
    * Implement saving and loading the FAISS index to/from disk using `faiss.write_index` and `faiss.read_index`.
    * Demonstrate saving and then loading the index.

In [None]:
# Your code for FAISS index creation and population here.
# Include explanations for your chosen index type.

## Part 4: Similarity Search and Evaluation (20 Marks)

1.  **Formulate Query:**
    * Choose at least **3 distinct text queries** that are relevant to your dataset. These queries should *not* be directly present in your indexed documents, but semantically related.
        * For example, if your dataset is movie plots, a query could be: "A film about space exploration and finding new life."

2.  **Generate Query Embeddings:**
    * Generate embeddings for each of your query sentences using the *same* Sentence Transformer model used for your document embeddings.
        * Ensure the query embeddings have the correct shape (e.g., `(1, embedding_dimension)` for a single query).

3.  **Perform Similarity Search:**
        * Use the `search` method of your FAISS index to find the top `k` most similar documents for each query.
        * Experiment with different values of `k` (e.g., `k=3`, `k=5`, `k=10`) and observe the results.

4.  **Display Results and Analysis:**
    * For each query and selected `k` value:
        * Print the query.
        * Print the distances/similarity scores of the retrieved documents.
        * Print the original text content of the retrieved documents (if available from your dataset).
    * Qualitatively analyze the retrieved results. Are the top `k` documents relevant to your query? Are there any surprising or irrelevant results? Discuss potential reasons for good/bad retrieval.

In [None]:
# Your code for query formulation, embedding generation, similarity search, and result analysis here.
# Include your queries, retrieved documents, and qualitative analysis.

## Part 5: Reflection and Future Work (Bonus - 5 Marks)

1.  **Challenges Faced:**
    * Describe any challenges you encountered during this assignment and how you overcame them.

2.  **Potential Improvements:**
    * Suggest ways to improve the embedding store or the similarity search process. Consider:
        * Using different Sentence Transformer models.
        * Exploring advanced FAISS index types (e.g., `IVFFlat`, `HNSW`) for larger datasets.
        * Preprocessing techniques for text data.
        * Re-ranking retrieved results.

3.  **Learnings:**
    * Summarize your key learnings from building this embedding store with FAISS and Sentence Transformers.

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
* Ensure your code is well-commented and easy to understand.
* Provide a `requirements.txt` file.
* Include a brief `README.md` file (optional but recommended) explaining how to run your code and any specific instructions.
* Make sure your notebook runs without errors in the specified environment.