# Assignment: Implement Grounding Score Checker and Display with Logs
## Objective
This assignment focuses on implementing a **grounding score checker** for an LLM-powered application. You will learn to evaluate how well an LLM's generated response is supported by the provided context (grounding). This is a crucial step in ensuring the factual accuracy and trustworthiness of RAG systems. You'll also integrate logging to visualize these scores and associated information.

## Part 1: Setup and Basic RAG Pipeline (30 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment.
    * Install necessary libraries: `transformers`, `torch`/`tensorflow`, `sentence-transformers`, `scikit-learn`, `numpy`, `openai` (or your preferred LLM client library), `logging` (built-in).
    * Provide a `requirements.txt` file.

2.  **Document Corpus:**
    * Create a small, focused document corpus (e.g., 5-10 paragraphs about a specific topic, like a company's history, a product's features, or a specific scientific concept).
    * Ensure some sentences within the corpus are distinct and can be used to test grounding.

3.  **Basic RAG Pipeline Implementation:**
    * Implement a simplified RAG pipeline as functions:
        * `generate_embeddings(text_list)`: Takes a list of strings and returns their embeddings using a `sentence-transformers` model (e.g., `all-MiniLM-L6-v2`).
        * `retrieve_documents(query, document_embeddings, documents, k=3)`: Takes a query, document embeddings, the original documents, and returns the top `k` most similar document chunks (or full documents if chunks are not used).
        * `generate_response(query, context)`: Takes a user query and the retrieved context, and generates a response using a small LLM (e.g., `distilgpt2` from Hugging Face, or make a call to a larger model like OpenAI's `gpt-3.5-turbo` if you have an API key).
    * Demonstrate that your RAG pipeline can generate a response for a sample query, using the retrieved context.

In [None]:
# Your code for environment setup, document corpus, and basic RAG pipeline here.
# Include a demonstration of the RAG pipeline with a sample query.

## Part 2: Grounding Score Implementation (40 Marks)

1.  **Sentence Segmentation:**
    * Implement a function `split_into_sentences(text)` that takes a string and splits it into individual sentences. You can use `nltk.sent_tokenize` (remember to download `punkt` tokenizer) or a rule-based approach.

2.  **Semantic Similarity for Grounding:**
    * Implement a function `calculate_grounding_score(response_sentence, context_sentences, embedding_model, threshold=0.7)`:
        * This function should take an individual sentence from the LLM's response, a list of sentences from the retrieved context, the embedding model, and a similarity threshold.
        * It should generate embeddings for the `response_sentence` and all `context_sentences`.
        * Calculate the cosine similarity between the `response_sentence` embedding and *each* `context_sentence` embedding.
        * If the maximum similarity between the `response_sentence` and any `context_sentence` exceeds the `threshold`, consider that response sentence "grounded" and return `True` (or the max similarity score).
        * Otherwise, return `False` (or 0).

3.  **Overall Grounding Score:**
    * Create a function `get_overall_grounding(llm_response, retrieved_context, embedding_model, threshold=0.7)`:
        * This function will orchestrate the grounding check.
        * It should split the `llm_response` into individual sentences.
        * For each sentence in the `llm_response`, call `calculate_grounding_score` against the `retrieved_context`.
        * Calculate an overall grounding score. This could be:
            * **Option A (Recommended):** The percentage of sentences in the `llm_response` that are grounded.
            * **Option B:** The average of the maximum similarity scores for each response sentence.
        * Return the overall grounding score and, optionally, a list of (response_sentence, is_grounded) tuples for detailed logging.

4.  **Test Grounding Score:**
    * Call your `get_overall_grounding` function with your RAG pipeline's output (LLM response and retrieved context) for at least **3 different queries**.
    * Choose one query where you expect good grounding and one where you might expect poor grounding (e.g., by asking something outside the provided context, making the LLM hallucinate).
    * Print the overall grounding score for each test case.
    * Justify your chosen `threshold` value and how it impacts the score.

In [None]:
# Your code for sentence segmentation, grounding score functions, and overall grounding calculation.
# Include tests with different queries and discussions of the results and threshold choice.

## Part 3: Logging and Display (30 Marks)

1.  **Configure Logging:**
    * Configure Python's `logging` module to output messages to the console.
    * Set up different logging levels (e.g., `INFO`, `DEBUG`).
    * Include timestamps in your log format.

2.  **Integrate Logging into RAG and Grounding:**
    * Modify your RAG pipeline and grounding functions to emit log messages at key stages:
        * When a query is received.
        * The retrieved documents/context (e.g., `logging.debug` for the full text).
        * The LLM-generated response (`logging.info`).
        * For each response sentence, log whether it was grounded and its max similarity score (`logging.debug`).
        * The overall grounding score (`logging.info`).
        * Any warnings or errors (e.g., if LLM fails, or no context is found).

3.  **Demonstrate Logging:**
    * Run your complete RAG pipeline with grounding score calculation for at least **3 sample queries**.
    * Ensure the log output clearly shows:
        * The query.
        * The LLM response.
        * The overall grounding score.
        * (If DEBUG level is enabled) details for each response sentence grounding and the retrieved context.
    * Copy and paste the log output for one detailed example into your notebook.

4.  **Analysis:**
    * Based on your log outputs and scores, discuss instances where the LLM might have "hallucinated" (generated ungrounded information) or where the grounding was strong.
    * Explain the importance of logging for observability in LLM applications, especially for debugging and validating grounding.

In [None]:
# Your code for logging configuration and integration.
# Run the pipeline with logging.
# Copy and paste a sample log output.
# Your analysis of grounding and logging importance.

## Part 4: Reflection and Future Work (Bonus - 5 Marks)

1.  **Limitations of Semantic Similarity for Grounding:**
    * What are the limitations of using only semantic similarity to check grounding? (e.g., does high similarity always mean factual correctness?)
    * Suggest other methods or metrics that could enhance grounding evaluation (e.g., fact-checking, entailment models).

2.  **Integrating with Tracing Tools:**
    * If you've used tools like Langfuse or MLflow in previous assignments, how would you integrate this grounding score checker with such tracing/observability platforms?

3.  **Practical Applications:**
    * Where else can this grounding score checker be useful in real-world scenarios beyond RAG (e.g., content moderation, data validation)?

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
* Ensure your code is well-commented and easy to understand.
* Provide a `requirements.txt` file listing all dependencies.
* Clearly present all requested code, outputs, and discussions.
* Make sure your notebook runs without errors in the specified environment.