# Assignment: Evaluate LLM Outputs using LangSmith Eval Chains

## Objective
This assignment focuses on using LangSmith's evaluation capabilities, specifically its built-in evaluation chains (like `qa_flesch_kincaid`, `qa_coherence`, `qa_faithfulness`), to automatically assess the quality of LLM-generated outputs. You will learn to set up a LangSmith project, run a custom dataset through an LLM chain, and apply various evaluators to gain insights into your model's performance.

## Part 1: Environment Setup and LangSmith Project (20 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment.
    * Install necessary libraries: `langchain`, `langsmith`, `openai` (or your preferred LLM client library), `python-dotenv`, `datasets`.
        * Provide a `requirements.txt` file.

2.  **LangSmith Project Setup:**
    * If you don't have a LangSmith account, create one at [https://smith.langchain.com/](https://smith.langchain.com/).
    * Create a new LangSmith project for this assignment (e.g., "LLM_Eval_Assignment").
    * Obtain your LangSmith API Key.
    * Create a `.env` file in your project root and add your LangSmith API key (`LANGCHAIN_API_KEY`) and optionally `LANGCHAIN_TRACING_V2=true`.
    * Set your project name (`LANGCHAIN_PROJECT`) in the `.env` file.
    * Initialize LangChain's tracing by importing `os` and setting environment variables (or by using `load_dotenv()` and letting LangChain pick them up).
    * Confirm that LangSmith tracing is enabled by running a trivial LangChain expression and checking your LangSmith project for a new trace.

In [None]:
# Your code for environment setup, requirements.txt, and LangSmith project initialization here.
# Demonstrate a small LangChain expression to confirm tracing is active.

## Part 2: Custom Dataset and LLM Chain (30 Marks)

1.  **Custom Dataset Creation:**
    * Create a small dataset for your evaluation. This dataset should consist of input-output pairs suitable for an LLM to generate responses.
    * **Recommended Dataset Type:** Question-Answering (QA) pairs or simple text generation prompts.
    * **Minimum Requirement:** At least **10-15 examples** for evaluation. Each example should include:
        * `question` (str): The input query for the LLM.
        * `answer` (str): The expected or ground-truth answer (for comparison).
        * (Optional but Recommended) `context` (str): Relevant context that the LLM *should* use to answer the question, simulating a RAG setup.
    * Store this dataset as a list of dictionaries.
    * Upload this dataset to LangSmith using `client.upload_dataset`.
    * Take a screenshot of your uploaded dataset in the LangSmith UI.

2.  **LLM Chain Implementation:**
    * Choose an LLM provider (e.g., OpenAI, Anthropic, Google Generative AI) and set up your API key in the `.env` file.
    * Create a simple LangChain chain that takes a `question` (and optionally `context`) and generates an `answer`.
        * A basic `ChatPromptTemplate` with a `ChatModel` is sufficient.
        * If you included `context` in your dataset, ensure your prompt incorporates it for a RAG-like behavior.
    * Demonstrate your LLM chain by passing a sample input and printing the generated response.

In [None]:
# Your code for creating the custom dataset and uploading it to LangSmith.
# Screenshot of the uploaded dataset in LangSmith UI.
# Your code for implementing the LLM chain and demonstrating it with a sample input.

## Part 3: Running Evaluations with Eval Chains (40 Marks)

1.  **Select Evaluators:**
    * Import the necessary evaluators from `langchain.evaluation`.
    * Select at least **three** distinct evaluators from the following list (or similar if new ones are available):
        * `qa_flesch_kincaid` (Readability)
        * `qa_coherence` (Coherence)
        * `qa_faithfulness` (Faithfulness/Groundedness - requires `context`)
        * `qa_answer_relevance` (Answer Relevance - requires `question` and `answer`)
        * `qa_conciseness` (Conciseness)
        * `LabeledScoreString` (for custom scoring, if you want to define one).
    * Briefly explain what each chosen evaluator measures.

2.  **Configure and Run Evaluation:**
    * Use `run_evaluator` from `langsmith.evaluation`.
    * Pass your LLM chain, the name of your uploaded dataset, and a list of the chosen evaluators.
    * Set a `concurrency` (e.g., 5) to speed up evaluation if using an API-based LLM for evaluation.
    * Run the evaluation.
    * Capture the `run_id` returned by `run_evaluator`.

3.  **Analyze Results in LangSmith UI:**
    * Navigate to your LangSmith project and find the evaluation run associated with the `run_id`.
    * Explore the various tabs:
        * **Dataset & Runs:** View the overall scores.
        * **Examples:** Drill down into individual examples to see the LLM's response, the ground truth, and the scores from each evaluator.
        * **Traces:** For each example, inspect the full trace including the LLM call and how evaluators were invoked.
    * Take screenshots of:
        * The overall evaluation summary table showing scores for all evaluators.
        * The detailed view of at least two individual examples, showing the generated output and the scores from the evaluators.

4.  **Discussion of Results:**
    * Based on the evaluation scores and your qualitative observations from the individual examples, discuss:
        * Which aspects of your LLM's output did each evaluator highlight (strengths and weaknesses)?
        * How useful are these automated evaluators for understanding LLM performance? What are their limitations?
        * Suggest ways to improve the LLM's performance based on the evaluation results.

In [None]:
# Your code for selecting evaluators, configuring, and running the evaluation.
# Include explanations of each evaluator.
# Screenshots of LangSmith UI with evaluation summary and detailed example views.
# Your discussion of the evaluation results.

## Part 4: Reflection and Future Work (Bonus - 10 Marks)

1.  **Custom Evaluators:**
    * Beyond built-in evaluators, how would you design a custom evaluator for a specific aspect of your LLM's performance not covered by existing ones (e.g., style, tone, specific factual check)? Describe the logic and potential implementation.

2.  **Experiment Tracking:**
    * How does LangSmith facilitate experiment tracking and comparison of different LLM chains or prompts?

3.  **Real-world Application:**
    * In a production LLM application, how would you integrate continuous evaluation using tools like LangSmith to monitor performance drift or regression over time?

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
* Ensure your code is well-commented and easy to understand.
* Provide a `requirements.txt` file listing all dependencies.
* Include all requested screenshots directly in the notebook or as clearly referenced image files.
* Make sure your notebook runs without errors, assuming the LangSmith API key and LLM API key are correctly set up in the `.env` file.