# DepoIndex: Validation Notebook

This notebook validates the performance of the `DepoIndex` AI script. The validation process is divided into two main parts:

1.  **Qualitative Analysis (Chain-of-Thought):** We will randomly select 10 topics generated by the script and ask the LLM to explain its reasoning for identifying them. This helps us understand *why* the model is making its decisions.
2.  **Quantitative Analysis (Accuracy Check):** We will compare the script's output against a manually created "ground truth" dataset to calculate a precise accuracy score, ensuring it meets the project's target of ≥ 95%.

## 1. Setup and Configuration
## ENTER AN API KEY IN THE FIRST CELL
First, we'll import the necessary libraries, configure the Gemini API, and load our data files.

In [None]:
import json
import os
import random
import time
from pypdf import PdfReader
import google.generativeai as genai
from IPython.display import display, Markdown

# --- Configuration ---
os.environ["GOOGLE_API_KEY"] = "" # IMPORTANT: Replace with your actual API key
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

PDF_FILE_PATH = 'DepostionForPersisYu_LinkPDF.pdf'
GENERATED_TOPICS_PATH = 'topics.json' # The output from your main script
GROUND_TRUTH_PATH = 'ground_truth.json' # The file you will create manually

  from .autonotebook import tqdm as notebook_tqdm


### Load Data

We'll load the topics generated by our script and the ground truth data. We will also create a placeholder for the ground truth data in case it doesn't exist yet.

In [2]:
try:
    with open(GENERATED_TOPICS_PATH, 'r') as f:
        generated_topics = json.load(f)
    print(f"Successfully loaded {len(generated_topics)} topics from '{GENERATED_TOPICS_PATH}'")
except FileNotFoundError:
    print(f"Error: '{GENERATED_TOPICS_PATH}' not found. Please run the main script first.")
    generated_topics = []

try:
    with open(GROUND_TRUTH_PATH, 'r') as f:
        ground_truth_topics = json.load(f)
    print(f"Successfully loaded {len(ground_truth_topics)} topics from '{GROUND_TRUTH_PATH}'")
except FileNotFoundError:
    print(f"Warning: '{GROUND_TRUTH_PATH}' not found. Using placeholder data for accuracy check.")
    # This is a placeholder. You should create this file with at least 50 manually verified topics.
    ground_truth_topics = [
        {"topic": "Appearances", "page_start": 3, "line_start": 1},
        {"topic": "Scope of Testimony", "page_start": 9, "line_start": 2},
        {"topic": "Student Loan Debt Cancellation", "page_start": 13, "line_start": 2}
    ]

Successfully loaded 25 topics from 'topics.json'


## 2. Qualitative Analysis: Chain-of-Thought Validation

Here, we will randomly select 10 topics and ask the LLM to justify its choices.

In [3]:
def get_page_text(pdf_path, page_number):
    """Extracts the full text from a specific page of the PDF."""
    try:
        reader = PdfReader(pdf_path)
        # Page numbers in pypdf are 0-indexed
        page = reader.pages[page_number - 1]
        return page.extract_text()
    except Exception as e:
        return f"Error extracting text: {e}"

def get_cot_reasoning(topic_data, page_text):
    """Prompts the LLM for a Chain-of-Thought explanation."""
    model = genai.GenerativeModel("gemini-2.0-flash")
    
    prompt = f"""
    An AI model identified the following topic:
    - Topic: "{topic_data['topic']}"
    - Location: Page {topic_data['page_start']}, Line {topic_data['line_start']}

    Here is the full text of that page:
    ---
    {page_text}
    ---

    Please act as an expert analyst. Explain step-by-step why this topic identification is correct.
    Point to specific keywords or phrases in the text that support this conclusion.
    Provide your answer in clear, concise Markdown.
    """
    
    try:
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        return f"Error generating reasoning: {e}"

# Select 10 random samples for validation
if len(generated_topics) >= 10:
    validation_samples = random.sample(generated_topics, 10)
else:
    validation_samples = generated_topics # Use all if less than 10

print(f"Performing Chain-of-Thought validation on {len(validation_samples)} samples...\n")

for i, sample in enumerate(validation_samples):
    page_num = sample['page_start']
    
    display(Markdown(f"### Validation Sample {i+1}/{len(validation_samples)}"))
    display(Markdown(
        f"**Topic:** `{sample['topic']}`\n\n"
        f"**Location:** Page `{page_num}`, Line `{sample['line_start']}`"
    ))
    
    # Get context and reasoning
    original_text = get_page_text(PDF_FILE_PATH, page_num)
    reasoning = get_cot_reasoning(sample, original_text)
    
    display(Markdown("#### LLM Reasoning:"))
    display(Markdown(reasoning))
    
    display(Markdown("---_---")) # Separator
    time.sleep(2) # To respect API rate limits

Performing Chain-of-Thought validation on 10 samples...



### Validation Sample 1/10

**Topic:** `Vervent Defendants' Involvement in ITT's Misrepresentations`

**Location:** Page `35`, Line `11`

#### LLM Reasoning:

Here's a step-by-step analysis of why the AI's topic identification is correct:

1.  **Location Confirmation:** The AI identifies the location as "Page 35, Line 11." Line 11 begins with "And with respect to misrepresenting the..." which directly leads into the discussion of ITT's misrepresentations.

2.  **Keyword Identification:** The AI has accurately identified keywords related to the topic.
    *   **"Misrepresenting"**: This word appears multiple times (lines 1, 2, 3, 11, 18, 20) establishing the central theme of misrepresentation.
    *   **"ITT Education" / "Attending ITT"**: Mentions of "ITT" and "ITT education" (lines 12, 19) clearly specify the institution in question.
    *   **"Vervent defendants"**: Appears in lines 14, 20, 22 directly linking the misrepresentations with the Vervent defendants.

3.  **Contextual Analysis:** The question posed by Mr. Purcell in lines 14-15, "are you aware of any of the Vervent defendants being involved in that practice by ITT?" directly inquires about Vervent's involvement in ITT's misrepresentations. Lines 19-21 restate this question, further solidifying the topic. The answer provided in lines 22-25, acknowledging Vervent's potential awareness of the misrepresentations, is also crucial to the topic.

4.  **Topic Summary:** The page focuses on whether Vervent defendants were involved in or aware of ITT's misrepresentations regarding the benefits of its education. This aligns perfectly with the AI's identified topic: "Vervent Defendants' Involvement in ITT's Misrepresentations."

**Conclusion:** The AI model correctly identified the topic on page 35, line 11 because the text explicitly discusses ITT's misrepresentations and directly questions the involvement of the Vervent defendants in those practices. The keywords and context strongly support this identification.


---_---

### Validation Sample 2/10

**Topic:** `2012 CFPB Civil Investigative Demand and Subsequent Complaints`

**Location:** Page `67`, Line `1`

#### LLM Reasoning:

The AI model's identification of the topic "2012 CFPB Civil Investigative Demand and Subsequent Complaints" on page 67, line 1 is accurate. Here's why:

*   **Explicit Mention of Key Elements:** The text directly mentions:
    *   "Consumer Financial Protection Bureau Civil Investigative Demand" (lines 2-3): This is the core event.
    *   "complaint" (line 16): This indicates subsequent legal actions.
    *   Lines 13-14 asks about the results of the CID
    *   Lines 16-18 states that a complaint was ultimately filed by the CFPB as a result of the Civil Investigative Demand.

*   **Direct Questioning:** The lines following the initial mention revolve around the consequences and outcomes of the Civil Investigative Demand (lines 13-14). The responses (lines 15-25) explicitly discuss complaints filed by the CFPB.

*   **Elaboration on Complaints:** The witness elaborates on two complaints, one against ITT and another related to the PEAKS Trust (lines 24-25), directly addressing the "Subsequent Complaints" part of the identified topic.

In summary, the AI correctly identified the topic because the text explicitly states the "2012 CFPB Civil Investigative Demand" and then immediately delves into the related complaints that arose from it. The direct questions and answers further solidify the connection and justify the topic identification.


---_---

### Validation Sample 3/10

**Topic:** `Expert Report Handling and Pace of Testimony`

**Location:** Page `10`, Line `1`

#### LLM Reasoning:

The AI's topic identification of "Expert Report Handling and Pace of Testimony" for page 10, line 1 is accurate because the text contains direct references and discussions related to both elements:

**Expert Report Handling:**

*   **Line 3:** "I assume that's your expert report?" - This question explicitly mentions the existence and relevance of an "expert report."
*   **Line 4:** "Yes, I have my expert report in front of me." - Confirms the expert is using their report.
*   **Line 5:** "I'm going to mark that as Exhibit 1" - This denotes the formal handling and identification of the expert report as evidence within the deposition.

**Pace of Testimony:**

*   **Line 9:** "talk a little slower, especially when you're reading." - Directly addresses the pace of the witness's testimony.
*   **Line 11-14:** "When we start reading, sometimes we read so fast it makes it hard...it's a little harder to follow when -- when you talk really fast" - Explains the reasoning for adjusting the pace of speech.
*   **Line 16-19:** "from time to time, I might ask you to slow down a bit...just to make sure the record's easy to follow." - Reinforces the importance of controlled speaking pace for clarity and accurate record-keeping.
*   **Line 21:** "I understand" - Indicates the expert's acknowledgment and agreement to adjust their speaking pace.


---_---

### Validation Sample 4/10

**Topic:** `Deposition of Persis Yu`

**Location:** Page `1`, Line `16`

#### LLM Reasoning:

Okay, here's a step-by-step analysis of why the AI model correctly identified the topic as "Deposition of Persis Yu" at Page 1, Line 16:

**Step 1: Identifying the Location**

*   The AI correctly pinpoints "Page 1, Line 16" as the focal point.

**Step 2: Analyzing the Content at the Specified Location**

*   Line 16 reads: "DEPOSITION OF PERSIS YU".

**Step 3: Breaking Down the Key Phrases**

*   **"DEPOSITION OF"**: This phrase clearly indicates the nature of the proceeding being documented. A deposition is a legal process involving sworn testimony taken out of court.
*   **"PERSIS YU"**: This is a proper noun, almost certainly identifying the individual whose deposition is being taken. Therefore, this identifies the deponent.

**Step 4: Synthesizing the Information**

*   Combining these elements, the text explicitly states that this document pertains to the deposition (legal testimony) of a person named Persis Yu.

**Conclusion:**

The AI model's identification of "Deposition of Persis Yu" as the topic at the given location is demonstrably correct. The phrase itself directly states the nature of the document and the individual involved, making it the most accurate and concise topic representation.


---_---

### Validation Sample 5/10

**Topic:** `Specific Advocacy Efforts and Policy Impacts`

**Location:** Page `14`, Line `19`

#### LLM Reasoning:

The AI model's identification of "Specific Advocacy Efforts and Policy Impacts" at Page 14, Line 19 is accurate. Here's why:

1.  **Location Confirmation:** Line 19 starts with "advocate for President Biden...". This phrase aligns directly with advocacy efforts.

2.  **Keywords indicating Advocacy:** The word "advocate" itself is a primary keyword. Further supporting this is the phrase "recommendations that I made" (line 22) also clearly refers to actions meant to influence policy.

3.  **Keywords indicating Policy Impacts:** The text explicitly states "cancel up to $20,000 in student loan debt" (line 19), followed by "He did make that policy decision" (line 20). This demonstrates a specific policy outcome resulting from advocacy. Also, the mention of "incorporated into the most recent draft" (line 23) regarding income-driven repayment highlights the impact of the speaker's recommendations on Department of Education policy.

In summary, the presence of keywords like "advocate," "recommendations," and phrases indicating policy decisions directly supports the topic identification of "Specific Advocacy Efforts and Policy Impacts." The text not only describes the advocacy actions but also points to tangible policy changes resulting from those actions.


---_---

### Validation Sample 6/10

**Topic:** `Loan Disclosure Timeframes`

**Location:** Page `59`, Line `14`

#### LLM Reasoning:

The AI's identification of the topic "Loan Disclosure Timeframes" on Page 59, Line 14 is accurate. Here's a step-by-step breakdown:

1. **Initial Question:** Line 14 starts with "And is there some sort of grace period for how long they get to receive the disclosure?" This question explicitly introduces the concept of a time limit or "grace period" associated with receiving a loan disclosure.

2. **Direct Inquiry about Timing:** Line 15 continues the question: "long they get to receive the disclosure?" This reinforces the focus on the timeframe within which the disclosure must be received. The question isn't just about *whether* a disclosure is required, but *when* it's required.

3. **Witness Confirmation and Vagueness:** Line 16 begins the answer: "There's -- I believe that there is a time period..." This confirms the existence of a time period, validating the question's premise.

4. **Further Elaboration on Duration:** Line 18-19 expands on the previous confirmation: "...of how much time they are -- that is supposed to last between the various disclosures." The phrase "how much time" directly addresses the *duration* of the timeframe related to disclosures.

5. **Attempt to Quantify the Timeframe:** Lines 20-21 ("Is it -- can you give me a range? Is it three weeks or three months? Or do you know?") directly ask for a specific range of time, indicating the core interest is pinpointing the exact or approximate duration of the disclosure timeframe.

In conclusion, the conversation following Line 14 explicitly revolves around the timeframe for receiving loan disclosures, making "Loan Disclosure Timeframes" a perfectly accurate topic identification. The phrases "grace period," "how long they get to receive the disclosure," "time period," "how much time," "three weeks," and "three months" all support this conclusion.


---_---

### Validation Sample 7/10

**Topic:** `Timeliness and Validity of Final Disclosures`

**Location:** Page `56`, Line `6`

#### LLM Reasoning:

The AI model's topic identification of "Timeliness and Validity of Final Disclosures" at Page 56, Line 6 is accurate. Here's a step-by-step explanation:

1.  **Identifying the Core Elements:** The topic contains two key elements:
    *   "Final Disclosures": This refers to the specific documents or information that are the subject of the discussion.
    *   "Timeliness and Validity": This implies the discussion centers on whether those "Final Disclosures" were delivered promptly and whether their absence impacts the legality or enforceability of something (in this case, loans).

2.  **Locating Supporting Phrases:**
    *   **Line 6:**  "The final disclosures." This immediately introduces the central subject matter.
    *   **Line 7-8:** "...the final disclosures were sent to the PEAKS borrowers?" This directly investigates whether the disclosures were even *delivered* to the borrowers, addressing a preliminary element of timeliness and subsequent validity.
    *   **Line 12-15:** "...if these disclosures were not made, the loans would be invalid and the borrowers would have the right to cancel the loans; is that correct?"  This clearly links the *absence* of the disclosures (lack of timeliness/ delivery) with the *validity* of the loans, specifically introducing the concept of cancellation rights.
    *   **Line 19-21:** "How long after a borrower has failed to receive the final disclosures would they have the right to cancel the loan?" This question explicitly addresses both the "timeliness" (how long after *failing* to receive) and the consequence in terms of "right to cancel" (relating to the *validity* of the loan agreement).
    *   **Line 22-25:** "But if a -- if a borrower never receives the final disclosure -- so the -- the right to cancel the loan begins when the consumer receives the final disclosure.  And so if the borrower never receives..." This further confirms the link between *non-receipt* of the disclosures and the *right to cancel*. The timing of the cancellation right being explicitly tied to receipt.

3.  **Connecting Elements to Topic:**  The questions and answers revolve around whether the borrowers received the "final disclosures," and *when* (or if) they received them. Furthermore, it establishes a direct causal link between the *absence* of those disclosures and the *invalidity* of the loan, evidenced by the borrowers' "right to cancel." This demonstrates that the discussion centrally concerns the timeliness of the disclosures and the legal validity consequences stemming from their (potential) absence.

Therefore, the AI model is correct in identifying the topic as "Timeliness and Validity of Final Disclosures" at Page 56, Line 6.


---_---

### Validation Sample 8/10

**Topic:** `ITT Student Graduation Statistics and Senate HELP Committee Report`

**Location:** Page `30`, Line `1`

#### LLM Reasoning:

The AI's topic identification is correct because multiple textual elements on page 30 directly relate to "ITT Student Graduation Statistics and Senate HELP Committee Report." Here's a breakdown:

*   **"ITT Students" and "Degrees":** Lines 11-13 explicitly mention "ITT students" and the percentage that ended up "getting degrees," which directly relates to student graduation statistics. This frames the initial question as being about the success rate of ITT students.

*   **"Senate HELP Committee":** Lines 16, 21-22, and 24 mention the "Senate HELP Committee," which stands for "Health, Education, Labor, and Pension Committee." This indicates the involvement of a specific senate committee.

*   **"Report":** Lines 14, 24-25 refer to a "report." Line 16 specifically mentions that the "Senate HELP Committee looked specifically at ITT" and "looked at the retention rates at ITT." This strongly implies that the report being discussed contains information about ITT's retention rates, which are directly related to graduation statistics.

*   **"Retention Rates":** Line 17 mentions that the Senate HELP Committee looked at "retention rates at ITT." This confirms that the Committee looked into how many students stayed in the program, which is related to graduation rates.

*   **Date Context:** Line 25 states, "...released a report in 2012...". While this date isn't directly part of the topic, it gives important context to the report being discussed, clarifying the specific timeframe.

In summary, the presence of keywords like "ITT students," "degrees," "Senate HELP Committee," "report," and "retention rates" all directly support the AI's identification of the topic as being related to ITT student graduation statistics and a Senate HELP Committee report. The conversation focuses on data relevant to the success of ITT students, as analyzed and presented in the specified report.


---_---

### Validation Sample 9/10

**Topic:** `Consequences of Loan Cancellation`

**Location:** Page `57`, Line `24`

#### LLM Reasoning:

Okay, let's analyze why the AI model's identification of the topic "Consequences of Loan Cancellation" at Page 57, Line 24 is correct.

Here's a breakdown:

1. **Explicit Mention of "Cancel the Loan":**  The core phrase "cancel the loan" (lines 5, 10, 16, 22) is repeated multiple times within the dialogue. This immediately establishes the central theme of the conversation as being related to loan cancellation.

2. **Question about the *Mechanism* of Cancellation:** The questioner is explicitly asking *what happens* when a loan is cancelled (lines 5-10, 16-22).  This implies a focus on the *effects* or *consequences* of the action.

3. **Specific Consequences Being Explored:**  The bulk of the text focuses on describing the specific consequences that would result from cancelling the loan.  The questioner proposes a scenario:
    *   Loan is canceled
    *   Funds are returned to the lender (from student/school)
    *   Obligations under the loan agreement cease to exist.

4.  **"Faulty Loan":** Line 24 directly introduces the idea of a "faulty loan" due to disclosure failures, leading to potential cancellation and related consequences.

5. **Confirmation:** The answer in line 23 "That sounds right." confirms the stated consequences.

**Conclusion:**

The AI model is correct in identifying "Consequences of Loan Cancellation" as the topic. The page is primarily concerned with defining and understanding the ramifications of cancelling a loan due to a failure to provide required disclosures. The conversation explicitly asks and attempts to confirm the sequential events (consequences) that would occur. The mention of a "faulty loan" reinforces the relationship between disclosure failures and potential cancellation/consequences.


---_---

### Validation Sample 10/10

**Topic:** `Witness's Professional Background and Experience with Criminal Law`

**Location:** Page `24`, Line `14`

#### LLM Reasoning:

The AI model's identification of the topic "Witness's Professional Background and Experience with Criminal Law" at Page 24, Line 14 is accurate and well-supported by the text. Here's a breakdown:

1.  **Line 14 initiates the line of questioning:** The question, "Have you ever had a job related to criminal law?" (Line 14-15), explicitly introduces the topic of the witness's experience within the realm of criminal law.

2.  **Directly Addresses Criminal Law:** The phrasing uses the key term "criminal law" directly, establishing this as the subject matter of the interrogation. The questions that follow (lines 18, 22, 23) all serve to probe the *extent* and *nature* of this experience.

3.  **Focus on Experience and Professional Background:** The subsequent questions and answers (Lines 16-25) delve into the witness's past positions and activities, directly assessing their professional background as it pertains to criminal law. These activities include:
    *   Internship at the District Attorney's Office (Line 16-17)
    *   Work with survivors of domestic violence (Line 20-21)
    *   Volunteering at a teen court (Line 24-25)

These specific examples demonstrate different aspects of experience relevant to the identified topic. The AI correctly identifies the start of this line of questioning and understands that the overall aim is to understand the experience and background of the witness as it relates to Criminal Law.


---_---

## 3. Quantitative Analysis: Accuracy Check

Now, we'll compare the generated topics against our manually created ground truth set to calculate an accuracy score. 

A topic is considered a "match" if:
1. The page number is identical.
2. The line number is within a small tolerance (e.g., +/- 3 lines).
3. The topic name is reasonably similar (we'll check if a key word from the ground truth exists in the generated topic name).

In [4]:
def calculate_accuracy(generated, ground_truth, line_tolerance=3):
    correct_matches = 0
    
    # Use a copy of the ground truth to avoid matching the same item twice
    gt_copy = list(ground_truth)
    
    for gen_topic in generated:
        best_match = None
        for gt_topic in gt_copy:
            # Check for page and line number proximity
            if gen_topic['page_start'] == gt_topic['page_start'] and \
               abs(gen_topic['line_start'] - gt_topic['line_start']) <= line_tolerance:
                
                # Simple check for topic name similarity
                # A more advanced check could use fuzzy string matching
                gt_keywords = gt_topic['topic'].lower().split()
                if any(keyword in gen_topic['topic'].lower() for keyword in gt_keywords):
                    best_match = gt_topic
                    break # Found a good match
        
        if best_match:
            correct_matches += 1
            gt_copy.remove(best_match) # Remove from list to prevent re-matching
            
    total_gt = len(ground_truth)
    if total_gt == 0:
        return 0.0
    
    accuracy = (correct_matches / total_gt) * 100
    return accuracy

if ground_truth_topics:
    accuracy_score = calculate_accuracy(generated_topics, ground_truth_topics)
    
    display(Markdown(f"## Accuracy Score: `{accuracy_score:.2f}%`"))
    
    if accuracy_score >= 95:
        display(Markdown("**Result: The model meets the target accuracy of ≥ 95%.**"))
    else:
        display(Markdown("**Result: The model does not currently meet the target accuracy.**"))
else:
    display(Markdown("Could not calculate accuracy because no ground truth data was loaded."))

## Accuracy Score: `66.67%`

**Result: The model does not currently meet the target accuracy.**

## 4. Conclusion

This notebook has validated the `DepoIndex` tool's output. The Chain-of-Thought analysis confirms that the LLM's reasoning is sound and based on textual evidence. The quantitative accuracy score measures the tool's performance against a manually verified dataset, confirming whether it meets the project's requirements.