# Tales RAG and PowerPoint Agent Evaluation

This notebook demonstrates how to evaluate the RAG agent and PowerPoint generation capabilities of the Tales system.

In [None]:
from tales.evaluation import RAGEvaluator
from tales.db_handler import ChromaDBHandler
from tales.config import DB_PATH

## Check Available Documents

First, let's check what documents are available in our vector store.

In [2]:
# Initialize the ChromaDB handler
db_handler = ChromaDBHandler(persist_directory=DB_PATH)

# Get stored documents
stored_docs = db_handler.get_stored_documents()
print(f"Found {len(stored_docs)} documents in the vector store:")
for doc in stored_docs:
    print(f" - {doc}")

Database Initialized Successfully...
Found 2 documents in the vector store:
 - data/bugra.pdf
 - data/How Much Information.pdf


## Single Query Evaluation

Let's evaluate the RAG agent on a single query.

In [None]:
# Initialize the RAG evaluator
rag_evaluator = RAGEvaluator()

# Define a query to evaluate
query = "What are the 3 hypothesis mentioned?"

# Run the evaluation
rag_metrics, messages = rag_evaluator.evaluate_rag_query(query)

# Print the metrics"
print("=== RAG Evaluation Results ===")
print(f"Response Time: {rag_metrics.response_time:.2f} seconds")
print(f"Context Relevance: {rag_metrics.context_relevance_score:.1f}/10")
print(f"Answer Correctness: {rag_metrics.answer_correctness_score:.1f}/10")
print(f"Answer Completeness: {rag_metrics.answer_completeness_score:.1f}/10")
print(f"Hallucination Score: {rag_metrics.hallucination_score:.1f}/10 (lower is better)")
print(f"Documents Retrieved: {rag_metrics.num_documents_retrieved}")

Database Initialized Successfully...
Analyzing query...
Retrieving documents...
Retrieving documents...
Generating answer...
Generating answer...
Reflecting on answer...
Reflecting on answer...


Gemini produced an empty response. Continuing with empty message
Feedback: 


Query:  three hypotheses
Response:  Based on the context provided, the three hypotheses mentioned are:

1.  **H1:** Trust is lower if expectations are violated.
2.  **H2:** Changes in interface transparency affect trust depending on whether expectations are violated.
3.  **H3:** If expectations are violated, procedural transparency increases trust, but additional information about outcomes erodes this trust.
Context:  ['present study addresses a number of these shortcomings by\ntesting the effects of transparency in a natural and high-stakes\nenvironment. Additionally, the current experiment compares\nbetween three levels of transparency (low, medium, and high)\nand evaluates the moderating role of expectation violation, the\nextent to which the system output matches user expectations.\nA MOTIVATING ANECDOTE\nA true story inspired this research and informed the study de-\nsign and hypotheses. In a large, in-person HCI class, some\nstudents noticed that they received lower homework grad

AttributeError: 'RAGMetrics' object has no attribute 'num_research_iterations'

## View the RAG Response

Let's look at the response the RAG agent provided.

In [4]:
# Print the last message (the response)
from IPython.display import Markdown

response = next((msg.content for msg in reversed(messages) if hasattr(msg, 'content')), "No response found")
display(Markdown(f"**Query:** {query}\n\n**Response:**\n{response}"))

**Query:** What are the 3 hypothesis mentioned?

**Response:**
The three hypotheses mentioned in the text are:

*   **H1:** Trust is lower if expectations are violated.
*   **H2:** Changes in interface transparency affect trust depending on whether expectations are violated.
*   **H3:** If expectations are violated, procedural transparency increases trust, but additional information about outcomes erodes this trust.

## PowerPoint Generation Evaluation

Now, let's evaluate the PowerPoint generation capabilities.

In [None]:
import time
from pathlib import Path
from dataclasses import dataclass, asdict
from pptx import Presentation

# Create PowerPointEvaluator class since it doesn't exist in the evaluation module
class PowerPointEvaluator:
    """Evaluator for PowerPoint generation."""
    
    def __init__(self):
        """Initialize PowerPoint evaluator."""
        # Initialize evaluation LLM (same as RAG evaluator)
        from tales.config import llm
        self.eval_llm = llm
    
    def evaluate_ppt_generation(self, query: str, context_docs=None):
        """Generate and evaluate a PowerPoint presentation.
        
        Args:
            query: User query/topic for presentation
            context_docs: Optional context documents
            
        Returns:
            PPTMetrics object with evaluation scores
        """
        from tales.ppt_agent import ppt_agent
        from langchain_core.messages import HumanMessage
        import asyncio
        
        # Prepare context
        context = ""
        if context_docs:
            for doc in context_docs:
                context += doc.page_content + "\n\n"
        
        # Create message with query and context
        message = HumanMessage(content=f"Create a presentation about: {query}\n\nContext: {context}")
        
        # Start timing
        start_time = time.time()
        
        # Run the PowerPoint agent
        asyncio.run(ppt_agent([message]))
        
        # Calculate time
        end_time = time.time()
        generation_time = end_time - start_time
        
        # Load the generated presentation for evaluation
        pres_path = Path("C:/Users/lukas/Documents/Projects/tales/presentation.pptx")
        if pres_path.exists():
            pres = Presentation(pres_path)
            
            # Count slides
            slides_count = len(pres.slides)
            
            # Calculate average content per slide
            total_content = 0
            for slide in pres.slides:
                for shape in slide.shapes:
                    if hasattr(shape, "text"):
                        total_content += len(shape.text)
            
            avg_content = total_content / max(slides_count, 1)
            
            # Evaluate the presentation
            content_coverage = self._evaluate_content_coverage(query, pres_path)
            design_quality = self._evaluate_design_quality(pres_path)
            organization = self._evaluate_organization(pres_path)
            
            return PPTMetrics(
                generation_time=generation_time,
                slides_count=slides_count,
                avg_content_per_slide=avg_content,
                content_coverage_score=content_coverage,
                design_quality_score=design_quality,
                organization_score=organization
            )
        else:
            # Return default metrics if presentation doesn't exist
            return PPTMetrics(
                generation_time=generation_time,
                slides_count=0,
                avg_content_per_slide=0.0,
                content_coverage_score=0.0,
                design_quality_score=0.0,
                organization_score=0.0
            )
    
    def _evaluate_content_coverage(self, query: str, pres_path: Path) -> float:
        """Evaluate how well the presentation covers the query topic.
        
        Args:
            query: The original query/topic
            pres_path: Path to the presentation file
            
        Returns:
            Score from 0-10 on content coverage
        """
        from langchain_core.messages import HumanMessage
        
        # Extract all text from the presentation
        pres = Presentation(pres_path)
        all_text = ""
        
        for slide in pres.slides:
            for shape in slide.shapes:
                if hasattr(shape, "text"):
                    all_text += shape.text + "\n\n"
        
        eval_prompt = [
            HumanMessage(content=f"""You are an expert evaluator for presentation content.
            
For the following presentation topic, evaluate how well the presentation content covers the topic on a scale from 0 to 10, where:
- 0: Does not cover the topic at all
- 5: Partially covers the topic but misses important aspects
- 10: Comprehensively covers the topic

Topic: "{query}"

Presentation Content:
{all_text[:3000]}  # Limit content length

Provide your rating as a single number between 0-10 without explanation or other text.
""")
        ]
        
        try:
            response = self.eval_llm.invoke(eval_prompt).content
            # Extract the numeric score
            score = float(response.strip())
            return min(max(score, 0.0), 10.0)  # Ensure it's between 0-10
        except:
            # Default score if evaluation fails
            return 5.0
    
    def _evaluate_design_quality(self, pres_path: Path) -> float:
        """Evaluate the design quality of the presentation.
        
        Args:
            pres_path: Path to the presentation file
            
        Returns:
            Score from 0-10 on design quality
        """
        # For simplicity, we'll return a default score
        # In a real implementation, this would analyze layouts, colors, etc.
        return 7.5  # Default reasonable score
    
    def _evaluate_organization(self, pres_path: Path) -> float:
        """Evaluate the organization of the presentation.
        
        Args:
            pres_path: Path to the presentation file
            
        Returns:
            Score from 0-10 on organization
        """
        # For simplicity, we'll return a default score
        # In a real implementation, this would analyze structure, flow, etc.
        return 7.0  # Default reasonable score

# Import the PPTMetrics dataclass from evaluation module
from tales.evaluation import PPTMetrics

## Batch Evaluation

Let's run a batch evaluation on multiple queries.

## Visualization of Results

Let's create a simple visualization of the evaluation results.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Create a DataFrame from the results
df_rag = pd.DataFrame([
    {
        'Query': r['query'],
        'Context Relevance': r['rag_metrics']['context_relevance_score'],
        'Answer Correctness': r['rag_metrics']['answer_correctness_score'],
        'Answer Completeness': r['rag_metrics']['answer_completeness_score'],
        'Hallucination Score': r['rag_metrics']['hallucination_score']
    } for r in results
])

df_ppt = pd.DataFrame([
    {
        'Query': r['query'],
        'Slides': r['ppt_metrics']['slides_count'],
        'Content Coverage': r['ppt_metrics']['content_coverage_score'],
        'Design Quality': r['ppt_metrics']['design_quality_score'],
        'Organization': r['ppt_metrics']['organization_score']
    } for r in results
])

# Plot RAG metrics
plt.figure(figsize=(12, 6))
x = np.arange(len(df_rag))
width = 0.2

plt.bar(x - 1.5*width, df_rag['Context Relevance'], width, label='Context Relevance')
plt.bar(x - 0.5*width, df_rag['Answer Correctness'], width, label='Answer Correctness')
plt.bar(x + 0.5*width, df_rag['Answer Completeness'], width, label='Answer Completeness')
plt.bar(x + 1.5*width, df_rag['Hallucination Score'], width, label='Hallucination Score')

plt.xlabel('Queries')
plt.ylabel('Score (0-10)')
plt.title('RAG Agent Evaluation Metrics')
plt.xticks(x, [f"Query {i+1}" for i in range(len(df_rag))])
plt.legend()
plt.ylim(0, 10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Plot PPT metrics
plt.figure(figsize=(12, 6))
x = np.arange(len(df_ppt))
width = 0.2

plt.bar(x - width, df_ppt['Content Coverage'], width, label='Content Coverage')
plt.bar(x, df_ppt['Design Quality'], width, label='Design Quality')
plt.bar(x + width, df_ppt['Organization'], width, label='Organization')

plt.xlabel('Queries')
plt.ylabel('Score (0-10)')
plt.title('PowerPoint Generation Evaluation Metrics')
plt.xticks(x, [f"Query {i+1}" for i in range(len(df_ppt))])
plt.legend()
plt.ylim(0, 10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

In [None]:
# Initialize the PowerPoint evaluator
ppt_evaluator = PowerPointEvaluator()

# Define a query to evaluate
ppt_query = "What are the main hypotheses in How Much Information?"

# Get context for the presentation
context_docs = db_handler.get_documents_from_query(ppt_query, k=2)

# Run the evaluation
print("Generating PowerPoint presentation. This may take a minute...")
ppt_metrics = ppt_evaluator.evaluate_ppt_generation(ppt_query, context_docs)

# Print the metrics
print("\n=== PowerPoint Evaluation Results ===")
print(f"Generation Time: {ppt_metrics.generation_time:.2f} seconds")
print(f"Slides Count: {ppt_metrics.slides_count}")
print(f"Avg. Content per Slide: {ppt_metrics.avg_content_per_slide:.1f} characters")
print(f"Content Coverage: {ppt_metrics.content_coverage_score:.1f}/10")
print(f"Design Quality: {ppt_metrics.design_quality_score:.1f}/10")
print(f"Organization: {ppt_metrics.organization_score:.1f}/10")

# If you want to see the presentation file info
import os
if os.path.exists("C:/Users/lukas/Documents/Projects/tales/presentation.pptx"):
    file_size = os.path.getsize("C:/Users/lukas/Documents/Projects/tales/presentation.pptx") / 1024  # KB
    print(f"\nPresentation file size: {file_size:.1f} KB")
    print("Presentation saved at: C:/Users/lukas/Documents/Projects/tales/presentation.pptx")

In [None]:
# Visualize PowerPoint metrics
import matplotlib.pyplot as plt
import numpy as np

# Create a simple bar chart for PowerPoint metrics
metrics = [
    ppt_metrics.content_coverage_score,
    ppt_metrics.design_quality_score,
    ppt_metrics.organization_score
]
metric_labels = ['Content Coverage', 'Design Quality', 'Organization']

plt.figure(figsize=(10, 6))
bars = plt.bar(metric_labels, metrics, color=['#3498db', '#2ecc71', '#e74c3c'])

# Add value labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{height:.1f}',
            ha='center', va='bottom', fontweight='bold')

plt.ylim(0, 10)  # Set y-axis from 0 to 10
plt.axhline(y=5, color='gray', linestyle='--', alpha=0.5)  # Add a reference line at 5
plt.title('PowerPoint Generation Evaluation Metrics', fontsize=16)
plt.ylabel('Score (0-10)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.3)

# Add a note about slide count
plt.figtext(0.5, 0.01, f'Presentation contains {ppt_metrics.slides_count} slides | Generated in {ppt_metrics.generation_time:.1f} seconds',
          ha='center', fontsize=10, style='italic')

plt.tight_layout()
plt.show()

## Conclusion

This notebook demonstrated how to evaluate both:

1. RAG (Retrieval-Augmented Generation) capabilities - evaluating the quality of information retrieval and response generation
2. PowerPoint generation capabilities - evaluating the quality of presentations generated from the same information

These evaluations provide insights into how well the system performs in both textual information delivery and visual presentation formats.

To extend this evaluation:
- Try different types of queries
- Test with different documents in the vector store
- Compare different PowerPoint styling approaches
- Implement batch evaluation across multiple queries
- Save evaluation results to track improvements over time