# SQL for AI Projects

## Introduction

**Generative AI**

In this Jupyter notebook - we'll quickly setup the DuckDB database, get you familiar with this Google Colab setup and then we'll dive into the GenAI exercises for the SQL for AI Projects course!

### Practice Exercises

1. Perform GenAI metrics analysis to assess AI agent performance
2. Implement statistical adjustments for multi-variate A/B testing

### Database Setup

First things first, let's load up our Python libraries and setup access to our database.

Don't worry if you're not familiar with Python - we'll just need to run the very first cell to initialize our SQL instance and there will be clear instructions whenever there is some non-SQL components.


### Getting Started

To execute each cell in this notebook - you can click on the play button on the left of each cell or you could simply hit the `Run all` button on the top of the notebook just below the menu toolbar.

This cell below will help us download and connect to a DuckDB database object within this notebook's temporary environment.

There will also be a few outputs in the same cell including the following:

* An interactive entity relationship diagram for our database is also as an output from the following cell. This will help us visualize all of the database tables and their relevant primary and foreign keys.

In [None]:
# Initial setup steps
# ====================

# These pip install commands are required for Google Colab notebook environment
!pip install --upgrade --quiet duckdb==1.3.1
!pip install --quiet duckdb-engine==0.17.0
!pip install --quiet jupysql==0.11.1

# Also need to setup Git LFS for large file dowloads
# This helps us to download large files stored on GitHub
!apt-get install git-lfs -y
!git lfs install

# Clone GitHub repo into a "data" folder
!git clone https://github.com/LinkedInLearning/real-world-data-and-AI-challenges-with-SQL-3813163.git data

# Need to change directory into "data" to run download database object
%cd data
!git lfs pull

# Then we need to change directory back up so all our paths are correct!
%cd ..

# Time to import all our Python packages
import duckdb
import textwrap
import pandas as pd
from IPython.display import HTML, display

# Load the jupysql extension to enable us to run SQL code in code cells
%load_ext sql

# We can now set some basic Pandas settings for rendering SQL outputs
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# This is a convenience function to print long strings into multiple lines
# You'll see this in action later on in our tutorial!
def wrap_print(text):
    print(textwrap.fill(text, width=80))

# This is some boilerplate code to help us format printed output with wrapping
HTML("""
<style>
.output pre {
    white-space: pre-wrap;
    word-break: break-word;
}
</style>
""")

# Connecting to DuckDB
# ====================

# Setup the SQL connection
connection = duckdb.connect("data/data.db")
%sql connection

# Run a few test queries using both connections
tables = connection.execute("SHOW TABLES").fetchall()
table_names = [table[0] for table in tables]

preview_counts_list = []
for table_name in table_names:
    try:
        preview_counts_list.append(
            connection.execute(f"""
                SELECT '{table_name}' AS table_name,
                    COUNT(*) AS record_count
                FROM {table_name}""").fetchdf()
        )
    except Exception as e:
        print(f"❌ Could not preview table {table_name}: {e}")
        

print("✅ Database is now ready!")

print("\n📋 Show count of rows from each table in the database:")

# Combine all dataframes in preview_df_list
preview_counts_df = pd.concat(preview_counts_list, ignore_index=True)

display(preview_counts_df)

display(HTML('''
<iframe width="100%" height="600" src='https://dbdiagram.io/e/685279b3f039ec6d36c0c7e9/68527d19f039ec6d36c1813e'> </iframe>
'''
))

# How to Run SQL Queries

Let's quickly see how we can run SQL code in our Jupyter notebook.

In our Colab environment we can run single or multi-line queries. We can also easily save the output of SQL queries as a local Pandas DataFrame object and even run subsequent SQL queries which can interact with these same DataFrame objects.

## Single Line SQL Query

We can use our notebook magic `%sql` at the start of a notebook cell to run a single line of SQL to query our database.

Let's take a look at the first 5 rows from the `locations` table:

In [None]:
%sql SELECT * FROM locations LIMIT 5;

## Multi-Line SQL Query

We can also run multi-line SQL queries by using a different notebook magic `%%sql` where we now have 2 percentage signs.

We'll apply a filter on our `location` dataset and return 2 columns.

In [None]:
%%sql
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

## Saving SQL Outputs

By using the `<<` operator, we can assign the result of a SQL query (returned as a Pandas DataFrame) to a named Python variable in the notebook’s scope.

### Single Line Assignment

We can specify the name of the output variable directly after the `%sql` or `%%sql` magic command.

In [None]:
%sql single_magic_df << SELECT * FROM locations LIMIT 5;

We can now reference the Python variable directly as a Pandas DataFrame

In [None]:
# Python notebook scope
single_magic_df

We can also use this same variable as a table reference within a DuckDB `SELECT` statement.

In [None]:
%sql SELECT * FROM single_magic_df;

### Multi-line Assignment

This assignment using `<<` also works with the `%%sql` (multi-line) magic command.

In [None]:
%%sql multi_magic_df <<
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

In [None]:
# display the dataframe
multi_magic_df

When referencing the Python variable within DuckDB, we can also use it inside a multi-line SQL query using the `%%sql` magic command.

In [None]:
%%sql
SELECT *
FROM multi_magic_df;

# 1. GenAI Metrics

In this section, we’re going to explore how our LLMs and AI agents are actually performing out in the wild — using simulated interaction data from our Explore California use case.

There are quite a few questions we’ll answer, but don’t worry — the SQL is straightforward, and each query helps us uncover something useful.

---

To make sense of all this, we’ll organize our metrics into six key themes:

- **🧮 Usage & Token Metrics**  
  How many tokens are being used? Which agents are the most verbose? This helps us understand efficiency and cost.

- **⚡ Latency & Reliability**  
  Are agents responding quickly and reliably? We’ll look at average and p95 latency, plus error and retry rates.

- **🌟 Feedback & Hallucinations**  
  How do users feel about the responses? Are hallucinations dragging down ratings? We’ll dig into the human feedback.

- **📚 Retrieval & Context Behavior**  
  If we’re using document retrieval (RAG), how much context is pulled in — and does it help or hurt?

- **👥 Session Patterns & Engagement**  
  How do users interact over a session? Which visits are super active, and what does that tell us?

- **💰 Cost & Efficiency**  
  Finally, we’ll estimate how much each model is costing us and how many tokens we’re getting per dollar.

> Together, these give us a full 360° view of how our agents are doing — from quality and responsiveness to economics and user trust.

---

Before we dive in, let’s take a quick peek at the `interactions` table — the source of all our insights!


In [None]:
%sql SELECT * FROM interactions LIMIT 5;

## 1.1 Token Usage

These queries help us understand how much content each agent is generating (or using). Token usage is key when it comes to tracking cost and prompt efficiency.

In [None]:
%%sql
# 1. What is the average total tokens per agent?
SELECT
  agent_name,
  AVG(prompt_tokens + completion_tokens) AS avg_total_tokens
FROM interactions
GROUP BY agent_name;

In [None]:
%%sql
# 2. What is the average completions per agent?
SELECT
  agent_name,
  AVG(completion_tokens) AS avg_completion_tokens
FROM interactions
GROUP BY agent_name;

In [None]:
%%sql
# 3. Return the top 5 visits with the most total token usage
SELECT
  visit_id,
  SUM(prompt_tokens + completion_tokens) AS total_tokens
FROM interactions
GROUP BY visit_id
ORDER BY total_tokens DESC
LIMIT 5;

## 1.2 Latency and Reliability

Let’s look at latency and error rates to see how responsive and reliable each agent is. The 95th percentile latency helps us spot outliers that might frustrate users.

In [None]:
%%sql
# 4. What is the average latency for each agent?
SELECT
  agent_name,
  AVG(latency_ms) AS avg_latency
FROM interactions
GROUP BY agent_name;

In [None]:
%%sql
# 5. What is the 95th percentile for latency for all interactions?
SELECT
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95_latency
FROM interactions;

In [None]:
%%sql
# 6. What is the error rate split by agent?
SELECT agent_name,
       AVG(CASE WHEN error_flag THEN 1 ELSE 0 END) AS error_rate
FROM interactions
GROUP BY agent_name;

## 1.3 Feedback & Hallucinations

Now we’ll dive into user feedback and see how it correlates with hallucinations and errors. This gives us a human lens on quality.

In [None]:
%%sql
# 7. Avg feedback rating per agent
SELECT agent_name, AVG(feedback_rating) AS avg_feedback
FROM interactions
GROUP BY agent_name;

In [None]:
%%sql
# 8. Feedback vs hallucination
SELECT hallucination_flag, AVG(feedback_rating) AS avg_feedback
FROM interactions
GROUP BY hallucination_flag;

In [None]:
%%sql
# 9. Count of low-feedback interactions (≤2) with errors
SELECT COUNT(*) AS low_feedback_errors
FROM interactions
WHERE feedback_rating <= 2 AND error_flag = TRUE;

In [None]:
%%sql
# 10. Top visits by retry count
SELECT visit_id, SUM(retry_count) AS total_retries
FROM interactions
GROUP BY visit_id
ORDER BY total_retries DESC
LIMIT 10;

In [None]:
%%sql
# 11. Avg retries per agent
SELECT agent_name, AVG(retry_count) AS avg_retries
FROM interactions
GROUP BY agent_name;


## 1.4 Retrieval & Context Behavior

If you’re using RAG (Retrieval-Augmented Generation), these metrics help you understand how many documents are being retrieved and whether more context leads to slower or more hallucinated responses.

In [None]:
%%sql
# 12. Avg docs retrieved per agent
SELECT agent_name, AVG(documents_retrieved) AS avg_docs
FROM interactions
GROUP BY agent_name;

In [None]:
%%sql
# 13. Context tokens vs latency
SELECT context_tokens, AVG(latency_ms) AS avg_latency
FROM interactions
GROUP BY context_tokens
ORDER BY context_tokens;

In [None]:
%%sql
# 14. Hallucination rate by docs retrieved
SELECT documents_retrieved,
       AVG(CASE WHEN hallucination_flag THEN 1 ELSE 0 END) AS hallucination_rate
FROM interactions
GROUP BY documents_retrieved;

## 1.5 Session Patterns & Engagement

How many interactions do users have per visit? Are there certain sessions with a lot of back-and-forth?

We'll also dive into deeper metrics like how efficient completions are compared to prompts, how fast tokens are being generated, and how feedback levels vary.

In [None]:
%%sql
# 15. Avg interactions per visit
SELECT AVG(interaction_count) FROM (
  SELECT visit_id, COUNT(*) AS interaction_count
  FROM interactions
  GROUP BY visit_id
);

In [None]:
%%sql
#  16. Top 10 most active visits
SELECT visit_id, COUNT(*) AS interaction_count
FROM interactions
GROUP BY visit_id
ORDER BY interaction_count DESC
LIMIT 10;

In [None]:
%%sql
#  17. Hallucination rate by agent
SELECT agent_name,
       AVG(CASE WHEN hallucination_flag THEN 1 ELSE 0 END) AS hallucination_rate
FROM interactions
GROUP BY agent_name;

In [None]:
%%sql
#  18. Completion-to-prompt ratio
SELECT agent_name,
       AVG(CAST(completion_tokens AS FLOAT) / NULLIF(prompt_tokens, 0)) AS completion_to_prompt_ratio
FROM interactions
GROUP BY agent_name;


In [None]:
%%sql
#  19. Token throughput (tokens per ms)
SELECT agent_name,
       AVG((prompt_tokens + completion_tokens) * 1.0 / latency_ms) AS tokens_per_ms
FROM interactions
GROUP BY agent_name;


In [None]:
%%sql
#  20. Interaction count by feedback level
SELECT feedback_rating, COUNT(*) AS count
FROM interactions
GROUP BY feedback_rating
ORDER BY feedback_rating;

## 1.6 Cost & Efficiency

Finally, let’s look at the cost of each model and agent. We’ll also calculate how many tokens we’re getting per dollar — a handy metric for cost-efficiency.

Assuming that we have the following cost structure for our 3 different models used in Explore California:

| Model  | Price per 1K Tokens (USD) |
|--------|---------------------------|
| Kimi   | $0.002                    |
| GPT    | $0.030                    |
| Gemini | $0.010                    |

In [None]:
%%sql
# 21. Total cost estimate
SELECT
  SUM((prompt_tokens + completion_tokens) / 1000.0 *
      CASE model_name
          WHEN 'kimi' THEN 0.002
          WHEN 'gpt' THEN 0.03
          WHEN 'gemini' THEN 0.01
      END) AS total_cost_usd
FROM interactions;

In [None]:
%%sql
# 22. Cost breakdown by model
SELECT model_name,
  SUM((prompt_tokens + completion_tokens) / 1000.0 *
      CASE model_name
          WHEN 'kimi' THEN 0.002
          WHEN 'gpt' THEN 0.03
          WHEN 'gemini' THEN 0.01
      END) AS cost_usd
FROM interactions
GROUP BY model_name;

In [None]:
%%sql
# 23. Cost per agent
SELECT agent_name,
  SUM((prompt_tokens + completion_tokens) / 1000.0 *
      CASE model_name
          WHEN 'kimi' THEN 0.002
          WHEN 'gpt' THEN 0.03
          WHEN 'gemini' THEN 0.01
      END) AS cost_usd
FROM interactions
GROUP BY agent_name;

In [None]:
%%sql
# 24. Tokens per dollar (efficiency)
SELECT agent_name,
       SUM(prompt_tokens + completion_tokens) AS total_tokens,
       SUM((prompt_tokens + completion_tokens) / 1000.0 *
           CASE model_name
               WHEN 'kimi' THEN 0.002
               WHEN 'gpt' THEN 0.03
               WHEN 'gemini' THEN 0.01
           END) AS cost_usd,
       SUM(prompt_tokens + completion_tokens) /
           SUM((prompt_tokens + completion_tokens) / 1000.0 *
               CASE model_name
                   WHEN 'kimi' THEN 0.002
                   WHEN 'gpt' THEN 0.03
                   WHEN 'gemini' THEN 0.01
               END) AS tokens_per_dollar
FROM interactions
GROUP BY agent_name;


# 2. Experimentation Analysis

Let's now shift gears to analyze our AI agents experiments.

We'll first need to combine all of our `feature`, `visits` and `sales` data to make sure we are seeing a complete picture of our conversion journey.

In [None]:
%%sql
SELECT
    visits.visit_timestamp,
    visits.visit_id,
    visits.user_id,

    # Flag which Agent was active for this visit or 'Control' otherwise
    COALESCE(features.feature, 'Control') AS experiment_group,
    

    # Flag whether a sale occurred during this visit
    CASE WHEN sales.sale_id IS NOT NULL THEN 1 ELSE 0 END AS sale_flag,

    # Capture sale amount; default to 0 if no product linked
    COALESCE(products.price_usd, 0) AS sale_amount

  FROM visits
  LEFT JOIN features 
    ON visits.visit_id = features.visit_id
  LEFT JOIN sales 
    ON visits.visit_id = sales.visit_id
  LEFT JOIN products 
    ON sales.product_id = products.product_id
  WHERE visits.visit_timestamp BETWEEN DATE '2026-07-01' AND DATE '2026-12-31'

## 2.1 A/B Test Framework

We can run this A/B test just like we’ve done before — but since we’re testing 3 groups at the same time, we need to be more careful about false positives.

To do that, we’ll use Bonferroni’s adjustment, which lowers the p-value threshold to make sure our results are still statistically valid.

In [None]:
%%sql

# ------------------------------------------------------
# STEP 1: Build base dataset with visit-level outcomes
# ------------------------------------------------------
WITH cte_base AS (
  SELECT
    visits.visit_timestamp,
    visits.visit_id,
    visits.user_id,
    COALESCE(features.feature, 'Control') AS experiment_group,  # Assign agent or control group
    CASE WHEN sales.sale_id IS NOT NULL THEN 1 ELSE 0 END AS sale_flag,  # Conversion flag
    COALESCE(products.price_usd, 0) AS sale_amount  # Revenue per visit
  FROM visits
  LEFT JOIN features ON visits.visit_id = features.visit_id
  LEFT JOIN sales ON visits.visit_id = sales.visit_id
  LEFT JOIN products ON sales.product_id = products.product_id
  WHERE visits.visit_timestamp BETWEEN DATE '2026-07-01' AND DATE '2026-12-31'
),

# ------------------------------------------------------
# STEP 2: Aggregate metrics for the control group
# ------------------------------------------------------
cte_control AS (
  SELECT
    COUNT(DISTINCT visit_id) AS control_visit_count,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS control_sales_count,
    SUM(sale_amount) AS control_sales_amount,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END)::FLOAT / COUNT(DISTINCT visit_id) AS control_conversion_rate
  FROM cte_base
  WHERE experiment_group = 'Control'
),

# ------------------------------------------------------
# STEP 3: Aggregate metrics for Agent A
# ------------------------------------------------------
cte_agent_a AS (
  SELECT
    COUNT(DISTINCT visit_id) AS agent_a_visit_count,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS agent_a_sales_count,
    SUM(sale_amount) AS agent_a_sales_amount,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END)::FLOAT / COUNT(DISTINCT visit_id) AS agent_a_conversion_rate
  FROM cte_base
  WHERE experiment_group = 'Agent A'
),

# ------------------------------------------------------
# STEP 4: Aggregate metrics for Agent B
# ------------------------------------------------------
cte_agent_b AS (
  SELECT
    COUNT(DISTINCT visit_id) AS agent_b_visit_count,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS agent_b_sales_count,
    SUM(sale_amount) AS agent_b_sales_amount,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END)::FLOAT / COUNT(DISTINCT visit_id) AS agent_b_conversion_rate
  FROM cte_base
  WHERE experiment_group = 'Agent B'
),

# ------------------------------------------------------
# STEP 5: Combine group-level stats into a single row
# ------------------------------------------------------
cte_combined AS (
  SELECT *
  FROM cte_control, cte_agent_a, cte_agent_b
),

# ------------------------------------------------------
# STEP 6: Calculate uplifts, confidence intervals, and standard errors
# ------------------------------------------------------
cte_stats AS (
  SELECT *,
    
    # Absolute uplift vs. control
    agent_a_conversion_rate - control_conversion_rate AS agent_a_absolute_uplift,
    agent_b_conversion_rate - control_conversion_rate AS agent_b_absolute_uplift,

    # Bonferroni-adjusted confidence intervals (z = 2.241)
    agent_a_conversion_rate - 2.241 * SQRT((agent_a_conversion_rate * (1 - agent_a_conversion_rate)) / agent_a_visit_count) AS agent_a_ci_lower,
    agent_a_conversion_rate + 2.241 * SQRT((agent_a_conversion_rate * (1 - agent_a_conversion_rate)) / agent_a_visit_count) AS agent_a_ci_upper,

    agent_b_conversion_rate - 2.241 * SQRT((agent_b_conversion_rate * (1 - agent_b_conversion_rate)) / agent_b_visit_count) AS agent_b_ci_lower,
    agent_b_conversion_rate + 2.241 * SQRT((agent_b_conversion_rate * (1 - agent_b_conversion_rate)) / agent_b_visit_count) AS agent_b_ci_upper,

    control_conversion_rate - 2.241 * SQRT((control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count) AS control_ci_lower,
    control_conversion_rate + 2.241 * SQRT((control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count) AS control_ci_upper,
    
    # Standard error for uplift comparisons
    SQRT(
      (agent_a_conversion_rate * (1 - agent_a_conversion_rate)) / agent_a_visit_count +
      (control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count
    ) AS agent_a_control_se,

    SQRT(
      (agent_b_conversion_rate * (1 - agent_b_conversion_rate)) / agent_b_visit_count +
      (control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count
    ) AS agent_b_control_se
  FROM cte_combined
),

# ------------------------------------------------------
# STEP 7: Compute z-scores and apply Bonferroni significance threshold
# ------------------------------------------------------
cte_zscore AS (
  SELECT *,
    agent_a_absolute_uplift / agent_a_control_se AS agent_a_z_score,
    agent_b_absolute_uplift / agent_b_control_se AS agent_b_z_score,

    # Mark as significant if z exceeds Bonferroni-adjusted critical value
    CASE WHEN agent_a_absolute_uplift / agent_a_control_se >= 2.241402727604947
         THEN 'Significant' ELSE 'Not Significant' END AS agent_a_significance,

    CASE WHEN agent_b_absolute_uplift / agent_b_control_se >= 2.241402727604947
         THEN 'Significant' ELSE 'Not Significant' END AS agent_b_significance
  FROM cte_stats
)

# ------------------------------------------------------
# STEP 8: Final output with all summary metrics
# ------------------------------------------------------
SELECT
  # Agent A
  agent_a_visit_count, agent_a_sales_count, agent_a_conversion_rate,
  agent_a_absolute_uplift, agent_a_ci_lower, agent_a_ci_upper,
  agent_a_z_score, agent_a_significance,

  # Agent B
  agent_b_visit_count, agent_b_sales_count, agent_b_conversion_rate,
  agent_b_absolute_uplift, agent_b_ci_lower, agent_b_ci_upper,
  agent_b_z_score, agent_b_significance,

  # Control Group
  control_visit_count, control_sales_count, control_conversion_rate,
  control_ci_lower, control_ci_upper
FROM cte_zscore;


## 2.2 Experimentation Insights

Here is an example report we can generate using our calculated metrics from our A/B test framework.

---

### 📊 Experiment Results Summary

Our A/B test evaluated the performance of **Agent A** and **Agent B** compared to a **control group** to understand their impact on conversion rates.

#### ✅ Statistical Significance

- **Agent A**
  - **Z-score**: `44.19`
  - **Result**: **Significant** at the 95% confidence level (Bonferroni-adjusted)

- **Agent B**
  - **Z-score**: `25.91`
  - **Result**: **Significant** at the 95% confidence level (Bonferroni-adjusted)

This indicates **strong evidence** that both agents outperformed the control group in conversion rate.

---

#### 🎯 Conversion Performance

| Metric                     | Control Group  | Agent A          | Agent B          |
|---------------------------|----------------|------------------|------------------|
| Number of Visits           | 25,334         | 57,320           | 18,674           |
| Number of Conversions      | 968            | 6,769            | 1,941            |
| Conversion Rate            | 3.82%          | 11.81%           | 10.39%           |
| 95% CI (Conversion Rate)   | [3.55%, 4.09%] | [11.51%, 12.11%] | [9.89%, 10.89%]  |

- **Agent A uplift over control**: **+7.99%**
  - 95% CI for uplift: [7.31%, 8.29%]
- **Agent B uplift over control**: **+6.57%**
  - 95% CI for uplift: [6.07%, 7.07%]

---

#### 📌 Conclusion

Both Agent A and Agent B produced **statistically significant improvements** in conversion rates compared to the control group.  
Agent A performed the strongest, with nearly **8 percentage points of uplift**, suggesting it’s the most promising option for deployment to drive higher conversions.


## 2.3 Analyzing Revenue Impact with ANOVA

While conversion rate is a powerful metric for understanding user behavior, it's a **binary outcome** — either a user converts or not.  
Because of this, it's **not well-suited for ANOVA**, which assumes **continuous and normally distributed residuals**.

Instead, we’ll use **ANOVA** to compare the **`sale_amount`** (revenue per visit) across different model variants, giving us a way to test whether **any model produces significantly higher revenue**.

---

### 🧪 Revenue-Based Comparison Across Agent Variants

In this step, we’ll analyze whether **different models** used within Agent A and Agent B groups result in different **average sale amounts**.

This helps us identify which specific models not only convert users, but **generate higher-value transactions**.

---

### 🧹 Preparing the Input Data

To run ANOVA and follow-up pairwise comparisons (e.g. Tukey’s HSD), we need a dataset with:

- One row per **visit**
- A column for the **model variant** (e.g., `'Agent A - kimi'`, `'Agent B - gpt'`, `'Control'`)
- A column for the **continuous outcome** (`sale_amount`)

Example:

| visit_id | experiment_group  | sale_amount |
|----------|-------------------|-------------|
| 001      | Agent A - kimi    | 199.00      |
| 002      | Control           | 0.00        |
| 003      | Agent B - gpt     | 329.00      |
| ...      | ...               | ...         |

---

### 📊 Example ANOVA Output

| Source        | df  | Sum of Squares | Mean Square | F-Statistic | p-value |
|---------------|-----|----------------|-------------|-------------|---------|
| Between Groups|  4  | 122340.5       | 30585.1     | 8.42        | 0.00003 |
| Within Groups | 1000| 3639180.2      | 3639.2      |             |         |
| Total         | 1004| 3761520.7      |             |             |         |

This tells us that **at least one group** differs significantly in average revenue.

---

### 🔍 Example Tukey's HSD Output

| Group 1        | Group 2        | Mean Diff | p-adj  | Lower  | Upper  | Reject |
|----------------|----------------|-----------|--------|--------|--------|--------|
| Agent A - kimi | Control         | 12.5      | 0.001  | 5.4    | 19.6   | True   |
| Agent B - gpt  | Agent A - kimi | -4.2      | 0.042  | -8.3   | -0.1   | True   |
| ...            | ...             | ...       | ...    | ...    | ...    | ...    |

The `Reject = True` column means we **can confidently say** the two groups have significantly different average sale amounts.

---

By using ANOVA + Tukey's HSD, we go beyond just asking *“Does it convert?”*  
We start to answer: *“Which models drive the most valuable customers?”*


## 2.5 SQL Implementation: Preparing Inputs for ANOVA & Tukey's HSD

Before we can run our ANOVA and Tukey’s HSD tests, we need to generate a **visit-level dataset** that includes:

- The **model name** used (from the `interactions` table)
- The **experiment group** (e.g., Agent A, Agent B, or Control)
- The **sale amount** (continuous revenue outcome per visit)

We can generate this using a variation of our previous `cte_base` query by joining the `interactions` table and adding a composite label like:

```sql
COALESCE(features.feature || ' - ' || interactions.model_name, 'Control') AS experiment_group
```

This allows us to differentiate between individual model variants within each agent (e.g., `'Agent A - kimi'`, `'Agent B - gpt'`, etc.).

---

### ✅ Example Query Output (Truncated)

| visit_id | experiment_group  | model_name | sale_amount |
|----------|-------------------|------------|-------------|
| v001     | Agent A - kimi    | kimi       | 289.00      |
| v002     | Control           | NULL       | 0.00        |
| v003     | Agent B - gpt     | gpt        | 412.00      |
| ...      | ...               | ...        | ...         |

---

This dataset will serve as the **input into your Python-based statistical test**, where each row is a single observation.  
We'll now move into Python to run the ANOVA and post-hoc pairwise tests.

In [None]:
%%sql anova_inputs_df <<
  SELECT
    visits.visit_id,
    COALESCE(features.feature || ' - ' || interactions.model_name, 'Control') AS experiment_group,
    COALESCE(products.price_usd, 0) AS sale_amount
  FROM visits
  LEFT JOIN features ON visits.visit_id = features.visit_id
  LEFT JOIN sales ON visits.visit_id = sales.visit_id
  LEFT JOIN products ON sales.product_id = products.product_id
  LEFT JOIN interactions on visits.visit_id = interactions.visit_id
  WHERE visits.visit_timestamp BETWEEN DATE '2026-07-01' AND DATE '2026-12-31'

In [None]:
anova_inputs_df

In [None]:
from statsmodels.formula.api import ols
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Fit ANOVA model
anova_model = ols('sale_amount ~ C(experiment_group)', data=anova_inputs_df).fit()
anova_results = sm.stats.anova_lm(anova_model, typ=2)
print(anova_results)

# Perform Tukey's HSD
tukey_results = pairwise_tukeyhsd(
    endog=anova_inputs_df['sale_amount'],
    groups=anova_inputs_df['experiment_group'],
    alpha=0.05
)
print(tukey_results.summary())

## 2.6 Experimental Analysis

### ANOVA and Tukey's HSD Summary

We conducted a one-way **ANOVA** to compare the mean `sale_amount` across multiple experiment groups. The results showed a **statistically significant difference** between at least one pair of groups:

- **F-statistic**: `214.32`  
- **p-value**: `< 1e-273`  
- ✅ **Conclusion**: Reject the null hypothesis — not all groups have the same mean sale amount.

---

### 🔍 Post-hoc Comparison (Tukey's HSD)

To identify **which specific groups differ**, we ran a Tukey’s HSD test. Here's what we found:

#### 🔹 No Significant Difference:
- Among **Agent A** model variants (`gemini`, `gpt`, `kimi`) — these models performed **similarly** in terms of sale amount.
- Among **Agent B** variants — also **no significant differences** between `gemini`, `gpt`, and `kimi`.

#### 🔸 Significant Differences:
- **All Agent A models significantly outperformed** their **Agent B counterparts**.
- **All AI agents (A & B)** significantly outperformed the **Control** group (by over $148 per visit on average).

---

### 📌 Conclusion

This analysis confirms that while different model variants **within the same agent group** perform similarly, the **agent group itself** (Agent A vs. Agent B vs. Control) plays a major role in driving higher revenue. Both Agent A and B lead to **statistically significant increases in sale amount**, with **Agent A performing the best overall**.
