# üíò The Stolen Valentine Budget

<img src="https://raw.githubusercontent.com/Haross/sql_notebook/main/assignments/The_Stolen_Valentine_Budget/stolen_valentine_banner.png" width="60%">

## üïµÔ∏è A Corporate SQL Investigation - Introduction

Lunaris Systems approved a modest Valentine‚Äôs Day budget for its **Madrid office**.

The plan was simple: **a small internal celebration** to thank employees for their work.


The budget was approved. Expense reports were submitted. Everything appeared routine‚Ä¶

Until Finance reviewed the February numbers.

‚ö†Ô∏è The Valentine budget appears to have been quietly drained.

Several expense reports contain purchases that raise questions.  
Audit suspects that one employee may have disguised personal spending as legitimate office expenses.


#### üéØ Your mission

> Use SQL to investigate the data and identify the prime suspect.

This will not be a step-by-step recipe.

You will need to:

- Explore the company‚Äôs expense data
- Test investigative hypotheses
- Distinguish real clues from misleading ones
- Apply precise filtering and aggregation
- Produce a final, defensible SQL report

Your final submission must clearly identify the employee you believe is responsible, and the SQL logic that justifies that conclusion.

Be precise. Be rigorous. Finance will audit your audit.

## **SQL Environment Setup (do not edit)**

In [88]:
# @title
%%capture
!mkdir -p notebook_lib
!wget -q -O notebook_lib/sql_runner.py \
  https://raw.githubusercontent.com/Haross/sql_notebook/8021f5c05b7d973b8db549a1398a3c9a5c7829d5/notebook_lib/sql_runner.py
!wget -q -O notebook_lib/validators.py \
  https://raw.githubusercontent.com/Haross/sql_notebook/7baff2c6485cdf641cabcdb55d92a51317cd18b9/notebook_lib/validators.py

!wget -q -O data.db \
  https://raw.githubusercontent.com/Haross/sql_notebook/main/assignments/The_Stolen_Valentine_Budget/data_A.db

from notebook_lib.sql_runner import make_sql_runner
from notebook_lib.validators import make_df_validator_nospoilers, check_process_rules

import sqlite3
import pandas as pd
from pathlib import Path


In [89]:
# @title
DB_FILE = 'data.db'

conn = sqlite3.connect(DB_FILE)
print(f"Database ready ‚úÖ ({DB_FILE})")


Database ready ‚úÖ (data.db)


## üóÇÔ∏è The Database

The investigation involves five tables:

1. **employees**: Contains information about Lunaris employees.
2. **vendors**: Businesses where purchases were made.
3. **expense_reports**: Expense reports submitted by employees.
4. **expenses**: Individual expense line items inside each report.
5. **valentine_budget**: The officially approved Valentine budget for February (by office).

No single table contains the full story.

To solve the case, you will need to:
- Follow the money across multiple tables
- Understand how reports connect to employees
- Connect purchases to vendors
- Compare spending against the approved budget

Careful joins will matter.

In [90]:
# @title ER Diagram
%%html
<img id="er-img" style="width:75%; max-width:100%; height:auto;"
     data-light="https://raw.githubusercontent.com/Haross/sql_notebook/main/assignments/The_Stolen_Valentine_Budget/ER_stolen_valentine_budget.png"
     data-dark="https://raw.githubusercontent.com/Haross/sql_notebook/main/assignments/The_Stolen_Valentine_Budget/ER_stolen_valentine_budget_black.png"
     alt="ER diagram">

<script>
  const img = document.getElementById("er-img");

  function isDarkTheme() {
    // Colab sets html[theme=dark] on the top document
    const themeAttr = document.documentElement.getAttribute("theme");
    if (themeAttr) return themeAttr === "dark";

    // fallback: OS/browser preference
    return window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)").matches;
  }

  function updateImage() {
    img.src = isDarkTheme() ? img.dataset.dark : img.dataset.light;
  }

  updateImage();

  // React to Colab theme toggles (attribute changes)
  new MutationObserver(updateImage).observe(document.documentElement, {
    attributes: true,
    attributeFilter: ["theme"]
  });

  // React to OS/browser theme changes (fallback)
  if (window.matchMedia) {
    const mq = window.matchMedia("(prefers-color-scheme: dark)");
    mq.addEventListener?.("change", updateImage);
    mq.addListener?.(updateImage); // older browsers
  }
</script>

## ‚öñÔ∏è Investigation Rules

To keep the investigation fair, you may use **only the SQL concepts covered in class so far**:

- `SELECT`, `FROM`, `WHERE`
- `DISTINCT`
- `AND`, `OR`
- `ORDER BY`
- `JOIN ... ON`
- Aggregate functions: `COUNT`, `SUM`, `AVG`, `MIN`, `MAX`
- `GROUP BY`
- `HAVING`

You may **not** use any other advanced SQL features not covered in class such as subqueries, window functions, etc.

This case is designed to be solved using **clean relational reasoning**, not advanced syntax.

If your logic is correct, you do not need more tools.

> Overly complex queries will not receive additional credit.  
> Clarity and correctness matter more than cleverness.



## üéØ The Goal of This Investigation

Your task is to determine which employee is responsible for the suspicious Valentine spending.

By the end of this notebook, you will submit one final SQL query that supports your conclusion.

Assume Finance will review your report.

Your conclusion must be supported by clear evidence, not assumptions.






## üìò How This Investigation Is Structured

This notebook is divided into three sections:

### Section 1 ‚Äî Evidence Room

You will explore the database and understand how the tables relate to each other.

Some questions in this section require you to compute values using SQL and submit the **result** on Canvas. These are marked with:

> üìù Submit result on Canvas

Other queries are exploratory and intended to help you understand the data.  
While optional, they are strongly recommended.

### Section 2 ‚Äî Hypothesis Testing

You will test investigative hypotheses and analyze patterns in the data.

Some questions will be marked:

> üìù Submit query on Canvas

These require you to submit the SQL query itself.

Other prompts are exploratory and guide your reasoning toward the final investigation.

### Section 3 ‚Äî Discover the Prime Suspect

You will write one final SQL query that identifies the prime suspect.

This is the most important part of the assignment.

Your submission must include:
- The final SQL query
- The name of the employee you identified
- A brief explanation of your reasoning

Assume Finance will challenge your conclusion.
Your explanation must withstand scrutiny.


# üîé Section 1 ‚Äî The Evidence Room

<img src="https://raw.githubusercontent.com/Haross/sql_notebook/main/assignments/The_Stolen_Valentine_Budget/evidence_room_banner.png" width="50%">

In this section, your objective is to:

> Understand the structure of the data and what information is available.

Resist the temptation to jump to conclusions.

Do **not** try to identify the suspect yet.

First, analyze how the expense system works:
- How employees submit reports  
- How expenses are recorded  
- How vendors are linked  
- How budgets are defined  

Strong investigations begin with structural understanding.

In [91]:
# @title 1.1 - The Valentine Budget

make_sql_runner(
    conn,
    runner_id="ex1_1",
    description_md="""
## üßæ 1.1 The Valentine Budget

Let‚Äôs begin with the official budget for the case.

**Question:** What budget was approved for the **Madrid** office in **February 2026**?

To answer, inspect the `valentine_budget` table and locate the row that matches:

- office = Madrid
- budget_month = 2026-02

This establishes the scope of the investigation.

""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üßæ 1.1 The Valentine Budget</h2>\n<p>Let‚Äôs begin with the ‚Ä¶

In [92]:
# @title 1.2 - Who Works at Lunaris?

make_sql_runner(
    conn,
    runner_id="ex1_2",
    description_md="""
## üë• 1.2. Who Works at Lunaris?

Before analyzing expenses, we need to understand the people involved.

Explore the `employees` table.

As you investigate, answer the following:

- How many distinct offices are represented?  üìù Submit result on Canvas (a number)
- Are all employees located in Madrid?
- Could office location affect the scope of the Valentine budget?

Avoid jumping to conclusions, focus on understanding the structure of the workforce first.

""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üë• 1.2. Who Works at Lunaris?</h2>\n<p>Before analyzing ex‚Ä¶

In [93]:
# @title 1.3 - Where Are Purchases Made?

make_sql_runner(
    conn,
    runner_id="ex1_3",
    description_md="""
## üè¢ 3. Where Are Purchases Made?

Explore the `vendors` table to understand:

- What vendors exist
- How vendors are categorized
- What types of purchases are possible

As part of your investigation:

- Identify the different vendor types available.
- Consider which types might be typical for office use.
- Consider which types *could* raise questions in a Valentine budget context.

Do not draw conclusions yet, simply map the landscape of possible spending.

""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üè¢ 3. Where Are Purchases Made?</h2>\n<p>Explore the <code‚Ä¶

In [94]:
# @title 1.4 - How Do Expense Reports Work?

make_sql_runner(
    conn,
    runner_id="ex1_4",
    description_md="""
## üßæ 1.4 How Do Expense Reports Work?

Expense reports group multiple expense line items.

Explore the `expense_reports` table carefully.

As you investigate, consider:

- How many distinct report statuses exist? üìù Submit result on Canvas (a number)
- What does each status represent?
- Do all reports necessarily impact the company budget?
- Should all reports be treated equally in a financial investigation?

Understanding report status is critical before analyzing spending.

""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üßæ 1.4 How Do Expense Reports Work?</h2>\n<p>Expense repor‚Ä¶

In [105]:
# @title 1.5 - What Do Individual Expenses Look Like?

make_sql_runner(
    conn,
    runner_id="ex1_5",
    description_md="""
## üí≥ 1.5 What Do Individual Expenses Look Like?

The real details live in the `expenses` table. Explore it carefully. Pay attention to:

- The different expense categories
- The date range of recorded expenses
- Whether any receipts are missing (`receipt_id` IS NULL)

Some of these details may become important later.

---
#### üìù Submit on Canvas

1. What is the earliest `expense_date` in the table?
   (Submit the date exactly as stored.)

2. How many expenses have a missing receipt?
   (Count rows where `receipt_id` IS NULL.)
""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üí≥ 1.5 What Do Individual Expenses Look Like?</h2>\n<p>The‚Ä¶

## End of Section 1  ‚Äî Evidence Review

At this stage, you should clearly understand:

- How the five tables relate to each other
- What information each table contains
- Which columns may be relevant for filtering
- Which values appear meaningful in the context of February spending

You are **not** solving the case yet.

You are building structural understanding.

Strong investigations begin with context, not conclusions.

# üß† Section 2 ‚Äî Testing the Hypotheses

<img src="https://raw.githubusercontent.com/Haross/sql_notebook/main/assignments/The_Stolen_Valentine_Budget/hypothesis_banner.png" width="50%">

Finance and Audit provided several possible leads.

Not all of them are reliable.

Your job in this section is to **test each hypothesis using SQL**.

Some clues may appear suspicious at first glance.  
Others may seem harmless, until examined closely.

As investigators, you must rely on evidence, not intuition.

Every claim must be supported by data.


In [106]:
# @title Hypothesis 1 ‚Äî Only Approved Reports Affect the Budget

make_sql_runner(
    conn,
    runner_id="hypothesis_1",
    description_md="""
## üßæ Hypothesis 1: Only Approved Reports Affect the Budget

Audit suggests that only **approved expense reports** actually impacted Lunaris‚Äô budget.

Before accepting this claim, investigate how reports are distributed across different statuses.

### üîé Suggested exploration

To evaluate this hypothesis, use SQL to explore these questions:

* How many reports exist per `status`
* Whether multiple status values appear in the system
* Whether some statuses represent reports that should **not** be counted as budget-impacting

Think carefully:

- Should *Pending* or *Rejected* reports be treated the same way as *Approved*?
- If an expense ‚Äúaffects the budget,‚Äù does that imply the company has already reimbursed it or only that it was submitted?
- Which statuses most plausibly represent expenses that were actually paid?

For now, don‚Äôt lock in a rule ‚Äî just document what you observe.

""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üßæ Hypothesis 1: Only Approved Reports Affect the Budget</‚Ä¶

In [107]:
# @title Hypothesis 2 ‚Äî The Theft Happened on Valentine‚Äôs Day (February 14)

make_sql_runner(
    conn,
    runner_id="hypothesis_2",
    description_md="""
## üíò Hypothesis 2: The Theft Happened on Valentine‚Äôs Day (February 14)

A rumor suggests the suspicious activity occurred exactly on **February 14**.

Before trusting that claim, examine whether spending is concentrated on a single day or distributed across the month.

### üîé Suggested exploration

To evaluate this hypothesis, use SQL to explore:

- The full range of dates present in the `expenses` table
- How many expenses occur per day in February
- Whether **total spending** peaks on a specific date (not just the number of transactions)

Avoid filtering prematurely to February 14.

Let the data speak first.
""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üíò Hypothesis 2: The Theft Happened on Valentine‚Äôs Day (Fe‚Ä¶

In [117]:
# @title üìù Submit This Query ‚Äî Spending by Date (Daily totals)

make_sql_runner(
    conn,
    runner_id="hypothesis_submit_2",
    description_md="""
## Is spending concentrated on a single day?

Write a SQL query that returns, for each date:

- The expense date (attribute expense_date)
- The number of expenses as **expenses_count**
- The total spending amount as **total_amount**

Apply the following conditions:

- Office = Madrid
- Report status = 'Approved'

Your query must:

- Combine the necessary tables
- Group by expense_date
- Order by expense_date (ascending)

Do not filter to a single day.
Return all qualifying dates.
""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>Is spending concentrated on a single day?</h2>\n<p>Write ‚Ä¶

In [115]:
# @title Hypothesis 3 ‚Äî Certain Categories Are Suspicious

make_sql_runner(
    conn,
    runner_id="hypothesis_3",
    description_md="""
## üåπ Hypothesis 3 ‚Äî Certain Categories Are Suspicious

Audit suspects the theft may be disguised under certain expense categories, but they don‚Äôt know which ones.

Your goal here is not to accuse anyone yet, but to identify **which categories deserve closer scrutiny**.

To evaluate this hypothesis, explore spending patterns by `category` in the Valentine-budget context.

Consider examining:

- Total spending per `category`
- Which categories account for the largest share of spending in the relevant period
- Which categories seem typical for office operations vs. potentially personal

You may want to rank categories by total spending.

After analyzing your results, reflect:

- Which categories appear unusually high?
- Which seem clearly legitimate?
- Which categories would you investigate next and why?

You will use this reasoning later.
""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üåπ Hypothesis 3 ‚Äî Certain Categories Are Suspicious</h2>\n‚Ä¶

In [116]:
# @title Hypothesis 4 ‚Äî Missing Receipts May Indicate Suspicion

make_sql_runner(
    conn,
    runner_id="hypothesis_4",
    description_md="""
## üßæ Hypothesis 4: Missing Receipts May Indicate Suspicion

Audit flagged that some expenses lack receipt documentation.

Missing receipts can indicate weak documentation or potentially something more serious.

Your goal here is to evaluate whether missing receipts appear random or patterned.

### üîé Suggested exploration

To evaluate this hypothesis, consider:

* How many expenses have `receipt_id IS NULL`
* Whether missing receipts appear randomly
* Whether they cluster around specific vendors or categories

Consider whether missing documentation, in the Valentine-budget context, suggests anything meaningful.

Avoid jumping to conclusions.
Look for patterns.

""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üßæ Hypothesis 4: Missing Receipts May Indicate Suspicion</‚Ä¶

In [119]:
# @title üìù Submit This Query ‚Äî Missing Receipts by Vendor

make_sql_runner(
    conn,
    runner_id="hypothesis_submit_3",
    description_md="""
## Do missing receipts cluster around specific vendors?

To determine whether missing receipts cluster around specific vendors, write a SQL query that returns:

- `vendor_name`
- The number of expenses with missing receipts as `missing_receipts_count`
- The total amount of expenses with missing receipts as `missing_receipts_total_amount`

Only include expenses where `receipt_id IS NULL`

Apply the following conditions:

- Office = Madrid
- Report status = 'Approved'

Your query must:

- Combine the necessary tables
- Group by `vendor_name`
- Order by `missing_receipts_count` (descending), then by `missing_receipts_total_amount` (descending)

Do not apply any additional filters.
Return all qualifying vendors.
""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>Do missing receipts cluster around specific vendors?</h2>‚Ä¶

## End of Section 2 ‚Äî Evidence Synthesis

At this stage, you should have:

- Examined how report statuses affect financial interpretation
- Tested whether February 14 alone explains the spending pattern
- Identified which categories dominate spending
- Investigated whether missing receipts cluster around specific vendors

Now comes the critical step:

> Which filters are logically necessary to identify the true suspect?

The final investigation will require precision.

Over-filtering may hide the suspect.  
Under-filtering may implicate the wrong employee.

Proceed carefully, your conclusion must be defensible.


# ‚öñÔ∏è Section 3 ‚Äî The Final Case Report

<img src="https://raw.githubusercontent.com/Haross/sql_notebook/main/assignments/The_Stolen_Valentine_Budget/final_case_banner.png" width="50%">

You have tested the hypotheses.  
You have examined the evidence.

You have analyzed:

- Report statuses
- Daily spending patterns
- Category-level totals
- Receipt irregularities

Now Audit requires a formal conclusion.

This is no longer exploration.

It is your responsibility to present a clear, defensible finding.


## Audit Request

Audit requires a clear and defensible SQL report identifying the prime suspect in the Valentine budget case.

Your task:

> Write one SQL query that identifies the employee whose February spending patterns indicate misuse of the Valentine budget.

You must determine:

- Which filters are logically necessary  
- Which clues were misleading  
- Which expense categories are relevant  
- Which records should be excluded  
- Which aggregation rules meaningfully identify suspicious behavior  

Your query must rely only on data-driven logic.  
Do **not** hard-code employee names or add arbitrary conditions to shape the output.

---

### üìù Submission Requirements (Canvas)

You must submit:

1. Your final SQL query  
2. The employee(s) identified by your query  
3. A short justification explaining:
   - Why your filters are logically necessary  
   - Why alternative filters would be incorrect  
   - Why your conclusion is defensible  

If your query returns multiple employees, you must explain why.  
If it returns none, you must explain why.  

Your grade will be based on the strength of your reasoning.


In [102]:
# @title 3.1 Discover the Prime Suspect

make_sql_runner(
    conn,
    runner_id="3_1",
    description_md="""
## üîç 3.1 Discover the Prime Suspect

You will now produce the final audit report.

Your final query must return **exactly** these columns:

- `employee_id`
- `first_name`
- `last_name`
- `suspicious_expenses_count`
- `total_suspicious_amount`
- `distinct_vendors`
- `avg_expense_amount`

Sort the results so the most suspicious employee appears first.

This is the report that will be sent to senior management.

Choose your filters carefully and make sure you can defend every decision.

""",
    )



VBox(children=(HTML(value="<div class='sql-desc'><h2>üîç 3.1 Discover the Prime Suspect</h2>\n<p>Your final quer‚Ä¶