# SQL for AI Projects

## Introduction

**Supervised Machine Learning**

In this Jupyter notebook - we'll quickly setup the DuckDB database, get you familiar with this Google Colab setup and then we'll dive into the supervised machine learning practice exercises for the SQL for AI Projects course!

### Practice Exercises

1. Create multi-class classification labeled dataset for product recommendation
2. Create binary classification labeled dataset for purchase prediction
3. Implement statistical power analysis and A/B test framework

### Database Setup

First things first, let's load up our Python libraries and setup access to our database.

Don't worry if you're not familiar with Python - we'll just need to run the very first cell to initialize our SQL instance and there will be clear instructions whenever there is some non-SQL components.


### Getting Started

To execute each cell in this notebook - you can click on the play button on the left of each cell or you could simply hit the `Run all` button on the top of the notebook just below the menu toolbar.

This cell below will help us download and connect to a DuckDB database object within this notebook's temporary environment.

There will also be a few outputs in the same cell including the following:

* An interactive entity relationship diagram for our database is also as an output from the following cell. This will help us visualize all of the database tables and their relevant primary and foreign keys.

In [None]:
# Initial setup steps
# ====================

# These pip install commands are required for Google Colab notebook environment
!pip install --upgrade --quiet duckdb==1.3.1
!pip install --quiet duckdb-engine==0.17.0
!pip install --quiet jupysql==0.11.1
!pip install --quiet pyperclip==1.9.0

# Also need to setup Git LFS for large file dowloads
# This helps us to download large files stored on GitHub
!apt-get install git-lfs -y
!git lfs install

# Clone GitHub repo into a "data" folder
!git clone https://github.com/LinkedInLearning/real-world-data-and-AI-challenges-with-SQL-3813163.git data

# Need to change directory into "data" to run download database object
%cd data
!git lfs pull

# Then we need to change directory back up so all our paths are correct!
%cd ..

# Time to import all our Python packages
import duckdb
import textwrap
import pandas as pd
import pyperclip
from IPython.display import HTML, display

# Load the jupysql extension to enable us to run SQL code in code cells
%load_ext sql

# We can now set some basic Pandas settings for rendering SQL outputs
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# This is a convenience function to print long strings into multiple lines
# You'll see this in action later on in our tutorial!
def wrap_print(text):
    print(textwrap.fill(text, width=80))

# This is some boilerplate code to help us format printed output with wrapping
HTML("""
<style>
.output pre {
    white-space: pre-wrap;
    word-break: break-word;
}
</style>
""")

# Connecting to DuckDB
# ====================

# Setup the SQL connection
connection = duckdb.connect("data/data.db")
%sql connection

# Run a few test queries using both connections
tables = connection.execute("SHOW TABLES").fetchall()
table_names = [table[0] for table in tables]

preview_counts_list = []
for table_name in table_names:
    try:
        preview_counts_list.append(
            connection.execute(f"""
                SELECT '{table_name}' AS table_name,
                    COUNT(*) AS record_count
                FROM {table_name}""").fetchdf()
        )
    except Exception as e:
        print(f"❌ Could not preview table {table_name}: {e}")
        

print("✅ Database is now ready!")

print("\n📋 Show count of rows from each table in the database:")

# Combine all dataframes in preview_df_list
preview_counts_df = pd.concat(preview_counts_list, ignore_index=True)

display(preview_counts_df)

display(HTML('''
<iframe width="100%" height="600" src='https://dbdiagram.io/e/685279b3f039ec6d36c0c7e9/68527d19f039ec6d36c1813e'> </iframe>
'''
))

# How to Run SQL Queries

Let's quickly see how we can run SQL code in our Jupyter notebook.

In our Colab environment we can run single or multi-line queries. We can also easily save the output of SQL queries as a local Pandas DataFrame object and even run subsequent SQL queries which can interact with these same DataFrame objects.

## Single Line SQL Query

We can use our notebook magic `%sql` at the start of a notebook cell to run a single line of SQL to query our database.

Let's take a look at the first 5 rows from the `locations` table:

In [None]:
%sql SELECT * FROM locations LIMIT 5;

## Multi-Line SQL Query

We can also run multi-line SQL queries by using a different notebook magic `%%sql` where we now have 2 percentage signs.

We'll apply a filter on our `location` dataset and return 2 columns.

In [None]:
%%sql
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

## Saving SQL Outputs

By using the `<<` operator, we can assign the result of a SQL query (returned as a Pandas DataFrame) to a named Python variable in the notebook’s scope.

### Single Line Assignment

We can specify the name of the output variable directly after the `%sql` or `%%sql` magic command.

In [None]:
%sql single_magic_df << SELECT * FROM locations LIMIT 5;

We can now reference the Python variable directly as a Pandas DataFrame

In [None]:
# Python notebook scope
single_magic_df

We can also use this same variable as a table reference within a DuckDB `SELECT` statement.

In [None]:
%sql SELECT * FROM single_magic_df;

### Multi-line Assignment

This assignment using `<<` also works with the `%%sql` (multi-line) magic command.

In [None]:
%%sql multi_magic_df <<
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

In [None]:
# display the dataframe
multi_magic_df

When referencing the Python variable within DuckDB, we can also use it inside a multi-line SQL query using the `%%sql` magic command.

In [None]:
%%sql
SELECT *
FROM multi_magic_df;

# 1. Multi-Class Classification

Let's start our supervised machine learning challenge by exploring the data we'll be using for our product recommendation problem.

Our goal is to generate a table with the following columns:

* user_id
* product_name
* attributes

However - we'll soon find out that we'll need to perform some extra data transformations to get our dataset ready for machine learning!

## 1.1 Data Exploration

Let's explore the following datasets we'll use to generate our labeled dataset:

* sales
* users
* attributes
* products

In [None]:
%sql SELECT * FROM sales LIMIT 5;

In [None]:
%sql SELECT * FROM users LIMIT 5;

In [None]:
%sql SELECT * FROM attributes LIMIT 5;

In [None]:
%sql SELECT * FROM products LIMIT 5;

## 1.2 Data Transformation

Our next step is to create a **user-product interaction dataset** enriched with user behavior and preference attributes. It's designed to support downstream modeling tasks such as **product recommendation**, **user segmentation**, or **classification**.


### 1.2.1 Step-by-Step Breakdown

1. **Join Users, Sales, and Products**  
   Combines `users`, `sales`, and `products` tables to get a complete view of each user's purchase history — mapping every user to the products they’ve bought during the **2024–2025** timeframe.

2. **Expand User Attributes**  
   Uses `UNNEST(users.attributes)` to break out multi-valued attributes into individual rows. This allows us to later pivot each attribute as a binary indicator.

3. **Join Attribute Descriptions**  
   Links each `attribute_id` to its descriptive `attribute_name` by joining the `attributes` lookup table.

4. **Pivot Attributes into One-Hot Columns**  
   Performs a `PIVOT` to transform attribute rows into **one-hot encoded** columns. Each attribute becomes its own column, with a `1` if the user exhibits that behavior and `0` otherwise (using `COUNT(*)` as the aggregator).

### 1.2.2 Final Output

The result is a **wide-format dataset** where each row corresponds to a `(user_id, product_name)` pair, and columns represent the presence of various user traits — such as:

- `loves_hiking`
- `budget_planner`
- `frequent_subscriber`
- `travels_with_family`
- `young_professional`
- *(and many more...)*

This format is ideal for feeding into machine learning models that predict which products a user is likely to engage with — based on their **past purchases** and **attribute profile**.

### 1.2.3 Python Helper Function

The code below generates a sorted list of unique attribute names from our `attributes` table and formats them as a string suitable for our following SQL `PIVOT` transformation — with each value quoted and on its own line. It then copies the result to your clipboard using the `pyperclip` package, so you can easily paste it into a manual PIVOT statement in DuckDB.

In [None]:
%sql unique_attributes_list << SELECT DISTINCT LOWER(attribute_name) AS attribute_name FROM attributes ORDER BY 1;

# This creates the single quoted attribute names we'll use in our SQL query
columns_list = "(\n'" + "',\n'".join(unique_attributes_list["attribute_name"]) + "'\n)"

# We use this pyperclip package to copy the output directly to our clipboard
# Then we can copy paste this directly into our SQL query below!
pyperclip.copy(columns_list)

# Print out first few records from our unique_attributes_list
print(unique_attributes_list[:10])

## 1.3 SQL Implementation

In [None]:
%%sql product_recommendation_df <<

# ------------------------------------------------------
# 1. Join users, sales, and products to get purchase history
# ------------------------------------------------------
WITH cte_base AS (
  SELECT
    users.user_id,
    products.product_name,

    # Expand array of user attributes into individual rows
    UNNEST(users.attributes) AS attribute_id

  FROM users

  # Join to sales to track which user bought which product
  INNER JOIN sales
    ON users.user_id = sales.user_id

  # Join to product catalog to get product name
  INNER JOIN products
    ON sales.product_id = products.product_id

  # Filter to purchases within the experiment window
  WHERE sales.transaction_date BETWEEN DATE '2024-01-01' AND DATE '2025-12-31'
),

# ------------------------------------------------------
# 2. Join attributes table to get human-readable attribute names
# ------------------------------------------------------
cte_user_attributes AS (
  SELECT
    base.user_id,
    base.product_name,

    # Convert attribute names to lowercase for consistency
    LOWER(attributes.attribute_name) AS attribute_name

  FROM cte_base AS base

  # Join to attribute dictionary table
  INNER JOIN attributes
    ON base.attribute_id = attributes.attribute_id
)

# ------------------------------------------------------
# 3. Pivot user attributes into one-hot encoded columns
# ------------------------------------------------------
SELECT * FROM cte_user_attributes
PIVOT (
  # Each column indicates presence of an attribute for a user-product pair
  COUNT(*)
  FOR attribute_name IN (

    # List of user attributes to pivot into individual columns
    'big_box_store_visitor',
    'bird_watcher',
    'books_last_minute_trips',
    'bookstore_browser',
    'brand_comparison_shopper',
    'brand_loyal_shopper',
    'budget_planner',
    'bulk_buyer',
    'buy_now_pay_later_user',
    'buys_weekend_deals',
    'car_sharer',
    'cashless_shopper',
    'college_student',
    'commutes_by_bike',
    'commutes_by_car',
    'commutes_by_transit',
    'connects_to_public_wi_fi',
    'convenience_focused',
    'coupon_user',
    'digital_nomad',
    'discount_seeker',
    'downloads_offline_maps',
    'early_adopter',
    'early_riser',
    'eats_out_often',
    'eco-conscious_consumer',
    'eco_friendly',
    'engages_with_newsletters',
    'enjoys_quiet_places',
    'family_with_kids',
    'fitness_focused',
    'frequent_coffee_buyer',
    'frequent_pharmacy_visitor',
    'frequent_returns',
    'frequent_subscriber',
    'frequent_takeout_customer',
    'fuel_cost_sensitive',
    'gift_card_giver',
    'goes_solo',
    'grocery_delivery_user',
    'heavy_phone_user',
    'high_end_grocery_buyer',
    'history_buff',
    'holiday_sale_shopper',
    'home_cook',
    'home_improvement_spender',
    'impulse_buyer',
    'international_tourist',
    'last_minute_buyer',
    'likes_camping',
    'likes_luxury_trips',
    'likes_museums',
    'local_market_shopper',
    'loves_hiking',
    'loyalty_program_member',
    'mobile_grocery_app_user',
    'monthly_budget_adjuster',
    'mountain_explorer',
    'multichannel_shopper',
    'nature_lover',
    'night_owl',
    'organic_food_buyer',
    'owns_electric_vehicle',
    'pays_with_mobile_apps',
    'pet_friendly_traveler',
    'photography_enthusiast',
    'plans_trips_in_advance',
    'posts_on_instagram',
    'prefers_chain_stores',
    'prefers_contactless',
    'prefers_delivery_over_pickup',
    'prefers_gift_experiences',
    'prefers_local_brands',
    'prefers_weekday_travel',
    'price_comparer',
    'rents_cars_for_travel',
    'retired_traveler',
    'rewards_points_user',
    'road_trip_fan',
    'rural_resident',
    'shares_travel_on_social_media',
    'shops_online_frequently',
    'smart_home_user',
    'solo_female_traveler',
    'spiritual_traveler',
    'stationery_and_supplies_shopper',
    'streams_music_outdoors',
    'subscribes_to_meal_kits',
    'subscription_box_user',
    'suburban_resident',
    'sunset_seeker',
    'takes_group_tours',
    'takes_lots_of_selfies',
    'tech_gadget_buyer',
    'tracks_steps_on_phone',
    'travel_blogger',
    'travels_on_a_budget',
    'travels_with_family',
    'travels_with_partner',
    'urban_resident',
    'uses_google_maps',
    'uses_smart_watch',
    'uses_translation_apps',
    'uses_travel_apps',
    'walks_to_work',
    'waterfall_chaser',
    'weekly_grocery_shopper',
    'wildlife_spotter',
    'young_professional'
  )
);

## 1.4 Inspecting Outputs

In [None]:
%sql SELECT * FROM product_recommendation_df LIMIT 5;

# 2. Binary Classification

We’re creating a dataset to predict whether a user will make a **purchase in the next 30 days**.

This window can be adjusted — shorter windows may lack signal, while longer ones reduce actionability.

---

## 2.1 Labeled Dataset and Time Splits

To avoid data leakage, we use **out-of-time validation** where future time periods are held out for evaluation:

- **Training**: 2024  
- **Validation**: Jan–Jun 2025  
- **Test**: Jul–Dec 2025

---

## 2.2 Target Labeling Logic

Here’s how we assign the binary target:

- Users who purchased in 2024 → `label = 1`, with a `label_date` randomly 1–30 days before the purchase.
- All other users → randomly assign a `label_date` between Jan 2024 and Dec 2025.
- Look ahead 30 days from each `label_date`:
  - If a purchase is found → `label = 1`  
  - Otherwise → `label = 0`

This simulates real prediction windows across all sets.

---

### 2.2.1 Edge Case Handling

We **exclude** records where:

- `label_date` doesn’t allow for a full 30-day lookahead in the same period
- Examples: December 2024 (training), June 2025 (validation), December 2025 (test)

This avoids future data leakage during training.

---

## 2.3 SQL Logic Summary

The SQL query builds this labeled dataset using the following steps:

**Step-by-Step Summary of CTEs**

- **`cte_sale_events`**  
  Joins users with their sales and flags transactions in **2024** as part of the training period.

- **`cte_train_positive_records`**  
  For users with 2024 sales, assigns a `label_date` randomly up to 30 days before the purchase,  
  with `label = 1` and `period = 'train'`.

- **`cte_other_users`**  
  Selects all users **not already labeled as training positives** and randomly assigns them a  
  `label_date` between Jan 2024 and Dec 2025.

- **`cte_other_records`**  
  For each `label_date`, checks if a purchase occurs in the **next 30 days** to assign a  
  `label` of `1` or `0`, and tags each record with a **train/validation/test** period based on the date.

- **`cte_combined`**  
  Combines:
  - All **valid training positives** (excluding Dec 2024), and  
  - All **other labeled records** with valid periods  
  into a single dataset ready for modeling.

> ⚠️ **Note on Randomness in DuckDB**  
> The `RANDOM()` function in DuckDB is **not reproducible by default**, meaning it generates different values on each run.  
> For consistent sampling (e.g. in experiments or production pipelines), consider exporting to Python and applying a fixed random seed there.

---

> ⚠️ **Note**: DuckDB's `RANDOM()` function is **non-deterministic** — results will change on each run.  
> For reproducibility, sample label dates using Python with a fixed random seed.


In [None]:
%%sql labeled_dataset_df <<
# ------------------------------------------------------
# 1. Join users with sales and flag whether it's part of the training window
# ------------------------------------------------------
WITH cte_sale_events AS (
  SELECT
    users.user_id,
    sales.transaction_date,

    # Flag sales in 2024 as eligible for training
    CASE 
      WHEN transaction_date BETWEEN DATE '2024-01-01' AND DATE '2024-12-31' THEN 1
      ELSE 0
    END AS training_flag

  FROM users
  INNER JOIN sales ON users.user_id = sales.user_id

  # Use 2024–2025 to allow for label windows that look 30 days into the future
  WHERE sales.transaction_date BETWEEN DATE '2024-01-01' AND DATE '2025-12-31'
),

# ------------------------------------------------------
# 2. Create training positive examples (users who made a purchase in 2024)
# ------------------------------------------------------
cte_train_positive_records AS (
  SELECT
    user_id,

    # Randomly assign a label_date up to 30 days before the sale
    transaction_date - INTERVAL (CAST(1 + RANDOM() * 30 AS INTEGER)) DAYS AS label_date,

    1 AS label,
    'train' AS period
  FROM cte_sale_events
  WHERE training_flag = 1
),

# ------------------------------------------------------
# 3. Select users with no training purchases and randomly assign label dates
# ------------------------------------------------------
cte_other_users AS (
  SELECT
    user_id,

    # Assign random label dates between Jan 1 2024 and Dec 31 2025
    DATE '2024-01-01' + INTERVAL (CAST(1 + RANDOM() * 730 AS INTEGER)) DAYS AS label_date
  FROM users

  # Exclude users who already have a labeled positive record
  WHERE NOT EXISTS (
    SELECT 1
    FROM cte_train_positive_records AS train
    WHERE users.user_id = train.user_id
  )
),

# ------------------------------------------------------
# 4. For the rest of the users, label them based on whether a sale happens within 30 days
# ------------------------------------------------------
cte_other_records AS (
  SELECT
    users.user_id,
    users.label_date,

    # Label = 1 if a sale occurs within 30 days after the label date
    CASE 
      WHEN DATE_DIFF('DAY', users.label_date, sales.transaction_date) <= 30 THEN 1 
      ELSE 0 
    END AS label,

    # Assign split period based on label_date
    CASE
      WHEN users.label_date BETWEEN DATE '2024-01-01' AND DATE '2024-11-30' THEN 'train'
      WHEN users.label_date BETWEEN DATE '2025-01-01' AND DATE '2025-05-30' THEN 'validation'
      WHEN users.label_date BETWEEN DATE '2025-07-01' AND DATE '2025-11-30' THEN 'test'
      ELSE NULL
    END AS period

  FROM cte_other_users AS users
  LEFT JOIN cte_sale_events AS sales
    ON users.user_id = sales.user_id
),

# ------------------------------------------------------
# 5. Combine positive and negative samples for training, validation, and test
# ------------------------------------------------------
cte_combined AS (
  SELECT * 
  FROM cte_train_positive_records 
  WHERE label_date <= DATE '2024-11-30'   # Ensure train cutoff

  UNION ALL

  SELECT * 
  FROM cte_other_records 
  WHERE period IS NOT NULL                # Only keep records with a defined period
)

# ------------------------------------------------------
# 6. Final labeled dataset for supervised learning
# ------------------------------------------------------
SELECT
  user_id,
  label_date,
  label,
  period
FROM cte_combined;


### 2.3.1 Verify Training Labels

Let's now analyze the binary target labels (`label = 1` or `0`) and verify how they are distributed across the different dataset splits:

- **`train`**
- **`validation`**
- **`test`**

For each split (`period`), we'll report the following metrics:

- The total number of labeled records (`record_count`)
- The proportion of positive labels (`positive_rate`), calculated as:  
  `SUM(label) / COUNT(*)`

This helps validate that the dataset is balanced appropriately across time-based partitions and that the label distribution is consistent and reasonable for training and evaluation.


In [None]:
%%sql
# ------------------------------------------------------
# 1. Analyze label distribution across dataset splits
# ------------------------------------------------------
SELECT
  period,                            # Dataset split: 'train', 'validation', 'test'
  
  COUNT(*) AS record_count,          # Total number of labeled records in each period

  # Positive label rate = proportion of label = 1
  SUM(label) / COUNT(*) AS positive_rate

FROM labeled_dataset_df

# ------------------------------------------------------
# 2. Group by period to view distribution breakdown
# ------------------------------------------------------
GROUP BY period;

## 2.4 Feature Engineering Overview

To train a meaningful model, we combine two types of features:

- **User attributes** (like preferences and behaviors)
- **Recent activity**, such as how many times a user visited in the past 30 days

Even though user attributes could change over time, we treat them as static for simplicity in this case study.

### 2.4.1 SQL Implementation

**Step-by-Step Breakdown of CTEs**

1. `cte_user_attributes_base`
- Joins each labeled record with user profile data.
- Expands each user’s list of attribute IDs into individual rows using `UNNEST`.

2. `cte_user_attributes`
- Maps each attribute ID to a readable attribute name (e.g. `eco_friendly`, `coupon_user`).
- Assigns a value of `1` to indicate the presence of each attribute (for one-hot encoding).

3. `cte_user_visits`
- Calculates the number of visits each user made **in the 30 days before their label date**.
- Adds this as a numeric feature called `visits_last_30_days`.

4. `cte_combined_features`
- Combines the one-hot user attributes and the 30-day visit counts into a single table (long format).

5. Final `SELECT` with `PIVOT`
- Transforms the long-format table into wide-format:
- Each unique `attribute_name` becomes its own column.
- `MAX(attribute_value)` ensures correct numeric values are retained (e.g. visit counts or binary presence).

---

**Final Output**:  
A machine learning–ready dataset where each row represents a `(user_id, label_date)` pair, with:
- One-hot encoded user attributes  
- A numeric feature for recent visit activity  
- The binary label (target) for classification


In [None]:
%%sql labeled_binary_classification_dataset_df <<
# ------------------------------------------------------
# 1. Expand user attributes from labeled dataset
# ------------------------------------------------------
WITH cte_user_attributes_base AS (
  SELECT
    base.user_id,
    base.label,
    base.label_date,
    base.period,
    
    # Unnest each user’s attributes into individual rows
    UNNEST(users.attributes) AS attribute_id
  FROM labeled_dataset_df AS base
  INNER JOIN users ON base.user_id = users.user_id
),

# ------------------------------------------------------
# 2. Map attribute IDs to readable attribute names (one-hot style)
# ------------------------------------------------------
cte_user_attributes AS (
  SELECT
    base.user_id,
    base.label,
    base.label_date,
    base.period,
    LOWER(attributes.attribute_name) AS attribute_name,
    1 AS attribute_value  # Explicitly encode presence of attribute
  FROM cte_user_attributes_base AS base
  INNER JOIN attributes ON base.attribute_id = attributes.attribute_id
),

# ------------------------------------------------------
# 3. Compute visit count feature within 30 days before label_date
# ------------------------------------------------------
cte_user_visits AS (
  SELECT
    base.user_id,
    base.label,
    base.label_date,
    base.period,
    'visits_last_30_days' AS attribute_name,

    # Count visits within 30-day window before label_date
    COALESCE(COUNT(visits.*), 0) AS attribute_value
  FROM labeled_dataset_df AS base
  LEFT JOIN visits
    ON base.user_id = visits.user_id
    AND visits.visit_timestamp BETWEEN (base.label_date - INTERVAL 31 DAYS) AND (base.label_date - INTERVAL 1 DAYS)
  GROUP BY 1,2,3,4,5
),

# ------------------------------------------------------
# 4. Union behavioral and demographic features
# ------------------------------------------------------
cte_combined_features AS (
  SELECT * FROM cte_user_attributes
  UNION ALL
  SELECT * FROM cte_user_visits
)

# ------------------------------------------------------
# 5. Pivot attributes into wide-format one-hot encoded dataset
# ------------------------------------------------------
SELECT *
FROM cte_combined_features

PIVOT (
  # Use MAX to preserve numeric values (e.g. visit count) or presence (1)
  MAX(attribute_value)
  FOR attribute_name IN (
    'visits_last_30_days',
    'big_box_store_visitor',
    'bird_watcher',
    'books_last_minute_trips',
    'bookstore_browser',
    'brand_comparison_shopper',
    'brand_loyal_shopper',
    'budget_planner',
    'bulk_buyer',
    'buy_now_pay_later_user',
    'buys_weekend_deals',
    'car_sharer',
    'cashless_shopper',
    'college_student',
    'commutes_by_bike',
    'commutes_by_car',
    'commutes_by_transit',
    'connects_to_public_wi_fi',
    'convenience_focused',
    'coupon_user',
    'digital_nomad',
    'discount_seeker',
    'downloads_offline_maps',
    'early_adopter',
    'early_riser',
    'eats_out_often',
    'eco-conscious_consumer',
    'eco_friendly',
    'engages_with_newsletters',
    'enjoys_quiet_places',
    'family_with_kids',
    'fitness_focused',
    'frequent_coffee_buyer',
    'frequent_pharmacy_visitor',
    'frequent_returns',
    'frequent_subscriber',
    'frequent_takeout_customer',
    'fuel_cost_sensitive',
    'gift_card_giver',
    'goes_solo',
    'grocery_delivery_user',
    'heavy_phone_user',
    'high_end_grocery_buyer',
    'history_buff',
    'holiday_sale_shopper',
    'home_cook',
    'home_improvement_spender',
    'impulse_buyer',
    'international_tourist',
    'last_minute_buyer',
    'likes_camping',
    'likes_luxury_trips',
    'likes_museums',
    'local_market_shopper',
    'loves_hiking',
    'loyalty_program_member',
    'mobile_grocery_app_user',
    'monthly_budget_adjuster',
    'mountain_explorer',
    'multichannel_shopper',
    'nature_lover',
    'night_owl',
    'organic_food_buyer',
    'owns_electric_vehicle',
    'pays_with_mobile_apps',
    'pet_friendly_traveler',
    'photography_enthusiast',
    'plans_trips_in_advance',
    'posts_on_instagram',
    'prefers_chain_stores',
    'prefers_contactless',
    'prefers_delivery_over_pickup',
    'prefers_gift_experiences',
    'prefers_local_brands',
    'prefers_weekday_travel',
    'price_comparer',
    'rents_cars_for_travel',
    'retired_traveler',
    'rewards_points_user',
    'road_trip_fan',
    'rural_resident',
    'shares_travel_on_social_media',
    'shops_online_frequently',
    'smart_home_user',
    'solo_female_traveler',
    'spiritual_traveler',
    'stationery_and_supplies_shopper',
    'streams_music_outdoors',
    'subscribes_to_meal_kits',
    'subscription_box_user',
    'suburban_resident',
    'sunset_seeker',
    'takes_group_tours',
    'takes_lots_of_selfies',
    'tech_gadget_buyer',
    'tracks_steps_on_phone',
    'travel_blogger',
    'travels_on_a_budget',
    'travels_with_family',
    'travels_with_partner',
    'urban_resident',
    'uses_google_maps',
    'uses_smart_watch',
    'uses_translation_apps',
    'uses_travel_apps',
    'walks_to_work',
    'waterfall_chaser',
    'weekly_grocery_shopper',
    'wildlife_spotter',
    'young_professional'
  )
);


In [None]:
labeled_binary_classification_dataset_df.head()

# 3. Statistical Frameworks

In this section, we revisit A/B testing and introduce **statistical power analysis** — a key **pre-experiment** step used to estimate whether an experiment is likely to detect a meaningful effect.

**Key Concepts:**
- **Statistical power** is the probability of correctly detecting a true effect (i.e., rejecting the null hypothesis when the alternative is true).
- This is commonly set to **80%**, meaning a **20% chance of Type II error** is acceptable.
- Power depends on:
  - Expected number of observations (traffic volume)
  - Duration of the experiment
  - Anticipated conversion uplift

While our dataset already contains **post-experiment** data (so power analysis isn’t strictly necessary), understanding power analysis is critical for **experiment planning**.

## 3.1 Baseline Analysis

To ground this concept in real data, we’ll begin by examining **site traffic and conversion rates** from 2025 — a baseline period without experiments. This helps us estimate what’s feasible in future tests.

In [None]:
%%sql
# ------------------------------------------------------
# 1. Aggregate visit and sales metrics by calendar month
# ------------------------------------------------------
SELECT
  # Truncate visit timestamps to month for aggregation
  DATE_TRUNC('MON', visits.visit_timestamp) AS visit_month,

  # Count of unique visits per month
  COUNT(DISTINCT visits.visit_id) AS visits_count,

  # Count of unique sales per month
  COUNT(DISTINCT sales.sale_id) AS sales_count,

  # Monthly conversion rate = sales / visits
  sales_count / visits_count AS conversion_rate

FROM visits

# ------------------------------------------------------
# 2. Join sales to determine if each visit resulted in a purchase
# ------------------------------------------------------
LEFT JOIN sales
  ON visits.visit_id = sales.visit_id

# ------------------------------------------------------
# 3. Filter for the calendar year 2025
# ------------------------------------------------------
WHERE visits.visit_timestamp BETWEEN DATE '2025-01-01' AND DATE '2025-12-31'

# ------------------------------------------------------
# 4. Group by month and order chronologically
# ------------------------------------------------------
GROUP BY 1
ORDER BY 1;


## 3.2 Experiment Period Analysis

We observe that monthly traffic averages around **16,000–17,000 visits**, with conversion rates rising from **~6.5% during off-season** to nearly **9% in peak months** (April to August).

If we plan to run our experiment during **April to June 2026**, we can use these **higher-performing months** as our baseline for power analysis.

By averaging traffic and conversion metrics from this period, we can estimate the volume needed to detect different levels of conversion uplift and ensure our experiment is statistically valid.

In [None]:
%%sql
# ------------------------------------------------------
# 1. Compute monthly visit and sales metrics for April–June 2025
# ------------------------------------------------------
WITH cte_base AS (
  SELECT
    # Truncate visit timestamps to month
    DATE_TRUNC('MON', visits.visit_timestamp) AS visit_month,

    # Count of unique visits per month
    COUNT(DISTINCT visits.visit_id) AS visits_count,

    # Count of unique sales per month
    COUNT(DISTINCT sales.sale_id) AS sales_count,

    # Conversion rate = sales / visits
    sales_count / visits_count AS conversion_rate

  FROM visits
  LEFT JOIN sales ON visits.visit_id = sales.visit_id

  # Filter to Q2 2025 only
  WHERE visits.visit_timestamp BETWEEN DATE '2025-04-01' AND DATE '2025-06-30'

  GROUP BY 1
)

# ------------------------------------------------------
# 2. Compute average metrics across the 3-month baseline period
# ------------------------------------------------------
SELECT
  AVG(visits_count) AS visits,
  AVG(sales_count) AS sales,
  AVG(conversion_rate) AS conversion_rate
FROM cte_base;


## 3.3 Statistical Power Analysis

To understand the impact of test design on statistical power, we simulate a few different rollout scenarios:

**Traffic Allocation Scenarios:**
- **Scenario 1 – Split Test**: 50% target vs 50% control
- **Scenario 2 – All-In Test**: 90% target vs 10% control
- **Scenario 3 – Canary Test**: 10% target vs 90% control

Each scenario represents a different strategy for deploying AI-powered features to users.

**Expected Uplift Assumptions:**
- **Conservative**: 5% uplift on a 9% conversion rate
- **Moderate**: 10% uplift
- **Optimistic**: 15% uplift

Using these inputs, we simulate power calculations in SQL to estimate the traffic required for statistical significance.  
We then filter results to highlight only those scenarios with **≥ 50% statistical power**, which corresponds to a **≤ 50% chance of Type II error** which is reasonable in practical scenarios - although an 80% level is considered ideal.


## 3.4 SQL Implementation

This query performs a **statistical power analysis** using Monte Carlo simulation to evaluate different A/B test setups. It estimates how likely each test scenario is to detect a meaningful uplift in conversion rates.

---

### 3.4.1 Query Goals

Estimate **statistical power** (i.e., the probability of detecting a true effect) for various combinations of:

- **Conversion uplifts**: 5%, 10%, 15%
- **Target/control traffic splits**: 90/10, 50/50, 10/90
- **Experiment durations**: 4, 8, or 12 weeks

The analysis assumes a baseline conversion rate of **9%** and **4,000 weekly visits**.

---

### 3.4.2 Step-by-Step Breakdown

1. **`params` – Define test scenarios**  
   Generates all combinations of uplift, traffic split, and duration. Calculates the expected number of users in target and control groups.

2. **`cte_simulated_target_trials` – Simulate 1,000 experiments (target group)**  
   For each scenario, simulates whether each user in the **target group** converts using a higher conversion rate (e.g., 9% * 1.10). Each simulation consists of randomly generated outcomes per user.

3. **`cte_target_results` – Aggregate target simulations**  
   Computes the conversion rate in each simulation by averaging user-level outcomes.

4. **`cte_simulated_control_trials` – Simulate 1,000 experiments (control group)**  
   Repeats the same process as the target group but uses the baseline conversion rate (9%).

5. **`cte_control_results` – Aggregate control simulations**  
   Calculates the conversion rate for the control group in each simulation.

6. **`cte_joint_trials` – Compare target vs control**  
   For each simulated trial:
   - Computes the **difference** in conversion rates
   - Calculates the **pooled standard error**
   - Derives the **z-score** to test for a statistically significant difference

7. **Final SELECT – Estimate statistical power**  
   For each scenario:
   - Counts how many simulations produced a **z-score > 1.645** (significant at 95% confidence)
   - Computes the **estimated power** as the proportion of significant results
   - Filters to show only test scenarios with **power ≥ 50%**

---

### 3.4.3 SQL Summary
This SQL-based simulation helps you understand:
- How sample size, test duration, and expected uplift affect your test’s ability to detect real effects
- Which experimental designs are likely to be successful before launching a live A/B test

It’s a practical way to make **data-informed decisions** when planning AI product rollouts or feature experiments.

In [None]:
%%sql
# ------------------------------------------------------
# 1. Define simulation parameters and label each scenario
# ------------------------------------------------------
WITH params AS (
  SELECT 
    0.09::DOUBLE AS conversion_rate,                         # Baseline conversion rate
    conversion_uplift.unnest AS conversion_uplift,           # Uplift scenarios: 5%, 10%, 15%
    target_ratio.unnest AS target_ratio,                     # Target group allocation: 90%, 50%, 10%
    duration_weeks.unnest AS duration_weeks,                 # Experiment duration: 4, 8, 12 weeks
    1 - target_ratio.unnest AS control_ratio,
    4000 AS weekly_volume,
    
    # Add readable labels based on traffic split
    CASE 
      WHEN target_ratio.unnest = 0.5 THEN 'Split Test (50% Target)'
      WHEN target_ratio.unnest = 0.9 THEN 'All-In Test (90% Target)'
      WHEN target_ratio.unnest = 0.1 THEN 'Canary Test (10% Target)'
      ELSE 'Custom Split'
    END AS test_label,
    
    # Label for uplift magnitude
    CASE 
      WHEN conversion_uplift.unnest = 1.05 THEN 'Conservative Uplift (5%)'
      WHEN conversion_uplift.unnest = 1.10 THEN 'Moderate Uplift (10%)'
      WHEN conversion_uplift.unnest = 1.15 THEN 'Optimistic Uplift (15%)'
      ELSE 'Custom Uplift'
    END AS uplift_label,
    
    ROW_NUMBER() OVER (ORDER BY conversion_uplift, target_ratio, duration_weeks) AS trial_id
  FROM UNNEST([1.05, 1.1, 1.15]) AS conversion_uplift
  CROSS JOIN UNNEST([0.9, 0.5, 0.1]) AS target_ratio
  CROSS JOIN UNNEST([4, 8, 12]) AS duration_weeks
),

# ------------------------------------------------------
# 2. Simulate 1,000 experiments under the alternative hypothesis (uplift applied)
# ------------------------------------------------------
cte_simulated_target_trials AS (
  SELECT
    trial_id,
    sample_id.unnest AS sample_id,                           # Simulated experiment ID
    conversion_rate * conversion_uplift AS rate,             # Increased conversion rate
    weekly_volume,
    duration_weeks,
    weekly_volume * duration_weeks * target_ratio AS volume, # Target group volume
    UNNEST(RANGE(1, ROUND(volume)::BIGINT)) AS draw_id,      # Simulate each user
    RANDOM() AS random_variable                              # Random number to simulate outcome
  FROM params
  CROSS JOIN UNNEST(RANGE(1, 1000)) AS sample_id             # 1000 replications per scenario
),

# ------------------------------------------------------
# 3. Aggregate target results: compute simulated conversion rates
# ------------------------------------------------------
cte_target_results AS (
  SELECT
    trial_id,
    sample_id,
    rate,
    volume,
    SUM(CASE WHEN random_variable < rate THEN 1 ELSE 0 END)::FLOAT / volume AS probability
  FROM cte_simulated_target_trials
  GROUP BY 1,2,3,4
),

# ------------------------------------------------------
# 4. Simulate control group (null hypothesis) for the same 1,000 trials
# ------------------------------------------------------
cte_simulated_control_trials AS (
  SELECT
    trial_id,
    sample_id.unnest AS sample_id,
    conversion_rate AS rate,                                 # Baseline conversion rate (no uplift)
    weekly_volume,
    duration_weeks,
    weekly_volume * duration_weeks * control_ratio AS volume,
    UNNEST(RANGE(1, ROUND(volume)::BIGINT)) AS draw_id,
    RANDOM() AS random_variable
  FROM params
  CROSS JOIN UNNEST(RANGE(1, 1000)) AS sample_id
),

# ------------------------------------------------------
# 5. Aggregate control results
# ------------------------------------------------------
cte_control_results AS (
  SELECT
    trial_id,
    sample_id,
    rate,
    volume,
    SUM(CASE WHEN random_variable < rate THEN 1 ELSE 0 END)::FLOAT / volume AS probability
  FROM cte_simulated_control_trials
  GROUP BY 1,2,3,4
),

# ------------------------------------------------------
# 6. Join target and control simulations to compute z-scores
# ------------------------------------------------------
cte_joint_trials AS (
  SELECT
    target.trial_id,
    target.sample_id,
    target.volume AS target_volume,
    control.volume AS control_volume,
    target.probability,
    control.probability,
    target.probability - control.probability AS difference,
    # Pooled standard error for difference of proportions
    SQRT(
      (target.rate * (1 - target.rate)) / target.volume +
      (control.rate * (1 - control.rate)) / control.volume
    ) AS pooled_standard_error,
    # Z-score for hypothesis test
    difference::FLOAT / pooled_standard_error AS z_score
  FROM cte_target_results AS target
  INNER JOIN cte_control_results AS control
    ON target.trial_id = control.trial_id
    AND target.sample_id = control.sample_id
)

# ------------------------------------------------------
# 7. Final Output: Include test label and power estimate
# ------------------------------------------------------
SELECT
  params.trial_id,
  params.test_label,
  params.uplift_label,
  params.conversion_rate,
  params.conversion_uplift,
  params.target_ratio,
  params.duration_weeks,
  trials.target_volume,
  trials.control_volume,
  SUM(CASE WHEN trials.z_score > 1.645 THEN 1 ELSE 0 END) AS null_rejected,
  null_rejected::FLOAT / COUNT(trials.*) AS estimated_power
FROM cte_joint_trials AS trials
INNER JOIN params
  ON trials.trial_id = params.trial_id
GROUP BY 1,2,3,4,5,6,7,8,9
HAVING estimated_power >= 0.5
ORDER BY estimated_power DESC;

## 3.5 Experiment Evaluation

In this section we will perform our experimental A/B test analysis using a similar framework we've used for the natural language processing challenge.

The following SQL pipeline evaluates the effectiveness of a new **product recommendation** feature using visit-level data from Q2 2026.

It compares **target group users** (feature enabled) to **control group users** (feature disabled) using standard A/B testing techniques.

---

### 3.5.1 Step-by-Step Breakdown

1. **Join Event Data into a Unified Table**  
   Combines visit logs with optional feature flags, sales transactions, and product pricing to build a clean analysis-ready dataset.

2. **Aggregate Control Group Metrics**  
   Calculates key metrics (visit count, sales count, total revenue, and conversion rate) for users who **did not** see the feature.

3. **Aggregate Target Group Metrics**  
   Performs the same aggregation for users who **did** see the feature.

4. **Compare Groups Side-by-Side**  
   Cross-joins the control and target summaries to make them available for side-by-side analysis in a single row.

5. **Calculate Uplift and Confidence Intervals**  
   Computes the **absolute uplift** in conversion rate between target and control groups, and calculates **95% confidence intervals** for each group's conversion rate using standard error formulas.

6. **Run Z-Test and Estimate Business Impact**  
   - Computes a **z-score** and flags results as "Significant" if uplift is statistically valid at the 95% confidence level  
   - Estimates **incremental conversions** and **incremental revenue** gained from enabling the feature  
   - Projects a **baseline revenue** for comparison by simulating what the target group would have earned with the control group’s conversion rate

7. **Final Output**  
   Returns all key experiment metrics — conversion rates, uplift, statistical confidence, and estimated business value — in a single summary view.

---

This code helps us understand whether the feature rollout had a **meaningful and statistically significant impact** on user behavior and revenue, using well-established experimental design principles.


In [None]:
%%sql experiment_results_df <<

# ------------------------------------------------------
# 1. Join visits with feature flags, sales, and product data
# ------------------------------------------------------
WITH cte_base AS (
  SELECT
    visits.visit_timestamp,
    visits.visit_id,
    visits.user_id,

    # Flag whether the feature was active for this visit
    CASE WHEN features.feature IS NOT NULL THEN 1 ELSE 0 END AS feature_flag,

    # Flag whether a sale occurred during this visit
    CASE WHEN sales.sale_id IS NOT NULL THEN 1 ELSE 0 END AS sale_flag,

    # Capture sale amount; default to 0 if no product linked
    COALESCE(products.price_usd, 0) AS sale_amount

  FROM visits
  LEFT JOIN features 
    ON visits.visit_id = features.visit_id
  LEFT JOIN sales 
    ON visits.visit_id = sales.visit_id
  LEFT JOIN products 
    ON sales.product_id = products.product_id
  WHERE visits.visit_timestamp BETWEEN DATE '2026-04-01' AND DATE '2026-06-30'
),

# ------------------------------------------------------
# 2. Aggregate control group metrics (feature_flag = 0)
# ------------------------------------------------------
cte_control AS (
  SELECT
    COUNT(DISTINCT visit_id) AS control_visit_count,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS control_sales_count,
    SUM(sale_amount) AS control_sales_amount,
    control_sales_count / control_visit_count AS control_conversion_rate
  FROM cte_base
  WHERE feature_flag = 0
),

# ------------------------------------------------------
# 3. Aggregate target group metrics (feature_flag = 1)
# ------------------------------------------------------
cte_target AS (
  SELECT
    COUNT(DISTINCT visit_id) AS target_visit_count,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS target_sales_count,
    SUM(sale_amount) AS target_sales_amount,
    target_sales_count / target_visit_count AS target_conversion_rate
  FROM cte_base
  WHERE feature_flag = 1
),

# ------------------------------------------------------
# 4. Combine control and target group metrics
# ------------------------------------------------------
cte_combined AS (
  SELECT
    control.*,
    target.*
  FROM cte_target AS target
  CROSS JOIN cte_control AS control
),

# ------------------------------------------------------
# 5. Calculate uplift and confidence intervals
# ------------------------------------------------------
cte_stats AS (
  SELECT
    *,
    target_conversion_rate - control_conversion_rate AS absolute_uplift,

    # 95% Confidence Interval for target group
    target_conversion_rate - 1.96 * SQRT((target_conversion_rate * (1 - target_conversion_rate)) / target_visit_count) AS target_ci_lower,
    target_conversion_rate + 1.96 * SQRT((target_conversion_rate * (1 - target_conversion_rate)) / target_visit_count) AS target_ci_upper,

    # 95% Confidence Interval for control group
    control_conversion_rate - 1.96 * SQRT((control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count) AS control_ci_lower,
    control_conversion_rate + 1.96 * SQRT((control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count) AS control_ci_upper,

    # Standard error for difference in conversion rates
    SQRT(
      (target_conversion_rate * (1 - target_conversion_rate)) / target_visit_count +
      (control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count
    ) AS uplift_se
  FROM cte_combined
),

# ------------------------------------------------------
# 6. Compute z-score and estimate impact metrics
# ------------------------------------------------------
cte_zscore AS (
  SELECT
    *,
    absolute_uplift / uplift_se AS z_score,

    # One-tailed 95% significance test
    CASE 
      WHEN absolute_uplift / uplift_se >= 1.645 THEN 'Significant'
      ELSE 'Not Significant'
    END AS test_result,

    # Confidence interval for uplift
    absolute_uplift - 1.96 * uplift_se AS uplift_ci_lower,
    absolute_uplift + 1.96 * uplift_se AS uplift_ci_upper,

    # Estimated number of incremental conversions
    target_visit_count * absolute_uplift AS incremental_sales_count,

    # Projected baseline revenue if uplift had not occurred
    control_conversion_rate * target_visit_count * 
      (control_sales_amount * 1.0 / NULLIF(control_sales_count, 0)) AS expected_sales_amount_without_uplift,

    # Actual incremental sales revenue from test
    target_sales_amount - control_sales_amount AS incremental_sales_amount
  FROM cte_stats
)

# ------------------------------------------------------
# 7. Final output: summarized experiment evaluation
# ------------------------------------------------------
SELECT
  # Statistical Test Results
  z_score,
  test_result,

  # Uplift and Confidence Intervals
  absolute_uplift,
  uplift_ci_lower,
  uplift_ci_upper,
  incremental_sales_count,
  incremental_sales_amount,

  # Target Group Metrics
  target_visit_count,
  target_sales_count,
  target_conversion_rate,
  target_ci_lower,
  target_ci_upper,

  # Control Group Metrics
  control_visit_count,
  control_sales_count,
  control_conversion_rate,
  control_ci_lower,
  control_ci_upper
FROM cte_zscore;

In [None]:
experiment_results_df

## 3.6 Experimentation Insights

Here is an example report we can generate using our calculated metrics from our A/B test framework.

---

### 📊 Experiment Results Summary

Our A/B test aimed to evaluate whether the new product recommendations feature (enabled in the **target** group) led to improved conversion and revenue performance compared to the control group during Q2 2026.

---

#### ✅ Statistical Significance

- **Z-score**: `32.98`  
- **Result**: **Significant** at the 95% confidence level (one-tailed test)

This indicates **extremely strong evidence** that the target group outperformed the control group in conversion rate.

---

#### 🎯 Conversion Performance

| Metric                     | Control Group  | Target Group     |
|----------------------------|----------------|------------------|
| Number of Visits           | 25,425         | 25,424           |
| Number of Conversions      | 1,780          | 4,140            |
| Conversion Rate            | 7.00%          | 16.28%           |
| 95% CI (Conversion Rate)   | [6.69%, 7.31%] | [15.83%, 16.74%] |

- **Absolute uplift in conversion rate**: **+9.28%**  
- **95% Confidence Interval for uplift**: [8.73%, 9.83%]

---

#### 💰 Revenue Impact

- **Estimated incremental conversions**: `~2,360` additional sales  
- **Incremental sales amount**: **$5,945,660**

This represents the **extra revenue** driven by the feature rollout in the target group — compared to what would have occurred had they performed like the control group.

---

#### 📌 Conclusion

The experiment showed a **statistically significant and substantial uplift** in both conversion rate and revenue. These results suggest that enabling the new feature during this peak period had a **highly positive business impact**, and may warrant broader rollout.
