## Naive Bayes: Limes and Lemons Shop

In [1]:
import sqlite3
import pandas as pd

In [2]:
# Function to read and execute SQL model files
def execute_sql_file(filepath, connection):
    with open(filepath, 'r') as file:
        sql_script = file.read()
    connection.executescript(sql_script)
    print(f"Executed {filepath}")

# Creating a connection to SQLite for demo purposes
conn = sqlite3.connect(':memory:')  # Using an in-memory database for simplicity
cursor = conn.cursor()

In [3]:
# Load input data
execute_sql_file('../data/input/transactions.sql', conn)
execute_sql_file('../data/input/product_stock_outs.sql', conn)
execute_sql_file('../data/input/substitution_groups.sql', conn)
execute_sql_file('../data/input/transaction_spine.sql', conn)
execute_sql_file('../data/input/transaction_outcome.sql', conn)
execute_sql_file('../data/input/transaction_availability.sql', conn)

# Load naive bayes model
execute_sql_file('../data/naive_bayes/priors.sql', conn)
execute_sql_file('../data/naive_bayes/likelihoods.sql', conn)
execute_sql_file('../data/naive_bayes/posteriors.sql', conn)

# Load output data
execute_sql_file('../data/output/transactions_corrected.sql', conn)


Executed ../data/input/transactions.sql
Executed ../data/input/product_stock_outs.sql
Executed ../data/input/substitution_groups.sql
Executed ../data/input/transaction_spine.sql
Executed ../data/input/transaction_outcome.sql
Executed ../data/input/product_stock_outs.sql
Executed ../data/input/transaction_availability.sql
Executed ../data/naive_bayes/priors.sql
Executed ../data/naive_bayes/likelihoods.sql
Executed ../data/naive_bayes/posteriors.sql
Executed ../data/output/transactions_corrected.sql


In [5]:
df_transactions = pd.read_sql_query('SELECT * FROM transactions', conn)
df_transactions.head()

Unnamed: 0,transaction_id,sales_date_time,product_name,product_id,quantity_sold
0,1,2024-10-01 13:15:00,lime,11,2
1,2,2024-10-01 13:20:00,lemon,12,1
2,3,2024-10-01 14:50:00,lime,11,3
3,4,2024-10-01 14:55:00,lemon,12,1
4,5,2024-10-01 15:00:00,lemon,12,3


In [112]:
# look at nr of transactions per product
df_transactions.groupby('product_name').size().reset_index(name='nr_transactions')

Unnamed: 0,product_name,nr_transactions
0,lemon,23
1,lime,14


At first glance, it looks like there is a strong customer preference towards lemons. But let's see whether this is indeed the case

In [133]:
# Find transactions made when there were stock outs
df_out_of_stock = pd.read_sql_query("""
    SELECT DISTINCT 
        transaction_id,  
        CASE WHEN product_id_outcome = 11 THEN 'lime' ELSE 'lemon' END AS product_purchased,
        CASE WHEN product_id_available = 11 THEN 'lime' ELSE 'lemon' END AS product_stock_out,
        is_available
    FROM transaction_outcome
    INNER JOIN transaction_availability USING (transaction_id)
    WHERE is_available = 0
""", conn)

df_out_of_stock


Unnamed: 0,transaction_id,product_purchased,product_stock_out,is_available
0,19,lemon,lime,0
1,20,lemon,lime,0
2,21,lemon,lime,0
3,22,lemon,lime,0
4,23,lemon,lime,0
5,24,lemon,lime,0
6,26,lemon,lime,0
7,27,lemon,lime,0


We observed 8 transactions of lemons while limes were out of stock.

Expecting that limes and lemons are substitutes of each other; let's see whether customers bought lemons as a substitute for limes.


We can use Naive Bayes to calculate the posterior probability that a customer purchases lemons given that limes are out of stock:

$$ P(Lemons | Limes \, Out \, of \, Stock) = \frac{P(Limes \, Out \, of \, Stock | Lemons) \cdot P(Lemons)}{P(Limes \, Out \, of \, Stock)} $$

Let's start by finding the priors and likelihoods.




In [120]:
df_priors = pd.read_sql_query("""
SELECT 
    CASE WHEN product_id_outcome = 11 THEN 'lime' ELSE 'lemon' END AS product_name,
    count as n_times_purchased,
    total as n_total_purchased,
    prior
FROM priors
""", conn)

df_priors.head()

Unnamed: 0,product_name,n_times_purchased,n_total_purchased,prior
0,lime,14,37,0.378378
1,lemon,23,37,0.621622


We found, for example, the prior of limes by:

$$ P(Limes) = \frac{14}{37} = 0.378378 $$

In [18]:
df_likelihoods = pd.read_sql_query("""
WITH outcome_availability AS ( 
    SELECT DISTINCT 
        substitution_group_id,
        transaction_id,
        product_id_outcome,
        product_id_available,
        is_available
    FROM transaction_outcome
    INNER JOIN transaction_availability USING (transaction_id, substitution_group_id)
)

SELECT
    CASE WHEN product_id_outcome = 11 THEN 'lime' ELSE 'lemon' END AS product_purchased,
    CASE WHEN product_id_available = 11 THEN 'lime' ELSE 'lemon' END AS product_stock_out,
    is_available,
    COUNT(*) as n_occurrences_of_stock_out_status,
    SUM(COUNT(*)) OVER (PARTITION BY product_id_outcome, product_id_available) as total,
    CAST(COUNT(*) AS REAL) / CAST(SUM(COUNT(*)) OVER (PARTITION BY substitution_group_id, product_id_outcome, product_id_available) AS REAL) as likelihood
FROM outcome_availability
WHERE product_id_outcome = 12 and product_id_available = 11
GROUP BY 
    product_id_outcome,
    product_id_available,
    is_available
""", conn)

df_likelihoods

Unnamed: 0,product_purchased,product_stock_out,is_available,n_occurrences_of_stock_out_status,total,likelihood
0,lemon,lime,0,8,23,0.347826
1,lemon,lime,1,15,23,0.652174


The likelihood can be understood as the probability of observing the evidence given the hypothesis. For example, the likelihood of observing that limes are out of stock given that we purchase lemons.

This means we count two things:

1. The number of times limes were out of stock when lemons were purchased: which is 8

2. The number of times lemons were purchased: which is 23

$$ P(Limes \, Out \, of \, Stock | Lemons) = \frac{8}{23} = 0.3478 $$

Now we have calculated the priors and likelihoods, we have all the ingredients to calculate the posterior probability.

In [24]:
### TODO: More indepth on calculation and normalization constant

df_posteriors = pd.read_sql_query("""
SELECT 
    transaction_id,
    posterior,
    posterior_base,
    substitution_correction_ratio
FROM posteriors
WHERE posterior > posterior_base
""", conn)
df_posteriors

Unnamed: 0,transaction_id,posterior,posterior_base,substitution_correction_ratio
0,19,1.0,0.545,0.545
1,20,1.0,0.545,0.545
2,21,1.0,0.545,0.545
3,22,1.0,0.545,0.545
4,23,1.0,0.545,0.545
5,24,1.0,0.545,0.545
6,25,1.0,0.545,0.545
7,26,1.0,0.545,0.545


$$ P(Lemons | Limes \, Out \, of \, Stock) = \frac{P(Limes \, Out \, of \, Stock | Lemons) \cdot P(Lemons)}{P(Limes \, Out \, of \, Stock)} $$

Substitute the values:

$$ P(Lemons | Limes \, Out \, of \, Stock) = \frac{0.3478 \cdot 0.6216}{0.2162} $$

This simplifies to:

$$ P(Lemons | Limes \, Out \, of \, Stock) = 1 $$