# Practice Session 04: Basket analysis

Author: <font color="blue">Bernat Quintilla Castellón</font>

E-mail: <font color="blue">bernat.quintilla01@estudiant.upf.edu</font>

Date: <font color="blue">20/10/2023</font>

In [25]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
import csv
import gzip
from apyori import apriori

# 1. Playing with apyori

In [26]:
# LEAVE AS-IS

def print_apyori_output (association_results, info=False, info_key=False):
    for relation_record in association_results:
        itemset = list(relation_record.items)
        
        # Consider only itemsets of two elements
        if len(itemset) > 1: 
        
            print("Rules involving itemset %s" % itemset)
            support = relation_record.support

            for rules in relation_record.ordered_statistics:
                antecedent = list(rules.items_base)
                consequent = list(rules.items_add)
                
                if info_key:
                    antecedent = [info.loc[x][info_key] for x in antecedent]
                    consequent = [info.loc[x][info_key] for x in consequent]
                
                confidence = rules.confidence
                lift = rules.lift

                print("%s => %s (support=%.4f, confidence=%.2f, lift=%.2f)" %
                      (antecedent, consequent, support, confidence, lift))
            print()

In [27]:
#21 different transactions
transactions = [
    ['milk', 'bread', 'eggs', 'cheese'],
    ['bread', 'eggs', 'cheese'],
    ['milk', 'eggs', 'cheese'],
    ['milk', 'bread', 'cheese'],
    ['milk', 'bread', 'eggs'],
    ['bread', 'cheese'],
    ['milk', 'eggs'],
    ['milk', 'bread'],
    ['bread', 'eggs'],
    ['milk', 'cheese'],
    ['eggs'],
    ['bread', 'butter'],
    ['milk', 'butter'],
    ['butter'],
    ['milk', 'yogurt'],
    ['yogurt', 'eggs'],
    ['milk', 'yogurt', 'eggs'],
    ['yogurt', 'cheese'],
    ['milk', 'yogurt', 'cheese'],
    ['yogurt', 'bread'],
    ['milk', 'yogurt', 'bread']
]

results = list(apriori(transactions, min_support=0.1, min_confidence=0.5, min_lift=1.0))

print_apyori_output(results)

Rules involving itemset ['cheese', 'bread']
['cheese'] => ['bread'] (support=0.1905, confidence=0.50, lift=1.05)

Rules involving itemset ['cheese', 'milk']
['cheese'] => ['milk'] (support=0.2381, confidence=0.62, lift=1.09)

Rules involving itemset ['yogurt', 'milk']
['yogurt'] => ['milk'] (support=0.1905, confidence=0.57, lift=1.00)



**Rule of itemset ['bread', 'cheese']: ['cheese'] => ['bread']**

```Support```:

Support['cheese', 'bread'] = Number of transactions with ['cheese', 'bread'] / Total number of transactions

Support['cheese', 'bread'] = 4 / 21 = 0.1905

```Confidence```:

Confidence['cheese'] => ['bread'] = Number of transactions with ['cheese', 'bread'] / Number of transactions with ['cheese']

Confidence['cheese'] => ['bread'] = 4 / 8 = 0.50

```Lift```:

Lift['cheese'] => ['bread'] = Confidence['cheese'] => ['bread'] / Support['bread']

Support['bread'] = 10 / 21 = 0.476

Lift['cheese'] => ['bread'] = 0.5 / 0.476 = 1.05

**Rule of itemset ['cheese', 'milk']: ['cheese'] => ['milk']**

```Support```:

Support['cheese', 'milk'] = Number of transactions with ['cheese', 'milk'] / Total number of transactions

Support['cheese', 'milk'] = 5 / 21 = 0.2381

```Confidence```:

Confidence['cheese'] => ['milk'] = Number of transactions with ['cheese', 'milk'] / Number of transactions with ['cheese']

Confidence['cheese'] => ['milk'] = 5 / 8 = 0.62

```Lift```:

Lift['cheese'] => ['milk'] = Confidence['cheese'] => ['milk'] / Support['milk']

Support['milk'] = 12 / 21 = 0.5714

Lift['cheese'] => ['milk'] = 0.62 / 0.5714 = 1.09

**Rule of itemset ['yogurt', 'milk']: ['yogurt'] => ['milk']**

```Support```:

Support['yogurt', 'milk'] = Number of transactions with ['yogurt', 'milk'] / Total number of transactions

Support['yogurt', 'milk'] = 4 / 21 = 0.1905

```Confidence```:

Confidence['yogurt'] => ['milk'] = Number of transactions with ['yogurt', 'milk'] / Number of transactions with ['yogurt']

Confidence['yogurt'] => ['milk'] = 4 / 7 = 0.57

```Lift```:

Lift['yogurt'] => ['milk'] = Confidence['yogurt'] => ['milk'] / Support['milk']

Lift['yogurt'] => ['milk'] = 0.57 / 0.5714 = 1.00

# 2. Load and prepare the shopping baskets

In [28]:
# LEAVE AS-IS

# File names
INPUT_PRODUCTS = "instacart-products.csv"
INPUT_TRANSACTIONS = "instacart-transactions.csv.gz"

# Read into a dataframe
products = pd.read_csv(INPUT_PRODUCTS, delimiter=",")

# Set product_id as index, and drop column aisle_id
products = products.set_index('product_id').drop(columns=['aisle_id'])

products.head(100)

Unnamed: 0_level_0,product_name,department_id
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Chocolate Sandwich Cookies,19
2,All-Seasons Salt,13
3,Robust Golden Unsweetened Oolong Tea,7
4,Smart Ones Classic Favorites Mini Rigatoni Wit...,1
5,Green Chile Anytime Sauce,13
...,...,...
96,Sprinklez Confetti Fun Organic Toppings,13
97,Organic Chamomile Lemon Tea,7
98,2% Yellow American Cheese,16
99,Local Living Butter Lettuce,4


## 2.1. Select by department

In [29]:
# LEAVE AS-IS

DEPT_BAKERY = 3
DEPT_VEGGIES = 4
DEPT_ALCOHOL = 5
DEPT_WORLD = 6
DEPT_DRINKS = 7
DEPT_PETS = 8
DEPT_PHARMACY = 11
DEPT_CLEANING = 17
DEPT_BABIES = 18

In [30]:
ids_product = list(range(50)) #Select product_id from 0 to 49

def select_from_departments(products, product_ids, department_ids):
    selected_products = []

    for product_id in product_ids:
        if product_id in products.index: #Check if id in products dataset
            department_id = products.loc[product_id].department_id
            if department_id in department_ids:
                selected_products.append(product_id) #We append the product_id if is in selected dept

    return selected_products

selected_products = select_from_departments(products, ids_product, [DEPT_WORLD, DEPT_PETS])
print(selected_products)

[21, 26, 47]


In [31]:
#Id products test
id_products_test1 = [22, 26, 45, 54, 57, 71, 111, 112]
id_products_test2 = [12, 32, 45, 63, 71, 77, 85, 90, 110]
id_products_test3 = [8, 14, 23, 26, 33, 38, 46, 50, 55, 63, 73, 87, 92, 104]
#list of depts test
list_dept_test1 = [DEPT_BAKERY, DEPT_CLEANING]
list_dept_test2 = [DEPT_VEGGIES, DEPT_ALCOHOL, DEPT_DRINKS]
list_dept_test3 = [DEPT_PHARMACY, DEPT_BABIES, DEPT_WORLD]

def print_test_selected_products(products, id_products, list_dept):
    selected_products = select_from_departments(products, id_products, list_dept) #Use of the funtion and get selected products
    print("Input products:")
    for id_prod in id_products: #Loop for printing input products
        department_id = products.loc[id_prod].department_id #We access the dept_id and product name with the product_id
        product_name = products.loc[id_prod].product_name
        print(id_prod," ",product_name," (dept ",department_id,")")
    print("\nSelected products:")
    for prod in selected_products: #Loop for printig selected products similar as before
        dept_id = products.loc[prod].department_id
        prod_name = products.loc[prod].product_name
        print(prod," ",prod_name," (dept ",dept_id,")")
    print("\n")
    
print_test_selected_products(products, id_products_test1, list_dept_test1)
print_test_selected_products(products, id_products_test2, list_dept_test2)
print_test_selected_products(products, id_products_test3, list_dept_test3)

Input products:
22   Fresh Breath Oral Rinse Mild Mint  (dept  11 )
26   Fancy Feast Trout Feast Flaked Wet Cat Food  (dept  8 )
45   European Cucumber  (dept  4 )
54   24/7 Performance Cat Litter  (dept  8 )
57   Flat Toothpicks  (dept  17 )
71   Ultra 7 Inch Polypropylene Traditional Plates  (dept  17 )
111   Fabric Softener, Geranium Scent  (dept  17 )
112   Hot Tomatillo Salsa  (dept  13 )

Selected products:
57   Flat Toothpicks  (dept  17 )
71   Ultra 7 Inch Polypropylene Traditional Plates  (dept  17 )
111   Fabric Softener, Geranium Scent  (dept  17 )


Input products:
12   Chocolate Fudge Layer Cake  (dept  1 )
32   Nacho Cheese White Bean Chips  (dept  19 )
45   European Cucumber  (dept  4 )
63   Banana & Sweet Potato Organic Teething Wafers  (dept  18 )
71   Ultra 7 Inch Polypropylene Traditional Plates  (dept  17 )
77   Coconut Chocolate Chip Energy Bar  (dept  19 )
85   Soppressata Piccante  (dept  20 )
90   Smorz Cereal  (dept  14 )
110   Uncured Turkey Bologna  (dept  21

## 2.2. Read and filter transactions

In [32]:
#Form a function for usage in section 2.4.
def read_transactions(transactions_read, transactions_stored, transactions, products, INPUT_TRANSACTIONS, depts):
    # Open a compressed file
    with gzip.open(INPUT_TRANSACTIONS, "rt") as inputfile:

        # Create a CSV reader
        reader = csv.reader(inputfile, delimiter=",")

        # Iterate through the CSV file
        for row in reader:

            # Convert to integers
            items = [int(x) for x in row]
            selected_items = select_from_departments(products, items, depts) #Select products_id in items that are in DEPT_CLEANING
            if selected_items: #if it is not empty
                transactions.append(items)
                transactions_stored += 1

            transactions_read += 1

            if transactions_stored >= 5000:#Check if we've stored 5000 transactions
                break

            if transactions_read % 1000 == 0:#Print progress every 1000 transactions
                print(f"Transactions Read: {transactions_read}, Transactions Stored: {transactions_stored}")

    print("Finished reading and storing transactions.")
    
transactions_read = 0
transactions_stored = 0
transactions = []
read_transactions(transactions_read, transactions_stored, transactions, products, INPUT_TRANSACTIONS, [DEPT_CLEANING])    

Transactions Read: 1000, Transactions Stored: 158
Transactions Read: 2000, Transactions Stored: 311
Transactions Read: 3000, Transactions Stored: 460
Transactions Read: 4000, Transactions Stored: 598
Transactions Read: 5000, Transactions Stored: 745
Transactions Read: 6000, Transactions Stored: 902
Transactions Read: 7000, Transactions Stored: 1067
Transactions Read: 8000, Transactions Stored: 1206
Transactions Read: 9000, Transactions Stored: 1373
Transactions Read: 10000, Transactions Stored: 1515
Transactions Read: 11000, Transactions Stored: 1670
Transactions Read: 12000, Transactions Stored: 1807
Transactions Read: 13000, Transactions Stored: 1951
Transactions Read: 14000, Transactions Stored: 2102
Transactions Read: 15000, Transactions Stored: 2245
Transactions Read: 16000, Transactions Stored: 2384
Transactions Read: 17000, Transactions Stored: 2543
Transactions Read: 18000, Transactions Stored: 2692
Transactions Read: 19000, Transactions Stored: 2840
Transactions Read: 20000, T

## 2.3. Extract association rules and comment on them (DEPT_CLEANING)

In [12]:
results = list(apriori(transactions, min_support=0.003, min_confidence=0.5, min_lift=1.0)) #I put min_support=0.003 to have a reasonable number of rules
print_apyori_output(results, products, 'product_name')

Rules involving itemset [19604, 16797]
['Medium Scarlet Raspberries'] => ['Strawberries'] (support=0.0030, confidence=0.56, lift=12.35)

Rules involving itemset [13176, 5876, 27966]
['Organic Lemon', 'Organic Raspberries'] => ['Bag of Organic Bananas'] (support=0.0030, confidence=0.71, lift=6.24)

Rules involving itemset [13176, 47209, 8021]
['Organic Hass Avocado', '100% Recycled Paper Towels'] => ['Bag of Organic Bananas'] (support=0.0042, confidence=0.52, lift=4.59)

Rules involving itemset [13176, 21137, 27966]
['Organic Strawberries', 'Organic Raspberries'] => ['Bag of Organic Bananas'] (support=0.0058, confidence=0.60, lift=5.28)

Rules involving itemset [13176, 21137, 39275]
['Organic Strawberries', 'Organic Blueberries'] => ['Bag of Organic Bananas'] (support=0.0030, confidence=0.50, lift=4.37)

Rules involving itemset [13176, 39275, 27966]
['Organic Blueberries', 'Organic Raspberries'] => ['Bag of Organic Bananas'] (support=0.0030, confidence=0.63, lift=5.46)

Rules involving 

I would recommend to the shopping app to offer bundle deals or discounts for items that appear together in these association rules. For instance, the rule ['Organic Lemon', 'Organic Raspberries'] => ['Bag of Organic Bananas'] suggests that customers who buy Organic Lemon and Organic Raspberries often also purchase Bag of Organic Bananas. The app can offer a discount when these three items are bought together for example.

Another recomendation would be that because of the fact that certain items are frequently bought together, the app should promote the related categories as well. For example, if 'Organic Strawberries,' 'Organic Raspberries,' and 'Bag of Organic Bananas' are frequently bought together, the app should suggest to the customer the "Organic Fruits" category.

## 2.4. Extract association rules and comment on them (other departments)

In [13]:
#In this cell we generate transactions with set of departments DEPT_PETS, DEPT_PHARMACY, DEPT_BABIES using function read_transactions
transactions24_read = 0
transactions24_stored = 0
transactions24 = []
departments = [DEPT_PETS, DEPT_PHARMACY, DEPT_BABIES]
read_transactions(transactions24_read, transactions24_stored, transactions24, products, INPUT_TRANSACTIONS, departments)    

Transactions Read: 1000, Transactions Stored: 163
Transactions Read: 2000, Transactions Stored: 339
Transactions Read: 3000, Transactions Stored: 507
Transactions Read: 4000, Transactions Stored: 668
Transactions Read: 5000, Transactions Stored: 837
Transactions Read: 6000, Transactions Stored: 994
Transactions Read: 7000, Transactions Stored: 1161
Transactions Read: 8000, Transactions Stored: 1306
Transactions Read: 9000, Transactions Stored: 1480
Transactions Read: 10000, Transactions Stored: 1678
Transactions Read: 11000, Transactions Stored: 1855
Transactions Read: 12000, Transactions Stored: 2003
Transactions Read: 13000, Transactions Stored: 2176
Transactions Read: 14000, Transactions Stored: 2353
Transactions Read: 15000, Transactions Stored: 2524
Transactions Read: 16000, Transactions Stored: 2693
Transactions Read: 17000, Transactions Stored: 2837
Transactions Read: 18000, Transactions Stored: 2999
Transactions Read: 19000, Transactions Stored: 3137
Transactions Read: 20000, T

In [14]:
#Now we obtain the assoctiation rules from the read transactions
results24 = list(apriori(transactions24, min_support=0.0032, min_confidence=0.5, min_lift=1.0))
print_apyori_output(results24, products, 'product_name')

Rules involving itemset [19604, 16797]
['Medium Scarlet Raspberries'] => ['Strawberries'] (support=0.0032, confidence=0.50, lift=10.55)

Rules involving itemset [32018, 38141]
['Fiber & Protein Organic Pears, Raspberries, Butternut Squash & Carrots Snack'] => ['Organic Fiber & Protein Pear Blueberry & Spinach Baby Food'] (support=0.0032, confidence=0.76, lift=108.84)

Rules involving itemset [32018, 45495]
['Pear Kiwi & Kale Baby Food'] => ['Organic Fiber & Protein Pear Blueberry & Spinach Baby Food'] (support=0.0032, confidence=0.70, lift=99.38)

Rules involving itemset [13176, 27966, 21903]
['Organic Raspberries', 'Organic Baby Spinach'] => ['Bag of Organic Bananas'] (support=0.0046, confidence=0.55, lift=3.78)

Rules involving itemset [13176, 22825, 47209]
["Organic D'Anjou Pears", 'Organic Hass Avocado'] => ['Bag of Organic Bananas'] (support=0.0032, confidence=0.50, lift=3.45)

Rules involving itemset [13176, 27966, 30391]
['Organic Raspberries', 'Organic Cucumber'] => ['Bag of Or

Several rules exhibit high confidence and lift values, indicating strong associations between items. For example, Rule 2 has a confidence of 76% and a lift of 108.84, suggesting a robust relationship between the mentioned snack and baby food.

The rules highlight associations between specific products in different departments. For instance, Rule 1 suggests that customers purchasing 'Medium Scarlet Raspberries' might also buy 'Strawberries' with a lift of 10.55.

Some rules involve items from different departments, indicating potential cross-department purchasing behavior. This can be valuable information for marketing and product placement strategies.

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>