# Practice Session 04: Basket analysis

Association rule mining techniques are useful to analyze datasets consisting of transactions, in which each transaction is a collection of items.

We will use a well-known dataset named [Instacart](https://www.kaggle.com/c/instacart-market-basket-analysis) containing more than 3 million orders of products through a grocery shopping app. You can find it in the `instacart/` directory of the practicum data files.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Luca Franceschi</font>

E-mail: <font color="blue">luca.franceschi01@estudiant.upf.edu</font>

Date: <font color="blue">?/10/2024</font>

In [1]:
import numpy as np  
import matplotlib.pyplot as plt
import pandas as pd  
import csv
import gzip
                     
from apyori import apriori

## 0. The Apriori Algorithm in a nutshell

There are three major components of Apriori algorithm, which we describe below using as an example the case where transactions are purchase histories.

**Support**: the number of transactions containing a particular item divided by total number of transactions:

   *Support(A) = (Transactions containing (A))/(Total Transactions)*

**Confidence**: normally indicates the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought:

   *Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)*

**Lift**: the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) by Support(B):

   *Lift(A→B) = (Confidence (A→B))/(Support (B))*
   
A Lift of 1 means there is no association between products A and B. Lift greater than 1.0 means products A and B are more likely to be bought together. Lift less than 1.0 indicates two products are unlikely to be bought together.

The Apriori algorithm first finds itemsets having the desired level of support, and then within those itemsets tries to derive rules having the desired confidence and lift.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. Playing with apyori

The [apyori library](https://pypi.org/project/apyori/) is an implementation of the Apriori algorithm. Its typical usage is to receive a list of transactions and then print the association rules it found.

To use this library, we pass a list in which each element represents a transaction, for instance:

```python
transactions = [
    ['Barbie', 'Nomadland', 'Everything,   Everywhere, all at Once'],
    ['Barbie', 'Everything, Everywhere, all at Once'],
    ['Nomadland', 'Everything, Everywhere, all at Once'],
    ['Barbie', 'Nomadland', 'Soul', 'Everything, Everywhere, all at Once'],
    ['Encanto', 'Nomadland'],
    ['Barbie', 'Nomadland', 'Mad Max: Furiosa', 'Everything, Everywhere, all at Once'],
    ['Barbie', 'Mad Max: Furiosa'],
    ['Oppenheimer', 'Spiderman'],
    ['Spiderman', 'Top Gun: Maverick', 'Mad Max: Furiosa'],
    ['Spiderman', 'Top Gun: Maverick', 'Suicide Squad', 'Max Max: Furiosa'],
    ['Suicide Squad', 'Top Gun: Maverick', 'Mad Max: Furiosa'],
    ['Tenet', 'Everything, Everywhere, all at Once'],
    ['Encanto', 'Soul'],
    ['Soul', 'Spiderman'],
    ['Encanto', 'Soul', 'Inside Out'],
    ['Encanto', 'Inside Out'],
    ['Inside Out', 'Spiderman'],
    ['Nomadland', 'Soul'],
    
]
results = list(apriori(transactions, min_support=0.1, min_confidence=0.9, min_lift=1.0))
print_apyori_output(results)

```

The function below, which you can leave as-is, prints the output of the apyori library in a readable format. Use it to print the results of your association rules mining:

```python
print_apyori_output(results)
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [2]:
# LEAVE AS-IS

def print_apyori_output (association_results, info=False, info_key=False):
    for relation_record in association_results:
        itemset = list(relation_record.items)
        
        # Consider only itemsets of two elements
        if len(itemset) > 1: 
        
            print("Rules involving itemset %s" % itemset)
            support = relation_record.support

            for rules in relation_record.ordered_statistics:
                antecedent = list(rules.items_base)
                consequent = list(rules.items_add)
                
                if info_key:
                    antecedent = [info.loc[x][info_key] for x in antecedent]
                    consequent = [info.loc[x][info_key] for x in consequent]
                
                confidence = rules.confidence
                lift = rules.lift

                print("%s => %s (support=%.4f, confidence=%.2f, lift=%.2f)" %
                      (antecedent, consequent, support, confidence, lift))
            print()

Next, invent your own set of transactions. Be **creative** but **tasteful,** and think of a list of transactions that **makes sense,** e.g., involving food, music, books, products, places, apps, or other items. 

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your own example of transactions (at least 20 transactions). Execute the apriori algorithm, in which you should obtain at least <strong>two</strong> rules of the form ['A', 'B'] => ['C'], i.e., at least two rules having a 2-itemset in the antecedent and a 1-itemset in the consequent. Modify the transactions until you obtain such rules.</font>

In [3]:
transactions = [
    ['Margarita', 'Pepperoni', 'Veggie Supreme', 'Spicy Sausage'],
    ['BBQ Chicken', 'Meat Lovers'],
    ['Pesto', 'Four-Cheese', 'Sicilian', 'Pepperoni'],
    ['White Pizza', 'Buffalo Chicken', 'Hawaiian', 'Pesto', 'BBQ Chicken'],
    ['Sicilian', 'Veggie Supreme', 'Pepperoni'],
    ['Margarita', 'Spicy Sausage', 'Pesto', 'Buffalo Chicken'],
    ['Meat Lovers', 'White Pizza', 'BBQ Chicken', 'Pepperoni'],
    ['Sicilian', 'Four-Cheese', 'White Pizza', 'Pepperoni', 'Margarita'],
    ['Hawaiian', 'Spicy Sausage', 'Four-Cheese'],
    ['BBQ Chicken', 'Sicilian', 'Pepperoni'],
    ['Veggie Supreme', 'Meat Lovers', 'Four-Cheese'],
    ['White Pizza', 'Four-Cheese'],
    ['Pepperoni', 'BBQ Chicken', 'White Pizza', 'Hawaiian'],
    ['Margarita', 'Veggie Supreme', 'Buffalo Chicken', 'Pesto'],
    ['Spicy Sausage', 'Meat Lovers', 'Four-Cheese'],
    ['Veggie Supreme', 'Sicilian', 'Buffalo Chicken'],
    ['Pesto', 'Hawaiian'],
    ['Four-Cheese', 'Buffalo Chicken', 'Sicilian', 'White Pizza'],
    ['Pepperoni', 'Hawaiian'],
    ['Spicy Sausage', 'Sicilian', 'Veggie Supreme', 'Meat Lovers'],
]

results = list(apriori(transactions, min_support=0.1, min_confidence=0.9, min_lift=1.0))
print_apyori_output(results)

Rules involving itemset ['BBQ Chicken', 'White Pizza', 'Hawaiian']
['BBQ Chicken', 'Hawaiian'] => ['White Pizza'] (support=0.1000, confidence=1.00, lift=3.33)
['White Pizza', 'Hawaiian'] => ['BBQ Chicken'] (support=0.1000, confidence=1.00, lift=4.00)

Rules involving itemset ['Margarita', 'Pesto', 'Buffalo Chicken']
['Margarita', 'Buffalo Chicken'] => ['Pesto'] (support=0.1000, confidence=1.00, lift=4.00)
['Margarita', 'Pesto'] => ['Buffalo Chicken'] (support=0.1000, confidence=1.00, lift=4.00)

Rules involving itemset ['Four-Cheese', 'Pepperoni', 'Sicilian']
['Four-Cheese', 'Pepperoni'] => ['Sicilian'] (support=0.1000, confidence=1.00, lift=2.86)

Rules involving itemset ['Four-Cheese', 'White Pizza', 'Sicilian']
['White Pizza', 'Sicilian'] => ['Four-Cheese'] (support=0.1000, confidence=1.00, lift=2.86)



<font size="+1" color="red">Replace this cell with a markdown cell containing (1) a printout of the rules you have obtained, and (2) for each of those rules, indicate clearly how the support, confidence, and lift is calculated. Do not merely repeat the formula: indicate how each number is computed based on the transactions you provided, as if you were trying to verify that the numbers are correct.</font>

We first have to construct the table containing the 1-itemsets along with their count and support (count / len(transactions)).

| Pizza Type          | Count | Support |
| ------------------- | ----- | ------- |
| {Margarita}         | 4     | 0.2     |
| {Pepperoni}         | 8     | 0.4     |
| {Veggie Supreme}    | 6     | 0.3     |
| {Spicy Sausage}     | 5     | 0.25    |
| {BBQ Chicken}       | 5     | 0.25    |
| {Meat Lovers}       | 5     | 0.25    |
| {Pesto}             | 5     | 0.25    |
| {Four-Cheese}       | 7     | 0.35    |
| {Sicilian}          | 7     | 0.35    |
| {White Pizza}       | 6     | 0.3     |
| {Buffalo Chicken}   | 5     | 0.25    |
| {Hawaiian}          | 5     | 0.25    |

Now we have to discard the ones that have less than 0.1 support (none). After that, construct the same table but with 2-itemsets in a similar way (note that there are no entries with count=0, I skipped them):

I am starting to believe that this exercise was supposed to have like 3 elements but I chose 12... It is getting out of hand... In my defense I will say that for 20 transactions only 3 elements seemed too little. The little support of 0.1 does not help, too.

| Pizza Type                        | Count | Support |
| --------------------------------- | ----- | ------- |
| {Margarita, Pepperoni}            | 2     | 0.10    |
| {Margarita, Veggie Supreme}       | 2     | 0.10    |
| {Margarita, Spicy Sausage}        | 2     | 0.10    |
| {Pepperoni, Veggie Supreme}       | 2     | 0.10    |
| {Pepperoni, Spicy Sausage}        | 1     | 0.05    |
| {Veggie Supreme, Spicy Sausage}   | 2     | 0.10    |
| {BBQ Chicken, Meat Lovers}        | 2     | 0.10    |
| {Pepperoni, Pesto}                | 1     | 0.05    |
| {Pepperoni, Four-Cheese}          | 2     | 0.10    |
| {Pepperoni, Sicilian}             | 4     | 0.20    |
| {Pesto, Four-Cheese}              | 1     | 0.05    |
| {Pesto, Sicilian}                 | 1     | 0.05    |
| {Four-Cheese, Sicilian}           | 3     | 0.15    |
| {BBQ Chicken, Pesto}              | 1     | 0.05    |
| {BBQ Chicken, White Pizza}        | 3     | 0.15    |
| {BBQ Chicken, Buffalo Chicken}    | 1     | 0.05    |
| {BBQ Chicken, Hawaiian}           | 2     | 0.10    |
| {Pesto, White Pizza}              | 1     | 0.05    |
| {Pesto, Buffalo Chicken}          | 3     | 0.15    |
| {Pesto, Hawaiian}                 | 2     | 0.10    |
| {White Pizza, Buffalo Chicken}    | 2     | 0.10    |
| {White Pizza, Hawaiian}           | 2     | 0.10    |
| {Buffalo Chicken, Hawaiian}       | 1     | 0.05    |
| {Veggie Supreme, Sicilian}        | 3     | 0.15    |
| {Margarita, Pesto}                | 2     | 0.10    |
| {Margarita, Buffalo Chicken}      | 2     | 0.10    |
| {Spicy Sausage, Pesto}            | 1     | 0.05    |
| {Spicy Sausage, Buffalo Chicken}  | 1     | 0.05    |
| {Pepperoni, BBQ Chicken}          | 3     | 0.15    |
| {Pepperoni, Meat Lovers}          | 1     | 0.05    |
| {Pepperoni, White Pizza}          | 3     | 0.15    |
| {Meat Lovers, White Pizza}        | 1     | 0.05    |
| {Margarita, Four-Cheese}          | 1     | 0.05    |
| {Margarita, Sicilian}             | 1     | 0.05    |
| {Margarita, White Pizza}          | 1     | 0.05    |
| {Four-Cheese, White Pizza}        | 3     | 0.15    |
| {Sicilian, White Pizza}           | 2     | 0.10    |
| {Spicy Sausage, Four-Cheese}      | 2     | 0.10    |
| {Spicy Sausage, Hawaiian}         | 1     | 0.05    |
| {Four-Cheese, Hawaiian}           | 1     | 0.05    |
| {BBQ Chicken, Sicilian}           | 1     | 0.05    |
| {Veggie Supreme, Meat Lovers}     | 2     | 0.10    |
| {Veggie Supreme, Four-Cheese}     | 1     | 0.05    |
| {Meat Lovers, Four-Cheese}        | 2     | 0.10    |
| {Pepperoni, Hawaiian}             | 2     | 0.10    |
| {Veggie Supreme, Pesto}           | 1     | 0.05    |
| {Veggie Supreme, Buffalo Chicken} | 2     | 0.10    |
| {Spicy Sausage, Meat Lovers}      | 2     | 0.10    |
| {Sicilian, Buffalo Chicken}       | 2     | 0.10    |
| {Four-Cheese, Buffalo Chicken}    | 1     | 0.05    |
| {Spicy Sausage, Sicilian}         | 1     | 0.05    |
| {Meat Lovers, Sicilian}           | 1     | 0.05    |

Similarly, now we have to discard the ones that have less than 0.1 support, below the simplified table.

| Pizza Type                        | Count | Support |
| --------------------------------- | ----- | ------- |
| {Margarita, Pepperoni}            | 2     | 0.10    |
| {Margarita, Veggie Supreme}       | 2     | 0.10    |
| {Margarita, Spicy Sausage}        | 2     | 0.10    |
| {Pepperoni, Veggie Supreme}       | 2     | 0.10    |
| {Veggie Supreme, Spicy Sausage}   | 2     | 0.10    |
| {BBQ Chicken, Meat Lovers}        | 2     | 0.10    |
| {Pepperoni, Four-Cheese}          | 2     | 0.10    |
| {Pepperoni, Sicilian}             | 4     | 0.20    |
| {Four-Cheese, Sicilian}           | 3     | 0.15    |
| {BBQ Chicken, White Pizza}        | 3     | 0.15    |
| {BBQ Chicken, Hawaiian}           | 2     | 0.10    |
| {Pesto, Buffalo Chicken}          | 3     | 0.15    |
| {Pesto, Hawaiian}                 | 2     | 0.10    |
| {White Pizza, Buffalo Chicken}    | 2     | 0.10    |
| {White Pizza, Hawaiian}           | 2     | 0.10    |
| {Veggie Supreme, Sicilian}        | 3     | 0.15    |
| {Margarita, Pesto}                | 2     | 0.10    |
| {Margarita, Buffalo Chicken}      | 2     | 0.10    |
| {Pepperoni, BBQ Chicken}          | 3     | 0.15    |
| {Pepperoni, White Pizza}          | 3     | 0.15    |
| {Four-Cheese, White Pizza}        | 3     | 0.15    |
| {Sicilian, White Pizza}           | 2     | 0.10    |
| {Spicy Sausage, Four-Cheese}      | 2     | 0.10    |
| {Veggie Supreme, Meat Lovers}     | 2     | 0.10    |
| {Meat Lovers, Four-Cheese}        | 2     | 0.10    |
| {Pepperoni, Hawaiian}             | 2     | 0.10    |
| {Veggie Supreme, Buffalo Chicken} | 2     | 0.10    |
| {Spicy Sausage, Meat Lovers}      | 2     | 0.10    |
| {Sicilian, Buffalo Chicken}       | 2     | 0.10    |

Build a table and simplify again,  for 3-itemsets

| Pizza Type                            | Count | Support |
| ------------------------------------- | ----- | ------- |
| {Pepperoni, Four-Cheese, Sicilian}    | 2     | 0.10    |
| {BBQ Chicken, White Pizza, Hawaiian}  | 2     | 0.10    |
| {Margarita, Pesto, Buffalo Chicken}   | 2     | 0.10    |
| {Pepperoni, BBQ Chicken, White Pizza} | 2     | 0.10    |
| {Four-Cheese, Sicilian, White Pizza}  | 2     | 0.10    |

Now calculate the confidence and lift for all the rules remaining in the simplified 3-itemsets. The confidence is the 3-itemset support divided by the antecedent support, and the lift is the 3-itemset support divided by the multiplication of the antecedent and consequent support.

| Pizza Type                               | 3-itemset Support | Antecedent support | Consequent support | Confidence  | Lift |
| ---------------------------------------- | ----------------- | ------------------ | ------------------ | ----------- | ---- |
| {Pepperoni, Four-Cheese} → {Sicilian}    | 0.10              | 0.10               | 0.35               | 1.00        | 2.86 |
| {Pepperoni, Sicilian} → {Four-Cheese}    | 0.10              | 0.20               | 0.35               | 0.50        | 1.43 |
| {Four-Cheese, Sicilian} → {Pepperoni}    | 0.10              | 0.15               | 0.40               | 0.67        | 1.67 |
| {BBQ Chicken, White Pizza} → {Hawaiian}  | 0.10              | 0.15               | 0.25               | 0.67        | 2.67 |
| {BBQ Chicken, Hawaiian} → {White Pizza}  | 0.10              | 0.10               | 0.30               | 1.00        | 3.33 |
| {White Pizza, Hawaiian} → {BBQ Chicken}  | 0.10              | 0.10               | 0.25               | 1.00        | 4.00 |
| {Margarita, Pesto} → {Buffalo Chicken}   | 0.10              | 0.10               | 0.25               | 1.00        | 4.00 |
| {Margarita, Buffalo Chicken} → {Pesto}   | 0.10              | 0.10               | 0.25               | 1.00        | 4.00 |
| {Pesto, Buffalo Chicken} → {Margarita}   | 0.10              | 0.15               | 0.20               | 0.67        | 3.33 |
| {Pepperoni, BBQ Chicken} → {White Pizza} | 0.10              | 0.15               | 0.30               | 0.67        | 2.22 |
| {Pepperoni, White Pizza} → {BBQ Chicken} | 0.10              | 0.15               | 0.25               | 0.67        | 2.67 |
| {BBQ Chicken, White Pizza} → {Pepperoni} | 0.10              | 0.15               | 0.40               | 0.67        | 1.67 |
| {Four-Cheese, Sicilian} → {White Pizza}  | 0.10              | 0.15               | 0.30               | 0.67        | 2.22 |
| {Four-Cheese, White Pizza} → {Sicilian}  | 0.10              | 0.15               | 0.35               | 0.67        | 1.90 |
| {Sicilian, White Pizza} → {Four-Cheese}  | 0.10              | 0.10               | 0.35               | 1.00        | 2.86 |

We can now simplify the table removing all the entries that have confidence smaller than 0.9 or lift smaller than 1.0:

| Pizza Type                               | 3-itemset Support | Antecedent support | Consequent support | Confidence  | Lift |
| ---------------------------------------- | ----------------- | ------------------ | ------------------ | ----------- | ---- |
| {Pepperoni, Four-Cheese} → {Sicilian}    | 0.10              | 0.10               | 0.35               | 1.00        | 2.86 |
| {BBQ Chicken, Hawaiian} → {White Pizza}  | 0.10              | 0.10               | 0.30               | 1.00        | 3.33 |
| {White Pizza, Hawaiian} → {BBQ Chicken}  | 0.10              | 0.10               | 0.25               | 1.00        | 4.00 |
| {Margarita, Pesto} → {Buffalo Chicken}   | 0.10              | 0.10               | 0.25               | 1.00        | 4.00 |
| {Margarita, Buffalo Chicken} → {Pesto}   | 0.10              | 0.10               | 0.25               | 1.00        | 4.00 |
| {Sicilian, White Pizza} → {Four-Cheese}  | 0.10              | 0.10               | 0.35               | 1.00        | 2.86 |

# 2. Load and prepare the shopping baskets

The following code, which you should leave as-is, loads the information about products into a dataframe indexed by product id.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [4]:
# LEAVE AS-IS

# File names
INPUT_PRODUCTS = "instacart-products.csv"
INPUT_TRANSACTIONS = "instacart-transactions.csv.gz"

# Read into a dataframe
products = pd.read_csv(INPUT_PRODUCTS, delimiter=",")

# Set product_id as index, and drop column aisle_id
products = products.set_index('product_id').drop(columns=['aisle_id'])

products.head(100)

Unnamed: 0_level_0,product_name,department_id
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Chocolate Sandwich Cookies,19
2,All-Seasons Salt,13
3,Robust Golden Unsweetened Oolong Tea,7
4,Smart Ones Classic Favorites Mini Rigatoni Wit...,1
5,Green Chile Anytime Sauce,13
...,...,...
96,Sprinklez Confetti Fun Organic Toppings,13
97,Organic Chamomile Lemon Tea,7
98,2% Yellow American Cheese,16
99,Local Living Butter Lettuce,4


## 2.1. Select by department

As this file is large and complex, we will focus on one or two departments and try to get some conclusions about the products in those departments. The following cell, which you should leave as-is, list some department names.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [5]:
# LEAVE AS-IS

DEPT_BAKERY = 3
DEPT_VEGGIES = 4
DEPT_ALCOHOL = 5
DEPT_WORLD = 6
DEPT_DRINKS = 7
DEPT_PETS = 8
DEPT_PHARMACY = 11
DEPT_CLEANING = 17
DEPT_BABIES = 18

Write code that can select a list of products from a set of departments. Do this with a function named `select_from_departments` that takes as input:

* A dataframe containing product information, which will be the `products` dataframe we just loaded.
* A list of product ids
* A list of department ids

It should return a list containing only the product ids that belong to one of the listed departments. This may return an empty list if no product belongs to any of the specified departments.

Given that the products dataframe is indexed by *product_id*, if you want to obtain the *department_id* of product *product_id*, use:

```python
products.loc[product_id].department_id
```

Note that *product_id* must be an integer.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code for *select_from_departments*.</font>

In [74]:
def select_from_departments(df:pd.DataFrame, prod_IDs:list, dep_IDs:list=None):
    prod_IDs = np.array(prod_IDs) - 1
    df2 = df.iloc[prod_IDs]
    if dep_IDs != None:
        return df2[df2.department_id.isin(dep_IDs)]
    return df2

Test your function by passing it a list of products and ensuring it selects only the products in the 1-2 departments you have selected. To obtain test cases you can open the products file with a spreadsheet program.

Each test case should print:

* The product name and department id of each item in the input list
* The product name and department id of each item in the output list

For instance, suppose a test case is `[22, 26, 45, 54, 57, 71, 111, 112]` and we select products from DEPT_BAKERY and DEPT_CLEANING, a test case run should print something similar to this:

```
Test case:
[22, 26, 45, 54, 57, 71, 111, 112]

Input products:
22 Fresh Breath Oral Rinse Mild Mint (dept 11)
26 Fancy Feast Trout Feast Flaked Wet Cat Food (dept 8)
45 European Cucumber (dept 4)
54 24/7 Performance Cat Litter (dept 8)
57 Flat Toothpicks (dept 17)
71 Ultra 7 Inch Polypropylene Traditional Plates (dept 17)
111 Fabric Softener, Geranium Scent (dept 17)
112 Hot Tomatillo Salsa (dept 13)

Selected products:
57 Flat Toothpicks (dept 17)
71 Ultra 7 Inch Polypropylene Traditional Plates (dept 17)
111 Fabric Softener, Geranium Scent (dept 17)
```

Do not replicate code that can be easily factored in a function in your answer.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to test your function with three different test cases. Each test case is a list of items and a list of 1, 2, or 3 departments.</font>

In [82]:
def show_output(df:pd.DataFrame):
    for i, row in df.iterrows():
        print(f'(id={i:4}) {row.product_name} (dept {row.department_id})')

def test_select(df:pd.DataFrame, prod_IDs:list, dep_IDs:list):
    print(f'Test case: \nProducts: {prod_IDs}\nDepartments:{dep_IDs}')
    
    print(f'\nInput products:')
    input_prods = select_from_departments(df, prod_IDs)
    show_output(input_prods)
    
    print(f'\nOutput products:')
    output_prods = select_from_departments(df, prod_IDs, dep_IDs)
    show_output(output_prods)

Test case: 
Products: [22, 26, 45, 54, 57, 71, 111, 112]
Departments:[3, 17]

Input products:
(id=  22) Fresh Breath Oral Rinse Mild Mint (dept 11)
(id=  26) Fancy Feast Trout Feast Flaked Wet Cat Food (dept 8)
(id=  45) European Cucumber (dept 4)
(id=  54) 24/7 Performance Cat Litter (dept 8)
(id=  57) Flat Toothpicks (dept 17)
(id=  71) Ultra 7 Inch Polypropylene Traditional Plates (dept 17)
(id= 111) Fabric Softener, Geranium Scent (dept 17)
(id= 112) Hot Tomatillo Salsa (dept 13)

Output products:
(id=  57) Flat Toothpicks (dept 17)
(id=  71) Ultra 7 Inch Polypropylene Traditional Plates (dept 17)
(id= 111) Fabric Softener, Geranium Scent (dept 17)


In [107]:
tests = [
    [[22, 26, 45, 54, 57, 71, 111, 112], [DEPT_BAKERY, DEPT_CLEANING]],
    [[2158, 5474, 6632, 5828, 4794, 7129, 3125, 1685], [DEPT_VEGGIES]],
    [[786, 7049, 6068, 4458, 1150, 902, 7349, 2028], [DEPT_ALCOHOL, DEPT_DRINKS, DEPT_BABIES]]
]

for t in tests:
    print('======================================================')
    test_select(products, t[0], t[1])

Test case: 
Products: [22, 26, 45, 54, 57, 71, 111, 112]
Departments:[3, 17]

Input products:
(id=  22) Fresh Breath Oral Rinse Mild Mint (dept 11)
(id=  26) Fancy Feast Trout Feast Flaked Wet Cat Food (dept 8)
(id=  45) European Cucumber (dept 4)
(id=  54) 24/7 Performance Cat Litter (dept 8)
(id=  57) Flat Toothpicks (dept 17)
(id=  71) Ultra 7 Inch Polypropylene Traditional Plates (dept 17)
(id= 111) Fabric Softener, Geranium Scent (dept 17)
(id= 112) Hot Tomatillo Salsa (dept 13)

Output products:
(id=  57) Flat Toothpicks (dept 17)
(id=  71) Ultra 7 Inch Polypropylene Traditional Plates (dept 17)
(id= 111) Fabric Softener, Geranium Scent (dept 17)
Test case: 
Products: [2158, 5474, 6632, 5828, 4794, 7129, 3125, 1685]
Departments:[4]

Input products:
(id=2158) #2 Cone White Coffee Filters (dept 7)
(id=5474) Crackers, Puffed, Lightly Salted Corn (dept 19)
(id=6632) Brown Rice Salmon Avocado Roll (dept 20)
(id=5828) Powder Fresh Roll-On Antiperspirant Deodorant (dept 11)
(id=4794) Pu

## 2.2. Read and filter transactions

The transactions file is a compressed file containing one row per transaction. Each transaction is a comma-separated list of *product_id*. The following code iterates through this file:

```python
# Open a compressed file
with gzip.open(INPUT_TRANSACTIONS, "rt") as inputfile:
    
    # Create a CSV reader
    reader = csv.reader(inputfile, delimiter=",")
    
    # Iterate through the CSV file
    for row in reader:
        
        # Convert to integers
        items = [int(x) for x in row]
```

Read the transactions, filtering the items by department. Stop reading (`break`) after you have stored 5000 transactions into an array named `transactions`. Every 1000 transactions read, print the number of transactions read and the number of transactions stored.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to read transactions, keeping only items in DEPT_PHARMACY. Remember to stop after storing 5000 of the transactions read.</font>

In [136]:
def extract_transactions(filename, dept_IDs):
    transactions = []

    # Open a compressed file
    with gzip.open(filename, "rt") as inputfile:
        
        # Create a CSV reader
        reader = csv.reader(inputfile, delimiter=",")
        
        # Iterate through the CSV file
        for row in reader:
            
            # Convert to integers
            items = [int(x) for x in row]
            
            index = select_from_departments(products, items, dept_IDs).index.to_list()
            if len(index) != 0:
                transactions.append(index)

            if len(transactions)>5000:
                break
    
    return transactions

In [137]:
transactions = extract_transactions(INPUT_TRANSACTIONS, [DEPT_PHARMACY])

## 2.3. Extract association rules and comment on them

You are now ready to run the association rules mining algorithm over the selected transactions:

```python
results = list(apriori(transactions, min_support=..., min_confidence=..., min_lift=...))
print_apyori_output(results, products, 'product_name')
```

*Tip:* if you set `min_support` to a very small value, your notebook will probably hang.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to extract association rules from the read transactions.</font>

In [138]:
results = list(apriori(transactions, min_support=0.0002, min_confidence=0.9, min_lift=1.0))
print_apyori_output(results, products, 'product_name')

Rules involving itemset [5584, 4898]
['Vitamin C 250 mg 60 Gummies'] => ['Vitamin D3 Gummies, 1000 IU, Great Wild Berry Taste!'] (support=0.0004, confidence=1.00, lift=2500.50)
['Vitamin D3 Gummies, 1000 IU, Great Wild Berry Taste!'] => ['Vitamin C 250 mg 60 Gummies'] (support=0.0004, confidence=1.00, lift=2500.50)

Rules involving itemset [23425, 5019]
['Nourish & Moisturize Shampoo'] => ['Nourish+ Moisturize Conditioner'] (support=0.0006, confidence=1.00, lift=1250.25)

Rules involving itemset [11007, 5663]
['Chocolate Energy Supplement'] => ['Chocolate Calming Supplement'] (support=0.0006, confidence=1.00, lift=1250.25)

Rules involving itemset [10979, 6876]
['Sheer Blonde Highlight Activating Conditioner'] => ['Sheer Blonde Highlight Activating Brightening Shampoo'] (support=0.0004, confidence=1.00, lift=1250.25)

Rules involving itemset [13899, 9951]
['Outlast Long Lasting Mint Mouthwash'] => ['Mint Glide Floss Picks'] (support=0.0004, confidence=1.00, lift=555.67)

Rules involvin

<font size="+1" color="red">Replace this cell with a brief commentary on what you would recommend to the shopping app considering the extracted association rules.</font>

# TODO

## 2.4. Extract association rules and comment on them (other departments)

<font size="+1" color="red">Replace this cell with code to select a different set of departments (at least two, not DEPT_PHARMACY) and extract transactions again. Avoid replicating code when possible.</font>

In [146]:
transactions = extract_transactions(INPUT_TRANSACTIONS, [DEPT_CLEANING, DEPT_PETS])
results = list(apriori(transactions, min_support=0.0006, min_confidence=0.9, min_lift=1.0))
print_apyori_output(results, products, 'product_name')

Rules involving itemset [31747, 32748]
['Classic Cod Sole & Shrimp Feast Cat Food'] => ['Classic Salmon & Shrimp Feast Cat Food'] (support=0.0010, confidence=1.00, lift=833.50)

Rules involving itemset [31747, 24508, 32748]
['Classic Salmon & Shrimp Feast Cat Food', 'Classic Ocean Whitefish & Tuna Feast Cat Food'] => ['Classic Cod Sole & Shrimp Feast Cat Food'] (support=0.0008, confidence=1.00, lift=1000.20)
['Classic Ocean Whitefish & Tuna Feast Cat Food', 'Classic Cod Sole & Shrimp Feast Cat Food'] => ['Classic Salmon & Shrimp Feast Cat Food'] (support=0.0008, confidence=1.00, lift=833.50)



<font size="+1" color="red">Replace this cell with your commentary on the obtained rules.</font>

# DELIVER (individually)

Remember to read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/data-mining-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook

## Extra points available

For more learning and extra points, copy the function `print_apyori_output` to `print_apyori_output_diff_dept` and modify it to filter the obtained association rules so that you print only the ones involving products in different departments.

To be precise, this means rules in which there is at least a product in the *consequence* that belongs to a department that none of the products in the *antecedent* belongs to. Experiment with different combinations of departments, and try to discover interesting groups of products in different departments that are related to each other.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: experiments on cross-department association rules</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>