## Market basket analysis at the grocery outlet

### Introduction

**Market basket analysis** tells us which products tend to be purchased together and which are most amenable to promotion. This information is actionable: it can suggest new store layouts, determine which articles to put on special, indicate when to issue coupons, and so on. When these data can be tied to individual customers through a loyalty card or website registration, they become even more valuable. The application of **association rules** to market basket analysis is a classic of data mining. 

In this example, a Chicago-based marketing analyst focusing on the retail industry explores different approaches for modeling consumer behavior using data on **point-of-sale transactions** in small stores of the Chicago metropolitan area. She starts with a market basket analysis of data from a typical local grocery outlet, where she intends to identify **joint occurrence** of products in shopping baskets.

### The data set

The `groceries` data set covers one month of point-of-sale **transaction data**. It contains 9,835 transactions and the items are aggregated to 169 categories. The data come as a **matrix transaction/item**: an entry equal to 1 in the intersection of row `i` and column `j` indicates that transaction `i` includes item `j`. 

I start by loading the data in the usual way.

In [1]:
import pandas as pd
groceries = pd.read_csv('https://raw.githubusercontent.com/iese-bad/' +
    'DataSci/master/Data/groceries.csv')
groceries.shape

(9835, 169)

In [2]:
print(groceries.iloc[:10, :6])

   frankfurter  sausage  liver_loaf  ham  meat  finished_products
0            0        0           0    0     0                  0
1            0        0           0    0     0                  0
2            0        0           0    0     0                  0
3            0        0           0    0     0                  0
4            0        0           0    0     0                  0
5            0        0           0    0     0                  0
6            0        0           0    0     0                  0
7            0        0           0    0     0                  0
8            0        0           0    0     0                  0
9            0        0           0    0     0                  0


Note the **sparsity** of the data: there are only 43,367 nonzero entries, out of the 1,662,115 terms of this matrix (2.6%). So, this is an inefficient way of transporting the data, even if it can be used in this example to keep it simple.

In [3]:
groceries.shape[0]*groceries.shape[1]

1662115

In [4]:
groceries.sum().sum()

43367

The analysis of this example uses functions taken from the Python package `mextend`, developed as a reinforcement to scikit-learn. It comes in two steps: (a) extracting the most frequent itemsets, and (b) selecting association rules by support and confidence. We do not always find these two steps separated in data science software applications, but they are so in this package.

### Mining itemsets

For the frequent itemsets, we load the function `apriori`.

In [5]:
from mlxtend.frequent_patterns import apriori

What frequent itemset means depends on the particular data set. In this example, after some exploration (not reported), I set the **minimim support** to 0.01, which makes enough room for examination.

In [6]:
frequent_itemsets = apriori(groceries, min_support=0.01, use_colnames=True)
frequent_itemsets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 2 columns):
support     333 non-null float64
itemsets    333 non-null object
dtypes: float64(1), object(1)
memory usage: 5.3+ KB


The function `apriori` returns a data frame with two columns, the support and the itemset. 

In [7]:
print(frequent_itemsets.head())

    support       itemsets
0  0.058973  (frankfurter)
1  0.093950      (sausage)
2  0.026029          (ham)
3  0.025826         (meat)
4  0.042908      (chicken)


The terms of the column `itemsets` are (frozen) sets. The frozen set is an immutable version of a Python set object. While elements of a set can be modified at any time, elements of a frozen set remain the same after creation.

In [8]:
frequent_itemsets.itemsets[0]

frozenset({'frankfurter'})

I add the length of the itemsets, which will allow me to filter itemsets by legth, having a clearer picture. The method `apply` is used to apply a function term by term to a column of a data frame. Here, I use the function `len`, which returns the number of elements of a set. `apply` is typically used in combination with **lambda functions**, that is, functions which are defined on the fly, not named, and forgotten after execution.

In [9]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
print(frequent_itemsets.head())

    support       itemsets  length
0  0.058973  (frankfurter)       1
1  0.093950      (sausage)       1
2  0.026029          (ham)       1
3  0.025826         (meat)       1
4  0.042908      (chicken)       1


I pick first the itemsets of size 1. We do not need them for describing the association rules, but they help us to understand the concepts. The method `sort_values` allows for sorting the rows of a data frame based on a column. `ascending=0` means 'descending'.

In [10]:
item1 = frequent_itemsets[frequent_itemsets['length'] == 1]
print(item1.sort_values('support', ascending=0)[:20])

     support                 itemsets  length
18  0.255516             (whole_milk)       1
16  0.193493       (other_vegetables)       1
40  0.183935             (rolls_buns)       1
60  0.174377                   (soda)       1
23  0.139502                 (yogurt)       1
59  0.110524          (bottled_water)       1
13  0.108998        (root_vegetables)       1
9   0.104931         (tropical_fruit)       1
87  0.098526          (shopping_bags)       1
1   0.093950                (sausage)       1
43  0.088968                 (pastry)       1
8   0.082766           (citrus_fruit)       1
63  0.080529           (bottled_beer)       1
84  0.079817             (newspapers)       1
64  0.077682            (canned_beer)       1
10  0.075648              (pip_fruit)       1
62  0.072293  (fruit_vegetable_juice)       1
24  0.071683     (whipped_sour_cream)       1
42  0.064870            (brown_bread)       1
39  0.063447          (domestic_eggs)       1


Setting the length to 2, I pick a second part of the collection of frequent itemsets.

In [11]:
item2 = frequent_itemsets[frequent_itemsets['length'] == 2]
print(item2.sort_values('support', ascending=0)[:20])

      support                             itemsets  length
184  0.074835       (other_vegetables, whole_milk)       2
223  0.056634             (rolls_buns, whole_milk)       2
216  0.056024                 (yogurt, whole_milk)       2
166  0.048907        (whole_milk, root_vegetables)       2
165  0.047382  (other_vegetables, root_vegetables)       2
189  0.043416           (yogurt, other_vegetables)       2
194  0.042603       (rolls_buns, other_vegetables)       2
140  0.042298         (tropical_fruit, whole_milk)       2
232  0.040061                   (whole_milk, soda)       2
273  0.038332                   (rolls_buns, soda)       2
139  0.035892   (tropical_fruit, other_vegetables)       2
231  0.034367          (bottled_water, whole_milk)       2
253  0.034367                 (yogurt, rolls_buns)       2
226  0.033249                 (whole_milk, pastry)       2
202  0.032740             (other_vegetables, soda)       2
217  0.032232     (whipped_sour_cream, whole_milk)      

Finally, setting the length to 3, I get a third list.

In [12]:
item3 = frequent_itemsets[frequent_itemsets['length'] == 3]
print(item3.sort_values('support', ascending=0)[:20])

      support                                           itemsets  length
313  0.023183    (other_vegetables, whole_milk, root_vegetables)       3
319  0.022267             (yogurt, other_vegetables, whole_milk)       3
322  0.017895         (rolls_buns, other_vegetables, whole_milk)       3
308  0.017082     (tropical_fruit, other_vegetables, whole_milk)       3
331  0.015557                   (yogurt, rolls_buns, whole_milk)       3
310  0.015150               (yogurt, tropical_fruit, whole_milk)       3
320  0.014642  (whipped_sour_cream, other_vegetables, whole_m...       3
316  0.014540              (yogurt, whole_milk, root_vegetables)       3
325  0.013930               (other_vegetables, whole_milk, soda)       3
312  0.013523          (other_vegetables, whole_milk, pip_fruit)       3
304  0.013015       (citrus_fruit, other_vegetables, whole_milk)       3
314  0.012913        (yogurt, other_vegetables, root_vegetables)       3
317  0.012710          (rolls_buns, whole_milk, roo

For mining association rules, I use the function `association_rules`. I use the **confidence** for selecting the more relevant rules, setting the threshold to 0.4. You may find in the literature examples with much higher thresholds, but we cannot be so strict in this case.

In [13]:
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.4)

Finally, I arrange things so my presentation of the rules looks nicer.

In [14]:
rules = rules.sort_values('confidence', ascending=0)[0:20]
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

                                  antecedents         consequents   support  \
27            (citrus_fruit, root_vegetables)  (other_vegetables)  0.010371   
31          (tropical_fruit, root_vegetables)  (other_vegetables)  0.012303   
59                             (yogurt, curd)        (whole_milk)  0.010066   
46                 (other_vegetables, butter)        (whole_milk)  0.011490   
32          (tropical_fruit, root_vegetables)        (whole_milk)  0.011998   
44                  (yogurt, root_vegetables)        (whole_milk)  0.014540   
51          (domestic_eggs, other_vegetables)        (whole_milk)  0.012303   
60               (yogurt, whipped_sour_cream)        (whole_milk)  0.010880   
45              (rolls_buns, root_vegetables)        (whole_milk)  0.012710   
38              (other_vegetables, pip_fruit)        (whole_milk)  0.013523   
36                   (yogurt, tropical_fruit)        (whole_milk)  0.015150   
48                 (yogurt, other_vegetables)       