# Table of Contents
- [The store_data dataset](#The-store-data-dataset)
- [1. Generating frequent patterns with the apriori algorithm](#1.-Generating-frequent-patterns-apriori)
- [2. Generating frequent patterns with the FP-growth algorithm](#2.-Generating-frequent-patterns-fp-growth)
- [3. Association rules generation and evaluation](#3.-Association-rules-generation-and-evaluation)
- [4. Exercise](#4.-Exercise)

In [1]:
import os
import pandas as pd
import numpy as np

# The store_data dataset

In [2]:
df = pd.read_csv('dataset/store_data.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7497,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7498,chicken,,,,,,,,,,,,,,,,,,,
7499,escalope,green tea,,,,,,,,,,,,,,,,,,


As we load the dataset by using pandas, the number of columns is determined by the transaction with the maximun number of products thus **there is at least one transaction with 20 products**.

In [3]:
df.values

array([['shrimp', 'almonds', 'avocado', ..., 'frozen smoothie',
        'spinach', 'olive oil'],
       ['burgers', 'meatballs', 'eggs', ..., nan, nan, nan],
       ['chutney', nan, nan, ..., nan, nan, nan],
       ...,
       ['chicken', nan, nan, ..., nan, nan, nan],
       ['escalope', 'green tea', nan, ..., nan, nan, nan],
       ['eggs', 'frozen smoothie', 'yogurt cake', ..., nan, nan, nan]],
      shape=(7501, 20), dtype=object)

Getting know of your data is an important step and, as we can see, nan values are used to fill some columns. It is required to filter out those nan values so a **preprocessing step** is required.

In [4]:
data = [[val for val in row if val is not np.nan] for row in df.values]
data

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'],
 ['low fat yogurt'],
 ['whole wheat pasta', 'french fries'],
 ['soup', 'light cream', 'shallot'],
 ['frozen vegetables', 'spaghetti', 'green tea'],
 ['french fries'],
 ['eggs', 'pet food'],
 ['cookies'],
 ['turkey', 'burgers', 'mineral water', 'eggs', 'cooking oil'],
 ['spaghetti', 'champagne', 'cookies'],
 ['mineral water', 'salmon'],
 ['mineral water'],
 ['shrimp',
  'chocolate',
  'chicken',
  'honey',
  'oil',
  'cooking oil',
  'low fat yogurt'],
 ['turkey', 'eggs'],
 ['turkey',
  'fresh tuna',
  'tomatoes',
  'spagh

## Generating frequent patterns with the apriori algorithm

As in the previous hands-on session, in this notebook we will resort to `mlxtend` ([machine learning extension](http://rasbt.github.io/mlxtend/)), one of the third party libraries that implement the most popular frequent pattern mining algorithms.

**Encoded format**

The TransactionEncoder converts item lists into transaction data for frequent itemset mining (simply transforms the input dataset into a one-hot encoded NumPy boolean array)

In [5]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7497,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


After encoding each transaction we can see the total list of unique products.

In [6]:
te.columns_

[' asparagus',
 'almonds',
 'antioxydant juice',
 'asparagus',
 'avocado',
 'babies food',
 'bacon',
 'barbecue sauce',
 'black tea',
 'blueberries',
 'body spray',
 'bramble',
 'brownies',
 'bug spray',
 'burger sauce',
 'burgers',
 'butter',
 'cake',
 'candy bars',
 'carrots',
 'cauliflower',
 'cereals',
 'champagne',
 'chicken',
 'chili',
 'chocolate',
 'chocolate bread',
 'chutney',
 'cider',
 'clothes accessories',
 'cookies',
 'cooking oil',
 'corn',
 'cottage cheese',
 'cream',
 'dessert wine',
 'eggplant',
 'eggs',
 'energy bar',
 'energy drink',
 'escalope',
 'extra dark chocolate',
 'flax seed',
 'french fries',
 'french wine',
 'fresh bread',
 'fresh tuna',
 'fromage blanc',
 'frozen smoothie',
 'frozen vegetables',
 'gluten free bar',
 'grated cheese',
 'green beans',
 'green grapes',
 'green tea',
 'ground beef',
 'gums',
 'ham',
 'hand protein bar',
 'herb & pepper',
 'honey',
 'hot dogs',
 'ketchup',
 'light cream',
 'light mayo',
 'low fat yogurt',
 'magazines',
 'mashe

Now, obtain the items and itemsets with at least MinSup support (e.g., MinSup = 0.05):

In [7]:
from mlxtend.frequent_patterns import apriori
apriori?

[31mSignature:[39m
apriori(
    df,
    min_support=[32m0.5[39m,
    use_colnames=[38;5;28;01mFalse[39;00m,
    max_len=[38;5;28;01mNone[39;00m,
    verbose=[32m0[39m,
    low_memory=[38;5;28;01mFalse[39;00m,
)
[31mDocstring:[39m
Get frequent itemsets from a one-hot DataFrame

Parameters
-----------
df : pandas DataFrame
  pandas DataFrame the encoded format. Also supports
  DataFrames with sparse data; for more info, please
  see (https://pandas.pydata.org/pandas-docs/stable/
       user_guide/sparse.html#sparse-data-structures)

  Please note that the old pandas SparseDataFrame format
  is no longer supported in mlxtend >= 0.17.2.

  The allowed values are either 0/1 or True/False.
  For example,

```
         Apple  Bananas   Beer  Chicken   Milk   Rice
    0     True    False   True     True  False   True
    1     True    False   True    False  False   True
    2     True    False   True    False  False  False
    3     True     True  False    False  False  False
   

In [8]:
freq_itemset = apriori(df, 
                       min_support = 0.05, 
                       use_colnames = True,
                       verbose = True)

Processing 6 combinations | Sampling itemset size 3 2


In [9]:
freq_itemset

Unnamed: 0,support,itemsets
0,0.087188,(burgers)
1,0.081056,(cake)
2,0.059992,(chicken)
3,0.163845,(chocolate)
4,0.080389,(cookies)
5,0.05106,(cooking oil)
6,0.179709,(eggs)
7,0.079323,(escalope)
8,0.170911,(french fries)
9,0.063325,(frozen smoothie)


We can measure the execution time of this algorithm (and any other piece of code) by using the *timeit* program.

In [10]:
%timeit -n 100 -r 10 apriori(df, min_support = 0.1, use_colnames = True, verbose = False)

1.53 ms ± 40.8 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)


We can take advantage of pandas' capabilities to efficiently analyze and filter the results. For example, we can generate a DataFrame of frequent itemsets using the Apriori algorithm and then add a new column to store the length of each itemset.

In [11]:
freq_itemset['length'] = freq_itemset['itemsets'].apply(lambda x: len(x))
freq_itemset

Unnamed: 0,support,itemsets,length
0,0.087188,(burgers),1
1,0.081056,(cake),1
2,0.059992,(chicken),1
3,0.163845,(chocolate),1
4,0.080389,(cookies),1
5,0.05106,(cooking oil),1
6,0.179709,(eggs),1
7,0.079323,(escalope),1
8,0.170911,(french fries),1
9,0.063325,(frozen smoothie),1


Filtering the results based on some desired criteria (e.g., selects only the k-itemset with k>=2)

In [12]:
freq_itemset[freq_itemset['length'] >= 2]

Unnamed: 0,support,itemsets,length
25,0.05266,"(mineral water, chocolate)",2
26,0.050927,"(mineral water, eggs)",2
27,0.059725,"(spaghetti, mineral water)",2


## 2. Generating frequent patterns with the FP-growth algorithm

Unlike the Apriori algorithm, which follows the generate-and-test approach, the FP-growth algorithm takes a different approach. It compresses the dataset into a compact structure known as an FP-tree and directly extracts frequent itemsets from it.

In [13]:
from mlxtend.frequent_patterns import fpgrowth
freq_itemset = fpgrowth(df, min_support=0.05, use_colnames=True)
freq_itemset

Unnamed: 0,support,itemsets
0,0.238368,(mineral water)
1,0.132116,(green tea)
2,0.076523,(low fat yogurt)
3,0.071457,(shrimp)
4,0.065858,(olive oil)
5,0.063325,(frozen smoothie)
6,0.179709,(eggs)
7,0.087188,(burgers)
8,0.062525,(turkey)
9,0.129583,(milk)


In [14]:
%timeit -n 100 -r 10 fpgrowth(df, min_support=0.1, use_colnames=True)

29.1 ms ± 1.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)


Note: Since FP-growth builds a FP-tree, we can set the max depth of it by using the **max_len** parameter

In [15]:
fpgrowth?

[31mSignature:[39m
fpgrowth(
    df,
    min_support=[32m0.5[39m,
    null_values=[38;5;28;01mFalse[39;00m,
    use_colnames=[38;5;28;01mFalse[39;00m,
    max_len=[38;5;28;01mNone[39;00m,
    verbose=[32m0[39m,
)
[31mDocstring:[39m
Get frequent itemsets from a one-hot DataFrame

Parameters
-----------
df : pandas DataFrame
  pandas DataFrame the encoded format. Also supports
  DataFrames with sparse data; for more info, please
  see https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparse-data-structures.

  Please note that the old pandas SparseDataFrame format
  is no longer supported in mlxtend >= 0.17.2.

  The allowed values are either 0/1 or True/False.
  For example,

```
       Apple  Bananas   Beer  Chicken   Milk   Rice
    0   True    False   True     True  False   True
    1   True    False   True    False  False   True
    2   True    False   True    False  False  False
    3   True     True  False    False  False  False
    4  False    Fals

Since FP-Growth eliminates the need to explicitly generate candidate sets, it can be significantly faster than the Apriori algorithm. However, it may also be more memory-intensive (FP-tree may not fit in memory).

## 3. Association rules generation and evaluation

An association rule is an implication expression of the form $X \rightarrow Y$, where $X$ and $Y$ are disjoint itemsets.

Association rules can be generated as follows:
- for each frequent itemset $l$, generate all nonempty subset of $l$
- for every nonempty subset $s$ of $l$, output the rule "$s \rightarrow (l-s)$" if $\frac{\text{support}(l)}{\text{support}(s)}>\text{min\_conf}$

As the rules are generated from frequent itemsets, each one automatically satisfy the *minimum support*.

In the following we generate association rules from the frequent itemsets.

In [16]:
from mlxtend.frequent_patterns import association_rules
association_rules?

[31mSignature:[39m
association_rules(
    df: pandas.core.frame.DataFrame,
    num_itemsets: Optional[int] = [32m1[39m,
    df_orig: Optional[pandas.core.frame.DataFrame] = [38;5;28;01mNone[39;00m,
    null_values=[38;5;28;01mFalse[39;00m,
    metric=[33m'confidence'[39m,
    min_threshold=[32m0.8[39m,
    support_only=[38;5;28;01mFalse[39;00m,
    return_metrics: list = [[33m'antecedent support'[39m, [33m'consequent support'[39m, [33m'support'[39m, [33m'confidence'[39m, [33m'lift'[39m, [33m'representativity'[39m, [33m'leverage'[39m, [33m'conviction'[39m, [33m'zhangs_metric'[39m, [33m'jaccard'[39m, [33m'certainty'[39m, [33m'kulczynski'[39m],
) -> pandas.core.frame.DataFrame
[31mDocstring:[39m
Generates a DataFrame of association rules including the
metrics 'score', 'confidence', and 'lift'

Parameters
-----------
df : pandas DataFrame
  pandas DataFrame of frequent itemsets
  with columns ['support', 'itemsets']

df_orig : pandas DataFrame (defau

In [17]:
freq_itemset

Unnamed: 0,support,itemsets
0,0.238368,(mineral water)
1,0.132116,(green tea)
2,0.076523,(low fat yogurt)
3,0.071457,(shrimp)
4,0.065858,(olive oil)
5,0.063325,(frozen smoothie)
6,0.179709,(eggs)
7,0.087188,(burgers)
8,0.062525,(turkey)
9,0.129583,(milk)


In [18]:
association_rules(freq_itemset, 
                  metric = "confidence", 
                  min_threshold = 0.1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(mineral water),(eggs),0.238368,0.179709,0.050927,0.213647,1.188845,1.0,0.00809,1.043158,0.208562,0.138707,0.041372,0.248515
1,(eggs),(mineral water),0.179709,0.238368,0.050927,0.283383,1.188845,1.0,0.00809,1.062815,0.193648,0.138707,0.059103,0.248515
2,(spaghetti),(mineral water),0.17411,0.238368,0.059725,0.343032,1.439085,1.0,0.018223,1.159314,0.369437,0.169312,0.137421,0.296796
3,(mineral water),(spaghetti),0.238368,0.17411,0.059725,0.250559,1.439085,1.0,0.018223,1.102008,0.400606,0.169312,0.092566,0.296796
4,(mineral water),(chocolate),0.238368,0.163845,0.05266,0.220917,1.348332,1.0,0.013604,1.073256,0.339197,0.150648,0.068256,0.271158
5,(chocolate),(mineral water),0.163845,0.238368,0.05266,0.3214,1.348332,1.0,0.013604,1.122357,0.308965,0.150648,0.109018,0.271158


## Exercise

Given the *transactions_data.csv* dataset, perform Market Basket Analysis. Start by generating frequent itemsets using the Apriori or FP-growth algorithm. Then, use these frequent itemsets to generate association rules.