In [0]:
!pip install mlxtend



We will have to make use of another library to do association rule mining.

https://rasbt.github.io/mlxtend/

Citation:

Raschka, Sebastian. Mlxtend. apr, 2016, 10.5281/zenodo.594432. http://dx.doi.org/10.5281/zenodo.594432.

This library follows the same conventions as sklearn and was built to use Pandas DataFrames as input.


In [0]:
# libraries you will need for following through this notebook.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import csv

# allow us to view more rows at a time
pd.options.display.max_rows = 999

# the functions we need from mlxtend are here
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules


In [0]:
# The following is code for uploading a file to the colab.research.google 
# environment.

# library for uploading files
from google.colab import files 

def upload_files():
    # initiates the upload - follow the dialogues that appear
    uploaded = files.upload()

    # verify the upload
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(
            name=fn, length=len(uploaded[fn])))

    # uploaded files need to be written to file to interact with them
    # as part of a file system
    for filename in uploaded.keys():
        with open(filename, 'wb') as f:
            f.write(uploaded[filename])

In [0]:
# upload the groceries.csv file
upload_files()

Saving groceries.csv to groceries.csv
User uploaded file "groceries.csv" with length 500843 bytes


The groceries.csv file is taken from the R package for association rule mining:

https://cran.r-project.org/web/packages/arules/arules.pdf

Below is code for loading the file. The result is a list of lists. Each list represents a transaction. Each element of a transaction is an item. Transactions vary in length.

In [0]:
lines = []

csv_reader = csv.reader(open("groceries.csv", "r"))
for line in csv_reader:
    lines.append(line)

In [0]:
# run to check out what the data looks like
lines

[['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'],
 ['tropical fruit', 'yogurt', 'coffee'],
 ['whole milk'],
 ['pip fruit', 'yogurt', 'cream cheese ', 'meat spreads'],
 ['other vegetables',
  'whole milk',
  'condensed milk',
  'long life bakery product'],
 ['whole milk', 'butter', 'yogurt', 'rice', 'abrasive cleaner'],
 ['rolls/buns'],
 ['other vegetables',
  'UHT-milk',
  'rolls/buns',
  'bottled beer',
  'liquor (appetizer)'],
 ['pot plants'],
 ['whole milk', 'cereals'],
 ['tropical fruit',
  'other vegetables',
  'white bread',
  'bottled water',
  'chocolate'],
 ['citrus fruit',
  'tropical fruit',
  'whole milk',
  'butter',
  'curd',
  'yogurt',
  'flour',
  'bottled water',
  'dishes'],
 ['beef'],
 ['frankfurter', 'rolls/buns', 'soda'],
 ['chicken', 'tropical fruit'],
 ['butter', 'sugar', 'fruit/vegetable juice', 'newspapers'],
 ['fruit/vegetable juice'],
 ['packaged fruit/vegetables'],
 ['chocolate'],
 ['specialty bar'],
 ['other vegetables'],
 ['butter milk

The first step is to transform the data into a form that our algorithm will make sense of. We need a one-hot or dummy representation. mlxtend offers a TransactionEncoder that can allow us to turn this data into pandas dataframe somewhat easily.

https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.preprocessing/#transactionencoder

In [0]:
# init the encoder object
encoder = TransactionEncoder()

# fit learns what items exist
# transform turns the transactions into a one-hot numpy array (must be fit first)
# fit_transform fits and then transforms
lines_array = encoder.fit_transform(lines)

# the resulting array can be used to build a DataFrame. The column names
# are remembered by the encoder and can be accessed by the columns_ attribute
groceries_df = pd.DataFrame(lines_array, columns=encoder.columns_)

In [0]:
# run any EDA you want here
groceries_df.head()

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


The actual algorithm happens through two function calls. The first finds the candidate item sets i.e. finds the items that have a support higher than the given limit.

Tips for choosing support limit: Look at the frequencies of single items. If they are very low (less than 1%) you will need to use very low support limits. If they are above 1% then limit close to 1% might be appropriate. I tend to start with high limits and work down to allow a larger candidate set. If you start with very low limits you run the risk of having your program run for a very long time.

The function:

https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/#apriori

In [0]:
# min_support is a number between 0 and 1, the minimum allowed support
# use_colnames tells the algorithm that the column names in our input
# dataframe are meaningful
apriori(groceries_df, min_support=0.01, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.033452,(UHT-milk)
1,0.017692,(baking powder)
2,0.052466,(beef)
3,0.033249,(berries)
4,0.026029,(beverages)
5,0.080529,(bottled beer)
6,0.110524,(bottled water)
7,0.06487,(brown bread)
8,0.055414,(butter)
9,0.027961,(butter milk)


I ended up selecting 0.01 value for min_support. Try playing with different values and seeing what gets returned. Very small values might result in a long run-time however.

In [0]:
# let's capture that output for use in the next step
grocery_candidate_support_set = apriori(groceries_df, min_support=0.01, use_colnames=True)

The next step is to select the item sets than can be converted into rules while staying above a minimum confidence:

https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/#association_rules

This function returns a dataframe that we can use to look at the rules. I like to capture the output and sort by lift to see what rules are potentially the most meaningful.

Try playing with the values and see how the resulting rules change.

In [0]:
rules_df = association_rules(grocery_candidate_support_set, min_threshold=0.5)
rules_df.sort_values("lift", ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,"(citrus fruit, root vegetables)",(other vegetables),0.017692,0.193493,0.010371,0.586207,3.029608,0.006948,1.949059
6,"(tropical fruit, root vegetables)",(other vegetables),0.021047,0.193493,0.012303,0.584541,3.020999,0.008231,1.941244
5,"(root vegetables, rolls/buns)",(other vegetables),0.024301,0.193493,0.012201,0.502092,2.59489,0.007499,1.619792
7,"(yogurt, root vegetables)",(other vegetables),0.025826,0.193493,0.012913,0.5,2.584078,0.007916,1.613015
2,"(yogurt, curd)",(whole milk),0.017285,0.255516,0.010066,0.582353,2.279125,0.005649,1.782567
0,"(other vegetables, butter)",(whole milk),0.020031,0.255516,0.01149,0.573604,2.244885,0.006371,1.745992
11,"(tropical fruit, root vegetables)",(whole milk),0.021047,0.255516,0.011998,0.570048,2.230969,0.00662,1.731553
12,"(yogurt, root vegetables)",(whole milk),0.025826,0.255516,0.01454,0.562992,2.203354,0.007941,1.703594
3,"(domestic eggs, other vegetables)",(whole milk),0.022267,0.255516,0.012303,0.552511,2.162336,0.006613,1.663694
14,"(yogurt, whipped/sour cream)",(whole milk),0.020742,0.255516,0.01088,0.52451,2.052747,0.00558,1.565719


Bonus: How would you use this information? Are there any changes you could make to stores?