# Practical B - Association Rule Mining - Groceries DataSet

## Introduction

In this practical we will perform a market basket analysis of transactional data from a small grocery store (9835 transactions, 169 products).

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
sns.set_context("paper")

from itertools import combinations, groupby
from collections import Counter
import sys

## Import the data

The data used here was adapted from the Groceries dataset in the Apriori R package, and has been cleaned and simplified already.

Note:
 
 * In a large grocery store, hint Insacart, there is a huge variety of items. There might be five brands of milk, a dozen different types of laundry detergent, and three brands of coffee. 

 * If we can assume that the retailer is not terribly concerned with finding rules that apply only to a specific brand of milk or detergent we could remove all brand names and merge products. This reduces
the number of groceries to a more manageable size, using broad categories such as chicken, frozen meals, margarine, and soda.

In [None]:
# download remote source if local copy is not available
import requests, os
url = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv"
dataFile = "data/groceries.csv"
if not os.path.isfile(dataFile):
    r = requests.get(url)
    with open(dataFile, 'wb') as f:  
        f.write(r.content)

In [None]:
# print out the first 5 rows of groceries.csv
for line in open(dataFile).read().split("\n")[:5]:
    print(line)

These lines indicate five separate grocery store transactions. The first transaction
included four items: citrus fruit, semi-finished bread, margarine, and ready soups.
In comparison, the third transaction included only one item, whole milk.

In [None]:
# read in csv and convert to list of lists
# Note the [:-1] to drop the last empty line in the CSV file

transactions = [line.split(',') for line in open(dataFile).read().split("\n")][:-1]

In [None]:
# print out the first 5 transactions (to compare with first 5 lines of file)
from pprint import pprint
pprint(transactions[:5])

### Transcode transactions into a Data frame

In [None]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions,sparse=False)

df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

## Exploratory Data Analysis 

In [None]:
true_count = sum(sum(te_ary)) / 1662115
print ("Number of transactions = {:,} ".format(len(transactions)))
print ("Number of products = {:,} ".format(len(te.columns_)))
print ("Number of (non-unique) items sold = {:,}".format(te_ary.sum()))

print ("Sparseness of transaction database {:.3%}".format(te_ary.sum()/te_ary.size))
print ("Average number of items per transaction = {:.4}".format(te_ary.sum()/len(transactions)))

### Size of Transaction

In [None]:
# sum to get nubmer of True along each row (each transaction)
a = df.apply(lambda row: sum(row), axis=1)
a.head()

In [None]:
# Generate plot of count of the number of transactions of the same size
a.value_counts().plot.bar()
plt.title("Distribution of Tranasaction Size")
plt.xlabel("Number of items")
plt.ylabel("Frequency")
plt.show()

In [None]:
# or just output data   
print(a.value_counts())

In [None]:
# We can generate a set of statistics about the size of transactions. 
a.describe()

So from above tables and bar plot we see:

 * A total of 2,159 transactions contained only a single item, while one transaction had 32 items. 
 * The first quartile and median purchase size are 2 and 3 items respectively, implying that 25 percent of transactions contained two or fewer items and about half contained more or less than three items. 
 * The mean of 4.409 matches the value we calculated earlier.


## Association Rule Analysis

### Frequent Itemset Generation

In [None]:
from mlxtend.frequent_patterns import apriori
a = apriori(df, min_support=0.1,use_colnames=True).sort_values(by='support',ascending=False)
a

#### Visualizing item support – item frequency plots

In [None]:
names = [next(iter(n)) for n in a["itemsets"]]
a.plot(kind='bar', title ="Support for most frequent products")
plt.xticks(range(len(names)), names, rotation=20)
plt.show()

In [None]:
a = apriori(df, min_support=0.05, max_len=1, use_colnames=True) \
    .sort_values(by='support',ascending=False)
    
names = [next(iter(n)) for n in a["itemsets"]]
a.plot.barh(title ="Support for most frequent products")
plt.yticks(range(len(names)), names)
plt.gca().invert_yaxis()
plt.show()

### Visualization of the sparse matrix for the first $k$ transactions

In [None]:
k = 100
sns.set_style("white")
plt.figure(figsize=(10,5)) 
plt.imshow(1-te_ary[0:k], interpolation='none', cmap='gray')
plt.xlabel("Items (Columns)")
plt.ylabel("Transactions (Rows)")
plt.title("Visualation of first %d transactions" % k)
plt.show()

The above diagram depicts a matrix with $k$ rows and 169 columns, indicating the $k$ transactions and 169 possible items we requested. Cells in the matrix are filled with black for transactions (rows) where the item (column) was purchased.

A few columns seem fairly heavily populated, indicating some very popular items at the store, but overall, the distribution of dots seems fairly random. Given nothing else of note.

This visualization can be a useful tool for exploring the data. For one, it may help with the identification of potential data issues. Columns that are filled all the way down could indicate items that are purchased in every transaction-a problem that could arise, perhaps, if a retailer's name or identification number was inadvertently included in the transaction datase.

Additionally, patterns in the diagram may help reveal interesting segments of transactions or items, particularly if the data is sorted in interesting ways. For example, if the transactions are sorted by date, patterns in the black dots could reveal seasonal effects in the number or types of items people purchase. 

Perhaps around Christmas or Hanukkah, toys are more common; around Halloween, perhaps candy becomes popular. This type of visualization could be especially powerful if the items were also sorted into categories. In most cases, however, the plot will look fairly random, like static on an old CRT television screen which is not tuned to a channel. 

Keep in mind that this visualization will not be as useful for extremely large transaction databases because the cells will be too small to discern. Still, by combining it with sampling, you can view the sparse matrix for a randomly sampled set of transactions. 

In [None]:
frequent_itemsets = apriori(df, min_support=0.1,use_colnames=True)
frequent_itemsets

### Rule Generation

In [None]:
from mlxtend.frequent_patterns import association_rules

frequent_itemsets = apriori(df, min_support=0.006,use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.8)
rules.head()

So we generated zero rules.

If you think about it, this outcome should not have been terribly surprising. With the default support of 0.1, this means that in order to generate a rule, an item must have appeared in at least 0.1 * 9385 = 938.5 transactions. Since only eight items appeared this frequently in our data, it's no wonder we didn't find any rules.

One way to approach the problem of setting support is to think about the minimum number of transactions you would need before you would consider a pattern interesting. For instance, you could argue that if an item is purchased twice a day (about 60 times) then it may be worth taking a look at. From there, it is possible to
calculate the support level needed to find only rules matching at least that many transactions. Since 60 out of 9,835 equals 0.006, we'll try setting the support there first.

Setting the minimum confidence involves a tricky balance. On one hand, if confidence is too low, then we might be overwhelmed with a large number of unreliable rules—such as dozens of rules indicating items commonly purchased with batteries. How would we know where to target our advertising budget then? 

On the other hand, if we set confidence too high, then we will be limited to rules that are obvious or inevitable—like the fact that a smoke detector is always purchased in combination with batteries. In this case, moving the smoke detectors closer to the batteries is unlikely to generate additional revenue, since the two items were already almost always purchased together.

We'll start with a confidence threshold of 0.25, which means that in order to be included in the results, the rule has to be correct at least 25 percent of the time. This will eliminate the most unreliable rules while allowing some room for us to modify behavior with targeted promotions.

In [None]:
from mlxtend.frequent_patterns import association_rules

frequent_itemsets = apriori(df, min_support=0.006,use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.25)
print ("Generated {:,} rules".format(len(rules)))
rules.head()

In [None]:
# get stats on objective measures
rules[["support","confidence", "lift"]].describe()

Above, we see summary statistics for the rule quality measures: support, confidence, and lift. Support and confidence should not be very surprising, since we used these as selection criteria for the rules. However, we might be alarmed if most or all of the rules were very near the minimum thresholds—not the case here.

In [None]:
# order rules by lift
rules.sort_values(by='lift',ascending=False).head()

TODO: Take the result of learning association rules and divide them into three categories:

 * Actionable
 * Trivial
 * Inexplicable

In [None]:
for col in rules.columns: 
    print(col) 

### Taking subsets of association rules (by rule length)

In [None]:
# add new columnn storing the rule length
rules["rule_len"] = rules.apply(lambda row: len(row["antecedants"])+len(row["consequents"]), axis=1)

In [None]:
# get stats on rules groupt by rule length
rules[["rule_len","support", "lift"]].groupby("rule_len").agg(['mean', 'count']).reset_index()

In [None]:
# restrict analysis to rules of length 4 and order rules by lift
rules[rules["rule_len"]==4].sort_values(by='lift',ascending=False).head()

#### Taking subsets of association rules (by content)

In [None]:
# restrict analysis to transactions involving berries
df2 = df[df["berries"]==True]

In [None]:
frequent_itemsets = apriori(df2, min_support=0.006,use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.25)
rules.head()

In [None]:
rr = rules.apply(lambda row: "berries" in row["antecedants"], axis=1)

In [None]:
a = apriori(df, min_support=0.003,use_colnames=True)
a["itemset_len"] = a.apply(lambda row: len(row["itemsets"]), axis=1)
a.groupby("itemset_len").size().reset_index(name="count")

### Appendix - 

In [None]:
# effect of minsupport on distribution of frequent itemsets
sns.set_style("darkgrid")
for minsupport in [0.06, 0.006, 0.003, 0.002, 0.001]:
    a = apriori(df, min_support=minsupport,use_colnames=True)
    a["itemset_len"] = a.apply(lambda row: len(row["itemsets"]), axis=1)
    d = a.groupby("itemset_len").size().reset_index(name="count")
    d["count"].plot.line(label="minsupp = %s" % 0.01)
plt.xticks(range(7))
plt.legend()
plt.xlabel("Size of itemset")
plt.ylabel("Number of Frequent itemsets")
plt.show()