# 3. Aggregation and Pruning
## Performing aggregation
---
After completing minor consulting jobs for a library and an ebook seller, you've finally received your first big market basket analysis project: advising an online novelty gifts retailer on cross-promotions. Since the retailer has never previously hired a data scientist, it would like you to start the project by exploring its transaction data. It has asked you to perform aggregation for all signs in the dataset and also compute the support for this category. 

In [1]:
# import packages
import pandas as pd
import numpy as np

# import dataset
retail = pd.read_csv('../Datasets/online_retail.csv')
retail.head()

Unnamed: 0,InvoiceNo,StockCode,Description
0,562583,35637A,IVORY STRING CURTAIN WITH POLE
1,562583,35638A,PINK AND BLACK STRING CURTAIN
2,562583,84927F,PSYCHEDELIC TILE HOOK
3,562583,22425,ENAMEL COLANDER CREAM
4,562583,16008,SMALL FOLDING SCISSOR(POINTED EDGE)


In [2]:
# drop null values
retail = retail.dropna()

In [3]:
# group by Invoice number
retail_transactions = retail.groupby('InvoiceNo')\
                            .Description.unique().reset_index()

In [4]:
retail_transactions.head()

Unnamed: 0,InvoiceNo,Description
0,549687,"[DOORMAT RED RETROSPOT, DOORMAT WELCOME SUNRIS..."
1,550644,"[SET OF 6 SPICE TINS PANTRY DESIGN, PANTRY WAS..."
2,552695,"[BIRTHDAY PARTY CORDON BARRIER TAPE, ICE CREAM..."
3,553857,"[VINTAGE GLASS T-LIGHT HOLDER, SET/6 BEAD COAS..."
4,557499,"[PARTY CHARMS 50 PIECES, PACK OF 6 LARGE FRUIT..."


In [5]:
# number of transactions
len(retail_transactions)

9353

In [6]:
# create a list of transactions
transactions = []
for i in range(len(retail_transactions)):
    transactions += [retail_transactions.Description[i].tolist()]

In [7]:
# list of items of first stansaction
transactions[0]

['DOORMAT RED RETROSPOT',
 'DOORMAT WELCOME SUNRISE',
 'DOORMAT MULTICOLOUR STRIPE',
 'PACK OF 72 SKULL CAKE CASES',
 'PACK OF 60 PINK PAISLEY CAKE CASES',
 'PACK OF 60 MUSHROOM CAKE CASES',
 'PACK OF 72 RETROSPOT CAKE CASES',
 '72 SWEETHEART FAIRY CAKE CASES',
 '60 TEATIME FAIRY CAKE CASES',
 'SET OF 36 PAISLEY FLOWER DOILIES',
 'SET OF 36 MUSHROOM PAPER DOILIES',
 'SET OF 72 SKULL PAPER  DOILIES',
 'SET/10 BLUE POLKADOT PARTY CANDLES',
 'SET/10 PINK POLKADOT PARTY CANDLES',
 'SET/10 IVORY POLKADOT PARTY CANDLES',
 'SET/10 RED POLKADOT PARTY CANDLES']

In [8]:
# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder

# Instantiate transaction encoder and identify unique items
encoder = TransactionEncoder().fit(transactions)

# One-hot encode transactions
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)

# Print the one-hot encoded transaction dataset
onehot.shape

(9353, 3460)

In [9]:
onehot.sample(10)

Unnamed: 0,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TRELLIS COAT RACK,...,wet boxes,wet pallet,wet rusty,wet?,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804
1809,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1600,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3708,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3445,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
171,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1193,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
677,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6717,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2125,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6504,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [10]:
# Select the column headers for sign items
sign_headers = [i for i in onehot.columns if i.lower().find('sign')>=0]

# Select columns of sign items
sign_columns = onehot[sign_headers]

# Perform aggregation of sign items into sign category
signs = sign_columns.sum(axis = 1) >= 1.0

# Print support for signs
print('Share of Signs: %.2f' % signs.mean())

Share of Signs: 0.46


## Defining an aggregation function
---
Surprised by the high share of sign items in its inventory, the retailer decides that it makes sense to do further aggregation for different categories to explore the data better. This seems trivial to you, but the retailer has not previously been able to perform even a basic descriptive analysis of its transaction and items.

The retailer asks you to perform aggregation for the candles, bags, and boxes categories. To simplify the task, you decide to write a function. It will take a string that contains an item's category.

In [11]:
def aggregate(item):
	# Select the column headers for sign items
	item_headers = [i for i in onehot.columns if i.lower().find(item)>=0]

	# Select columns of sign items
	item_columns = onehot[item_headers]

	# Return category of aggregated items
	return item_columns.sum(axis = 1) >= 1.0

# Aggregate items for the bags, boxes, and candles categories  
bags = aggregate('bag')
boxes = aggregate('box')
candles = aggregate('candle')

# Print support for signs
print('Share of bags: %.2f' % bags.mean())
print('Share of boxes: %.2f' % boxes.mean())
print('Share of candles: %.2f' % candles.mean())

Share of bags: 0.39
Share of boxes: 0.38
Share of candles: 0.19


## Pruning and Apriori
---
we introduced the Apriori algorithm, which made use of the Apriori principle to prune itemsets. The Apriori principle tells us that subsets of frequent itemsets are frequent. Thus, if we find an infrequent itemset, which we'll call {X}, then it must be the case that {X, Y} is also infrequent, so we may eliminate it without computing its support.

In [12]:
# create aggregated dataframe
aggregated = pd.concat([bags, boxes, candles, signs], axis=1)\
               .rename(columns = {0:'bag', 1:'box', 2:'candle', 3:'sign'})

In [13]:
aggregated

Unnamed: 0,bag,box,candle,sign
0,False,False,True,False
1,False,True,False,True
2,True,True,False,True
3,True,False,True,False
4,False,False,False,False
...,...,...,...,...
9348,False,False,False,False
9349,False,False,False,False
9350,False,False,False,False
9351,False,True,False,False


## Identifying frequent itemsets with Apriori
---
The aggregation exercise you performed for the online retailer proved helpful. It offered a starting point for understanding which categories of items appear frequently in transactions. The retailer now wants to explore the individual items themselves to find out which are frequent.

In this exercise, you'll apply the Apriori algorithm to the online retail dataset without aggregating first. Your objective will be to prune the itemsets using a minimum value of support and a maximum item number threshold. 

In [14]:
# Import apriori from mlxtend
from mlxtend.frequent_patterns import apriori

# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(aggregated, 
                            min_support = 0.006, 
                            max_len = 4, 
                            use_colnames = True)

# Print a preview of the frequent itemsets
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.385865,(bag)
1,0.382551,(box)
2,0.189244,(candle)
3,0.458569,(sign)
4,0.232332,"(box, bag)"
5,0.108414,"(candle, bag)"
6,0.280979,"(sign, bag)"
7,0.132578,"(box, candle)"
8,0.271571,"(box, sign)"
9,0.130119,"(sign, candle)"


## Selecting a support threshold
---
The manager of the online gift store looks at the results you provided from the previous exercise and commends you for the good work. She does, however, raise an issue: all of the itemsets you identified contain only one item. She asks whether it would be possible to use a less restrictive rule and to generate more itemsets, possibly including those with multiple items.

After agreeing to do this, you think about what might explain the lack of itemsets with more than 1 item. It can't be the <code>max_len</code> parameter, since that was set to three. You decide it must be support and decide to test two different values, each time checking how many additional itemsets are generated. 

In [15]:
# Compute frequent itemsets using a support of 0.3 and length of 3
frequent_itemsets_1 = apriori(aggregated, min_support = 0.2, 
                            max_len = 3, use_colnames = True)

# Compute frequent itemsets using a support of 0.1 and length of 3
frequent_itemsets_2 = apriori(aggregated, min_support = 0.1, 
                            max_len= 3, use_colnames = True)

# Print the number of freqeuent itemsets
print(len(frequent_itemsets_1), len(frequent_itemsets_2))

6 12


In [16]:
frequent_itemsets_1

Unnamed: 0,support,itemsets
0,0.385865,(bag)
1,0.382551,(box)
2,0.458569,(sign)
3,0.232332,"(box, bag)"
4,0.280979,"(sign, bag)"
5,0.271571,"(box, sign)"


In [17]:
frequent_itemsets_2

Unnamed: 0,support,itemsets
0,0.385865,(bag)
1,0.382551,(box)
2,0.189244,(candle)
3,0.458569,(sign)
4,0.232332,"(box, bag)"
5,0.108414,"(candle, bag)"
6,0.280979,"(sign, bag)"
7,0.132578,"(box, candle)"
8,0.271571,"(box, sign)"
9,0.130119,"(sign, candle)"


## Generating association rules
---
In the final exercise of the previous section, you computed itemsets for the novelty gift store owner using the Apriori algorithm. You told the store owner that relaxing support from 0.005 to 0.003 increased the number of itemsets from 9 to 12914. Relaxing it again to 0.001 increased the number to 113952. Satisfied with the descriptive work you've done, the store manager asks you to identify some association rules from those two sets of frequent itemsets you computed.

In [18]:
# Import the association rule function from mlxtend
from mlxtend.frequent_patterns import association_rules

# Compute all association rules for frequent_itemsets_1
rules_1 = association_rules(frequent_itemsets_1, 
                            metric = "support", 
                         	min_threshold = 0.1)

# Compute all association rules for frequent_itemsets_2
rules_2 = association_rules(frequent_itemsets_2, 
                            metric = 'support', 
                        	min_threshold = 0.15)

# Print the number of association rules generated
rules_1

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(box),(bag),0.382551,0.385865,0.232332,0.607323,1.573923,0.084719,1.563967
1,(bag),(box),0.385865,0.382551,0.232332,0.602106,1.573923,0.084719,1.551792
2,(sign),(bag),0.458569,0.385865,0.280979,0.61273,1.587937,0.104033,1.585805
3,(bag),(sign),0.385865,0.458569,0.280979,0.72818,1.587937,0.104033,1.991868
4,(box),(sign),0.382551,0.458569,0.271571,0.709894,1.548062,0.096144,1.866318
5,(sign),(box),0.458569,0.382551,0.271571,0.592213,1.548062,0.096144,1.514144


In [19]:
rules_2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(box),(bag),0.382551,0.385865,0.232332,0.607323,1.573923,0.084719,1.563967
1,(bag),(box),0.385865,0.382551,0.232332,0.602106,1.573923,0.084719,1.551792
2,(sign),(bag),0.458569,0.385865,0.280979,0.61273,1.587937,0.104033,1.585805
3,(bag),(sign),0.385865,0.458569,0.280979,0.72818,1.587937,0.104033,1.991868
4,(box),(sign),0.382551,0.458569,0.271571,0.709894,1.548062,0.096144,1.866318
5,(sign),(box),0.458569,0.382551,0.271571,0.592213,1.548062,0.096144,1.514144
6,"(box, sign)",(bag),0.271571,0.385865,0.190527,0.701575,1.818185,0.085737,2.057918
7,"(bag, sign)",(box),0.280979,0.382551,0.190527,0.678082,1.772527,0.083038,1.918033
8,"(box, bag)",(sign),0.232332,0.458569,0.190527,0.820064,1.78831,0.083987,3.009025
9,(sign),"(box, bag)",0.458569,0.232332,0.190527,0.415481,1.78831,0.083987,1.313334


## Pruning with lift
---
Once again, you report back to the novelty gift store manager. This time, you tell her that you identified no rules when you used a higher support threshold for the Apriori algorithm and only two rules when you used a lower threshold. She commends you for the good work, but asks you to consider using another metric to reduce the two rules to one.

You remember that lift had a simple interpretation: values greater than 1 indicate that items co-occur more than we would expect if they were independently distributed across transactions. You decide to use lift, since that message will be simple to convey.

In [20]:
# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(aggregated, min_support = 0.001, 
                            max_len = 2, use_colnames = True)

# Compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, 
                            metric = "lift", 
                         	min_threshold = 1.5)

# Print association rules
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.385865,(bag)
1,0.382551,(box)
2,0.189244,(candle)
3,0.458569,(sign)
4,0.232332,"(box, bag)"
5,0.108414,"(candle, bag)"
6,0.280979,"(sign, bag)"
7,0.132578,"(box, candle)"
8,0.271571,"(box, sign)"
9,0.130119,"(sign, candle)"


In [21]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(box),(bag),0.382551,0.385865,0.232332,0.607323,1.573923,0.084719,1.563967
1,(bag),(box),0.385865,0.382551,0.232332,0.602106,1.573923,0.084719,1.551792
2,(sign),(bag),0.458569,0.385865,0.280979,0.61273,1.587937,0.104033,1.585805
3,(bag),(sign),0.385865,0.458569,0.280979,0.72818,1.587937,0.104033,1.991868
4,(box),(candle),0.382551,0.189244,0.132578,0.346562,1.831298,0.060182,1.240755
5,(candle),(box),0.189244,0.382551,0.132578,0.700565,1.831298,0.060182,2.062046
6,(box),(sign),0.382551,0.458569,0.271571,0.709894,1.548062,0.096144,1.866318
7,(sign),(box),0.458569,0.382551,0.271571,0.592213,1.548062,0.096144,1.514144


## Pruning with confidence
---
Once again, you've come up short: you found multiple useful rules, but can't narrow it down to one. Even worse, the two rules you found used the same itemset, but just swapped the antecedents and consequents. You decide to see whether pruning by another metric might allow you to narrow things down to a single association rule.

What would be the right metric? Both lift and support are identical for all rules that can be generated from an itemset, so you decide to use confidence instead, which differs for rules produced from the same itemset.

In [22]:
# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(aggregated, min_support = 0.001, 
                            max_len=2, use_colnames = True)

# Compute all association rules using confidence
rules = association_rules(frequent_itemsets, 
                            metric = "confidence", 
                         	min_threshold = 0.5)

# Print association rules
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(box),(bag),0.382551,0.385865,0.232332,0.607323,1.573923,0.084719,1.563967
1,(bag),(box),0.385865,0.382551,0.232332,0.602106,1.573923,0.084719,1.551792
2,(candle),(bag),0.189244,0.385865,0.108414,0.572881,1.484666,0.035392,1.437855
3,(sign),(bag),0.458569,0.385865,0.280979,0.61273,1.587937,0.104033,1.585805
4,(bag),(sign),0.385865,0.458569,0.280979,0.72818,1.587937,0.104033,1.991868


## Aggregation and filtering
we helped a gift store manager arrange the sections in her physical retail location according to association rules. The layout of the store forced us to group sections into two pairs of product types. After applying advanced filtering techniques, we proposed the floor layout below.

![image](https://assets.datacamp.com/production/repositories/5654/datasets/aea954a43d3541f900e9922f6e12cf0154a2820f/product_pairing_1_34F.png)

The store manager is now asking you to generate another floorplan proposal, but with a different criterion: each pair of sections should contain one high support product and one low support product.

In [23]:
# Apply the apriori algorithm with a minimum support of 0.0001
frequent_itemsets = apriori(aggregated, min_support = 0.0001, use_colnames = True)

# Generate the initial set of rules using a minimum support of 0.0001
rules = association_rules(frequent_itemsets, 
                          metric = "support", min_threshold = 0.0001)

# Set minimum antecedent support to 0.35
rules = rules[rules['antecedent support'] > 0.35]

# Set maximum consequent support to 0.35
rules = rules[rules['consequent support'] < 0.35]

# Print the remaining rules
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3,(bag),(candle),0.385865,0.189244,0.108414,0.280964,1.484666,0.035392,1.12756
6,(box),(candle),0.382551,0.189244,0.132578,0.346562,1.831298,0.060182,1.240755
10,(sign),(candle),0.458569,0.189244,0.130119,0.283749,1.499382,0.043337,1.131944
16,(box),"(bag, candle)",0.382551,0.108414,0.089811,0.234768,2.165469,0.048337,1.165118
17,(bag),"(box, candle)",0.385865,0.132578,0.089811,0.232751,1.755584,0.038654,1.130562
21,(sign),"(box, bag)",0.458569,0.232332,0.190527,0.415481,1.78831,0.083987,1.313334
22,(box),"(bag, sign)",0.382551,0.280979,0.190527,0.498044,1.772527,0.083038,1.432436
23,(bag),"(box, sign)",0.385865,0.271571,0.190527,0.493766,1.818185,0.085737,1.438917
28,(bag),"(sign, candle)",0.385865,0.130119,0.091201,0.236354,1.816446,0.040992,1.139115
29,(sign),"(bag, candle)",0.458569,0.108414,0.091201,0.198881,1.83445,0.041485,1.112925


## Applying Zhang's rule
---
We learned that Zhang's rule is a continuous measure of association between two items that takes values in the [-1,+1] interval. A -1 value indicates a perfectly negative association and a +1 value indicates a perfectly positive association. In this exercise, you'll determine whether Zhang's rule can be used to refine a set of rules a gift store is currently using to promote products.

Note that the frequent itemsets have been computed for you and are available as frequent_itemsets. Additionally, zhangs_rule() has been defined and association_rules() have been imported from mlxtend. You will start by re-computing the original set of rules. After that, you will apply Zhang's metric to select only those rules with a high and positive association.

In [24]:
# Define a function to compute Zhang's metric
def zhangs_rule(rules):
	# Complete the expressions for the numerator and denominator
	numerator = rules['support'] - rules['antecedent support']*rules['consequent support']
	denominator = pd.concat([rules['support']*(1-rules['antecedent support']), 
                         rules['antecedent support']*(rules['consequent support']-rules['support'])], axis=1).max(axis=1)
    
	# Return Zhang's metric
	return numerator / denominator

In [25]:
# Generate the initial set of rules using a minimum lift of 1.00
rules = association_rules(frequent_itemsets, metric = "lift", min_threshold = 1)

# Set antecedent support to 0.005
rules = rules[rules['antecedent support'] > 0.005]

# Set consequent support to 0.005
rules = rules[rules['consequent support'] > 0.005]

# Compute Zhang's rule
rules['zhang'] = zhangs_rule(rules)

# Set the lower bound for Zhang's rule to 0.8
rules = rules[rules['zhang'] > 0.8]
print(rules[['antecedents', 'consequents']])

    antecedents          consequents
16        (box)        (bag, candle)
21       (sign)           (box, bag)
29       (sign)        (bag, candle)
34        (box)       (candle, sign)
43   (box, bag)       (sign, candle)
44  (box, sign)        (bag, candle)
47        (box)  (sign, bag, candle)
48        (bag)  (sign, box, candle)
49       (sign)   (box, candle, bag)


## Advanced filtering with multiple metrics
---
Earlier, we used data from an online novelty gift store to find antecedents that could be used to promote a targeted consequent. Since the set of potential rules was large, we had to rely on the Apriori algorithm and multi-metric filtering to narrow it down. In this exercise, we'll examine the full set of rules and find a useful one, rather than targeting a particular antecedent.

In [26]:
# Apply the Apriori algorithm with a minimum support threshold of 0.001
frequent_itemsets = apriori(aggregated, min_support = 0.001, use_colnames = True)

# Recover association rules using a minium support threshold of 0.001
rules = association_rules(frequent_itemsets, metric = 'support', min_threshold = 0.001)

# Apply a 0.002 antecedent support threshold, 0.60 confidence threshold, and 2.50 lift threshold
filtered_rules = rules[(rules['antecedent support'] > 0.002) &
						(rules['consequent support'] > 0.01) &
						(rules['confidence'] > 0.60) &
						(rules['lift'] > 2.50)]

# Print remaining rule
print(filtered_rules[['antecedents','consequents']])

       antecedents  consequents
41   (bag, candle)  (box, sign)
42  (sign, candle)   (box, bag)
