# Affinity Analysis for 3 Items

## Use the Affinity Analysis Example and the Affinity Dataset to coplete Parts A to G below.

### Part A: Write down a statement that best describes the support of rules containing three items at a time. For example:  If a person buys cheese and apples, then they are likely to purchase bananas. 

Express your answer as P(X, Y, Z) = P(X|Y,Z)P(Y|Z)P(Z) 

### Part B: Write down a statement that best describes the confidence of rules containing three items at a time. i.e. given x and Y, a transaction also contains Z. For example:  If a person buys  cheese and apples, then they are likely to purchase  bananas.

Express your answer as P(XY|Z) = P(XY|Z)/P(Z)

### Part C: Revise the code developed in the Affinity Example to calculate the support and confidence of rules containing three items at a time. For example:  If a person buys cheese and apples, then they are likely to purchase bananas. 

#### Load the dataset 

In [3]:
import numpy as np
import pandas as pd # read xls
dataset_filename = "ReducedData.xlsx"
X = pd.read_excel(dataset_filename)
n_samples, n_features = X.shape 
print("This dataset has {0} samples and {1} features".format(n_samples, n_features))

This dataset has 2897 samples and 12 features


#### List the features

In [4]:
# The names of the features, for your reference.
features = list(X.columns)
print(features)

['household_key', 'BASKET_ID', 'DAY', 'PRODUCT_ID', 'QUANTITY', 'SALES_VALUE', 'STORE_ID', 'RETAIL_DISC', 'TRANS_TIME', 'WEEK_NO', 'COUPON_DISC', 'COUPON_MATCH_DISC']


#### Print transactions 

Print the first five rows of the dataset to get a sense of what the dataset looks like. The result will show you which items were bought in the first five transactions listed

In [5]:
print(X.head())

   household_key    BASKET_ID  DAY  PRODUCT_ID  QUANTITY  SALES_VALUE  \
0           1228  29046618323  157       20000         1         3.49   
1            358  30707611686  247       20000         1         3.49   
2           1675  30760265177  250       20000         1         0.99   
3           1420  30591251330  238       20000         1         1.54   
4            486  30636771192  242       20000         2         1.98   

   STORE_ID  RETAIL_DISC  TRANS_TIME  WEEK_NO  COUPON_DISC  COUPON_MATCH_DISC  
0      3313         0.00        2213       23          0.0                0.0  
1      3266         0.00        1211       36          0.0                0.0  
2      3235         0.00         936       36          0.0                0.0  
3      3297         0.00        1342       35          0.0                0.0  
4      3217        -0.52        1411       35          0.0                0.0  


#### Compute the Support and Confidence

Compute the Support and Confidence of the rule "If a person buys cheese and apples, then they are likely to purchase bananas." 

In [6]:
# Get a list with all the products
products = []
for index, sale in X.iterrows():
    products.append(int(sale['PRODUCT_ID']))
products = list(set(products))

# We create a dictionary holding which products each household bought
whoBought = {}
for index, sale in X.iterrows():
    if int(sale['household_key']) in whoBought:
        whoBought[int(sale['household_key'])].append(int(sale['PRODUCT_ID'])) 
    else:
        whoBought[int(sale['household_key'])] = [int(sale['PRODUCT_ID'])]

# Remove duplicates?
for key in whoBought.keys():
    whoBought[key] = list(set(whoBought[key]))

### Compute the sale counts for each product separately
boughtCount = {key:len(whoBought[key]) for key in whoBought.keys()}
###

for key in whoBought.keys():
    print("household_{0} bought products {1}".format(key, whoBought[key]))


household_1536 bought products [30000, 70000, 90000, 100000, 120000]
household_1409 bought products [20000, 40000, 60000, 80000, 100000, 120000, 160000, 180000, 200000, 50000, 30000, 70000, 90000, 110000, 130000, 150000, 170000, 190000]
household_2 bought products [30000, 170000, 60000, 100000]
household_2181 bought products [120000]
household_2182 bought products [20000, 40000, 60000, 80000, 100000, 120000, 140000, 160000, 200000, 50000, 30000, 70000, 90000, 110000, 130000, 150000, 190000]
household_1722 bought products [40000, 200000, 80000, 120000, 190000, 70000]
household_1032 bought products [20000, 40000, 60000, 80000, 100000, 120000, 140000, 160000, 200000, 50000, 30000, 70000, 90000, 110000, 130000, 150000, 170000, 190000]
household_364 bought products [20000, 40000, 60000, 80000, 100000, 120000, 140000, 50000, 30000, 70000, 90000, 110000, 130000, 190000]
household_1675 bought products [20000, 40000, 60000, 80000, 100000, 120000, 140000, 160000, 180000, 200000, 50000, 30000, 70

In [15]:
# Compute the rule for products 20000, 40000, 60000
c2, c4, c6 = (0,0,0)
for h in whoBought.keys():
    if 20000 in whoBought[h]:
        c2 += 1
    if 40000 in whoBought[h]:
        c4 += 1
    if 60000 in whoBought[h]:
        c6 += 1
        
rule_valid = 0
rule_invalid = 0
for h in whoBought.keys():
    if 20000 in whoBought[h]: # this person bought Cheese
        if 40000 in whoBought[h]: # this person bought Cheese and Apples
            if 60000 in whoBought[h]: # this person bought Cheese and Apples AND Bananas
                rule_valid +=1
            else:
                rule_invalid +=1
                
print("{0} cases of the rule being valid were discovered".format(rule_valid))
print("{0} cases of the rule being invalid were discovered".format(rule_invalid))

13 cases of the rule being valid were discovered
13 cases of the rule being invalid were discovered


In [19]:
support = rule_valid  # The Support is the number of times the rule is discovered.
confidence2 = rule_valid / c2
confidence4 = rule_valid / c4
confidence6 = rule_valid / c6
print(confidence2, confidence4, confidence6)

0.4642857142857143 0.34210526315789475 0.8125


#### Compute these statistics for all rules in the dataset

In [24]:
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

# Compute the Support for each products combinations
for premiseA in products:
    for premiseB in products:
        if premiseB == premiseA: continue
        for conclusion in products:
            if conclusion == premiseA or conclusion == premiseB: continue
            for h in whoBought.keys():
                if premiseA in whoBought[h] and premiseB in whoBought[h]:
                    num_occurences[premiseA, premiseB] += 1
                    if conclusion in whoBought[h]:
                        valid_rules[(premiseA, premiseB, conclusion)] += 1
                    else:
                        invalid_rules[(premiseA, premiseB, conclusion)] += 1


support = valid_rules
confidence = defaultdict(float)
for premiseA, premiseB, conclusion in valid_rules.keys():
    confidence[(premiseA, premiseB, conclusion)] = valid_rules[(premiseA, premiseB, conclusion)] / num_occurences[premiseA, premiseB]
    
for premiseA, premiseB, conclusion in confidence:
    print("Rule: If a person buys {0} and {1} they will also buy {2}".format(premiseA, premiseB, conclusion))
    print(" - Confidence: {0:.3f}".format(confidence[(premiseA, premiseB, conclusion)]))
    print(" - Support: {0}".format(support[(premiseA, premiseB, conclusion)]))
    print("")

Rule: If a person buys 120000 and 60000 they will also buy 150000
 - Confidence: 0.029
 - Support: 11

Rule: If a person buys 190000 and 100000 they will also buy 110000
 - Confidence: 0.059
 - Support: 17

Rule: If a person buys 170000 and 20000 they will also buy 100000
 - Confidence: 0.048
 - Support: 9

Rule: If a person buys 110000 and 120000 they will also buy 50000
 - Confidence: 0.056
 - Support: 20

Rule: If a person buys 90000 and 50000 they will also buy 80000
 - Confidence: 0.055
 - Support: 28

Rule: If a person buys 150000 and 80000 they will also buy 30000
 - Confidence: 0.055
 - Support: 14

Rule: If a person buys 40000 and 190000 they will also buy 110000
 - Confidence: 0.056
 - Support: 19

Rule: If a person buys 140000 and 200000 they will also buy 20000
 - Confidence: 0.059
 - Support: 10

Rule: If a person buys 110000 and 180000 they will also buy 130000
 - Confidence: 0.059
 - Support: 9

Rule: If a person buys 170000 and 190000 they will also buy 180000
 - Confid

#### Create a function that will print out the rules in a readable format

In [25]:
def print_rule(premiseA, premiseB, conclusion):
    print("Rule: If a person buys {0} and {1} they will also buy {2}".format(premiseA, premiseB, conclusion))
    print(" - Confidence: {0:.3f}".format(confidence[(premiseA, premiseB, conclusion)]))
    print(" - Support: {0}".format(support[(premiseA, premiseB, conclusion)]))
    print("")

#### Test the function

Call the print_rule function to report the support confidence statistics on the Rule: If a person buys cheese and apples, then they are likely to purchase bananas. 

In [26]:
print_rule(20000, 40000, 60000)

Rule: If a person buys 20000 and 40000 they will also buy 60000
 - Confidence: 0.050
 - Support: 22



### Part D: Sort the rules derived in Part C according to support and print the top 5

In [29]:
from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)

for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premiseA, premiseB, conclusion) = sorted_support[index][0]
    print_rule(premiseA, premiseB, conclusion)

Rule #1
Rule: If a person buys 40000 and 70000 they will also buy 80000
 - Confidence: 0.054
 - Support: 31

Rule #2
Rule: If a person buys 70000 and 80000 they will also buy 40000
 - Confidence: 0.055
 - Support: 31

Rule #3
Rule: If a person buys 40000 and 80000 they will also buy 70000
 - Confidence: 0.059
 - Support: 31

Rule #4
Rule: If a person buys 80000 and 40000 they will also buy 70000
 - Confidence: 0.059
 - Support: 31

Rule #5
Rule: If a person buys 80000 and 70000 they will also buy 40000
 - Confidence: 0.055
 - Support: 31



### Part E: Sort the rules derived in Part C according to confidence and print the top 5 

In [33]:
from operator import itemgetter
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premiseA, premiseB, conclusion) = sorted_confidence[index][0]
    print_rule(premiseA, premiseB, conclusion)

Rule #1
Rule: If a person buys 190000 and 100000 they will also buy 110000
 - Confidence: 0.059
 - Support: 17

Rule #2
Rule: If a person buys 140000 and 200000 they will also buy 20000
 - Confidence: 0.059
 - Support: 10

Rule #3
Rule: If a person buys 110000 and 180000 they will also buy 130000
 - Confidence: 0.059
 - Support: 9

Rule #4
Rule: If a person buys 140000 and 130000 they will also buy 80000
 - Confidence: 0.059
 - Support: 11

Rule #5
Rule: If a person buys 130000 and 40000 they will also buy 70000
 - Confidence: 0.059
 - Support: 21



### Part F: What recommendations would you make to the manager of the store where this data has come from, based on this analysis, and why? 

### Part G: Prepare the data in the transaction dataset, and then analyse it in a similar way (as above). Use the results to make recommendations to the manager of the store where the data was collected. 