# Exercise 5

Today, we will implement two techniques that are part of the so-called shopping basket analysis, which will help us to better understand how customers data are being processed to extract insights about their habits.


#### Notes about external libraries
You can check your implementation of the Apriori algorithm and the Association Rules using MLxtend, a data mining library. Unfortunately, the library is not directly shipped with Anaconda. To install MLxtend, just execute  

```bash
pip install mlxtend  
```

Or directly using Anaconda

```bash
conda install -c conda-forge mlxtend 
```

Note that the installation of MLxtend is not mandatory, as we will provide the expected results in pre-rendered cells.

## Exercise 5.1

In the first part of this excercise, we will put into practice the Apriori algorithm. In particular, we will extract frequent itemsets from a list of transactions coming from a grocery store. You will have to complete the function get_support(...).

In [24]:
import operator
import numpy as np

"""
Format the transaction dataset.
Expect a list of transaction in the format:
[[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], ...]
"""
def preprocess(dataset):
    unique_items = set()
    for transaction in dataset:
        for item in transaction:
            unique_items.add(item)
       
    # Converting to frozensets to use itemsets as dict key
    unique_items = [frozenset([i]) for i in list(unique_items)]
    
    return unique_items,list(map(set,dataset))


"""
Generate candidates of length n+1 from a list of items, each of length n.

Example:
[{1}, {2}, {5}]          -> [{1, 2}, {1, 5}, {2, 5}]
[{2, 3}, {2, 5}, {3, 5}] -> [{2, 3, 5}]
"""
def generate_candidates(Lk):
    output = []

    # We generate rules of the target size k
    k=len(Lk[0])+1
    
    for i in range(len(Lk)):
        for j in range(i+1, len(Lk)): 
            L1 = list(Lk[i])[:k-2]; 
            L2 = list(Lk[j])[:k-2]
            L1.sort(); 
            L2.sort()

            # Merge sets if first k-2 elements are equal
            # For the case of k<2, generate all possible combinations
            if L1==L2: 
                output.append(Lk[i] | Lk[j])

    return output


"""
Print the results of the apriori algorithm
"""
def print_support(support,max_display=10,min_items=1):
    print('support\t itemset')
    print('-'*30)
    filt_support = {k:v for k,v in support.items() if len(k)>=min_items}
    for s,sup in sorted(filt_support.items(), key=operator.itemgetter(1),reverse=True)[:max_display]:
        print("%.2f" % sup,'\t',set(s))
        
def print_support_mx(df,max_display=10,min_items=1):
    print('support\t itemset')
    print('-'*30)
    lenrow = df['itemsets'].apply(lambda x: len(x))
    df  = df[lenrow>=min_items]
    df  = df.sort_values('support',ascending=False).iloc[:max_display]
    for i,row in df.iterrows():
        print("%.2f" % float(row['support']),'\t',set(row['itemsets']))
        

"""
Run the apriori algorithm

dataset     : list of transactions
min_support : minimum support. Itemsets with support below this threshold
              will be pruned.
L : Frequent item sets. L[0] : [{'a'}, {'b'}]; L[1]: [{'a', 'b'}, {'c', 'd'}]
Support data: Dictioary of frequent itemsets with their corresponding support.
"""
def apriori(dataset, min_support = 0.5):
    unique_items,dataset = preprocess(dataset)
    L1, supportData      = get_support(dataset, unique_items, min_support) # L1 = {'a'} {'b'} {'c'}
    L = [L1]
    k = 0
    
    while True:
        Ck       = generate_candidates(L[k]) # Union 2 single itemset to a composite itemset {'a'}, {'b'} => {'a', 'b'}
        Lk, supK = get_support(dataset, Ck, min_support)
        
        # Is there itemsets of length k that have the minimum support ?
        if len(Lk)>0:
            supportData.update(supK) # rewrite the values in support dict with newly calculated support 
            L.append(Lk) 
            k += 1
        else:
            break # calculates, as min support augments, the itemsets that satitisfies condition
            
    return L, supportData

### TODO

The [Apriori Algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm) identifies frequent combinations of items by extending them to larger and larger itemsets (see the generate_candidates function) as long as they appear sufficiently often in a list of transactions.

Compute support for all the candidate itemsets contained in Ck, given the total list of transactions. We already provide the functions to compute candidate itemsets. The support of the itemset $X$ with respect to the list of transactions $T$ is defined as the proportion of transactions $t$ in the dataset which contains the itemset $X$. Support can be computed using the following formula

$$\mathrm{supp}(X) = \frac{|\{t \in T; X \subseteq t\}|}{|T|}$$  

After computing the support for each itemset, prune the ones that do not match the minimal specificied support.

In [2]:
"""
Compute support for each provided itemset by counting the number of
its occurences in the original dataset of transactions.

dataset      : list of transactions, preprocessed using 'preprocess()'
Ck           : list of itemsets to compute support for. 
min_support  : minimum support. Itemsets with support below this threshold
               will be pruned.
              
output       : list of remaining itemsets, after the pruning step. 
support_dict : dictionary containing the support value for each itemset.
"""
def get_support(dataset, Ck, min_support):
    
    # This dictionary should contain the number of appearance of each itemset in the dataset.
    # Itemset in Ck are represented as frozensets and can directly be uses as dictionary keys.
    support_count = {}
    
    for transaction in dataset:
        for candidate in Ck:
            if candidate.issubset(transaction):
                if candidate in support_count:
                    support_count[candidate] += 1
                else:
                    support_count[candidate] = 1
    
    output = []
    support_dict = {}
    for key in support_count:
        # Calculate fraction of presence of itemset over all transactions
        support = support_count[key] / float(len(dataset))
        
        if support >= min_support:
            output.insert(0,key)
            
        support_dict[key] = support
    return output, support_dict

### Run

In [25]:
dataset = [ l.strip().split(',') for i,l in enumerate(open('groceries.csv').readlines())]

L,support = apriori(dataset,min_support=0.01)
print_support(support,10,min_items=2)

support	 itemset
------------------------------
0.07 	 {'other vegetables', 'whole milk'}
0.06 	 {'rolls/buns', 'whole milk'}
0.06 	 {'yogurt', 'whole milk'}
0.05 	 {'root vegetables', 'whole milk'}
0.05 	 {'other vegetables', 'root vegetables'}
0.04 	 {'yogurt', 'other vegetables'}
0.04 	 {'rolls/buns', 'other vegetables'}
0.04 	 {'tropical fruit', 'whole milk'}
0.04 	 {'soda', 'whole milk'}
0.04 	 {'rolls/buns', 'soda'}


### Check

You can check the results of your implementation using MLXtend. Just run the cell below

In [7]:
import pandas as pd
from mlxtend.frequent_patterns import apriori as mx_apriori

df_dummy = pd.get_dummies(pd.Series(dataset).apply(pd.Series).stack()).sum(level=0)
frequent_itemsets = mx_apriori(df_dummy, min_support=0.01, use_colnames=True)
print_support_mx(frequent_itemsets,10,min_items=2)

support	 itemset
------------------------------
0.07 	 {'whole milk', 'other vegetables'}
0.06 	 {'rolls/buns', 'whole milk'}
0.06 	 {'yogurt', 'whole milk'}
0.05 	 {'root vegetables', 'whole milk'}
0.05 	 {'root vegetables', 'other vegetables'}
0.04 	 {'yogurt', 'other vegetables'}
0.04 	 {'rolls/buns', 'other vegetables'}
0.04 	 {'tropical fruit', 'whole milk'}
0.04 	 {'soda', 'whole milk'}
0.04 	 {'rolls/buns', 'soda'}


## Question 5.2

Such associations are not necessarily symmetric. Therefore, in the second part, we will use [association rule learning](https://en.wikipedia.org/wiki/Association_rule_learning) to better understand the directionality of our computed frequent itemsets. In other terms, we will have to infer if the purchase of one item generally implies the the purchase of another.

In [45]:
"""
L              : itemsets
supportData    : dictionary storing itemsets support
min_confidence : rules with a confidence under this threshold should be pruned
"""
def generate_rules(L, supportData, min_confidence=0.7):  
    # Rules to be computed
    rules = []
    
    # Iterate over itemsets of length 2..N
    for i in range(1, len(L)):
        
        # Iterate over each frequent itemset
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
#             print("H1: {}".format(H1))
            # If the itemset contains more than 2 elements
            # recursively generate candidates 
            if (i+1 > 2):
                rules_from_consequent(freqSet, H1, supportData, rules, min_confidence)
                compute_confidence(freqSet, H1, supportData, rules, min_confidence)
            # If the itemsset contains 2 or less elements
            # conpute rule confidence
            else:
                compute_confidence(freqSet, H1, supportData, rules, min_confidence)

    return rules   

"""
freqSet        : frequent itemset
H              : candidate elements to create a rule
supportData    : dictionary storing itemsets support
rules          : array to store rules
min_confidence : rules with a confidence under this threshold should be pruned
"""
def rules_from_consequent(freqSet, H, supportData, rules, min_confidence=0.7):
    m = len(H[0])
#     print("H: {}".format(H))
    if (len(freqSet) > (m + 1)): 

        # create new candidates of size n+1 i.e. merge individual sets
        Hmp1 = generate_candidates(H)
        Hmp1 = compute_confidence(freqSet, Hmp1, supportData, rules, min_confidence)
        
        if (len(Hmp1) > 1):    #need at least two sets to merge
            rules_from_consequent(freqSet, Hmp1, supportData, rules, min_confidence)
            
"""
Print the resulting rules
"""
def print_rules(rules,max_display=10):
    print('confidence\t rule')
    print('-'*30)
    for a,b,sup in sorted(rules, key=lambda x: x[2],reverse=True)[:max_display]:
        print("%.2f" % sup,'\t',set(a),'->',set(b))
def print_rules_mx(df,max_display=10):
    print('confidence\t rule')
    print('-'*30)
    df  = df.sort_values('confidence',ascending=False).iloc[:max_display]
    for i,row in df.iterrows():
        print("%.2f" % float(row['confidence']),'\t',set(row['antecedants']),'->',set(row['consequents']))

### TODO:

You will have to complete the method `compute_confidence(...)`, that computes confidence for a set of candidate rules H and prunes the rules that have a confidence below the specified threshold. Please complete it by computing rules confidence using the following formula:

$$\mathrm{conf}(X \Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X)$$


In [47]:
"""
Compute confidence for a given set of rules and their respective support

freqSet        : frequent itemset of N-element
H              : list of candidate elements Y1, Y2... that are part of the frequent itemset
supportData    : dictionary storing itemsets support
rules          : array to store rules
min_confidence : rules with a confidence under this threshold should be pruned
"""
def compute_confidence(freqSet, H, supportData, rules, min_confidence=0.7):
    prunedH = [] 
    
    for Y in H:
        # Compute X which is the frequent itemset minus the considered Y
        X           = freqSet - Y
        
#         print("freqSet: {}; X : {}; Y: {}".format(freqSet, X, Y))
        
        # Compute support for both terms
        support_XuY = supportData[freqSet]
        support_X   = supportData[X]
        
        
        # Compute confidence
        conf        = support_XuY / support_X
        
        if conf >= min_confidence: 
            rules.append((X, Y, conf))
            prunedH.append(Y)
#     print()
    return prunedH

### Run

In [40]:
rules=generate_rules(L,support, min_confidence=0.1)
print_rules(rules,10)

H1: [frozenset({'napkins'}), frozenset({'whole milk'})]
freqSet: frozenset({'napkins', 'whole milk'}); X : frozenset({'whole milk'}); Y: frozenset({'napkins'})
freqSet: frozenset({'napkins', 'whole milk'}); X : frozenset({'napkins'}); Y: frozenset({'whole milk'})

H1: [frozenset({'rolls/buns'}), frozenset({'domestic eggs'})]
freqSet: frozenset({'rolls/buns', 'domestic eggs'}); X : frozenset({'domestic eggs'}); Y: frozenset({'rolls/buns'})
freqSet: frozenset({'rolls/buns', 'domestic eggs'}); X : frozenset({'rolls/buns'}); Y: frozenset({'domestic eggs'})

H1: [frozenset({'shopping bags'}), frozenset({'tropical fruit'})]
freqSet: frozenset({'shopping bags', 'tropical fruit'}); X : frozenset({'tropical fruit'}); Y: frozenset({'shopping bags'})
freqSet: frozenset({'shopping bags', 'tropical fruit'}); X : frozenset({'shopping bags'}); Y: frozenset({'tropical fruit'})

H1: [frozenset({'whipped/sour cream'}), frozenset({'root vegetables'})]
freqSet: frozenset({'whipped/sour cream', 'root veget

### Check

You can check the results of your implementation using MLXtend. Just run the cell below (you will have to run the checking code of question 1 first).

In [10]:
from mlxtend.frequent_patterns import association_rules as mx_association_rules

rules_mx = mx_association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)
print_rules_mx(rules_mx,max_display=10)

confidence	 rule
------------------------------
0.59 	 {'root vegetables', 'citrus fruit'} -> {'other vegetables'}
0.58 	 {'tropical fruit', 'root vegetables'} -> {'other vegetables'}
0.58 	 {'yogurt', 'curd'} -> {'whole milk'}
0.57 	 {'butter', 'other vegetables'} -> {'whole milk'}
0.57 	 {'tropical fruit', 'root vegetables'} -> {'whole milk'}
0.56 	 {'root vegetables', 'yogurt'} -> {'whole milk'}
0.55 	 {'domestic eggs', 'other vegetables'} -> {'whole milk'}
0.52 	 {'yogurt', 'whipped/sour cream'} -> {'whole milk'}
0.52 	 {'rolls/buns', 'root vegetables'} -> {'whole milk'}
0.52 	 {'pip fruit', 'other vegetables'} -> {'whole milk'}


## EPFL Twitter Data

Now that we have a working implementation, we will apply the Apriori algorithm on a dataset that you should know pretty well by now: EPFL Twitter data. In that scenario, tweets will be considered as transactions and words will be items. Let's see what kind of frequent associations we can discover.

The method below cleans the tweets and formats them in the same format as the transactions of the previous exercise. Run the cells and generate the results for both algorithms. What can you observe from the association rules results? Briefly explain.

In [42]:
# Loading of libraries and documents

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')

# Tokenize, stem a document
stemmer = PorterStemmer()
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# Remove stop words
def clean_voc(documents):
    cleaned = []
    for tweet in documents:
        new_tweet = []
        tweet = tokenize(tweet).split()
        for word in tweet:
            if (word not in stopwords.words('english') and 
                word not in stopwords.words('german') and
                word not in stopwords.words('french')):
                if word=="epflen":
                    word = "epfl"
                new_tweet.append(word)
        if len(new_tweet)>0:
            cleaned.append(new_tweet)
    return cleaned

# Read a list of documents from a file. Each line in a file is a document
with open("epfldocs.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = clean_voc(original_documents)

[nltk_data] Downloading package stopwords to /Users/yawen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/yawen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [43]:
L,support = apriori(documents,min_support = 0.01)
print_support(support,20,min_items=2)

support	 itemset
------------------------------
0.08 	 {'epfl', 'via'}
0.06 	 {'epfl', '’'}
0.05 	 {'epfl', 'new'}
0.05 	 {'amp', 'epfl'}
0.05 	 {'research', 'epfl'}
0.04 	 {'epfl', 'lausann'}
0.04 	 {'epfl', 'vdtech'}
0.04 	 {'epfl', 'switzerland'}
0.04 	 {'robot', 'epfl'}
0.03 	 {'epfl', 'day'}
0.03 	 {'epfl', 'swiss'}
0.03 	 {'via', 'vdtech'}
0.03 	 {'epfl', 'via', 'vdtech'}
0.03 	 {'innov', 'epfl'}
0.03 	 {'epfl', 'scienc'}
0.03 	 {'epfl', 'student'}
0.03 	 {'epfl', 'first'}
0.03 	 {'epfl', 'work'}
0.02 	 {'epfl', 'technolog'}
0.02 	 {'learn', 'epfl'}


In [48]:
rules=generate_rules(L,support, min_confidence=0.1)
print_rules(rules,20)

confidence	 rule
------------------------------
1.00 	 {'scientist'} -> {'epfl'}
1.00 	 {'perovskit'} -> {'epfl'}
1.00 	 {'epflcampu'} -> {'epfl'}
1.00 	 {'technolog'} -> {'epfl'}
1.00 	 {'»'} -> {'epfl'}
1.00 	 {'»'} -> {'«'}
1.00 	 {'«'} -> {'»'}
1.00 	 {'«'} -> {'epfl'}
1.00 	 {'improv'} -> {'epfl'}
1.00 	 {'next'} -> {'epfl'}
1.00 	 {'present'} -> {'epfl'}
1.00 	 {'eth'} -> {'epfl'}
1.00 	 {'learn'} -> {'epfl'}
1.00 	 {'show'} -> {'epfl'}
1.00 	 {'drone'} -> {'epfl'}
1.00 	 {'model'} -> {'epfl'}
1.00 	 {'particip'} -> {'epfl'}
1.00 	 {'mooc'} -> {'epfl'}
1.00 	 {'brain'} -> {'epfl'}
1.00 	 {'»'} -> {'«', 'epfl'}


## 5.3 Pen and Paper

You are given the following accident and weather data. Each line corresponds to one event:

1. car_accident rain lightning wind clouds fire
2. fire clouds rain lightning wind
3. car_accident fire wind
4. clouds rain wind
5. lightning fire rain clouds  
6. clouds wind car_accident  
7. rain lightning clouds fire  
8. lightning fire car_accident

(a) You would like to know what is the likely cause of all the car accidents. What association rules do you need to look for? Compute the confidence and support values for these rules. Looking at these values, which is the most likely cause of the car accidents?

(b) Find all the association rules for minimal support 0.6 and minimal confidence of 1.0 (certainty). Follow the apriori algorithm.