# Algorithms for Data Science -- Laboratory 1
Author: Pablo Mollá Chárlez

## Finding Frequent Itemsets

The objective of this lab is to implement and analyze the A-Priori algorithm for mining frequent itemsets and their associated rules. This lab needs Python and Jupyter, along with the NumPy package.

1. We will first load the required libraries and the file containing the baskets. We will also set the support threshold $s$ here.

In [19]:
import sys
import numpy as np
import urllib.request
import itertools
import tqdm

file_location = 'https://phparis.net/slides/algo_4_ds/week1/groceries.csv' #you can change this to a local file on your computer

#creating in-memory data structure
infile = urllib.request.urlopen(file_location)
baskets = []
for line in infile:
  line = str(line).strip().split(',')[1:-1]
  baskets.append([x for x in line if x!=''])
print("Number of baskets: %d"%len(baskets))
print("Set of Baskets:", baskets)
print('Basket 1:',baskets[0])

print(max([len(x) for x in baskets]))

# Smaller version
baskets = [x for x in baskets if len(x) > 10]
len(baskets)

Number of baskets: 9835
Set of Baskets: [['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'], ['tropical fruit', 'yogurt', 'coffee'], ['whole milk'], ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'], ['other vegetables', 'whole milk', 'condensed milk', 'long life bakery product'], ['whole milk', 'butter', 'yogurt', 'rice', 'abrasive cleaner'], ['rolls/buns'], ['other vegetables', 'UHT-milk', 'rolls/buns', 'bottled beer', 'liquor (appetizer)'], ['potted plants'], ['whole milk', 'cereals'], ['tropical fruit', 'other vegetables', 'white bread', 'bottled water', 'chocolate'], ['citrus fruit', 'tropical fruit', 'whole milk', 'butter', 'curd', 'yogurt', 'flour', 'bottled water', 'dishes'], ['beef'], ['frankfurter', 'rolls/buns', 'soda'], ['chicken', 'tropical fruit'], ['butter', 'sugar', 'fruit/vegetable juice', 'newspapers'], ['fruit/vegetable juice'], ['packaged fruit/vegetables'], ['chocolate'], ['specialty bar'], ['other vegetables'], ['butter milk', 'pastry'], ['w

650

2. Next, we count the items that are present once in the the dataset (Pass 1). This will be used for subsequent steps. We keep two data structures: _items_, an array keeping the original names of the items, and the dictionary _count_ which keeps, for each unique item, the number of times it has been encountered.

In [20]:
items = []
item_id = {}
count = []
idx = 0
for b in baskets: #pass over the file
  for i in b:
    if i in item_id:
      count[item_id[i]] = count[item_id[i]]+1
    else:
      count.append(1) #add a new count of 1
      items.append(i) #add the string of items
      item_id[i] = idx
      idx += 1

print("Number of unique items: %d"%len(items))
print("Unique items and their counts:")
print("Items:", items)
print("Number of Items:", count)
print("Dict:", item_id)


Number of unique items: 164
Unique items and their counts:
Items: ['tropical fruit', 'root vegetables', 'other vegetables', 'frozen dessert', 'rolls/buns', 'flour', 'sweet spreads', 'salty snack', 'waffles', 'candy', 'bathroom cleaner', 'whole milk', 'yogurt', 'domestic eggs', 'brown bread', 'pastry', 'sugar', 'cereals', 'coffee', 'soda', 'frankfurter', 'sausage', 'citrus fruit', 'pip fruit', 'hard cheese', 'cream cheese', 'rice', 'canned fruit', 'misc. beverages', 'fruit/vegetable juice', 'hygiene articles', 'ham', 'butter', 'curd', 'whipped/sour cream', 'soft cheese', 'sliced cheese', 'frozen meals', 'frozen fish', 'white bread', 'bottled beer', 'beef', 'butter milk', 'beverages', 'dish cleaner', 'cookware', 'meat', 'dessert', 'tea', 'bottled water', 'salt', 'canned vegetables', 'canned fish', 'skin care', 'napkins', 'hamburger meat', 'grapes', 'potato products', 'chocolate', 'newspapers', 'onions', 'soups', 'margarine', 'baking powder', 'semi-finished bread', 'chicken', 'pasta', 'mu

3. Now we count frequent pairs (having support at least _s_), in a single pass over the file. For each basket in the file, we add the count of the pair only if the two elements in the pair are present in _count_ at least _s_ times.


In [21]:
#defining a support threshold
s = 5

counts_pair = {} #hash table, where a pair is the key

pairs_counted = 0
all_pairs = 0

for b in baskets:
  b_int = sorted([item_id[x] for x in b]) #use ids and sort
  for pair in itertools.combinations(b_int,2):
    all_pairs += 1
    if count[pair[0]]>=s and count[pair[1]]>=s: #only care for candidate pairs
      pairs_counted += 1
      if pair in counts_pair:
        counts_pair[pair] = counts_pair[pair]+1
      else:
        counts_pair[pair] = 1

pairs_list = []

for pair in counts_pair.keys():
  if counts_pair[pair]>=s: pairs_list.append(pair)

print ("All pairs %d, counted pairs: %d"%(all_pairs,pairs_counted))
print ("Frequent pairs: %d (s=%d)"%(len(pairs_list),s))
for pair in pairs_list:
  print ("\t %s, %s"%(items[pair[0]],items[pair[1]]))


    

All pairs 60073, counted pairs: 59465
Frequent pairs: 3061 (s=5)
	 tropical fruit, root vegetables
	 tropical fruit, other vegetables
	 tropical fruit, frozen dessert
	 tropical fruit, rolls/buns
	 tropical fruit, flour
	 tropical fruit, sweet spreads
	 tropical fruit, salty snack
	 tropical fruit, waffles
	 tropical fruit, candy
	 root vegetables, other vegetables
	 root vegetables, frozen dessert
	 root vegetables, rolls/buns
	 root vegetables, flour
	 root vegetables, sweet spreads
	 root vegetables, salty snack
	 root vegetables, waffles
	 root vegetables, candy
	 root vegetables, bathroom cleaner
	 other vegetables, frozen dessert
	 other vegetables, rolls/buns
	 other vegetables, flour
	 other vegetables, sweet spreads
	 other vegetables, salty snack
	 other vegetables, waffles
	 other vegetables, candy
	 other vegetables, bathroom cleaner
	 frozen dessert, rolls/buns
	 frozen dessert, candy
	 rolls/buns, flour
	 rolls/buns, sweet spreads
	 rolls/buns, salty snack
	 rolls/buns, w

4. Implement the a-priori algorithm to count __all__ the frequent itemsets. Use the implementations in Sections 2 and 3 as a base to compute the $C_k$ and $L_k$ sets (refer to the lecture notes for indications on how to implement). Show all the frequent itemsets that are not pairs or singletons. Play with the parameters and explain how the number of frequent items change.

In [31]:
# YOUR CODE HERE

# Example to correct and understand the algorithm
#baskets = [['m','c','b'], ['m','p','j'], ['m','b'], ['c','j'], ['m','p','b'], ['m','c','b','j'], ['c','b','j'], ['b','j']]
#count = [4, 1, 2, 1, 2]
#item_id = {'m': 0, 'c': 1, 'b': 2, 'p': 3, 'j':4}
#items = ['m','c','b','p','j']
#defining a support threshold

# Auxiliary Function
def items_count_itemsid_calculator(baskets):
  items = []
  item_id = {}
  count = []
  idx = 0
  for b in baskets:
    for i in b:
      if i in item_id:
        count[item_id[i]] = count[item_id[i]]+1
      else:
        count.append(1) #add a new count of 1
        items.append(i) #add the string of items
        item_id[i] = idx
        idx += 1
  return items, count, item_id
  
# To compute all candidates to frequent itemsets (C_k) and truly frequent itemsets (L_k)
def frequent_itemsets(baskets, k, s):
  # Computing items, counts and items_id dict
  items, count, item_id = items_count_itemsid_calculator(baskets)

  # Hash table, where a combination is the key
  counts_combination = {}
  truly_frequent_itemsets = []

  for basket in baskets:
    basket_ids = sorted([item_id[item] for item in basket]) # Use ids and sort
    # Combinations of k items
    for combination in itertools.combinations(basket_ids, k):
      if all([count[index] >= s for index in combination]):  
        if combination in counts_combination:
            counts_combination[combination] = counts_combination[combination]+1
        else:
            counts_combination[combination] = 1

  counts_combination_filtered = {key: value for key, value in counts_combination.items() if value >= s}
  
  # If you want to have the original baskets instead of indices
  #for combination in counts_combination_filtered:
  #   aux = []
  #   for index in range(0,k):
  #     aux.append(items[combination[index]])
  #   truly_frequent_itemsets.append(aux)

  return counts_combination_filtered

# k=3 and s=2
frequent_itemsets_3_2 = frequent_itemsets(baskets, 3, 2)
print(frequent_itemsets_3_2)

# Auxiliary Function
def support_calculator(baskets, itemset_ids, k, s):
  sorted_itemset_ids = tuple(sorted(list(itemset_ids)))
  # Extracting count
  if sorted_itemset_ids in frequent_itemsets(baskets, k, s).keys():
    counts_itemset_ids = frequent_itemsets(baskets, k, s)[sorted_itemset_ids]
    return counts_itemset_ids

#print(support_calculator(baskets, (0, 3), 2, 3))


{(0, 1, 2): 70, (0, 1, 3): 7, (0, 1, 4): 31, (0, 1, 5): 6, (0, 1, 6): 3, (0, 1, 7): 10, (0, 1, 8): 11, (0, 1, 9): 8, (0, 1, 10): 3, (0, 2, 3): 11, (0, 2, 4): 44, (0, 2, 5): 15, (0, 2, 6): 3, (0, 2, 7): 22, (0, 2, 8): 17, (0, 2, 9): 14, (0, 2, 10): 3, (0, 3, 4): 4, (0, 3, 5): 2, (0, 3, 8): 3, (0, 3, 9): 2, (0, 4, 5): 7, (0, 4, 6): 3, (0, 4, 7): 5, (0, 4, 8): 9, (0, 4, 9): 11, (0, 5, 7): 4, (0, 5, 8): 3, (0, 5, 9): 2, (0, 6, 7): 2, (0, 6, 8): 2, (0, 7, 8): 4, (0, 7, 9): 2, (0, 8, 9): 7, (0, 8, 10): 2, (1, 2, 3): 12, (1, 2, 4): 59, (1, 2, 5): 14, (1, 2, 6): 7, (1, 2, 7): 17, (1, 2, 8): 21, (1, 2, 9): 16, (1, 2, 10): 5, (1, 3, 4): 4, (1, 3, 5): 2, (1, 3, 8): 2, (1, 4, 5): 8, (1, 4, 6): 4, (1, 4, 7): 6, (1, 4, 8): 14, (1, 4, 9): 9, (1, 4, 10): 2, (1, 5, 7): 3, (1, 5, 8): 5, (1, 5, 9): 2, (1, 6, 7): 3, (1, 6, 8): 2, (1, 7, 8): 3, (1, 7, 9): 2, (1, 8, 9): 6, (1, 8, 10): 3, (1, 9, 10): 2, (2, 3, 4): 7, (2, 3, 5): 2, (2, 3, 7): 2, (2, 3, 8): 3, (2, 3, 9): 3, (2, 4, 5): 11, (2, 4, 6): 4, (2, 4, 

5. __TASK__ Generate all the association rules of support at least $s$ and confidence at least $c$ (parameter to be set below).

In [29]:
c = 0.75

# YOUR CODE HERE

# The second input variable has to be a frequent itemset otherwise no sense
def association_rules_calculator(baskets, frequent_itemset_ids, k, s, c):

  # Variables
  truly_association_rules = []

  # Back to numbers
  items, count, item_id = items_count_itemsid_calculator(baskets)
  
  # Computing supports & confidence
  for i in range(1,len(frequent_itemset_ids)):
      for combination in itertools.combinations(frequent_itemset_ids, i):
          items_left = tuple([x for x in frequent_itemset_ids if x not in combination])
          #print(frequent_itemset_ids, '|', combination, ' -> ', items_left)
          confidence = support_calculator(baskets, frequent_itemset_ids, len(frequent_itemset_ids), s) / support_calculator(baskets, combination, len(combination), s)

          if confidence >= c:
            aux1 = []
            for item_index in combination:
              aux1.append(items[item_index])
            aux2 = []
            for item_index in items_left:
              aux2.append(items[item_index])
            truly_association_rules.append(f' {aux1} -> {aux2} ')
  
  return truly_association_rules

#print(association_rules_calculator(baskets, (2,4), 2, 3, c))

def apriori_algorithm(baskets, k, s, c):
  association_rules_list = []
  frequent_itemset_list = frequent_itemsets(baskets, k, s)
  for itemset in tqdm.tqdm(frequent_itemset_list):
    #print("Computing Association Rules")
    association_rules_list.append(association_rules_calculator(baskets, itemset, k, s, c))
  
  # Priting Information: Frequent Itemsets and Association Rules
  print(f'Frequent Itemsets of size {k} with support {s}:')
  for item in frequent_itemset_list:
    print('   ', item)

  print(f'\nAssociation Rules with confidence c = {c}:')
  for rule in association_rules_list:
    if rule != []:
      print('   ', rule)

apriori_algorithm(baskets, 3, 30, 0.70)

100%|██████████| 227/227 [02:22<00:00,  1.59it/s]

Frequent Itemsets of size 3 with support 30:
    (0, 1, 2)
    (0, 1, 4)
    (0, 2, 4)
    (1, 2, 4)
    (0, 1, 11)
    (0, 1, 12)
    (0, 11, 12)
    (0, 11, 13)
    (0, 11, 15)
    (0, 11, 19)
    (1, 11, 12)
    (1, 11, 13)
    (1, 11, 19)
    (11, 12, 13)
    (11, 12, 14)
    (11, 12, 15)
    (11, 12, 19)
    (11, 13, 19)
    (11, 15, 19)
    (0, 11, 21)
    (0, 11, 22)
    (0, 11, 23)
    (0, 11, 29)
    (11, 13, 22)
    (11, 21, 22)
    (11, 21, 23)
    (11, 23, 29)
    (0, 1, 23)
    (0, 2, 11)
    (0, 2, 12)
    (0, 2, 13)
    (0, 2, 23)
    (0, 2, 29)
    (0, 2, 32)
    (0, 2, 33)
    (0, 2, 34)
    (0, 11, 32)
    (0, 11, 33)
    (0, 11, 34)
    (0, 11, 39)
    (0, 12, 23)
    (0, 12, 29)
    (0, 12, 32)
    (0, 12, 34)
    (1, 2, 11)
    (1, 2, 12)
    (1, 2, 13)
    (1, 2, 23)
    (1, 2, 29)
    (1, 2, 32)
    (1, 2, 33)
    (1, 2, 34)
    (1, 2, 36)
    (1, 11, 20)
    (1, 11, 23)
    (1, 11, 29)
    (1, 11, 32)
    (1, 11, 33)
    (1, 11, 34)
    (1, 12, 23)
    (1, 12, 3


