# Finding frequent itemsets with the Apriori algorithm

In this notebook we will see a simple implementation of the Apriori algorithm using only Python data types (*list, dictionary, tuple*).

We assume that:
- The item identifiers are integers;
- The input file contains a set of baskets, one basket per line;
- each item within a basket is separated by comma ","

An example is:
```text
1,2,3,4
1,2,5
4,6,8,10,22,16
```

The other required input is the **support**, expressed as fraction of the total number of baskets (since we do not know how many baskets our dataset will contain).


## Loading the data


We first define the function to load the data.

We put each line in a list, so the whole dataset will be a list of lists.

In [None]:
def load_data(filename):
    input_lines = []
    raw_lines = open(filename, 'r').read().splitlines()
    for line in raw_lines:
        input_lines.append([int(x) for x in line.split(',')])
    return input_lines

The input file containing the dataset is called "1-baskets.txt".

Run the following cell if you are using Colab and you want to mount your google drive as data repository

In [None]:
from google.colab import drive
drive.mount('/content/drive')

If you then want to read a file, then call:
```python
input_file = "/content/drive/My Drive/..."
```

Let's load our dataset and print the first 5 lines (baskets)

In [None]:
#input_file = "[PATH]/1-baskets.txt"
input_file = "./1-baskets.txt"

dataset = load_data(input_file)

for elem in dataset[:4]:
    print(elem)

## First pass

In the first pass, we filter the frequet single items.

We will use a dictionary to keep track of the items and their counts -- this is the complete set of items, called **C1**. We then we filter the dictionary to remove items not sufficently frequent -- in this way we obtain the frequent items **L1**.

In [None]:
def createL1(dataset, support):
    freq_items = {}
    for basket in dataset:
        for item in basket:
            if item in freq_items:
                freq_items[item] += 1
            else:
                freq_items[item] = 1

    # remove non frequent items; len(data) returns the number of baskets
    support = support*len(dataset) 
    delete = [item for item in freq_items if freq_items[item] < support]
    for item in delete: 
        del freq_items[item]
    
    return freq_items

We are now ready to obtain the frequent items: we show the dictionary (i.e., the count for each item), and then we use only the keys and put them in a list.

In [None]:
support = 0.5
freq_item = createL1(dataset, support)
print("The frequent items with thier counts are:\n", freq_item)

freq_item_keys = sorted(list(freq_item.keys()))
print("\n\nThe frequent items (L1) are:", freq_item_keys)


## Naive algorithm 

Assuming we did not do the first pass, but we would like to find the pairs directly, with a single pass. Let's compute how much memory we need (we will compare the value with the second pass of the Apriori algorithm).

In ordert to count how many times a pair occurs, we use a dictionay, where the key is the pair. Since dictionary keys must be immutable, the pair are represented with a **tuple**. 

In [None]:
import sys  # used in computing the byte size

def createL2_naive(dataset, support):
    freq_itemset = {}
    for basket in dataset:
        sorted(basket)
        len_basket = len(basket)
        for i in range(len_basket):
            for j in range(i+1, len_basket):
                pair_tuple = (basket[i],basket[j])
                if pair_tuple in freq_itemset:
                    freq_itemset[pair_tuple] +=1
                else:
                    freq_itemset[pair_tuple] =1
    # At this point, freq_itemset contains all the pairs:
    # this is the maximum space used by this method
    byte_size = sys.getsizeof(freq_itemset)
                    
    # remove non frequent items; len(data) returns the number of baskets
    support = support*len(dataset) 
    delete = [item for item in freq_itemset if freq_itemset[item] < support]
    for item in delete: 
        del freq_itemset[item]
    
    return freq_itemset, byte_size

We can obtain the frequent pairs (we show the dictionary, i.e., the count for each pair), along with the maximum memory used.

In [None]:
freq_pair_naive, naive_size = createL2_naive(dataset, support)
print("The frequent pairs with thier counts are:\n", freq_pair_naive)
print("The memory used with the naive approach is (bytes): ", naive_size)


## Second pass

With the second pass, for each basket, we first remove non frequent items, then we build all the pairs. In addition to the dataset, we need to pass the L1 list (a list of item idenfitiers, not their counts).

In ordert to count how many times a pair occurs, we use a dictionay, where the key is the pair. Since dictionary keys must be immutable, the pair are represented with a **tuple**. 

With this pass, we directly compute **L2**.

In [None]:
def createL2(dataset, L1, support):
    freq_itemset = {}
    for basket in dataset:
        # remove non freq. items
        filtered_basket = []
        for item in basket:
            if item in L1:
                filtered_basket.append(item)
        # generate couples
        sorted(filtered_basket)
        len_basket = len(filtered_basket)
        for i in range(len_basket):
            for j in range(i+1, len_basket):
                pair_tuple = (filtered_basket[i],filtered_basket[j])
                if pair_tuple in freq_itemset:
                    freq_itemset[pair_tuple] +=1
                else:
                    freq_itemset[pair_tuple] =1
    # The freq_itemset contains all the pairs built from freq items:
    # this is the maximum space used by this method
    byte_size = sys.getsizeof(freq_itemset)
                    
    # remove non frequent itemset; 
    # len(data) returns the number of baskets
    support = support*len(dataset) 
    delete = [item for item in freq_itemset if freq_itemset[item] < support]
    for item in delete: 
        del freq_itemset[item]
    
    return freq_itemset, byte_size

We are now ready to obtain the frequent pairs (we show the dictionary, i.e., the count for each pair).

In [None]:
freq_pair, apriori_size = createL2(dataset, freq_item_keys, support)
print("The frequent pairs with thier counts are:\n", freq_pair)
print("The memory used with the apriori approach is (bytes): ", apriori_size)


## Third pass

Now it is possible to find the triples, which is left as exercise.

Triples (**C3**) can be build starting, as input, from L1 and L2 (only the keys, not the whole dictionary).

```python
def createL3(dataset, L2, L1, support):
    freq_itemset = {}
    for basket in dataset:
        # remove non freq. items
        filtered_basket = []
        for item in basket:
            if item in L1:
                filtered_basket.append(item)
        # generate triples, but only if the 
        # possible couples are frequent 
        ...         
    # remove non frequent itemset; 
    ...
    return freq_itemset
```

### Question  Q1
<div class="alert alert-info">
Using the skeleton provided above, implement the function createL3( )
</div>

In [None]:
# your code here

### Question  Q2
<div class="alert alert-info">
Find the triples and the amount of memory used to compute such a result
</div>

In [None]:
# your code here