# Notebook 8: Representing Itemsets
***

In this notebook we'll have some practice representing itemset data as a matrix, as a triangular array, and as a list of triples. We will also take a look at how we can use a hash function and hash table to represent items in our inventory as we are discovering them, similar to how we would learn new items in the inventory as customers make new purchases.

We'll need some nice packages for this notebook, so let's load them.

In [3]:
import numpy as np 
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

<br>

### Exercise 1:  A Most Triangular of Matrices

Continuing to work with the PDQ data from last time, let's load up the inventory and the 8 baskets worth of buying history.

In [4]:
inventory = ["apple", "banana", "candy", "fancy feast", "grape soda", "ice cream"]

baskets = {0 : set(["apple", "banana", "candy", "fancy feast"]),
           1 : set(["apple", "banana", "grape soda"]),
           2 : set(["banana", "ice cream"]),
           3 : set(["apple", "candy", "ice cream"]),
           4 : set(["apple", "fancy feast", "banana", "ice cream"]),
           5 : set(["apple", "banana", "candy", "ice cream"]),
           6 : set(["candy", "ice cream", "banana"]),
           7 : set(["banana", "fancy feast", "ice cream"])}

Write a `for` loop (or maybe some nested loops) to create and fill in an upper-triangular matrix to reprenset the pairs itemset counts for this data set. One example is done for you.

In [15]:
n = len(inventory)
U = np.zeros((n, n))

# what items are we counting up?
item1 = "banana"
item2 = "fancy feast"
irow = inventory.index(item1)
icol = inventory.index(item2)

# compute and fill in the (banana, fancy feast) element
count = np.sum([set([item1, item2]) <= baskets[k] for k in range(len(baskets))])
U[irow,icol] = count

# fill in the rest of U here!
# TODO

### Exercise 2: Pruning the Nonsense

The triangular matrix is good and all, but there are so many 0s! Necessarily more than half of the elements of the array are things we don't need to store (0s below the main diagonal, and on the diagonal). So we are motivated to try to trim down this wasteful representation of our data.

Instead, let's represent as a *triangular array*. This is a 1-dimensional object with clever indexing to store elements in row $i$ and column $j$ of the upper-triangular matrix:

$a[k]$ = count for the pair $(i,j)$, where $0 \leq i < j \leq n-1$ ($n$ is the number of items), and 
$k = i \cdot \left(n - \frac{i+1}{2}\right) + j - i -1,$

where the above equation has been modified from the presentation in the lecture slides to fit with Python's 0-based indexing. Note that this is stored in a row-major fashion, so we know what to expect from seeing the full upper-triangular counts matrix above.

#### Exercise 2.5: A quick pit-stop

If we use a triangular matrix to count pairs, and $n$, the number of items, is 20, what pair’s count is in $a[100]$? Assume indexing begins at 1.

**Solution:** 


#### Back to the triangular array times!

Now then. Store the matrix $U$ from Exercise 1 as a triangular array. Check a few elements to make sure things are working properly. First, it might be useful to consider how many elements we expect in the resulting triangular array? The largest row stores $n-1$ and the smallest stores just 1 element... maybe there is a fond memory from Calculus 1 or Discrete that could be useful here...?

In [None]:
# SOLUTION:

# need to sum up 1+2+...+(n-1).
nt = ###Your code Here
print("We expect {:0.0f} elements".format(nt))

In [28]:
# SOLUTION:

# first, defining a helper function to get the Triangular Array Indices
def tai(i,j,n):
    k = ###Your code here
    return int(k)

# now nested for loops over the trinagular matrix
a = [0]*nt
k = 0
for i in range(n):
    ###Your code here

In [None]:
print(U)
print(a) #should be the uppertriag entries of U!

### Exercise 3: Array Triple Threat

Another handy representation of our itemset counts data is as an array of triples $(i,j,c)$, where $c$ is the count for the pair $(i,j)$. Try to code up the itemset count data as an array of triples (indexed starting at 0). *Hint: go green! and recycle almost all of the code from the triangular array above.*

In [2]:
# SOLUTION:

###Your code
print(trips)

NameError: name 'trips' is not defined

### Exercise 4: Eggs with a side of Hash Brow-I mean- Functions

Suppose we have upped our convenience store game and now sell the following products. (Note: yes, there may be repeated items in this inventory.)

In [53]:
inventory = ["puppies", "better candy", "cookie jar ice cream", "pizza bagels", \
             "warm slippers on a cold day", "coffee", "better candy", "mashed potatoes"]

In practical applications, using methods like `inventory.index("mashed potatoes")` won't be a very efficient way to get the integer index for the counts matrix elements corresponding to mashed potatoes because the matrix will be huge. 

Instead, as we sequentially read new items from our list of inventory, we can use a **hash function** to convert the items into integers. Those integers, however, will almost certainly never be the sequential numbers 1, 2, ..., $n$ (where $n$ is the number of items in our inventory). Rather, we can use the hash values to look up in our hash table what is the integer value corresponding to a given item.

Let's use the hash function that sums the ASCII values for each letter in an item's name, and takes the remainder when that sum is divided by a nice large prime number. Pick a nice prime number from [this list](https://www.mathsisfun.com/numbers/prime-numbers-to-10k.html). For the sake of example, we'll use $p=37$.

**Reminder:** `ord("a")` returns the ASCII value corresponding to the character "a", for example.

In [201]:
def hashfcn(itemname, p):
    # sum up the ASCII values in the string itemname
    #TODO
    # mod down by the prime p
    #TODO
    return 0 # TODO - return the hash value

To hash the item "puppies", for example, we would do:

In [203]:
p = 37
hv = hashfcn("puppies", p)
print(hv)

34


And we need a table where we can look up the element at slot 37 to find the integer corresponding to "puppies". Since this is the first item, we should start it off at index 0 in our triangular matrix/array/list of triples. The second item to be hashed and stored should be assigned index 1, and so on. We will store these items and their indices as tuples (item, index) so that if there were some collisions, we could resolve them by storing a list of (item, index) tuples at that hash value and searching through to check whether the item we are hashing is already at that spot.

We start by initializing our lookup table. 

In [207]:
lookup = [False]*p  # using -1 as the fill-value since 0 could be an index, or False

Read the inventory list element-by-element, and store the (item, index) tuples in the lookup table. Yes, we could certainly use a dictionary, but let's just pretend we have a more primitive language. For fun!

The code stencil below assumes no collisions. It will be up to you in your homework to generalize this code to resolve collisions, so consider playing around with what you might do in that case here.

In [None]:
# need to keep a running count of how many items we've seen
cnt = 0

for item in inventory:
    # hash the item
    #TODO
    if not lookup[hv]:
        # if the slot is free, put the (item, index) pair there
        #TODO
        # and increment the index counter
        #TODO
    else:
        # if the slot is taken, check if the element there is the one we want
        # this gets a list of all the item names stored there
        items = [lookup[hv][k][0] for k in range(len(lookup[hv]))]
        if item in items:
            # let the user know you've seen this item before
            print("We have seen item [{}] before.".format(item))
        else:
            # for now, just let the user know there is a collision to resolve
            print("Collision at hash value {}".format(hv))

#### Using our hash table: a tail of pizza bagels and puppies

Suppose for the pair (puppies, pizza bagels) we need to set the itemset count equal to 16. Use the `lookup` table and `hashfcn` to determine the row $i$ and column $j$ of the upper-triangular counts matrix that corresponds to this itemset pair.