# Lab 2: Introduction to Itemset Mining in Python

In this lab, we will use a Groceries Dataset available on Kaggle. A full description of the dataset is available at https://www.kaggle.com/heeraldedhia/groceries-dataset. It is composed of 38,765 rows of items purchased at the grocery store and contains the following columns:
  * `Member_number` the id of the member who purchased the item
  * `Date` the date that the transaction occurred
  * `itemDescription` the item that was purchased


We will be performing Market Basket Analysis on the data by applying the **Apriori Algorithm**. When doing so, it is important to keep in mind that this is a very small, limited dataset so we must be careful with how we interpret the significance of any of our findings.

In [1]:
#%pip install pandas
import pandas as pd 
#%pip install numpy
import numpy as np
#%pip install seaborn
import seaborn as sns
%pip install apyori
from apyori import apriori

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5974 sha256=461cba6b752c1e3c839e49534bcdd30ffab0d0512344a293f24a11a5a2f08b09
  Stored in directory: /root/.cache/pip/wheels/cb/f6/e1/57973c631d27efd1a2f375bd6a83b2a616c4021f24aab84080
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


### 1. Data Exploration

Let's begin by loading the dataset and performing some preliminary data exploration.

In [None]:
# load the dataset
from google.colab import files
uploaded = files.upload()

groceries = pd.read_csv('Groceries_dataset.csv')
print("Number of rows:", groceries.shape[0])
print("Number of columns:", groceries.shape[1])
groceries.sample(5)

TypeError: ignored

We see that we have 38,765 rows and three columns: the 'Member_number' (the ID of the customer), 'Date' (the date of the purchase) and 'itemDescription' (the product purchased). Therefore, each row represents an item that was purchased by a customer on a given date. If multiple items were purchased by the customer on that date, we will have multiple rows for that combination of 'Member_number' and 'Date'.

In [None]:
# number of unique customers
print("Number of unique customers:", groceries.Member_number.nunique())
# number of unique items
print("Number of unique items:", groceries.itemDescription.nunique())

In [None]:
# number of unique dates
print("Number of unique dates:", groceries.Date.nunique())
# range of dates
print("Date Range: {} - {}".format(groceries.Date.min(), groceries.Date.max()))

In [None]:
# number of unique customer-date combination
print("Number of unique customer-date combinations:", 
      len(groceries.drop_duplicates(['Member_number', 'Date'])))

In [None]:
# average size of basket
avg_basket_size = np.round(groceries.groupby(['Member_number', 'Date']).size().mean(), 2)
min_basket_size = np.round(groceries.groupby(['Member_number', 'Date']).size().min(), 2)
max_basket_size = np.round(groceries.groupby(['Member_number', 'Date']).size().max(), 2)
print("Average Basket Size: {} items".format(avg_basket_size))
print("Minimum Basket Size: {} items".format(min_basket_size))
print("Maximum Basket Size: {} items".format(max_basket_size))

# plot distribution of basket sizes
sizes = groceries.groupby(['Member_number', 'Date']).size().reset_index(name='Basket Size')['Basket Size']
sns.distplot(sizes, kde=False).set_title('Distribution of Basket Sizes');

In [None]:
# what are the 10 most frequent items in the dataset? how many times were they bought?
groceries.itemDescription.value_counts()[:10]

### 2. Association Rules

Now that we have explored the nature of our data, we can start mining the data for insights. We will start with association rules. The end goal with association rules is to be able to predict other items that customers are likely to buy based on what they actually have bought/have in their basket. As we saw in lecture, an example of an association rule is that for people who bought {x, y, z}, they also tend to buy {v, w}. We will want to find all (interesting) rules X --> Y with minimum support and confidence. As a review:

  * `Support`: probability that a transaction contains X and Y
  * `Confidence`: conditional probability that a transaction having X also contains Y, P(Y|X)
  * `Interest`: difference between its confidence and the fraction of baskets that contain Y
  
Using a small subset of the data, we will first caclulate these concepts by hand. Then, we will apply the Apriori Algorithm to the entire dataset.

But first, we must do some data manipulation to get the data in the right format. We will need to group the data by 'Member_number' and 'Date' so that we have a set of the items for each transaction (i.e., all items purchased by the customer on that date).

In [None]:
# get data in proper format
transactions = [set(items[1].itemDescription) for items in list(groceries.groupby(['Member_number','Date']))]
transactions[0] # example transaction

In [None]:
# we will use a subset of 4 transactions for this example

print("Transaction 1:", transactions[0])
print("Transaction 2:", transactions[6])
print("Transaction 3:", transactions[14])
print("Transaction 4:", transactions[58])

transaction_subset = [transactions[0], transactions[6], transactions[14], transactions[24]]

In [None]:
# we can also represent this data as a dataframe/matrix where 1 = item present and 0 = item absent

transactions_subset_df = pd.DataFrame({'semi-finished bread': [1,0,0,0],
                   'whole milk': [1,1,0,1],
                   'yogurt': [1,0,0,1],
                   'sausage': [1,1,1,1],
                   'rolls/buns': [0,1,1,0],
                   'bottled beer': [0,0,0,1] })

transactions_subset_df

In [None]:
# calculate support for individual items: which are equal to or greater than minsup = 50%?

item_counter = {}
for product in transactions_subset_df.columns:
    item_counter[product] = sum(transactions_subset_df[product]>0)
    
item_counter

Looking at the frequency of each individual item, we see that 'whole milk', and 'sausage' meet or surpass the 50% minsup threshold. Focusing in on these items, let's see if there are any pairs of items that surpass the 50% minsup threshold as well.

In [None]:
# calculate support for item pairs: which are equal to or greater than minsup = 50%?

# source: https://dzenanhamzic.com/2017/01/19/market-basket-analysis-mining-frequent-pairs-in-python/

# take data matrix from dataframe
transaction_matrix = transactions_subset_df.to_numpy()
# get number of rows and columns
rows, columns = transaction_matrix.shape
# init new matrix
frequent_items_matrix = np.zeros((6,6))
# compare every product with every other
for this_column in range(0, columns-1):
    for next_column in range(this_column + 1, columns):
        # multiply product pair vectors
        product_vector = transaction_matrix[:,this_column] * transaction_matrix[:,next_column]
        # check the number of pair occurrences in baskets
        count_matches = sum((product_vector)>0)
        # save values to new matrix
        frequent_items_matrix[this_column,next_column] = count_matches

frequent_items_df = pd.DataFrame(frequent_items_matrix, columns = transactions_subset_df.columns.values, index = transactions_subset_df.columns.values)
frequent_items_df

Looking at the results, we see that the only itemset that surpasses our 50% minsup threshold is {whole milk, sausage}. Now let's calculate the confidence to see if it surpasses our 50% minconf threshold as well. 

Association rule 1: sausage -> whole milk

Confidence = Pr(whole milk | sausage) = 3/4 = 75%

Interest = conf(sausage -> whole milk) − Pr(whole milk) = 75% - 75% = 0% (definitely not interesting)


Association rule 2: whole milk -> sausage

Confidence = Pr(sausage | whole milk) = 3/3 = 100%

Interest = conf(whole milk → sausage) − Pr(sausage) = 100% - 100% = 0% (also definitely not interesting)


This serves an an important lesson that not all high confidence association rules are interesting! Next, let's apply the apriori algorithm to our entire dataset to see if we can find more interesting insights.

### 2.1 Association Rules using the Apriori Algorithm 

With these concepts in mind, we can specify the min_support and min_confidence when using the Apriori Algorithm. Also, we can specify the value for min_lift, a measure of interestingness. Keep in mind that we will want to try out different values of min_support, min_confidence and min_lift depending on the task and problem at hand - it will require a bit of trial and error.

In [None]:
association_results = list(apriori(transactions, min_support = 0.003, min_confidence = 0.05, min_lift=1.0))
association_results = filter(lambda x: len(x.items) > 1, association_results) # filtering to rules with at least 2 items

# source: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
for item in association_results:
    items = [x for x in item.items]
    print("Rule: " + items[0] + " -> " + items[1])
    print("Support: " + str(item[1]))
    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Using the given criteria, we get a set of 8 association rules. We can see how some may be interesting. For example, we may predict that those who are buying bottled beer will also buy sausages since they could be planning a barbecue/party.

### 3. Frequent Itemsets

Next, let's experiment with Finding Frequent Itemsets. We want to find items that are frequently purchased together in our groceries data. To do so, we must set a value for min_support, the minimum support needed for the itemset to be considered frequent. As a reminder, support refers to the fraction of baskets that contain the itemset. Therefore, an itemset will be considered frequent if its support value is greater than or equal to the min_support threshold we set.

In [None]:
itemsets = list(apriori(transactions, min_support=0.05)) # setting the minimum support value to 0.05
frequent_itemsets = []

for item in itemsets:
    frequent_itemsets.append((item.items, item.support))
    
sorted_freq_itemsets = sorted(frequent_itemsets, key= lambda t:t[1], reverse=True)

pd.DataFrame(sorted_freq_itemsets, columns=['Items', 'Support'])

We see that all of our itemsets only include 1 item... this is not very interesting to us since we want to know what items customers tend to buy together. Let's up the value of min_support so that we get results with 2 or more items and then filter the itemsets to those with at least 2 items in them.

In [None]:
itemsets = list(apriori(transactions, min_support=0.007))
itemsets = filter(lambda x: len(x.items) > 1, itemsets)

frequent_itemsets = []
for item in itemsets:
    frequent_itemsets.append((item.items, item.support))
    
sorted_freq_itemsets = sorted(frequent_itemsets, key= lambda t:t[1], reverse=True)

pd.DataFrame(sorted_freq_itemsets, columns=['Items', 'Support'])

### 4. Evaluation of Frequent Itemsets

There are a variety of metrics available for evaluating frequent itemsets. Applying evaluation metrics is an important task since not all frequent itemsets will be meaningful. One example is Jaccard Similarity, which we will employ here. The entirety of the task is the following:
  * Split the dataset into seasons (Fall, Winter, Spring, Summer) and select 2 of your choice
  * Identify the 25 most frequent itemsets (with at least 2 items in each itemset) for each season
  * Compute the Jaccard Similarity between them

In [None]:
### YOUR CODE: divide the groceries dataframe into 4 dataframes, one for each season
### then, select 2 of them to use for the remainder of this analysis

### you can use the following breakdown of months into seasons:
# Winter = December (12), January (01), February (02)
# Spring = March (03), April (04), May (05)
# Summer = June (06), July (07), August (08)
# Fall = September (09), October (10), November (11)

In [None]:
### YOUR CODE: transform the data into the proper format (hint: see code above)
def measure_Jaccord_distance(document1, document2):
  

In [None]:
### YOUR CODE: find the top 25 most frequent itemsets (with at least 2 items) for each season
### sort them by their value for 'support' to identify the top 25 itemsets
### HINT: adjust the value of min_support, as necessary

In [None]:
### YOUR CODE: compute the Jaccard Similarity between the two sets of the top 25 most frequent itemsets