# CS5228 Assignment 1b - EDA, Data Cleaning, Association Rules (50 Points)

Hello everyone, this assignment notebook covers the topic of Exploratory Data Analysis, Data Cleaning, & Association Rules. There are some code-completion tasks and question-answering tasks in this answer sheet. For code completion tasks, please write down your answer (i.e. your lines of code) between sentences that "your code starts here" and "your code end here". The space between these two lines does not reflect reflect the required or expected lines of code :). For answers in plain text, you can refer to [this Markdown guide](https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd) to customize the layout (although it shouldn't be needed) 

**Important:** 
* Remember to save this Jupyter notebook as A1a_YourNameInLumiNUS_YourNUSNETID.ipynb
* Please upload only this notebook directly to LumiNUS (no other files, not as zipped archive)
* Submission deadline is September 19th, 11.59 pm (together with A1a)

Please also add your nusnet and student id also in the code cell below. This is just to make any identification of your notebook doubly sure.

In [1]:
student_id = 'A0236597M'
nusnet_id = 'e0744016'

Here is an overview over the tasks to be solved and the points associated with each task. The notebook can appear very long and verbose, but note that a lot of parts are provide additional explanations, documentation, or some discussion. The code and markdown cells you are a supposed to complete are well, but you can use the overview below to double-check that you covered everything.


* **1. Exploratory Data Analysis (EDA) & Data Cleaning (25 Points)**
    * 1.1 Cleaning a Real-World Dataset (12 Points)
    * 1.2 Basic Facts about the Dataset (8 Points)
    * 1.3 Discussion (5 Points)
* **2 Association Rule Mining - Apriori Algorithm (25 Points)**
    * 2.1 Generate & Prune (k+1)-itemsets (5 Points)
    * 2.2 Generate Frequent Itemsets with Apriori Algorithm (5 Points)
    * 2.3 Association Rule Mining over Real-World Dataset (COVID-19) (9 Points)
        * 2.3a) (3 Points)
        * 2.3b) (3 Points)
        * 2.3c) (3 Points)
    * 2.4 Comparison with Different Use Case (Market Basket Analysis) (6 Points)
        * 2.4a) (4 Points)
        * 2.4b) (2 Points)

## Setting up the Notebook

In [2]:
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
# This will automatically reload src/dtree.py every time you make changes and save the file
%load_ext autoreload
%autoreload 2

In [3]:
%matplotlib inline

The following statements contain all the packages need to complete the notebook. Note that this notebook will provide you with a series of methods to help you solving the programming tasks. The purpose and the output of each auxiliary method is pretty straightforward, but we will provide examples throughout the notebook.

**Important:** For this notebook, you need to install the [`efficient-apriori`](https://pypi.org/project/efficient-apriori/) Python package. This is required for the task covering Association Rules.

In [4]:
import numpy as np
import pandas as pd

from efficient_apriori import apriori
from src.utils import unique_items, powerset, support, confidence, merge_itemsets, generate_association_rules, show_top_rules

## 1. Exploratory Data Analysis (EDA) & Data Cleaning (25 Points)

### 1.1 Cleaning a Real-World Dataset (12 Points)

Assume that you have been tasked to build a regression model to predict the resale prices of HDB flats in Singapore. To this end you get a dataset containing information about 20,000 past resale transaction, including a documentation with the following information about the attributes:

* **transaction_id**: Unique ID of the transactions. A 6-digit integer number uniquely assigned to each transaction. If this code starts with the letter 'C', it indicates a cancellation of the transaction.
* **town**: The town in which the flat is located, e.g., "bukit merah", "woodlands".
* **flat_type**: Type of flat, e.g., "3 room", "4 room", "executive".
* **street_name**: Name of street.
* **storey_range**: Description on what storey the flat is located.
* **floor_area_sqm**: Living area of flat in square meter.
* **flat_model**: Model of the flat, e.g., "model a", "improved".
* **lease_commence_date**: Year when the lease of the flat commenced, e.g., 1999, 1984
* **resale_price**: Resale price of the flat in Singapore dollar.

Let's have a first look at the data

In [5]:
df_hdb = pd.read_csv('data/a1-resale-flat-prices.csv')

df_hdb.head()

Unnamed: 0,transaction_id,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price
0,545066,sengkang,4 room,201C,compassvale dr,07 to 09,90.0,model a,2001,378000.0
1,539856,punggol,3 room,664A,punggol dr,04 to 06,65.0,model a,2016,342888.0
2,535785,bukit panjang,3 room,251,bangkit rd,07 to 09,73.0,model a,1988,252000.0
3,537535,punggol,4 room,213C,punggol walk,04 to 06,93.0,model a,2015,435000.0
4,531556,yishun,3 room,629,yishun st 61,10 to 12,73.0,model a,1988,280000.0


**Perform EDA on the HDB Resale Price dataset and perform appropriate preprocessing steps to clean the data!**
The preprocessing step for cleaning the data step may include 
* the removal of "dirty records" (i.e., records that do not adhere to the data description) and
* the modification of records

In the following, **identify at least 5 issues** with the dataset that would negatively affect any subsequent analysis, and clean the data accordingly.

**Important:**

* Recall from the lecture that data cleaning often involves to make certain decisions. As such, you might come up with different steps than other students. This is OK as long as you can reasonably justify your steps.
* Perform the data cleaning on a copy of the original dataset `df_hdb_cleaned`; see code cell below. Later tasks will work on the original dataset `df_hdb` to ensure that the result are consistent and do not depend on your choice of data preprocessing.
* The goals is to preserve as much of the records as possible! So only remove records as part of your data cleaning if it's really necessary (this includes that you should not remove any attributes!). There might be different valid cases, so don't forget to briefly justify yout decision.

Please provide your answer below. It should list the different issues you have identified and briefly discuss which data cleaning steps you can and/or need to perform to address those issues.

**Your answer:**

1. remove the rows which include `NaN`
2. remove the column `transaction_id`
3. change `storey_range` into arrays: 07 to 09 ➡️ [7,8,9]; 04 to 06 ➡️ [4,5,6]
4. remove the rows where `lease_commence_date` is greater than 2021
5. rescale the data of `resale_price`: `resale_price = resale_price/10000`


Use the code cell below to actually implement your steps for cleaning clean the data. The results should back up your answer above. Feel free to split the cell into multiple code cells to improve organization (not a must, though).

**Important:** Avoid using loops in the parts of the codes you have to complete -- `pandas` is your best friend here :). If you use loops but the results is correct, there will be some minor deduction of points.

In [6]:
# We first create a copy of the dataset and use this one to clean the data.
df_hdb_cleaned = df_hdb.copy()

#########################################################################################
### Your code starts here ###############################################################
# 1. remove NaN
df_hdb_cleaned =  df_hdb_cleaned.dropna()

# 2. remove the column transaction_id
df_hdb_cleaned = df_hdb_cleaned.drop(['transaction_id'], axis=1)

# 3. change storey_range into arrays
def update_storey_range(str):
    a = [int(i.strip()) for i in str.split('to')]
    b = np.arange(a[0], a[1]+1)
    return b

df_hdb_cleaned['storey_range'] = df_hdb_cleaned['storey_range'].apply(update_storey_range)

# 4. remove the rows where lease_commence_date is greater than 2021
df_hdb_cleaned = df_hdb_cleaned[df_hdb_cleaned['lease_commence_date']<2021]

# 5. rescale the sale_price
df_hdb_cleaned['resale_price'] = (df_hdb_cleaned['resale_price']/10000).round(2)

### Your code ends here #################################################################
#########################################################################################

print('After preprocessing, There are now {} entries.'.format(df_hdb_cleaned.shape[0]))

After preprocessing, There are now 16724 entries.


### 1.2 Basic Facts about a Real-World Dataset (8 Points)

This task is about getting basic insights into a given dataset. For this, we use the "Online Retail II" dataset. It contains all the transactions occurring for a UK-based and registered, non-store online retail. The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers (Source: https://archive.ics.uci.edu/ml/machine-learning-databases/00502/)

Attribute Information:

* **Invoice**: Invoice number. A 6-digit integral number uniquely assigned to each transaction. 
* **StockCode**: Product (item) code. A 5-digit integral number uniquely assigned to each distinct product. 
* **Description**: Product (item) name. A string with letters being UPPERCASE.
* **Quantity**: The quantities of each product (item) per transaction. A simple integer value.
* **InvoiceDate**: Invoice date and time. A string representing the day and time when a transaction was generated.
* **Price**: Unit price. A numeric value representing the product price per unit in sterling.
* **CustomerID**: Customer number. A 5-digit integral number uniquely assigned to each customer.

**Important:** This is not the raw dataset from the source linked above, but has already been cleaned! Also, each (Invoice,StockCode) value pair is unique -- that is, each transaction contains the same StockCode at most once.

In [7]:
df_retail = pd.read_csv('data/a1-online-retail-cleaned.csv')

df_retail.head(10)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,CustomerID
0,489435,22350,CAT BOWL,12,12/01/09 07:46 AM,2.55,13085
1,489435,22349,"DOG BOWL , CHASING BALL DESIGN",12,12/01/09 07:46 AM,3.75,13085
2,489435,22195,HEART MEASURING SPOONS LARGE,24,12/01/09 07:46 AM,1.65,13085
3,489435,22353,LUNCHBOX WITH CUTLERY FAIRY CAKES,12,12/01/09 07:46 AM,2.55,13085
4,489440,22350,CAT BOWL,8,12/01/09 09:43 AM,2.55,18087
5,489440,22349,"DOG BOWL , CHASING BALL DESIGN",8,12/01/09 09:43 AM,3.75,18087
6,489448,20827,GOLD APERITIF GLASS,48,12/01/09 10:18 AM,2.12,15413
7,489448,20825,GOLD WINE GLASS,48,12/01/09 10:18 AM,3.39,15413
8,489448,20823,GOLD WINE GOBLET,48,12/01/09 10:18 AM,4.25,15413
9,489448,20826,SILVER APERITIF GLASS,48,12/01/09 10:18 AM,2.12,15413


**Complete the table below by answering the 8 given questions!** Use the code cell below the table to actually implement your steps that enabled you to answer the questions. Read all questions carefully to make sure you provide the correct answer (Hint: If the 10 example entries shown above are not sufficient, you can always open `data/a1-online-retail-cleaned.csv` in Excel, text editor, etc. if you want to have a closer look at the data).

This is a markdown cell. Please fill in your answers for (1)~(8).

| No. | Question                                                                                                   | Answer       |
|-----|------------------------------------------------------------------------------------------------------------|--------------|
| 1)  | What are the dates of the **first transaction** and the **last transaction**?                                                                              | 12/01/09 07:46 AM, 12/09/10 07:28 PM |
| 2)  | What are the **unique number of items** and **the unique number of customers**?                                                                                | 2263, 2181 |
| 3)  | How many transactions included at **least 10** items with **StockCode 84839**?                                                                                       | 8 |
| 4)  | Which transaction contains the **highest number of individual items** (give the Invoice number and the number of individual items)?                                                                        | 502269, 1000 |
| 5)  | What was the **most expensive transaction** (give the Invoice number and the total price)?                                                                                    | 530715, 15818.4 |
| 6)  | How many transactions sold **at least one CAKESTAND**?                                                         | 447 |
| 7)  | Which customer has made the **most transactions** and how many (give the Customer ID and the number of transactions)?                                                        | 14646, 422 |
| 8)  | Which item has **sold the most** across all transactions (given the StockCode and the number of sold items)? |22423, 407 |

**Important:** Avoid using loops in the parts of the codes you have to complete; again, check out the in-built methods that `pandas` provides. If you use loops but the results is correct, there will be some minor deduction of points.

In [8]:
#########################################################################################
### Your code starts here ###############################################################
# 1)
first_transaction_date = df_retail['InvoiceDate'][0]
last_transaction_date = df_retail['InvoiceDate'][47142]
print('first_transaction_date:', first_transaction_date)
print('last_transaction_date:', last_transaction_date)
# 2)
unique_num_items = len(df_retail['StockCode'].unique())
unique_num_customers = len(df_retail['CustomerID'].unique())
print('unique_num_items:', unique_num_items)
print('unique_num_customers:', unique_num_customers)
# 3)
transactions_with_84839 = len(df_retail[(df_retail['StockCode']==84839) & (df_retail['Quantity']>=10)])
print('transactions_with_84839:', transactions_with_84839)
# 4)
max_quantity = df_retail['Quantity'].max()
Invoice = df_retail[df_retail['Quantity']==max_quantity]['Invoice']
print('max_quantity:', max_quantity)
print('Invoice:', Invoice)
# 5)
df_retail['Cost'] = df_retail['Price']*df_retail['Quantity']
total_price = df_retail['Cost'].max()
Invoice = df_retail[df_retail['Cost']==total_price]['Invoice']
print('Invoice:', Invoice)
print('total_price:', total_price)
# 6)
number_cakestand = len(df_retail[df_retail['Description'].str.contains('CAKESTAND')])
print('number_cakestand:', number_cakestand)
# 7)
df_retail['CustomerID'] = df_retail['CustomerID'].astype(str)
most_freq_customer_id = df_retail['CustomerID'].describe()['top']
most_freq_customer_counts = df_retail['CustomerID'].describe()['freq']
print('most_freq_customer_id:', most_freq_customer_id)
print('most_freq_customer_counts:', most_freq_customer_counts)
# 8)
most_freq_item_sold = df_retail['Description'].describe()['top']
stockcode_most_freq_item_sold = df_retail[df_retail['Description']==most_freq_item_sold]['StockCode']
num_most_freq_item_sold = df_retail['Description'].describe()['freq']
print('stockcode_most_freq_item_sold:', stockcode_most_freq_item_sold)
print('num_most_freq_item_sold:', num_most_freq_item_sold)
### Your code ends here #################################################################
#########################################################################################

first_transaction_date: 12/01/09 07:46 AM
last_transaction_date: 12/09/10 07:28 PM
unique_num_items: 2263
unique_num_customers: 2181
transactions_with_84839: 8
max_quantity: 10000
Invoice: 6413    502269
6414    502269
6415    502269
6416    502269
Name: Invoice, dtype: int64
Invoice: 37445    530715
Name: Invoice, dtype: int64
total_price: 15818.4
number_cakestand: 447
most_freq_customer_id: 14646
most_freq_customer_counts: 422
stockcode_most_freq_item_sold: 5659     22423
5707     22423
5739     22423
5782     22423
5996     22423
         ...  
46359    22423
46593    22423
46834    22423
46905    22423
46937    22423
Name: StockCode, Length: 407, dtype: int64
num_most_freq_item_sold: 407


### 1.3 Discussion (5 Points)

In 1.1 we cleaned the dataset of HDB Resale Prices. Now we can use it for any subsequent analysis such as building a regression model to predict the resale prices based on the attributes of flats, but also other analysis such as clustering. Performing a specific analysis (e.g., training a regression model as covered in future lectures) is not part of this assignment! However, any analysis benefits from a very good understanding of the data. This may include the relevance and/or importance of attributes, the types of attributes (e.g., nominal, ordinal, interval, ratio), the need for converting attributes (e.g., via encoding), the need for normalization, important information not captured by the dataset, etc.

**Inspect the HDB Resale Price dataset to get a better understand and briefly discuss your findings!** Your finding may cover the issues outlined above (attribute relevance/importance, attribute types, further preprocessing steps, etc.) or anything else you deem relevant.

**Your Answer:**

1. `floor_area_sqm` and `resale_price` are positively correlated.
2. The field `lease_commence_date` can be further processed as the age of the house instead of the year the house was built.
3. First clarify the meaning of the values of `executive` and `multi generation`, and then convert them into the form of `** room`. Then convert `** room` or `** - room` into the intergers, only keep the number of rooms.
4. Maybe the `storey_range` can be reclassified into three categories: {'low', 'middle', 'high'};
5. The field `floor_area_sqm` and `resale_price` can be normalized.

In [74]:
df_hdb.corr()

Unnamed: 0,floor_area_sqm,lease_commence_date,resale_price
floor_area_sqm,1.0,0.053226,0.628217
lease_commence_date,0.053226,1.0,0.23537
resale_price,0.628217,0.23537,1.0


------------------------------------------------------------------------------------

## 2 Association Rule Mining - Apriori Algorithm (25 Points)

Your task is to implement the Apriori Algorihtm for finding Association Rules. In more detail, we focus on the **Apriori Algorithm for finding Frequent Itemsets** -- once we have the Frequent Itemsets, we use a naive approach for the association rule. We will provide the small method for later. 

#### Toy Dataset

The following dataset with 5 transaction and 6 different items is directly taken from the lecture slides. This should make it easier to test your implementation. The format is a list of tuples, where each tuple represents the set of items of an individual transactions. This format can also be used as input for the `efficient-apriori` package.

In [23]:
transactions_demo = [
    ('bread', 'yogurt'),
    ('bread', 'milk', 'cereal', 'eggs'),
    ('yogurt', 'milk', 'cereal', 'cheese'),
    ('bread', 'yogurt', 'milk', 'cereal'),
    ('bread', 'yogurt', 'milk', 'cheese')
]

#### Auxiliary Methods

We want you to focus on Apriori algorithm. So we provide with a set of auxiliary functions. Feel free to look at their implementation in the file `data/utils.py`.

The method `unique_items()` returns all the unique items across all transactions.

In [10]:
unique_items(transactions_demo)

{'bread', 'cereal', 'cheese', 'eggs', 'milk', 'yogurt'}

The method `support()` calculates and return the support for a given itemset and set of transactions.

In [11]:
support(transactions_demo, ('bread','milk'))

0.6

The method `confidence()` calculates and return the confidence for a given association rules and set of transactions. An association rule is represented by a 2-tuple, where the first element represents itemset X and the second element represents items Y (i.e., $X \Rightarrow Y$)

In [12]:
confidence(transactions_demo, (('bread',), ('milk',)))

0.75

The method `generate_association_rules()` calculates and returns all possible association rules given an itemset. The result is a list of association rules, each association rule represented as 2-tuple (see above).

In [13]:
generate_association_rules(('bread', 'milk', 'cereal'))

[(('bread',), ('cereal', 'milk')),
 (('cereal',), ('bread', 'milk')),
 (('milk',), ('bread', 'cereal')),
 (('bread', 'cereal'), ('milk',)),
 (('bread', 'milk'), ('cereal',)),
 (('cereal', 'milk'), ('bread',))]

The method `merge_itemsets()` merges two given itemsets into one itemset.

In [14]:
merge_itemsets(('bread', 'milk'), ('bread', 'eggs'))

('bread', 'eggs', 'milk')

For your implementation, you can make use of these auxiliary methods wherever you see fit. And that is, of course strongly recommended, as it makes the programming task much easier. So, let's get started.

### 2.1 Generate & Prune (k+1) Candidate Itemsets

Let's assume we have found all Frequent Itemsets for size $k-1$. The next is now to find all Candidate Itemets of size $k$. In the lecture we introduced two methods for this. For this assignment, we focus on the $\mathbf{F_{k-1} \times F_{k-1}}$ methods -- that is, we use the Frequent Itemsets from the last step to calculate the Candidate Itemsets for the current step

Recall that we also can (and should) **prune** any Candidate Itemsets than cannot possible also be Frequent Itemsets  based on the information we already have. In other words, the Candidate Itemsets of size $k$ should only contain the itemsets for which we indeed calculate the support for.

**Implement `generate_kplus1_itemsets()` to calculate the Candidate Itemsets of size $k$ given the Frequent Itemsets of size $k-1$!** Note that we walked in detail through an example of this process in the lecture. Below is a code cell that reflects the this example to test your implementation.

In [15]:
def generate_kplus1_itemsets(k_itemsets):
    
    # Just as fail-safe, return an empty set if the k_itemset is None or empty
    if k_itemsets is None or len(k_itemsets) == 0:
        return set()
    
    # It's convenient to actually have the value for k (e.g., k=3 for 3-itemsets)
    # The code just looks a bit odd since we cannot get an element from a set using indexing
    k = len(next(iter(k_itemsets)))

    # Initialize as set to avoid duplicates
    kplus1_itemsets = set()
    
    for itemset1 in k_itemsets:
        for itemset2 in k_itemsets:
            
            ######################################################################
            ### Your code starts here ############################################
            
            # get the merge itemset of itemset1 and itemset2
            itemset = merge_itemsets(itemset1, itemset2)

            # only keep itemset of length k+1
            if len(itemset) == k+1:
                # get subsets of itemset of length k
                itemset_subset_k = list()
                for i in range(len(itemset)):
                    temp = list(itemset)
                    temp.pop(i)
                    itemset_subset_k.append(temp)
                
                # pruning, delete itemset if any subset of itemset of length k is not in k_itemsets
                item_flag = True
                for item in itemset_subset_k:
                    if tuple(item) not in k_itemsets:
                        item_flag = False

                if item_flag:
                    kplus1_itemsets.add(itemset)
            
            ### Your code ends here ##############################################
            ######################################################################
            
            pass # Just there so the empty loop does not throw an error
    
    return kplus1_itemsets

The example below is directly taken from the lecture slides. As such the output should match that of the example as shown on the slides. The input is a set of Frequent Itemsets of size 2, and the output is a set with all Candidate Itemsets of size 3.

In [20]:
kplus1_itemsets = generate_kplus1_itemsets(
    {('bread', 'cereal'), ('bread', 'milk'), ('bread', 'yogurt'), 
     ('cereal', 'milk'), ('cereal', 'yogurt'), ('milk', 'yogurt')}
)


for itemset in kplus1_itemsets:
    print(itemset)

('bread', 'milk', 'yogurt')
('bread', 'cereal', 'yogurt')
('cereal', 'milk', 'yogurt')
('bread', 'cereal', 'milk')


### 2.2 Generate Frequent Itemsets with Apriori Algorithm

The method `generate_kplus1_itemsets()` covered the "Generate" and "Prune" step of the Apriori Algorithm for finding Frequent Itemsets. Now only the "Calculate" and "Filter" step is missing. However, with `generate_kplus1_itemsets()` in place and together with the auxiliary methods we provide (see above), putting the Apriori Algorithm together should be pretty straightforward.

**Implement `frequent_itemsets_apriori()` to find all Frequent Itemset given a set of transactions an a minimum support of `min_support`!** Again, below is a code cell that reflects the this example to test your implementation.

In [28]:
def frequent_itemsets_apriori(transactions, min_support):
    # The 1-itemsets are just all unique items across all transactions
    one_itemsets = unique_items(transactions)
    ############################################################################################
    ### Your code starts here ##################################################################
    
    # Calculate frequent 1-itemsets -- using the auxiliary methods provided, this can be a one-liner :)
    frequent_1_itemsets = None
    
    frequent_1_itemsets = set([(item,) for item in one_itemsets if support(transactions,(item,))>=min_support])
    
    ### Your code ends here ####################################################################
    ############################################################################################
    
    # Initialize dictionary with all current frequent itemsets for each size k
    # Example: { 1: {(a), (b), (c)}, 2: {(a, c), ...} }
    frequent_itemsets = { 1: frequent_1_itemsets }
    
    for k in range(1, len(one_itemsets)+1):

        frequent_kplus1_itemsets = set()
        
        ########################################################################################
        ### Your code starts here ##############################################################
        
        kplus1_itemsets = generate_kplus1_itemsets(frequent_itemsets[k])
        frequent_kplus1_itemsets = set([item for item in kplus1_itemsets if support(transactions,item) >= min_support])

        ### Your code ends here ################################################################
        ########################################################################################
                
        frequent_itemsets[k+1] = frequent_kplus1_itemsets    

    # Merge the dictionary of itemsets to a single set and return it
    # Example: {1: {(a), (b), (c)}, 2: (a, c)} => {(a), (b), (c), (a, c)}
    return set.union(*[ itemsets for k, itemsets in frequent_itemsets.items() ])

Again, you can check your implementation using the example from the lecture slides.

In [29]:
frequent_itemsets = frequent_itemsets_apriori(transactions_demo, 0.6)
for itemset in frequent_itemsets:
    print(itemset)

('cereal',)
('milk', 'yogurt')
('milk',)
('cereal', 'milk')
('bread', 'yogurt')
('bread',)
('yogurt',)
('bread', 'milk')


#### From Frequent Itemsets to Association Rules

Your implementation so far gives you the Frequent Itemsets in a list of transactions using the Apriori method. This step is typically the most time-consuming one in Association Rule Mining. However, we still have to do the second step and find all Association Rules given the Frequent Itemsets. We saw in the lecture, that this can also be done in an efficient manner using the Apriori method to avoid checking all rules.

Since this step is typically less computationally expensive, we simply do it the naive way -- that is, we go over all Frequent Itemsets, and check for Frequent Itemset and check which of the Association Rules that can be generated from it has a sufficiently high confidence. With all the auxiliary methods we provide, this becomes trivial to implement, so we simply give you the method `find_association_rules()` below. Note how it uses your implementation of `frequent_itemsets_apriori()`

In [None]:
def find_association_rules(transactions, min_support, min_confidence):
    # Initialize empty list of association rules
    association_rules = []
    
    # Find and loop over all frequent itemsets
    for itemset in frequent_itemsets_apriori(transactions, min_support):
        if len(itemset) == 1:
            continue

        # Find and loop over all association rules that can be generated from the itemset
        for r in generate_association_rules(itemset):
            # Check if the association rule fulfils the confidence requriement
            if confidence(transactions, r) >= min_confidence:
                association_rules.append(r)
                
    # Return final list of association rules
    return association_rules

find_association_rules(transactions_demo, 0.6, 1.0)

[(('cereal',), ('milk',))]

If everything is correct, for the default values for `min_support` and `min_confidence`, the one Association Rules that should be returned is $\{cereal\}\Rightarrow \{milk\}$ (in Python represented as a tuple of 2 tuples, left-hand side and right-hand side).

#### Comparison with `efficient-apriori` package

You can run the apriori algorithm over the demo data to check if your implementation is correct. Try different values for the parameters `min_support` and `min_confidence` and compare the results. Note that the order of the returned association rules might differ between your implementation and the apriori one.

In [None]:
_, rules = apriori(transactions_demo, min_support=0.6, min_confidence=1.0, max_length=4)

for r in rules:
    print('Rule [{} => {}] (support: {}, confidence: {}, lift: {})'.format(r.lhs, r.rhs, r.support, r.confidence, r.lift))


Rule [('cereal',) => ('milk',)] (support: 0.6, confidence: 1.0, lift: 1.25)


The `efficient-apriori` provides, of course, a much more efficient and convenient (e.g., keeping track of all the metrics for each rule). And this is why we use this package for finding Association Rules in a real-world dataset below. Still, in its core, `efficient-apriori` implements the same underlying Apriori method to Find Frequent Itemsets (but also to find the Association Rules). If you're interested, at the end, you cam compare the runtimes of `efficient-apriori` and your implementation. Just don't be too disappointed :).

### 2.3 Association Rule Mining over Real-World Datasets (COVID-19) (9 Points)

In this task we, use the [Coronavirus Disease 2019 (COVID-19) Clinical Data Repository](https://covidclinicaldata.org/) to find Association Rules that might tell as, which symptoms are most indicative of a COVID-19 infections. We already downloaded, cleaned, and prepared the dataset for you, so you can use it to mine Association Rules.

The dataset file `data/a1-covid-symptoms-result.csv` contains over 710k transactions. Each transaction is a set of $0..n$ symptoms and $1$ test result label ("POSITIVE" and "NEGATIVE"). For example a line in the file can look like `runny_nose sore_throat fatigue POSITIVE`. Note that a line might also be just `NEGATIVE` in case a person was tested without any symptoms. Feel free to take a look at the file -- looking at the raw data is always a good first step when it comes to data mining.

#### Loading the Data

Since we have transactional data and not tabular-like data, using `pandas` does not really help. We therefore simply read the dataset file line by line to generate our list of transactions. Note that we generate 2 lists of transactions:
* `transactions_covid_all` constains all 710k+ transactions in the dataset
* `transactions_covid_pos` constains all 11k+ transactions with a "POSITVE" test result label

In [31]:
transactions_covid_all = []
transactions_covid_pos = []

with open('data/a1-covid-symptoms-result.csv') as file:
    for line in file:
        line = line.strip()
        
        if 'POSITIVE' in line:
            transactions_covid_pos.append(line.split(' '))
        
        transactions_covid_all.append(line.split(' '))

print('Number of transactions overall: {}'.format(len(transactions_covid_all)))
print('Number of "POSITIVE" transactions: {}'.format(len(transactions_covid_pos)))

Number of transactions overall: 710350
Number of "POSITIVE" transactions: 11060


Compared to traditional Market Basket Analysis where all items in a transaction are of the same type (e.g., products in a supermarket), a COVID data transactions contains both symptoms and test result labels. This means that we might find important Association Rules such as $(\text{runny_nose} \Rightarrow \text{fever})$. These are perfectly valid rules, but in the following, we are interested only in rules where the right-hand side is either "POSITIVE" or "NEGATIVE".

To make this easy for you, we provide a `show_top_rules()` which computes the Association Rules using the `efficient-apriori` package, but (a) filters the rules w.r.t. to the right-hand side, (b) sorts the rules w.r.t. the specified metric, and (c) shows only the top-k rules.

**Run the following 4 code cells and interpret the results below!** All 4 code cells find Association Rules using the `efficient-apriori` package encapsulated in the auxiliary method `show_top_rules()` for convenience. Note how they differ with respect to the parameters including the used dataset and the restriction of the right-hand side of the resulting rules!

In [73]:
%%time
# Run A
show_top_rules(transactions_covid_all, min_support=0.001, min_confidence=0.2, k=5, sort='lift', rhs='POSITIVE')

=== Total Number of Rules: 8861 | Number of rules with matching RHS: 4 ===
('loss_of_smell', 'loss_of_taste') => ('POSITIVE',): supp: 0.001, conf: 0.296, lift: 19.016
('loss_of_smell',) => ('POSITIVE',): supp: 0.002, conf: 0.224, lift: 14.389
('cough', 'fever', 'headache') => ('POSITIVE',): supp: 0.001, conf: 0.205, lift: 13.149
('loss_of_taste',) => ('POSITIVE',): supp: 0.002, conf: 0.200, lift: 12.866

CPU times: user 12.1 s, sys: 89.6 ms, total: 12.2 s
Wall time: 12.2 s


In [None]:
%%time
# Run B
show_top_rules(transactions_covid_pos, min_support=0.15, min_confidence=0.8, k=5, sort='lift', rhs='POSITIVE')

=== Total Number of Rules: 6 | Number of rules with matching RHS: 6 ===
('fatigue',) => ('POSITIVE',): supp: 0.195, conf: 1.000, lift: 1.000
('fever',) => ('POSITIVE',): supp: 0.200, conf: 1.000, lift: 1.000
('muscle_sore',) => ('POSITIVE',): supp: 0.193, conf: 1.000, lift: 1.000
('headache',) => ('POSITIVE',): supp: 0.240, conf: 1.000, lift: 1.000
('cough',) => ('POSITIVE',): supp: 0.323, conf: 1.000, lift: 1.000
('sore_throat',) => ('POSITIVE',): supp: 0.174, conf: 1.000, lift: 1.000

CPU times: user 46.5 ms, sys: 1.82 ms, total: 48.3 ms
Wall time: 47.6 ms


In [None]:
%%time
# Run C
show_top_rules(transactions_covid_all, min_support=0.04, min_confidence=0.8, k=5, sort='lift', rhs='NEGATIVE')

=== Total Number of Rules: 6 | Number of rules with matching RHS: 6 ===
('sore_throat',) => ('NEGATIVE',): supp: 0.076, conf: 0.966, lift: 0.981
('fatigue',) => ('NEGATIVE',): supp: 0.079, conf: 0.963, lift: 0.978
('runny_nose',) => ('NEGATIVE',): supp: 0.044, conf: 0.961, lift: 0.976
('headache',) => ('NEGATIVE',): supp: 0.068, conf: 0.948, lift: 0.963
('muscle_sore',) => ('NEGATIVE',): supp: 0.042, conf: 0.933, lift: 0.948
('cough',) => ('NEGATIVE',): supp: 0.070, conf: 0.933, lift: 0.947

CPU times: user 1.86 s, sys: 51.7 ms, total: 1.91 s
Wall time: 1.92 s


In [None]:
%%time
# Run D
show_top_rules(transactions_covid_all, min_support=0.001, min_confidence=0.8, k=5, sort='lift', rhs='NEGATIVE')

=== Total Number of Rules: 646 | Number of rules with matching RHS: 345 ===
('sore_throat',) => ('NEGATIVE',): supp: 0.076, conf: 0.966, lift: 0.981
('fatigue',) => ('NEGATIVE',): supp: 0.079, conf: 0.963, lift: 0.978
('sob',) => ('NEGATIVE',): supp: 0.035, conf: 0.962, lift: 0.977
('runny_nose',) => ('NEGATIVE',): supp: 0.044, conf: 0.961, lift: 0.976
('fatigue', 'sob') => ('NEGATIVE',): supp: 0.017, conf: 0.961, lift: 0.976
('runny_nose', 'sore_throat') => ('NEGATIVE',): supp: 0.017, conf: 0.958, lift: 0.973

CPU times: user 12.8 s, sys: 124 ms, total: 13 s
Wall time: 13.1 s


**2.3a) Discuss your obervations! (3 Points)** You must have noticed numerous differences between the 4 runs A-D. List at least 3 differences you have found. You may want to consider the elapsed time and the quality of the results. Briefly explain your observations!

**Your Answer:**

Observations:
1. Both `min_support` and `min_condifence` have influences on the result of the number of rules. The smaller the `min_support`, the more rules. 
2. Compare A and D, with the same `min_support` and different `min_confidence`, there is not much difference in cpu time. While comparing B and C, withe the same `min_confidence` and different `min_support`, there is a big difference in cpu time, the larger `min_support` takes less cpu time.
3. When the `min_support` is greater than 0.04, the number of rules remain the same as 6. 
4. The lift of `positive` is greater than 1, and the lift of `negative` is smaller than 1.

**2.3b) Interpret the results! (3 Points)** Runs A and B return association rules with symptoms on the left-hand side and a POSITIVE test result on the right-hand side. As such both runs find rules which (combination of) symptoms are most indicative of a positive test result. However, the results of Run A and B a rather different. Explain the differences and discuss which result provides more reliable insights!

**Your Answer:**

The result is different due to the choice of the parameters `min_support` and `min_confidence`. The `min_support` of round A is much smaller than that of round B, therefore, the number of total rules of round A is significantly larger than that of round B.  
I think the result A is more reliable. From the result we can see that, the `min_confidence` in round B is too large that only itemset with one item can meet the condition. Although the result of round B has higher support and confidence, the lift is only 1. While the lift in round A is very obvious, with the highest lift of 19.06. 

**2.3c) Disccus effects of input parameters! (3 Points)** From your observation, what are the effects of increasing/reducing `min_support` and `min_confidence`? In which cases (i.e., runs A-B) we can optimize the parameters for performance without losing any quality in the results. Support your answer with evidence. You can perform more runs with different parameter settings, if needed.

**Your Answer:**

The chart below shows the result of rounds I performed.
| Round | min_support | min_confidence | total rules | match rules |  cpu time   |
|-------|-------------|----------------|-------------|-------------|-------------|
| 1     |     0.001   |      0.02      |     18359   |     35      |    12.9 s   |
| 2     |     0.04    |      0.02      |      12     |     0       |    1.97 s   |
| 3     |     0.15    |      0.02      |      0      |     0       |    362 ms   |
| 4     |     0.001   |      0.2       |     8861    |     4       |    12.8 s   |
| 5     |     0.04    |      0.2       |     6       |     0       |    1.9 s   |
| 6     |     0.15    |      0.2       |     0       |     0       |    353 ms   |
| 7     |     0.001   |      0.8       |      646    |     0       |    12.1 s  |
| 8     |     0.04    |      0.8       |      6      |     0       |    2.08 s   |
| 9     |     0.15    |      0.8       |      0      |     0       |    340 ms   |

1. Comparing round 1 with round 2 and round 3,or comparing round 4 with round 5 and round 6, or comparing round 7 with round 8 and round 9, when the `min_confidence` is the same, the increase of `min_support` will reduce the number of total rules, and increase the computation time.
2. When the `min_confidence` is greater than a certain threshold (which is unknown in this experiment), the increase of `min_confidence` will reduce the number of total rules, otherwise the number of rules remains the same with the increse of `min_confidence`.
3. The cpu time is mainly affected by the value of `min_support`.

### 2.4 Comparison with Different Use Case (Market Basket Analysis) (6 Points)

The COVID-19 dataset contains 710k+ transactions. Now let's make a basic comparison with a different dataset. The [Online Retail II](https://archive.ics.uci.edu/ml/machine-learning-databases/00502/) dataset contains all the transactions occurring for a UK-based and registered, non-store online retail. The company mainly sells unique all-occasion giftware. Many customers of the company are wholesalers.

We already downloaded and prepared the dataset for you to be find Association Rules. The file `data/a1-retail-transactions.csv` contains ~19.8k transactions; each line represents one transaction. Each transaction is a set of product codes; for this task, we don't need to know what the actual products are (but if you're curious, you can check with the original dataset linked above)

Let's read the dataset file with retail transactions.

In [68]:
transactions_retail = []

with open('data/a1-retail-transactions.csv') as file:
    for line in file:
        line = line.strip()
        
        transactions_retail.append(line.split(' '))

print('Number of retial transactions: {}'.format(len(transactions_retail)))
print('Example transactions: {}'.format(transactions_retail[0]))

print(len(unique_items(transactions_retail)))

Number of retial transactions: 19853
Example transactions: ['85048', '79323P', '79323W', '22041', '21232', '22064', '21871', '21523']
4094


Now let's use `efficient-apriori` to find interesting Assication Rules.

In [72]:
%%time 

_, rules_retail = apriori(transactions_retail, min_support=0.01, min_confidence=0.2)

print('Overall number of rules: {}'.format(len(rules_retail)))

# Let's sort w.r.t. to lift
rules_retail = sorted(rules_retail, key=lambda rule: rule.lift, reverse=True)

# Print the top-5 rules w.r.t to lift
for i, r in enumerate(rules_retail):
    # Stop after i rules
    if i >= 5:
        break
    
    print('Rule [{} => {}] (support: {:.4f}, confidence: {:.4f}, lift: {:.4f})'.format(r.lhs, r.rhs, r.support, r.confidence, r.lift))


Overall number of rules: 638
Rule [('22748',) => ('22745',)] (support: 0.0112, confidence: 0.7957, lift: 63.4418)
Rule [('22745',) => ('22748',)] (support: 0.0112, confidence: 0.8916, lift: 63.4418)
Rule [('22301',) => ('22300',)] (support: 0.0100, confidence: 0.7425, lift: 56.0517)
Rule [('22300',) => ('22301',)] (support: 0.0100, confidence: 0.7567, lift: 56.0517)
Rule [('22699',) => ('22697',)] (support: 0.0108, confidence: 0.7329, lift: 55.7464)
CPU times: user 3min 14s, sys: 885 ms, total: 3min 15s
Wall time: 3min 16s


**2.4a) Discuss your observations! (4 Points)** If you compare with the Runs A-D in 2.3, you should observe several differences when extracting Association Rules from this dataset with retail transactions. List your observations and briefly provide an explanation for each observation. You may want to consider the size of the dataset, the runtimes of `efficient-apriori`, the values of the different metrics for the top rules, etc.


**Your Answer:**

1. The size of retail dataset is only 1.8 times that of the Covid-19 dataset, while the running time is dozens of times longer than the latter. When applying the efficient-apriori algorithm, calculation time doubles with the amount of data.
2. The top rules of the retail dataset are all one single item itemset, while the match rules of the Covid-19 dataset contains itemset which has more than two itmes. I think this is very reasonable. When several symptoms appear together, we can infer that the Covid-19 result is positive. However, when shopping, the probability of buying two items together is relatively high, and it is not common to buy multiple items together.
3. Compared to Runs A in 2.3, the confidence and lift of the retail dataset are both significantly higher than that of Covid-19 dataset. I think this is because a positive Covid-19 result is a rare event with a low probability, while buying two products together is very common. Therefore, the confidence and lift are higher in retail dataset.

**2.4b) Perform a Complexity Analysis of the Brute-Force Approach! (2 Points)**. We know from the lecture that the brute-force implementation for Frequent Itemset Generation has to check $2^d-1$ itemsets, with $d$ being the number of unique items. Suppose we can count 2^{36} itemsets per second. How many unique items (approx.) may a transaction dataset contain so that we will be able to complete the counting before the sun burns out (the sun has another $5\cdot 10^9$ years to burn)? And what does it mean for our COVID-19 and our Retail Dataset?


**Your Answer:**

From the equation`2^(d)-1 = 5*10^(9)*365*24*60*60*2^(36)`, we can know `d` is approximately equal to `93`. As we know the unique items of Covid-19 is 4094, which means it will take forever to count all unique items using brute force. It is a completely impossible task!

------------------------------------------------------------------------------------

### Comparing Your Implementation with `efficient-apriori` (just for fun!)

So far, you run and tested your implementation for finding Association Rules (focus on the Frequent Itemsets) on the toy dataset from the lecture. Now let's run it on the COVID dataset. However, we better just use a sample of 10% for this :)

In [None]:
transactions_covid_sample = []

with open('data/a1-covid-symptoms-result-sample.csv') as file:
    for line in file:
        line = line.strip()
        transactions_covid_sample.append(line.split(' '))

print('Number of transactions overall: {}'.format(len(transactions_covid_sample)))

Let's find all Association Rules using your implementation and with the same parameters as for Run A (see above). Depending on your machine, this may take a couple of minutes.

In [None]:
%%time
association_rules = find_association_rules(transactions_covid_sample, 0.001, 0.2)

relevant_rules = [ r for r in association_rules if r[1][0] == 'POSITIVE' ]

for r in relevant_rules:
    print(r)

Hopefully, your implementation returns the same 4 Association Rules as `efficient-apriori` for Run A :). Apart from that, you cannot fail to observe the difference in performance. However, performance and optimization was not the focus here (this includes the rather naive implementation of the auxiliary methods we provided). This comparison should help you to appreciate the complexity of the task of Association Rule Mining over large datasets.