## Finding Association Rules on Grocery Dataset
In this Notebook, we will be implementing an algorithm for mining association rules from a dataset. We will test our algorithm with a small synthetic (artificial) dataset, before we use the algorithm to find association rules from a larger dataset - the [grocery dataset](https://www.kaggle.com/irfanasrullah/groceries).

Our Notebooks in CSMODEL are designed to be guided learning activities. To use them, simply through the cells from top to bottom, following the directions along the way. If you find any unclear parts or mistakes in the Notebooks, email me at thomas.tiam-lee@dlsu.edu.ph.

## Import
Import **pandas** and **matplotlib**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Synthetic Dataset
Before we use a more complicated dataset, we will first test our algorithm using a synthetic (artificial) dataset created using random numbers. The dataset contains 20 distinct items. There are 300 observations representing the baskets in the market-basket model. Each observation (basket) contains at most 8 items.

Let's first create the synthetic dataset using the [`choice`](https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html) function of `numpy`. You may check the documentation of the function for further information. We have set the same seed to have the same values in the synthetic dataset.

In [None]:
np.random.seed(1)
baskets = [np.sort(np.random.choice(20, size=(np.random.randint(1, 9)), replace=False)) for i in range(300)]

Let's display the contents of the synthetic dataset. It should list 300 baskets with its contents.

In [None]:
for i, basket in enumerate(baskets):
    print('Basket', i, basket)

As of now, our dataset is represented as a list of list. Instead of using this representation, we will convert our dataset to a matrix represented as a `pandas` `DataFrame`. The `DataFrame` will contain 300 rows - equivalent to the number of observations in the dataset, and 20 columns - equivalent to the number of distinct items in the dataset. The value in the cell in row `x` and column `y` is 1 if item `y` is in observation (basket) `x`, otherwise, the value in the cell in row `x` and column `y` is 0.

In [None]:
syn_df = pd.DataFrame([[0 for _ in range(20)] for _ in range(300)], columns=[i for i in range(20)])

for i, basket in enumerate(baskets):
    syn_df.iloc[i, basket] = 1

Let's check the `DataFrame` representing the synthetic dataset here. In row `0`, the `DataFrame` should contain the value `1` in columns `3`, `10`, `14`, `15`, `17`, and `18`. All other columns in row `0` should contain the value 0. You may check the other values based on the list-of-list representation that we have displayed earlier.

In [None]:
print(syn_df)

## Rule Miner
Open `rule_miner.py` file. Some of the functions in the `RuleMiner` class are not yet implemented. We will implement the missing parts of this class.

Import the `RuleMiner` class

In [None]:
from rule_miner import RuleMiner

Instantiate a `RuleMiner` object with `support_t` equal to `10` and `confidence_t` equal to `0.6`. The field `support_t` represents the support threshold, while the field `confidence_t` represents the confidence threshold.

In [None]:
rule_miner = RuleMiner(10, 0.6)

Open `rule_miner.py` file and complete the `get_support()` function. This function returns the support for an itemset. The support of an itemset refers to the number of baskets wherein the itemset is present.

Implement the `get_support()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [None]:
print(rule_miner.get_support(syn_df, [0]))
print(rule_miner.get_support(syn_df, [0, 1]))
print(rule_miner.get_support(syn_df, [0, 1, 2]))

**Question:** What is the support of the itemset `{0}`? 
- *Write your answer here.*

**Question:** What is the support of the itemset `{0, 1}`? 
- *Write your answer here.*

**Question:** What is the support of the itemset `{0, 1, 2}`? 
- *Write your answer here.*

Open `rule_miner.py` file again and complete the `get_frequent_itemsets()` function. This function returns a list frequent itemsets in the dataset. The support of each frequent itemset should be greater than or equal to the support threshold.

Implement the `get_frequent_itemsets()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [None]:
frequent_itemsets = rule_miner.get_frequent_itemsets(syn_df)
print(frequent_itemsets)

**Question:** List all the frequent itemsets in the dataset, given the support threshold `10`.
- *Write your answer here.*

Using the `get_rules()` function in `rule_miner.py`, let us list all the possible rules for all frequent itemsets in our dataset. The `get_rules()` function returns a list of rules produced from an itemset.

In [None]:
for itemset in frequent_itemsets:
    print(rule_miner.get_rules(itemset))

Upon getting all the possible rules based on our most frequent itemsets, we should check if the confidence of each rule is greater than or equal to the confidence threshold that we set.

To do this, open `rule_miner.py` file again and complete the `get_confidence()` function. This function returns the confidence for a rule. Suppose that we want to find the rule is `{1, 2} -> {3}`, then the confidence for the rule is the support of `{1, 2, 3}` divided by the support of `{1, 2}`. In this code, we represent a rule using a list which contains 2 lists -  the first list contains the left-hand side of the rule (which in our example is `{1, 2}`), and the second list contains the right-hand side of the rule (which in our example is `{3}`).

Implement the `get_confidence()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [None]:
print(rule_miner.get_confidence(syn_df, [[1, 2], [3]]))
print(rule_miner.get_confidence(syn_df, [[4, 5], [6]]))
print(rule_miner.get_confidence(syn_df, [[7, 8], [9]]))

**Question:** What is the confidence of the rule `{1, 2} -> {3}`? 
- *Write your answer here.*

**Question:** What is the confidence of the rule `{4, 5} -> {6}`? 
- *Write your answer here.*

**Question:** What is the confidence of the rule `{7, 8} -> {9}`? 
- *Write your answer here.*

We have now completed all functions necessary for our rule miner. The only function left to implement is the `get_association_rules()` function, which integrates all of these functions together.

Open `rule_miner.py` file again and complete the `get_association_rules()` function. This function returns a list of association rules with support greater than or equal to the support threshold `support_t` and confidence greater than or equal to the confidence threshold `confidence_t`.

Implement the `get_association_rules()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

With `support_t` equal to `10`, and `confidence_t` equal to `0.6`, let's get the association rules from this dataset.

In [None]:
rules = rule_miner.get_association_rules(syn_df)
print(rules)

**Question:** What is/are the association rules that we derived from the dataset?
- *Write your answer here.*

## Grocery Dataset
For this notebook, we will work on a dataset called `grocery dataset`. This dataset contains 9835 rows which represents transactions by customers shopping for groceries. The dataset contains 169 unique items.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Notepad to see how it is exactly formatted.

If you view the `.csv` file in Excel, you can see that our dataset contains a list of items bought by a customer for each single transaction, represented in rows.

In [None]:
temp_df = pd.read_csv("groceries.csv", header=None)

Let's convert the items, represented as strings, to integers. To do this, let's create a dictionary that will contain the mapping for each item string to its corresponding integer. The dictionary should contain 169 unique strings, with integer mapping from 0 to 168.

In [None]:
values = temp_df.values.ravel()
values = [value for value in pd.unique(values) if not pd.isnull(value)]

value_dict = {}
for i, value in enumerate(values):
    value_dict[value] = i
    
print(value_dict)

As of now, the `DataFrame` representation of the transaction contains 9835 rows, wherein each row contains a list of string representing the items bought for each transaction. We want to convert this representation to a list of list, with the corresponding integers as value instead of the strings.

In [None]:
temp_df = temp_df.stack().map(value_dict).unstack()

baskets = []
for i in range(temp_df.shape[0]):
    basket = np.sort([int(x) for x in temp_df.iloc[i].values.tolist() if str(x) != 'nan'])
    baskets.append(basket)

Let's display the contents of the dataset. It should list 9835 baskets with its contents.

In [None]:
for i, basket in enumerate(baskets):
    print('Basket', i, basket)

As of now, our dataset is represented as a list of list. Instead of using this representation, we will convert our dataset to a matrix represented as a `pandas` `DataFrame`. The `DataFrame` will contain 9835 rows - equivalent to the number of observations in the dataset, and 169 columns - equivalent to the number of distinct items in the dataset. The value in the cell in row `x` and column `y` is 1 if item `y` is in observation (basket) `x`, otherwise, the value in the cell in row `x` and column `y` is 0.

In [None]:
grocery_df = pd.DataFrame([[0 for _ in range(169)] for _ in range(9835)], columns=values)

for i, basket in enumerate(baskets):
    grocery_df.iloc[i, basket] = 1

Let's check the `DataFrame` representing the dataset here. In row `0`, the `DataFrame` should contain the value `1` in columns `citrus fruit`, `semi-finished bread`, `margarine`, and `ready soups`. All other columns in row `0` should contain the value 0. You may check the other values based on the list-of-list representation that we have displayed earlier.

In [None]:
print(grocery_df)

Instantiate a `RuleMiner` object with `support_t` equal to `85` and `confidence_t` equal to `0.6`. The field `support_t` represents the support threshold, while the field `confidence_t` represents the confidence threshold.

In [None]:
rule_miner = RuleMiner(85, 0.6)

With `support_t` equal to `85`, and `confidence_t` equal to `0.6`, let's get the association rules from this dataset.

In [None]:
rules = rule_miner.get_association_rules(grocery_df)
print(rules)

**Question:** What is/are the association rules that we derived from the dataset?
- *Write your answer here.*