# Exercise 14: Association Analysis for *FoodForAll*

In this exercise, we will look at transaction data from a supermarket.

The grocery store *FoodForAll* has trouble displaying its products optimally in the store. What the store wants to do is to increase sales to customers. For this, *FoodForAll* has given you a dataset containing transaction data on what customers bought during every visit to the store.

### Load the data into a matrix

We will use a package called `mlxtend` for this exercise. If you wish, can read more about the details of `mlxtend` [here](https://rasbt.github.io/mlxtend/) but it is not necessary for completing this exercise.

First, let's load the relevant modules we will use in this exercise:

In [None]:
import csv
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import matplotlib
%matplotlib inline
from client.api.notebook import Notebook
ok = Notebook('ex14.ok')

The dataset we want to use is available in a local `.csv` file and to load this we need to use the following code:

In [None]:
with open("groceries.csv") as groceries_file:
    dataset = list(csv.reader(groceries_file))
dataset

If you are unsure what this `.csv` file looks like in its raw format, you check the contents of it in a regular text editor, or by going to the Jupyter dashboard and opening it, just to get a hint of what data we will handle.

Now we will use `mlxtend` to read all the items into a sparse matrix.

Every product item that exists throughout the set of data will have its own column - in other words, if there are 1000 unique items in the dataset then there will be 1000 columns in the matrix. Each row represents a shopping cart with an ID used as the index. The content of each cell in each row represents the number of each product item purchased in a shopping cart.

Below is displayed an explanatory example of how a sparse matrix can look with 3 items and 3
customers. Do you see the difference between the `.csv` file and this matrix?


| &nbsp;     | Product 1 | Product 2 | Product 3 |
|:----------:|:---------:|:---------:|:---------:|
| Customer 1 |     0     |     0     |     1     |
| Customer 2 |     1     |     1     |     0     |
| Customer 3 |     0     |     1     |     0     |


A sparse matrix is more memory-efficient than keeping each shopping cart record in its full format.

If we had saved it in a full `DataFrame` then we would have had to keep the entire data in the memory, repeating the names of elements that appear more than once. The downside is that we get a matrix with many cells containing just zero.

The `mlxtend` package lets us create the sparse matrix by fitting the input data records using a `TransactionEncoder()` class. We can load it as a Pandas `DataFrame` to inspect more easily:

In [None]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
groceries = pd.DataFrame(te_ary, columns=te.columns_)
groceries.head()

### Summarize and inspect the transactions

Now you have loaded the dataset that *FoodForAll* has provided, we need to familiarize ourselves with transactions before generating the association rules.

Use the functions you learned previously to describe the groceries dataset by replacing the ellipsis `...` in the next code cell with your own code:

In [None]:
groceries.describe()

In [None]:
groceries.shape

*Hint: Run the following cell to get the number of `True` and `False` values in the sparse matrix. This will help you calculate the density.*

In [None]:
groceries.stack().value_counts()

Sometimes it may be desirable to see specific transactions. To do so, use Pandas indexing with `[ ... ]` if you do this on the entire dataset then you will get all the transactions.

In [None]:
groceries[groceries.sum(axis=1) == 1].shape

In [None]:
groceries.sum(axis=1).max()

In [None]:
groceries[5:10]

Note that although the index shows that it has returned transactions 5-9, you actually got transactions 6-10 (`DataFrame`s are indexed starting with row 0).

It is important, however, to understand that the numbers in the output does not show the transaction number, but is an auto-generated ID given by Pandas that simply reflects the row number.

You can also count the numbers of instances of Truefor different items:

In [None]:
groceries.sum().sort_values(ascending=False)[:5]

In [None]:
groceries["soda"].value_counts()

In [None]:
groceries[groceries.sum(axis=1) == 1].shape

Your task now is to compile what we just got from the output of the code above. The following questions you should try to figure out and make sure to run each cell that contains your answer, and validate them by running the `ok.grade()` cell after Q3.7. Make sure you really try to figure out the answers first before you check if they are correct.

**Q14.1.** How many product items are in the dataset?

In [None]:
num_product_items_in_dataset = ...

**Q14.2.** How many transactions are there in the dataset?

In [None]:
num_transactions_in_dataset = ...

**Q14.3.** What is the density of the dataset? Provide your answer to 5 decimal places.

In [None]:
dataset_density = ...

**Q14.4.** What are the most common items in the dataset? Provide your answer as a list of string, for example `["potato", "köttbullar"]` etc.

In [None]:
most_common_5_items = [ ... ]

**Q14.5.** How many of the transactions contain soda?

In [None]:
num_transactions_containing_soda = ...

**Q14.6.** How many transactions contain only 1 item?

In [None]:
num_transactions_containing_1_item = ...

**Q14.7.** How many items are in the transaction with the most items?

In [None]:
max_num_product_items_in_a_transcation = ...

In [None]:
# run this cell to check questions 1 to 7. Make sure to run the code cells that contain your answers
_ = ok.grade('q31to37')

### Frequency of items

If you want to see how many transactions of a particular item relative to the total number of transactions (expressed as a percentage) we can define a function `item_frequency()`.

In [None]:
def item_frequency(dataset):
    return dataset.sum() / len(dataset) * 100

item_frequency(groceries)

We can also filter by specific column names on the input `DataFrame` by providing the column name within square brackets `[ ... ]`, for example just `meat spreads` we would write: 

In [None]:
item_frequency(groceries['meat spreads'])

We can also filter by specific row numbers on the input `DataFrame`by providing the row range within the square brackets as normal, for example just rows 0 to 3 we would write:

In [None]:
item_frequency(groceries[0:3])

If you want to look at a specific set of products to compare their frequency, you can provide an array of column names to the `groceries` indexer:

In [None]:
item_frequency(groceries[["whole milk", "butter", "rice"]])

This may be interesting, but it may be more interesting to see such goods occurring over a certain frequency. For this you can further filter out frequencies above a certain degree of support. We can define a new function that we can specify the minimum support:

In [None]:
def item_frequency_plot(dataset, support):
    frequencies = dataset.sum() / len(dataset)
    freq = frequencies[frequencies > support]
    return freq * 100

We can additionally now visualize the results in a bar chart. Fill in the missing parameters to the new function with the input dataset and a minimum support of 0.125:

In [None]:
_ = item_frequency_plot(groceries, 0.125).plot.bar()

**Q14.8.** What does your graph look like? What is the most frequently purchased item according to the transaction data?

*Edit this cell to type your answer here*

## Extracting Association Rules

*FoodForAll* is now getting impatient and say that you have only produced things that they already knew and could easily look up in their databases. You promised that you will contribute new knowledge about their customers.

Because you do not want to tempt more on their patience you do not want to end up in to bring out unnecessary rules again. There are three different types of rules - *trivial*, *unexplained* and *actionable*.

*Trivial* are one that you can easily predict, for example, that you often buy milk and cereal together.

*Unexplainable* For example, when you buy diapers you might often buy hammer. There is no explanation for why that would be so.

*Actionable* are the rules that lead to insight into something and that we can act on. Examples of what we can do are:

1. Have one item at low cost while the other is a little more expensive,
1. Make sure that customers have to go past the goods that are relevant,
1. Alternative ways to market the goods
1. Put the goods closer together.

In order to make it easier for us to talk about rules, we hereby develop one common way to talk about these.

We will now write $Antecedent \rightarrow Consequent$.

Example:

$Toys, wrapping paper \rightarrow Batteries$

It is read out that if you buy toys and wrapping paper you are also likely to buy batteries.

### Measuring Association Rules

There are three different ways to measure association rules. This is to we can evaluate how much weight we should attach to a specific rule.

#### Support

Support is about the number of transactions that contain a certain set of items. The more often items occur together in the input dataset, the greater the support weight.

```
t1: Beef, Carrot, Milk
t2: Steak, Cheese
t3: Cheese, Flingor
t4: Steak, Carrot, Cheese
t5: Steak, Carrot, Butter, Cheese, Milk
t6: Carrot, Butter, Milk
t7: Carrot, Milk, Butter
```

For example,

$$Support(Carrot, Butter, Milk) = \frac{3}{7} = 0.43$$

because the combination of these three items appears 3 times in the input of 7 transactions.

#### Confidence

Confidence is that if there is a rule $Beef, Chicken \rightarrow Apple$ and has a confidence of 33%, we mean that if there is beef and chicken bought together, there is 33% chance that there are also apples in the shopping cart.

Confidence is calculated for example that one has the rule:

$$Butter \rightarrow Milk, Chicken = \frac{Support (Butter \land Milk \land Chicken)}{Support (Butter)}$$

#### Lift

Lift gives us a metric about how good a rule is, only based on the right side of the rule. For example, if items on the right side are already common the rule will not tell us anything valuable.

If the lift is $>1$ then the rule is better than guessing. If the lift is $\leq1$ the rule is pretty much as good as guessing.

For example:

$$Chicken \rightarrow Milk = \frac{Support (Chicken \land Milk)}{Support(Chicken) \times Support (Milk)} = \frac{(4 / 7)}{(5 / 7) \times (4 / 7)} = 1.4$$

This implies that $Chicken \rightarrow Milk$ might be a good rule as $1.4 > 1$. However if we increase the support for milk to $6 \div 7$ to show that it is bought more often in more cases:

$$Chicken \rightarrow Milk = \frac{Support (Chicken \land Milk)}{Support(Chicken) \times Support (Milk)} = \frac{(4 / 7)}{(5 / 7) \times (6 / 7)} = 0.933$$

This now implies $Chicken \rightarrow Milk$ might be a bad rule as $0.933 < 1$. 

### Perform association analysis with Python

Now let's find associations between items in the dataset using default values on support and confidence.

First we create a frequent item sets dataset using the `apriori()` function that calculates the support (item frequency) in a similar method as we did at the beginning of this exercise. The function also adds combinations of items into the calculation, and additionally allows us to filter on the minimum support. In this case we set the minimum support to 0.5%:

In [None]:
frequent_itemsets = apriori(groceries, min_support=0.005, use_colnames=True)
frequent_itemsets

As you can see, this does not generate rules so we now need to generate them. If you want to decide what values you want to consider, `frequent_itemsets` is a `DataFrame` so you can filter on it before you provide it to the `association_rules()` function.

Note that you can explore the rules generated filtered on different metrics (`support`, `confidence` and `lift`) by specifying the metric and minimum threshold to the function, for example as follows:

In [None]:
grocery_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)
grocery_rules

We can add columns to calculate the size (number of items contained) of the antecendants:

In [None]:
grocery_rules["num_antecedents"] = grocery_rules["antecedents"].apply(lambda x: len(x))
grocery_rules

Use the `describe()` function on your grocery rules dataset to get a summary:

In [None]:
grocery_rules.describe()

We can filter out the rules by filtering on the `num_antecedents` column we created:

In [None]:
grocery_rules_3_items = grocery_rules[grocery_rules.num_antecedents >= 3]
grocery_rules_3_items

**Q14.9.** How many rules have three items?

In [None]:
num_rules_with_three_items = ...

In [None]:
_ = ok.grade('q39')

Let's filter further on various rule measures:

In [None]:
grocery_rules[(grocery_rules.num_antecedents >= 3) 
              & (grocery_rules.confidence > 0.6)
              & (grocery_rules.support > 0.0005)]

**Q14.10.** What makes a rule interesting? Is there an interesting rule found in our rules with three antecedents that is worth investigating further?

*Edit this cell to type your answer here*

Sometimes it may be desirable to want to know the rules that contain a certain product. For example, *FoodForAll* has had trouble knowing what to put next to citrus fruits. First, we filter out the rules that contain citrus fruit.

In [None]:
citrus_rules = grocery_rules[grocery_rules.antecedents.apply(str).str.contains("citrus fruit")]

Now you can explore the association rules for just citrus fruit:

In [None]:
citrus_rules[(citrus_rules.num_antecedents == 1) 
              & (citrus_rules.confidence > 0.1)
              & (citrus_rules.support > 0.01)]

**Q14.11.** What products do you recommend *MatFörAlla* to put next to citrus fruit? Explain your answer.

*Edit this cell to type your answer here*

## Conclusion

*FoodForAll* thank you for your help and are grateful that you arranged so that they now sells much better.

Because you did such a good job, you have also learned that there is some further aspects that *FoodForAll* wants to know:

**Q14.12.** *FoodForAll* can now see which customers have made which transactions. What further possibilities can such data provide?

*Edit this cell to type your answer here*

**Q14.13.** A customer makes a purchase where he buys a candle together with 20 cans of beer and *FoodForAll* then wonder how this will affect the analysis. You can say with confidence that it will not affect the analysis. Why can you say that?

*Edit this cell to type your answer here*

---
When you're finished with exercise 14, get one the TA or lecturer to discuss your observations.

If you are running this notebook using Binder, choose **Save and Checkpoint** from the **File** menu, **rename** your notebook to add a hyphen and your initials to the notebook name e.g. `Ex14_Association_analysis_for_FoodForAll-DJ`, then choose **Download as Notebook** and save it to your computer or USB stick.

If you are running this notebook on your own machine, choose **Save and Checkpoint** from the **File** menu, choose **Make a copy** from the **File** menu, then **rename** your notebook to add a hyphen and your initials to the notebook name e.g. rename from `Ex14_Association_analysis_for_FoodForAll-Copy1` to `Ex14_Association_analysis_for_FoodForAll-DJ`.