# Practice Session 04: Basket analysis

Association rule mining techniques are useful to find common patterns of items in large data sets. One specific application called **market basket analysis** is useful for online shops because if we know that item A and B are bought together frequently, we can design new actions to increase the profit as:

- A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
- People who buy one of the products can be targeted through an advertisement campaign to buy the other.
- Collective discounts can be offered on these products if the customer buys both of them.
- Both A and B can be packaged together.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Your name here</font>

E-mail: <font color="blue">Your e-mail here</font>

Date: <font color="blue">The current date here</font>

In [None]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

If the apyori library is not already installed in your laptop, you can install it with: `pip install apyori`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 0. The Apriori Algorithm in a nutshell

There are three major components of Apriori algorithm, which we describe below using as an example the case where transactions are purchase histories.

**Support**: the number of transactions containing a particular item divided by total number of transactions:

   *Support(A) = (Transactions containing (A))/(Total Transactions)*

**Confidence**: normally indicates the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought:

   *Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)*

**Lift**: the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) by Support(B):

   *Lift(A→B) = (Confidence (A→B))/(Support (B))*
   
A Lift of 1 means there is no association between products A and B. Lift greater than 1.0 means products A and B are more likely to be bought together. Lift less than 1.0 indicates two products are unlikely to be bought together.

The Apriori algorithm first finds itemsets having the desired level of support, and then within those itemsets tries to derive rules having the desired confidence and lift.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. Playing with apyori

The [apyori library](https://pypi.org/project/apyori/) is an implementation of the Apriori algorithm. Its typical usage is to receive a list of transactions and then print the association rules it found.

To use this library, we pass a list in which each element represents a transaction, for instance:

```python
transactions = [
    ['beer', 'chips', 'nuts', 'olives'],
    ['beer', 'chips', 'olives'],
    ['chips', 'nuts' ],
    ['chips', 'olives'],
    ['beer', 'nuts' ],
    ['chips'],
    ['nuts', 'olives'],
    ['beer', 'nuts'],
    ['beer', 'chips', 'olives'], 
    ['beer', 'nuts', 'olives'], 

]
results = list(apriori(transactions, min_support=0.2, min_confidence=0.75, min_lift=1.0))

```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your own example of transactions (at least 20 transactions) and execution of the apriori algorithm, in which you should obtain at least ONE and at most THREE rules.</font>

The function below, which you can leave as-is, prints the output of the apyori library in a readable format. Use it to print the results of your association rules mining.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# Leave this code as-is

def print_apyori_output (association_results):
    for relation_record in association_results:
        itemset = list(relation_record.items)
        
        # Consider only itemsets of two elements
        if len(itemset) > 1: 
        
            print("Rules involving itemset %s" % itemset)
            support = relation_record.support

            for rules in relation_record.ordered_statistics:
                antecedent = list(rules.items_base)
                consequent = list(rules.items_add)
                confidence = rules.confidence
                lift = rules.lift

                print("%s => %s (support=%.2f, confidence=%.2f, lift=%.2f)" %
                      (antecedent, consequent, support, confidence, lift))
            print()

<font size="+1" color="red">Replace this cell with (1) a printout of the rules you have obtained, and (2) for each of those rules, indicate clearly how the support, confidence, and lift is calculated. Do not merely repeat the formula: indicate how each number is computed based on the transactions you provided, as if you were trying to verify that the numbers are correct.</font>

# 2. Load and prepare the services purchased dataset

Next we will use a dataset contained in `services_purchased.csv` with 1000 customers that purchased up to 8 different services from a portfolio of a Big Internet Player. The portfolio includes:

- *WEBHOSTING*: Web hosting
- *OFFICESUITE*: Office suite that includes email, Office tools as docs, excels and presentation
- *SECURITY*: Security solutions to protect cyber-attacks
- *CLOUD_IAAS*: Cloud sub-product: infrastructure as a service
- *CLOUD_PAAS*: Cloud sub-product: platform as a service
- *CONTENTMGM*: Content management solution such as Wordpress, Joomla!, Drupal, etc....
- *CHATBOT*: Chatbot for customer care
- *ADVERTISING*: Advertising

Each record (row) corresponds to a company and each column represents one of the products from the portfolio and can take the value 1 if the product was purchased or 0 if it was not.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
INPUT_FILENAME = "services_purchased.csv"

In [None]:
dataset = pd.read_csv(INPUT_FILENAME, sep=",")
dataset.head()

Next, show how many customers have purchased each service, your code should display this information formatted correctly:

```
   WEBHOSTING: 274
  OFFICESUITE: 176
     SECURITY: 608
   CLOUD_IAAS: 67
   CLOUD_PAAS: 6
   CONTENTMGM: 152
      CHATBOT: 0
  ADVERTISING: 9
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to print how many customers have requested each service.</font>

<font size="+1" color="red">Replace this cell with code to remove the ID_customer column, which we do not need. Print the number of columns remaining, there should be 8.</font>

<font size="+1" color="red">Replace this cell with code to remove all customers that have not purchased any service. Print the number of rows remaining, there should be 753.</font>

Now, you need to create a variable named `transactions` containing the dataset as a list of transactions. Remember each transaction is a list.

The first five elements of this `transactions` variable should be:

```python
[
  ['SECURITY'],
  ['OFFICESUITE', 'SECURITY'],
  ['WEBHOSTING', 'SECURITY', 'CONTENTMGM'],
  ['SECURITY'],
  ['WEBHOSTING', 'OFFICESUITE', 'SECURITY', 'CONTENTMGM'],
  ...
]
```

You can iterate through the rows of a dataframe `df` with `for recordnum, record in df.iterrows()`  and through its columns with `for column in df.columns`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create the "transactions" list from the dataset.</font>

# 3. Run the Apriori algorithm

Execute the apriori algorithm using [apyori.apriori](https://pypi.org/project/apyori/) **twice** with different values of minimum values for support, confidence, lift. **Remember to set the "lift" parameter to a value strictly greater than 1.0.** 

```
results = apriori(transactions, min_support= ... , min_confidence= ... , min_lift= ... )  
print_apyori_output(results)
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to produce and print high-support rules. It should return more than 1 and less than 5 rules.</font>

<font size="+1" color="red">Replace this cell with a brief commentary on the rules that you have found.</font>

<font size="+1" color="red">Replace this cell with code to produce and print high-lift rules. It should return more than 1 and less than 5 rules.</font>

<font size="+1" color="red">Replace this cell with a brief commentary on the rules that you have found.</font>

<font size="+1" color="red">Replace this cell with (1) a description of the customers that purchase the content management product (CONTENTMGM), and (2) a description of the customers that purchase the web hosting (WEBHOSTING) product. You may need to do additional runs of Apriori to obtain the rules you will need for this characterization.</font>

<font size="+1" color="red">Replace this cell with your conclusions. What would be your top three recommendations towards this service provider? A recommendation could be "if a customer buys X, you should recommend them to buy Y, because ...". Remember to justify clearly your recommendations based on the results from association rules mining.</font>

# DELIVER (individually)

Remember to read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/data-mining-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook

## Extra points available

For more learning and extra points, perform association rules mining on this [bakery dataset](https://github.com/viktree/curly-octo-chainsaw). There is a nice [notebook](https://github.com/viktree/curly-octo-chainsaw/blob/master/Bakery%20Transactions.ipynb) describing how to load this data, feel free to copy-paste from that notebook the data loading and cleaning parts. Format the data in the format that apyori expects, run the association rules mining, and write your conclusions briefly.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: experiments on the bakery dataset</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>