# Practice Session 04: Basket analysis

Association rule mining techniques are useful to find common patterns of items in large data sets. One specific application called **market basket analysis** is useful for online shops because if we know that item A and B are bought together frequently, we can design new actions to increase the profit as:

- A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
- People who buy one of the products can be targeted through an advertisement campaign to buy the other.
- Collective discounts can be offered on these products if the customer buys both of them.
- Both A and B can be packaged together.

# 0. Preliminaries

## 0.1. Dataset

In this practice we are using a dataset contained in `dataset_associationrules.csv` with 1000 customers that purchased up to 8 different services from a portfolio of a Big Internet Player. The portfolio includes:

- Web hosting
- Office suite that includes email, Office tools as docs, excels and presentation
- Security solutions to protect cyber-attacks
- Cloud sub-product: infrastructure as a service
- Cloud sub-product: platform as a service
- Content Management as Wordpress, Joomla!, Drupal, etc....
- Chatbot for customer care
- Advertising

Each record (row) corresponds to a company and each column represents one of the products from the portfolio and can take the value 1 if the product was purchased or 0 if it was not.

## 0.2. Imports

In [1]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

## 0.3. Load the data

Open the csv with separator "," and assign to a dataframe variable (use read_csv from Pandas library)

In [2]:
dataset=pd.read_csv("Datasets/dataset_associationrules.csv", sep=",")
dataset.head().to_excel("Tables/head.xlsx")
dataset.head()

Unnamed: 0,ID_customer,WEBHOSTING,OFFICESUITE,SECURITY,CLOUD_IAAS,CLOUD_PAAS,CONTENTMGM,CHATBOT,ADVERTISING
0,0,0,0,1,0,0,0,0,0
1,1,0,1,1,0,0,0,0,0
2,2,1,0,1,0,0,1,0,0
3,3,0,0,1,0,0,0,0,0
4,4,1,1,1,0,0,1,0,0


## 0.4. The Apriori Algorithm in a nutshell
There are three major components of Apriori algorithm:

- Support: refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item A. This can be calculated as:<br>


<center> **Support(A) = (Transactions containing (A))/(Total Transactions)** </center>

- Confidence: refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:<br>

<center>**Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)**</center>


- Lift: Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:<br>

<center>**Lift(A→B) = (Confidence (A→B))/(Support (B))**</center>

<UL> A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.</UL>

# 1. Apriori algorithm

## 1.1. Exploratory data analysis

[**REPORT**] Plot the head with top 5 registers

In [3]:
dataset.head()

Unnamed: 0,ID_customer,WEBHOSTING,OFFICESUITE,SECURITY,CLOUD_IAAS,CLOUD_PAAS,CONTENTMGM,CHATBOT,ADVERTISING
0,0,0,0,1,0,0,0,0,0
1,1,0,1,1,0,0,0,0,0
2,2,1,0,1,0,0,1,0,0
3,3,0,0,1,0,0,0,0,0
4,4,1,1,1,0,0,1,0,0


[**REPORT**] Evaluate the dimension of the dataset and the type of the given variables (float, string, integer, etc.).

In [4]:
print('The Dataset have the following size:', dataset.shape)
print('The types are the following:\n', dataset.info())

The Dataset have the following size: (1000, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
ID_customer    1000 non-null int64
WEBHOSTING     1000 non-null int64
OFFICESUITE    1000 non-null int64
SECURITY       1000 non-null int64
CLOUD_IAAS     1000 non-null int64
CLOUD_PAAS     1000 non-null int64
CONTENTMGM     1000 non-null int64
CHATBOT        1000 non-null int64
ADVERTISING    1000 non-null int64
dtypes: int64(9)
memory usage: 70.4 KB
The types are the following:
 None


Different statistical algorithms have been developed to implement association rule mining where Apriori is one such algorithm. In this practice we will focus on Apriori algorithm  will later apply to our dataset.

Now we will use one existing Apriori algorithm from [apyori library](https://pypi.org/project/apyori/) to find out which products are commonly sold together.

*Note: In case of this apriori library is not already installed in your laptop, you can install it with: `pip install apyori`*

## 1.2. Data preparation

The **Apriori** library we are going to use requires our dataset to be in the form of a list of lists where each element is a product sold.
However, our dataset is in the form of a pandas dataframe where each row represents a customer and each column takes value 1 if it was sold to the customer or 0 if it wasn't. Therefore, we need to 1st) replace "1"s by the name of the product and 2nd) to convert the dataframe into a list of lists.

[**CODE**] Replace "1"s by product names

In [5]:
for column_name in list(dataset.columns):
    dataset[column_name] = dataset[column_name].replace(1, column_name)

[**CODE**] Besides, the **Apriori** algorithm does not need the **customer_ID** variable. Remove the column with **customer_ID**

In [6]:
#MY OUTPUT
dataset.drop(['ID_customer'], axis=1, inplace=True)
dataset.head()

Unnamed: 0,WEBHOSTING,OFFICESUITE,SECURITY,CLOUD_IAAS,CLOUD_PAAS,CONTENTMGM,CHATBOT,ADVERTISING
0,0,0,SECURITY,0,0,0,0,0
1,0,OFFICESUITE,SECURITY,0,0,0,0,0
2,WEBHOSTING,0,SECURITY,0,0,CONTENTMGM,0,0
3,0,0,SECURITY,0,0,0,0,0
4,WEBHOSTING,OFFICESUITE,SECURITY,0,0,CONTENTMGM,0,0


At this point, your dataset should look like this:

In [7]:
#ORIGINAL OUTPUT
dataset.head()

Unnamed: 0,WEBHOSTING,OFFICESUITE,SECURITY,CLOUD_IAAS,CLOUD_PAAS,CONTENTMGM,CHATBOT,ADVERTISING
0,0,0,SECURITY,0,0,0,0,0
1,0,OFFICESUITE,SECURITY,0,0,0,0,0
2,WEBHOSTING,0,SECURITY,0,0,CONTENTMGM,0,0
3,0,0,SECURITY,0,0,0,0,0
4,WEBHOSTING,OFFICESUITE,SECURITY,0,0,CONTENTMGM,0,0


[**CODE**] Convert the dataframe into a list of lists and store it in a `records` array

In [8]:
#Remove rows with all zeros before passing to list
dataset = dataset[(dataset.T != 0).any()]
records = dataset.values.tolist()

[**CODE**] Remove all "0"s and store in the `records_final` array

In [9]:
records_final = records
for record in records_final:
    while 0 in record: 
        record.remove(0)

print('Length of the final records:', len(records_final))

#Extra: remove items bought alone
records_final_filtered = []
for record in records_final:
    if len(record)>1:
        records_final_filtered.append(record)

print('Length of the final records filtered:', len(records_final_filtered))

Length of the final records: 753
Length of the final records filtered: 380


Now, everything is ready to execute the `Apriori` function.

## 1.3. Algorithm execution and evaluation

[**REPORT**] Execute the apriori algorithm using [apyori.apriori](https://pypi.org/project/apyori/) **3 times** with different values of minimum values for support, confidence, lift and length. For each iteration:
- Indicate the number of association rules
- Create a table with the main relevant association rules and justify the results. Explain their characteristics, i.e. support, confidence and lift

The function `association_result_list` will facilitate the visualization of association rules results

In [10]:
association_results = list(apriori(records_final))
print(association_results)

[RelationRecord(items=frozenset({'CONTENTMGM'}), support=0.20185922974767595, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'CONTENTMGM'}), confidence=0.20185922974767595, lift=1.0)]), RelationRecord(items=frozenset({'OFFICESUITE'}), support=0.2337317397078353, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'OFFICESUITE'}), confidence=0.2337317397078353, lift=1.0)]), RelationRecord(items=frozenset({'SECURITY'}), support=0.8074369189907038, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'SECURITY'}), confidence=0.8074369189907038, lift=1.0)]), RelationRecord(items=frozenset({'WEBHOSTING'}), support=0.36387782204515273, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'WEBHOSTING'}), confidence=0.36387782204515273, lift=1.0)]), RelationRecord(items=frozenset({'SECURITY', 'CONTENTMGM'}), support=0.18326693227091634, ordered_statistics=[OrderedStatistic

In [11]:
#MODIFIED
def association_result_list (association_results, show_text=False):
    association_result_df = pd.DataFrame()
    for item in association_results:
        current_item = pd.DataFrame()
        item_origin=[]
        item_origin.append([x for x in item[2][0][0]])
        item_destin=[]
        item_destin.append([x for x in item[2][0][1]])
        if str(item_origin) != '[[]]':
            if show_text:
                print("Rule: " +str(item_origin) +" -> " + str(item_destin))
                print("Support: " + str(item[1]))
                print("Confidence: " + str(item[2][0][2]))
                print("Lift: " + str(item[2][0][3]))
                print("=====================================")
            current_item['Item_A'] = [str(item_origin)[1:-1]]
            current_item['Item_B'] = [str(item_destin)[1:-1]]
            current_item['Support'] = item[1]
            current_item['Confidence'] = item[2][0][2]
            current_item['Lift'] = item[2][0][3]
            association_result_df = pd.concat([association_result_df, current_item], ignore_index = True)
    return association_result_df

Your output should look similar to this one, but numbers may vary depending on the lift and confidence parameters that you provide.

In [12]:
#Example 1
association_results = list(apriori(records_final, min_support = 0.01, min_confidence = 0.5, min_lift = 1, max_length=2))
pretty_table = association_result_list(association_results, show_text = False).sort_values('Lift', ascending=False) 
display(pretty_table)
pretty_table.to_excel('Tables/example1.xlsx')

Unnamed: 0,Item_A,Item_B,Support,Confidence,Lift
0,['CONTENTMGM'],['SECURITY'],0.183267,0.907895,1.124416


In [13]:
#Example 2
association_results = list(apriori(records_final, min_support = 0.005, min_confidence = 0.0, min_lift = 1.5, max_length=4))
pretty_table = association_result_list(association_results, show_text = False).sort_values('Lift', ascending=False) 
display(pretty_table)
pretty_table.to_excel('Tables/example2.xlsx')

Unnamed: 0,Item_A,Item_B,Support,Confidence,Lift
1,['CLOUD_PAAS'],['OFFICESUITE'],0.00664,0.833333,3.565341
2,"['CLOUD_IAAS', 'CONTENTMGM']",['OFFICESUITE'],0.011952,0.818182,3.500517
6,"['SECURITY', 'CLOUD_IAAS', 'CONTENTMGM']",['OFFICESUITE'],0.007968,0.75,3.208807
3,"['SECURITY', 'CLOUD_IAAS']",['OFFICESUITE'],0.030544,0.479167,2.050071
0,['CLOUD_IAAS'],['OFFICESUITE'],0.039841,0.447761,1.915706
4,"['CLOUD_IAAS', 'WEBHOSTING']",['OFFICESUITE'],0.010624,0.444444,1.901515
8,"['SECURITY', 'WEBHOSTING', 'OFFICESUITE']",['CONTENTMGM'],0.017264,0.371429,1.840038
7,"['SECURITY', 'CLOUD_IAAS', 'WEBHOSTING']",['OFFICESUITE'],0.007968,0.4,1.711364
5,"['WEBHOSTING', 'OFFICESUITE']",['CONTENTMGM'],0.01992,0.306122,1.516515


In [14]:
#Example 3
association_results = list(apriori(records_final, min_support = 0.02, min_confidence = 0.5, min_lift = 1, max_length=3))
pretty_table = association_result_list(association_results, show_text = False).sort_values('Lift', ascending=False) 
display(pretty_table)
pretty_table.to_excel('Tables/example3.xlsx')

Unnamed: 0,Item_A,Item_B,Support,Confidence,Lift
2,"['WEBHOSTING', 'CONTENTMGM']",['SECURITY'],0.067729,0.927273,1.148415
0,['CONTENTMGM'],['SECURITY'],0.183267,0.907895,1.124416
1,"['OFFICESUITE', 'CONTENTMGM']",['SECURITY'],0.047809,0.818182,1.013307


[**REPORT**] Considering the previous results:

- As Data Scientist, which is your main recommendation to increase sales to the Big Internet Player? Explain why
- When a customer purchases **CLOUD_PAAS**, which is the product that uses to buy too? Why?
- Describe the type of customer that purchases **OFFICESUITE** product
- Indicate two products that do **NOT** use to appear together. Why? 

In [15]:
#CLOUD_PAAS
association_results = list(apriori(records_final, min_support = 0.001, min_confidence = 0.0, min_lift = 1, max_length=2))
pretty_table = association_result_list(association_results, show_text = False).sort_values('Lift', ascending=False) 
filtered = pretty_table.loc[pretty_table['Item_A'] == '[\'CLOUD_PAAS\']']
display(filtered)
filtered.to_excel('Tables/cloud_paas.xlsx')

Unnamed: 0,Item_A,Item_B,Support,Confidence,Lift
5,['CLOUD_PAAS'],['OFFICESUITE'],0.00664,0.833333,3.565341
4,['CLOUD_PAAS'],['CONTENTMGM'],0.003984,0.5,2.476974


In [16]:
#OFFICESUITE
association_results = list(apriori(records_final, min_support = 0.001, min_confidence = 0.0, min_lift = 0.0, max_length=4))
pretty_table = association_result_list(association_results, show_text = False).sort_values('Lift', ascending=False) 
filtered = pretty_table.loc[pretty_table['Item_A'] == '[\'OFFICESUITE\']']
display(filtered)
filtered.to_excel('Tables/officesuite.xlsx')

Unnamed: 0,Item_A,Item_B,Support,Confidence,Lift
15,['OFFICESUITE'],['SECURITY'],0.158035,0.676136,0.837386
16,['OFFICESUITE'],['WEBHOSTING'],0.065073,0.278409,0.765117


In [17]:
#LESS LIFT
association_results = list(apriori(records_final, min_support = 0.001, min_confidence = 0.0, min_lift = 0.0, max_length=2))
pretty_table = association_result_list(association_results, show_text = False).sort_values('Lift', ascending=True)
display(pretty_table.head())
pretty_table.head().to_excel('Tables/lift.xlsx')

Unnamed: 0,Item_A,Item_B,Support,Confidence,Lift
2,['ADVERTISING'],['OFFICESUITE'],0.001328,0.111111,0.475379
3,['ADVERTISING'],['SECURITY'],0.00664,0.555556,0.688048
8,['CLOUD_IAAS'],['WEBHOSTING'],0.023904,0.268657,0.738316
16,['OFFICESUITE'],['WEBHOSTING'],0.065073,0.278409,0.765117
5,['CLOUD_IAAS'],['CONTENTMGM'],0.014608,0.164179,0.813335


# 2. Deliver

Deliver:

* A zip file containing your notebook (.ipynb file) with all the [**CODE**] parts implemented.
* A 4-pages PDF report including all parts of this notebook marked with "[**REPORT**]"

The report should end with the following statement: **I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.**