# LAB3

In [1]:
from tqdm import tqdm

## Association rules from frequent itemsets

1. First, you need to load the dataset into memory, using the csv module. Make sure you identify all valid rows. Also consider that rows having an InvoiceNo that starts with C should be discarded, as they indicate that the invoice is about a cancelled purchase.

Caricamento classico

In [2]:
import csv

online_retail_csv = []

with open('online_retail.csv') as f:
    for row in csv.reader(f):
        if row[0][0] != 'C': online_retail_csv.append(row)

Caricamento con pandas

In [3]:
import pandas as pd

online_retail_df = pd.read_csv('online_retail.csv')
online_retail_df = online_retail_df[~(online_retail_df['InvoiceNo'].str.startswith('C'))]

2. Now that you have a dataset of items, you should aggregate it at an “invoice” level. For each invoice (identified by InvoiceNo) there can be multiple items (from multiple rows) in the dataset. For each invoice, you should build a list of all items belonging to it. 

Elaborazione classica

In [4]:
# online_retail_csv_grouped = []

# invoices_set = set([row[0] for row in online_retail_csv])

# for invoice in tqdm(invoices_set):
#     online_retail_csv_grouped.append([invoice, [row[2] for row in online_retail_csv[1:] if row[0] == invoice]])

Elaborazione con Pandas

In [5]:
online_retail_df_grouped = online_retail_df.filter(['InvoiceNo', 'Description'])
online_retail_df_grouped = online_retail_df_grouped.groupby('InvoiceNo').aggregate({'Description': lambda x: list(x)})
online_retail_df_grouped = online_retail_df_grouped.reset_index()

3. You should now have a list (one for each invoice) of lists (each list containing the items bought for that invoice). Now, we need to convert this into a matrix form. Of the many possible formats, we will use the one expected by the Mlxtend library, which is as follows. Given an ordered list of M possible items (in this case, all possible products that can be bought), and given N itemsets (in this case, invoices), we should build a matrix of N rows and M columns. The element at the ith row and jth column should be 1 if the ith itemset (invoice) contains the jth item (product), 0 otherwise. 

Base

In [6]:
# all_items = set()

# for row in online_retail_csv_grouped:
#     all_items = all_items.union(set(row[1]))

In [7]:
# pa_matrix = [] 

# all_items = list(all_items)

# for row in online_retail_csv_grouped:
#     pa_matrix.append([1 if col in row[1] else 0 for col in all_items])

In [8]:
# df = pd.DataFrame(data=pa_matrix, columns=all_items)

mlxtend

In [9]:
from mlxtend.preprocessing import TransactionEncoder

In [22]:
descr_list = list(online_retail_df_grouped['Description'].apply(lambda x: list(map(str,x))).to_numpy())

# for i, row in enumerate(descr_list):
#     descr_list[i] = [str(item) for item in row]

te = TransactionEncoder()
online_retail_df_dummies = te\
    .fit(descr_list)\
    .transform(descr_list)


online_retail_df_dummies = pd.DataFrame(data=online_retail_df_dummies, columns=te.columns_)

4. With the df that you defined in the previous exercise, you can now use the fp_growth function. This function, which is described in the detail in the official documentation. The first argument required is the previously built DataFrame, df. The second is the minimum support (minsup), i.e. the minimum fraction of the entire dataset in which the itemset should show up for it to be considered “frequent”. Try using different values of minsup, such as 0.5, 0.1, 0.05, 0.02, 0.01.

In [11]:
from mlxtend.frequent_patterns import fpgrowth

In [28]:
minsups = [.5,.1,.05,.02,.01]
for minsup in tqdm(minsups):
    fi = fpgrowth(online_retail_df_dummies, minsup)
    print(len(fi))

 20%|██        | 1/5 [00:00<00:01,  3.04it/s]

0


 40%|████      | 2/5 [00:00<00:00,  3.07it/s]

1


 60%|██████    | 3/5 [00:01<00:00,  2.81it/s]

23


 80%|████████  | 4/5 [00:03<00:01,  1.21s/it]

303


100%|██████████| 5/5 [00:15<00:00,  3.07s/it]

1472





5. Consider the itemsets extracted for minsup = 0.02. How many items are contained? Which ones would you be considered the most useful?

In [29]:
fi = fpgrowth(online_retail_df_dummies, .02)
print(len(fi))

303


Estraggo itemset con almeno 2 item

In [31]:
fi[fi['itemsets'].map(len) >= 2]

Unnamed: 0,support,itemsets
246,0.021392,"(1824, 1823)"
247,0.029007,"(161, 165)"
248,0.021302,"(164, 165)"
249,0.024429,"(3969, 3965)"
250,0.022435,"(2836, 3903)"
251,0.026242,"(1857, 2045)"
252,0.037391,"(1857, 1855)"
253,0.023341,"(1870, 1855)"
254,0.032814,"(1857, 1870)"
255,0.026514,"(1857, 1845)"


6. Use the value returned by fpgrowth to extract the relevant association rules.


In [34]:
M = online_retail_df_dummies.values
support_2656 = len(M[M[:, 2656] == 1])/len(M)
support_1599 = len(M[M[:, 1599] == 1])/len(M)
support_both = len(M[(M[:, 2656] == 1) & (M[:, 1599] == 1)])/len(M)
print(f"Confidence 2656 => 1599: {support_both / support_2656}")
print(f"Confidence 1599 => 2656: {support_both / support_1599}")

Confidence 2656 => 1599: 0.0
Confidence 1599 => 2656: 0.0


7. Extract the association rules from the frequent itemsets extracted with minsup = 0.01. You can find the documentation for association_rules() on the official documentation. You can use the confidence as the metric to identify the rules, and a minimum threshold of 0.85 (feel free to vary these values and observe how the results vary).

In [39]:
from mlxtend.frequent_patterns import association_rules

fi = fpgrowth(online_retail_df_dummies, .01)

ar = association_rules(fi, metric='confidence', min_threshold = .85)

In [40]:
ar

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"(3546, 722)",(2860),0.017177,0.046864,0.014775,0.860158,18.354481,0.01397,6.815824
1,"(3546, 722, 723)",(2860),0.012282,0.046864,0.011104,0.904059,19.291256,0.010528,9.934613
2,"(3546, 722, 3980)",(2860),0.01192,0.046864,0.010968,0.920152,19.634657,0.010409,11.936898
3,"(3546, 723, 3980)",(2860),0.013733,0.046864,0.011784,0.858086,18.310257,0.01114,6.716286
4,"(1868, 1870, 1855)",(1857),0.014005,0.094815,0.012146,0.867314,9.147426,0.010819,6.822003
5,"(3337, 3291)",(3338),0.013416,0.023885,0.012011,0.89527,37.482435,0.01169,9.320323
6,(1730),(1731),0.010877,0.010741,0.010016,0.920833,85.726864,0.009899,12.495897
7,(1731),(1730),0.010741,0.010877,0.010016,0.932489,85.726864,0.009899,14.651378
8,"(722, 723, 3980)",(2860),0.013234,0.046864,0.011376,0.859589,18.342333,0.010756,6.78819
9,"(3560, 1857)",(1091),0.012781,0.032088,0.011285,0.882979,27.517009,0.010875,8.271244


## Apriori implementation