Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/08_Association'
except ImportError as e:
    pass

# Association
## Frequent Itemsets & Association Rules
- Frequent Itemset
    - Support count: Frequency of an itemset
    - Support: relative frequency of an itemset (wrt. all transactions)
- Association Rule 𝑋→𝑌
    - Support: Support of the itemset 𝑋 ∪ 𝑌
    - Confidence: relative frequency of 𝑋 ∪ 𝑌 wrt. 𝑋
        - “If an itemsetcontains 𝑋, in x% of the cases it also contains 𝑌”
    - Lift: confidence of rule 𝑋→𝑌divided by support of consequent 𝑌
        - \>1X and Y are positively correlated
        - <1X and Y are negatively correlated
        - =1X and Y are independent

## Python Library for Frequent Itemsets & Association Rules

Scikit-learn does not include algorithms for frequent itemset generation and association rules. In this exercise, we will use [the implementations from the Orange library](https://orange3-associate.readthedocs.io/en/latest/scripting.html).

This package offers you three functions:
- [```frequent_itemsets()```](https://orange3-associate.readthedocs.io/en/latest/scripting.html#fpgrowth.frequent_itemsets): Generates frequent itemsets from a dataset
- [```association_rules()```](https://orange3-associate.readthedocs.io/en/latest/scripting.html#fpgrowth.association_rules): Generates association rules from frequent itemsets
- [```rules_stats()```](https://orange3-associate.readthedocs.io/en/latest/scripting.html#fpgrowth.rules_stats): Calculates additional statistics for association rules from frequent itemsets

In [3]:
import Orange

#%pip install -q -U Orange3-Associate

In [1]:

import pandas as pd
shopping = pd.read_excel('ShoppingBaskets.xls')
shopping_data = shopping.drop('BasketNo', axis=1)
shopping_data.head()

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'c:\\users\\jannik\\anaconda3\\lib\\site-packages\\pandas-1.2.3.dist-info\\METADATA'



Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,ThinkPad X220,Asus EeePC,HP Laserjet P2055,2 GB DDR3 RAM,8 GB DDR3 RAM,Lenovo Tablet Sleeve,Netbook-Schutzhülle,HP CE50 Toner,LT Laser Maus,LT Minimaus
0,1,0,0,0,1,1,0,0,0,1
1,0,1,0,1,0,0,1,0,1,0
2,1,0,1,0,0,1,0,1,1,0
3,0,1,0,1,0,0,1,0,0,1
4,0,1,1,1,0,0,0,1,0,0


In [4]:
# I made some adjustments here
import pandas as pd
import numpy as np

x = np.array([1,2,34])

y = 1 + x
y

array([ 2,  3, 35])

### Frequent Itemsets

In [2]:
from orangecontrib.associate.fpgrowth import frequent_itemsets

# calculate the frequent itemsets
itemsets = dict(frequent_itemsets(shopping_data.values, 0.20))

# store the results in a dataframe
rows = []
for itemset, support_count in itemsets.items():
    domain_names= ",".join([shopping_data.columns[item_index] for item_index in itemset])
    rows.append((len(itemset), support_count, support_count / len(shopping_data.index), domain_names))

item_set_table = pd.DataFrame(rows, columns=["size", "support count", "support", "items"])
item_set_table.sort_values('support', ascending = False)

ModuleNotFoundError: No module named 'orangecontrib'

We can filter the results using conditions on the dataframe:

In [13]:
display(item_set_table[ item_set_table['items'].str.contains('ThinkPad X220') ])

Unnamed: 0,size,support count,support,items
0,1,4,0.4,ThinkPad X220
3,2,2,0.2,"ThinkPad X220 ,HP Laserjet P2055"
7,2,2,0.2,"ThinkPad X220 ,8 GB DDR3 RAM"
9,2,3,0.3,"ThinkPad X220 ,Lenovo Tablet Sleeve"
11,3,2,0.2,"ThinkPad X220 ,HP Laserjet P2055,Lenovo Tablet..."
17,2,2,0.2,"ThinkPad X220 ,HP CE50 Toner"
19,3,2,0.2,"ThinkPad X220 ,HP Laserjet P2055,HP CE50 Toner"
21,3,2,0.2,"ThinkPad X220 ,Lenovo Tablet Sleeve,HP CE50 Toner"
23,4,2,0.2,"ThinkPad X220 ,HP Laserjet P2055,Lenovo Tablet..."
25,2,2,0.2,"ThinkPad X220 ,LT Laser Maus"


### Association rules

In [5]:
from orangecontrib.associate.fpgrowth import association_rules, rules_stats

# calculate association rules from the itemsets
rules = association_rules(itemsets, 0.70)

# calculate statistics about the rules and store them in a dataframe
rows = []
for premise, conclusion, sup, conf,cov, strength, lift, leverage  in rules_stats(rules, itemsets, len(shopping_data)):
    premise_names = ",".join([shopping_data.columns[item_index] for item_index in premise])
    conclusion_names = ",".join([shopping_data.columns[item_index] for item_index in conclusion])
    rows.append((premise_names, conclusion_names, sup, conf,cov, strength, lift, leverage))

pd.DataFrame(rows, columns = ['Premise', 'Conclusion', 'Support', 'Confidence', 'Coverage', 'Strength', 'Lift', 'Leverage'])

Unnamed: 0,Premise,Conclusion,Support,Confidence,Coverage,Strength,Lift,Leverage
0,"HP Laserjet P2055,Lenovo Tablet Sleeve,HP CE50...",ThinkPad X220,2,1.00,0.2,2.00,2.500000,0.12
1,"ThinkPad X220 ,Lenovo Tablet Sleeve,HP CE50 Toner",HP Laserjet P2055,2,1.00,0.2,1.50,3.333333,0.14
2,"Lenovo Tablet Sleeve,HP CE50 Toner","ThinkPad X220 ,HP Laserjet P2055",2,1.00,0.2,1.00,5.000000,0.16
3,"ThinkPad X220 ,HP Laserjet P2055,HP CE50 Toner",Lenovo Tablet Sleeve,2,1.00,0.2,2.00,2.500000,0.12
4,"ThinkPad X220 ,HP CE50 Toner","HP Laserjet P2055,Lenovo Tablet Sleeve",2,1.00,0.2,1.00,5.000000,0.16
...,...,...,...,...,...,...,...,...
60,2 GB DDR3 RAM,Netbook-Schutzhülle,4,0.80,0.5,0.80,2.000000,0.20
61,HP CE50 Toner,HP Laserjet P2055,3,1.00,0.3,1.00,3.333333,0.21
62,HP Laserjet P2055,HP CE50 Toner,3,1.00,0.3,1.00,3.333333,0.21
63,LT Minimaus,Asus EeePC,4,0.80,0.5,1.20,1.333333,0.10


### Preprocessing in pandas

We now look at some more options for data preprocessing using pandas dataframes.

In [6]:
from scipy.io import arff
adult_arff_data, adult_arff_meta = arff.loadarff(open('adult-dataset-tweaked.arff', 'r'))
adult = pd.DataFrame(adult_arff_data)
adult = adult.applymap(lambda x: x.decode('utf8') if hasattr(x, 'decode') else x)
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,X,0.0,0.0,40.0,United-States,X
1,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
2,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
3,28.0,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
4,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K


To merge several categorical values, we can use the ```replace()``` function:

In [7]:
adult['education'].replace(['Bachelors','Masters','Assoc-acdm','Prof-school','Assoc-voc', 'Doctorate'], 'Other-Grad', inplace=True)
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,X,0.0,0.0,40.0,United-States,X
1,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
2,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
3,28.0,Local-gov,336951.0,Other-Grad,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
4,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K


If we don't want to specify all values individually, we can also replace all values that satisfy a condition using the ```loc[]``` accessor:

In [8]:
adult.loc[ adult['native-country'] != 'United-States', 'native-country'] = 'Non-US'
adult.sort_values(by='native-country').head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
25728,28.0,Private,398220.0,5th-6th,3.0,Never-married,Craft-repair,Other-relative,White,Male,0.0,0.0,40.0,Non-US,<=50K
29341,24.0,Private,385540.0,HS-grad,9.0,Married-civ-spouse,Sales,Husband,White,Male,0.0,0.0,40.0,Non-US,<=50K
10991,38.0,Private,43311.0,5th-6th,3.0,Married-civ-spouse,Other-service,Wife,White,Female,0.0,0.0,40.0,Non-US,<=50K
44338,63.0,Private,158199.0,1st-4th,2.0,Widowed,Machine-op-inspct,Unmarried,White,Female,0.0,0.0,44.0,Non-US,<=50K
38921,34.0,Private,340917.0,Other-Grad,15.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,2829.0,0.0,50.0,Non-US,<=50K


In addition to using scikit-learn KBinsDiscretizer, we can also discretize numeric values using pandas [```cut()``` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html).

In [9]:
adult['age'] = pd.cut(adult['age'], [0, 20, 65, 100],labels=['low', 'middle', 'high'])
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,middle,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,X,0.0,0.0,40.0,United-States,X
1,middle,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
2,middle,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
3,middle,Local-gov,336951.0,Other-Grad,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
4,middle,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K
