Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/08_Association'
except ImportError as e:
    pass

# Association
## Frequent Itemsets & Association Rules
- Frequent Itemset
    - Support count: Frequency of an itemset
    - Support: relative frequency of an itemset (wrt. all transactions)
- Association Rule 𝑋→𝑌
    - Support: Support of the itemset 𝑋 ∪ 𝑌
    - Confidence: relative frequency of 𝑋 ∪ 𝑌 wrt. 𝑋
        - “If an itemsetcontains 𝑋, in x% of the cases it also contains 𝑌”
    - Lift: confidence of rule 𝑋→𝑌divided by support of consequent 𝑌
        - \>1X and Y are positively correlated
        - <1X and Y are negatively correlated
        - =1X and Y are independent

## Python Library for Frequent Itemsets & Association Rules

Scikit-learn does not include algorithms for frequent itemset generation and association rules. In this exercise, we will use [the implementations from the Orange library](https://orange3-associate.readthedocs.io/en/latest/scripting.html).

This package offers you three functions:
- [```frequent_itemsets()```](https://orange3-associate.readthedocs.io/en/latest/scripting.html#fpgrowth.frequent_itemsets): Generates frequent itemsets from a dataset
- [```association_rules()```](https://orange3-associate.readthedocs.io/en/latest/scripting.html#fpgrowth.association_rules): Generates association rules from frequent itemsets
- [```rules_stats()```](https://orange3-associate.readthedocs.io/en/latest/scripting.html#fpgrowth.rules_stats): Calculates additional statistics for association rules from frequent itemsets

In [None]:
#%pip install -q -U Orange3-Associate
import pandas as pd
shopping = pd.read_excel('ShoppingBaskets.xls')
shopping_data = shopping.drop('BasketNo', axis=1)
shopping_data.head()

### Frequent Itemsets

In [None]:
from orangecontrib.associate.fpgrowth import frequent_itemsets

# calculate the frequent itemsets
itemsets = dict(frequent_itemsets(shopping_data.values, 0.20))

# store the results in a dataframe
rows = []
for itemset, support_count in itemsets.items():
    domain_names= ",".join([shopping_data.columns[item_index] for item_index in itemset])
    rows.append((len(itemset), support_count, support_count / len(shopping_data.index), domain_names))

item_set_table = pd.DataFrame(rows, columns=["size", "support count", "support", "items"])
item_set_table.sort_values('support', ascending = False)

We can filter the results using conditions on the dataframe:

In [None]:
display(item_set_table[ item_set_table['items'].str.contains('ThinkPad X220') ])

### Association rules

In [None]:
from orangecontrib.associate.fpgrowth import association_rules, rules_stats

# calculate association rules from the itemsets
rules = association_rules(itemsets, 0.70)

# calculate statistics about the rules and store them in a dataframe
rows = []
for premise, conclusion, sup, conf,cov, strength, lift, leverage  in rules_stats(rules, itemsets, len(shopping_data)):
    premise_names = ",".join([shopping_data.columns[item_index] for item_index in premise])
    conclusion_names = ",".join([shopping_data.columns[item_index] for item_index in conclusion])
    rows.append((premise_names, conclusion_names, sup, conf,cov, strength, lift, leverage))

pd.DataFrame(rows, columns = ['Premise', 'Conclusion', 'Support', 'Confidence', 'Coverage', 'Strength', 'Lift', 'Leverage'])

### Preprocessing in pandas

We now look at some more options for data preprocessing using pandas dataframes.

In [None]:
from scipy.io import arff
adult_arff_data, adult_arff_meta = arff.loadarff(open('adult-dataset-tweaked.arff', 'r'))
adult = pd.DataFrame(adult_arff_data)
adult = adult.applymap(lambda x: x.decode('utf8') if hasattr(x, 'decode') else x)
adult.head()

To merge several categorical values, we can use the ```replace()``` function:

In [None]:
adult['education'].replace(['Bachelors','Masters','Assoc-acdm','Prof-school','Assoc-voc', 'Doctorate'], 'Other-Grad', inplace=True)
adult.head()

If we don't want to specify all values individually, we can also replace all values that satisfy a condition using the ```loc[]``` accessor:

In [None]:
adult.loc[ adult['native-country'] != 'United-States', 'native-country'] = 'Non-US'
adult.sort_values(by='native-country').head()

In addition to using scikit-learn KBinsDiscretizer, we can also discretize numeric values using pandas [```cut()``` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html).

In [None]:
adult['age'] = pd.cut(adult['age'], [0, 20, 65, 100],labels=['low', 'middle', 'high'])
adult.head()