# Data Mining Practical Works

# TP3 - Pattern Mining: Association Analysis

Association Mining can be used in problems where you need to make better decisions based on habits of your customers.


# Load and inspect dataset

The dataset contains 541909 transactions by customers shopping.

**Step 1:** Load the dataset (data.csv). Output the first five rows to inspect the data content.

In [1]:

import pandas as pd
data=pd.read_csv('data.csv')
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


**Step 2:** Check some statistics using the function `.describe()`.

In [2]:
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


# Data Preprocessing

**Step 3:** Data cleaning: remove the extra spaces in the column `Description` using the function `.str.strip()`.

In [3]:
data['Description'].str.strip()

0          WHITE HANGING HEART T-LIGHT HOLDER
1                         WHITE METAL LANTERN
2              CREAM CUPID HEARTS COAT HANGER
3         KNITTED UNION FLAG HOT WATER BOTTLE
4              RED WOOLLY HOTTIE WHITE HEART.
                         ...                 
541904            PACK OF 20 SPACEBOY NAPKINS
541905            CHILDREN'S APRON DOLLY GIRL
541906           CHILDRENS CUTLERY DOLLY GIRL
541907        CHILDRENS CUTLERY CIRCUS PARADE
541908           BAKING SET 9 PIECE RETROSPOT
Name: Description, Length: 541909, dtype: object

**Step 4:** Drop the rows that are without Invoice Number using the function `df.dropna`. This function takes three parameters axis=0, subset=['InvoiceNo'], inplce=True.

In [4]:
data.dropna(axis=0,subset=['InvoiceNo'],inplace=True)

**Step 5:** Make the `InvoiceNo` column values as string using the function `astype('str')`.

In [6]:
data['InvoiceNo']=data['InvoiceNo'].astype('str')

# Restructure the data: One Hot Encode

**Step 6:** Now consolidate the items into 1 transaction per row with each product 1 hot encoded. A one hot encoding is a representation of categorical variables as binary vectors.

A one hot encoding allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

`The basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column`.

Example, if we have the sequence: red,red,green; We could represent it with the integer encoding: 0,0,1. And the one hot encoding of: [1,0][1,0][0,1]

- Manual one hot encoding: in a `basket`, group by `InvoiceNo` and `Description` the sum of the quantities. Set the index `InvoiceNo`. Apply the functions `unstack()`, `reset_index()`, and `fillna(0)`.
- One Hot Encode with scikit-learn: `LabelEncoder` and `OneHotEncoder` (`sklearn.preprocessing`).
- `TransactionEncoder` (`mlxtend.preprocessing`).
- `OnehotTransactions` (`mlxtend.preprocessing`)

In this step, apply the manual approach.

In [37]:
basket=data[ data['Country']=="France"].groupby(['InvoiceNo','Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket

Description,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TRELLIS COAT RACK,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C579532,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
C579562,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
C580161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
C580263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


To view the list of country in the dataset, apply `Country.unique()`. Apply the one hot encoding to `France`.

In [38]:
data.Country.unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

**Step 7:** Now convert all positive numbers to 1 and others are 0. Create a function for that named `encode_units` which returns 1 for all positive numbers and 0 otherwise.

In [39]:
def encode_units(x):
    if(x>=1):
        return 1
    if(x<=0):
        return 0
    

**Step 8:** To make the data structured (`basket_sets`), apply the function `encode_units` to `basket` by using the function `applymap`.

In [40]:
basket_sets=basket.applymap(encode_units)
basket_sets

Description,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TRELLIS COAT RACK,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C579532,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
C579562,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
C580161,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
C580263,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Generation of frequent itemsets


**Step 9:** Now the data is structured. We can generate the frequent item sets using `apriori` algorithm; import `apriori` from `mlxtend.frequent_patterns`.

Consider `min_support` 7% to get many frequent paterns.

Print out the frequent items.

In [32]:
from mlxtend.frequent_patterns import apriori
frequent_items=apriori(basket_sets,min_support=0.07, use_colnames=True)
frequent_items

Unnamed: 0,support,itemsets
0,0.077944,(6 RIBBONS RUSTIC CHARM)
1,0.076285,(JUMBO BAG WOODLAND ANIMALS)
2,0.087894,(PLASTERS IN TIN CIRCUS PARADE )
3,0.08126,(PLASTERS IN TIN SPACEBOY)
4,0.104478,(PLASTERS IN TIN WOODLAND ANIMALS)
5,0.620232,(POSTAGE)
6,0.072968,(RED TOADSTOOL LED NIGHT LIGHT)
7,0.104478,(REGENCY CAKESTAND 3 TIER)
8,0.119403,(ROUND SNACK BOXES SET OF 4 FRUITS )
9,0.185738,(ROUND SNACK BOXES SET OF4 WOODLAND )


# Generation of Rules
**Step 10:** This step is to generate the rules with their corresponding `support`, `confidence` and `lift`.

- Lift = Lift is the ratio of the observed support to that expected if the two rules were independent
- Support = Support is the relative frequency that the rules show up. 
- Confidence = Confidence is a measure of the reliability of the rule.

Import `association_rules` from `mlxtend.frequent_patterns`. The inputs are the `frequent_itemsets`, metric='lift', and minimum threshold `min_threshold`=1.

Output the rules.

In [33]:
from mlxtend.frequent_patterns import association_rules

rules=association_rules(frequent_items,metric='lift',min_threshold=1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(POSTAGE),(PLASTERS IN TIN CIRCUS PARADE ),0.620232,0.087894,0.076285,0.122995,1.399354,0.021771,1.040023
1,(PLASTERS IN TIN CIRCUS PARADE ),(POSTAGE),0.087894,0.620232,0.076285,0.867925,1.399354,0.021771,2.875385
2,(POSTAGE),(PLASTERS IN TIN SPACEBOY),0.620232,0.08126,0.076285,0.122995,1.513587,0.025885,1.047587
3,(PLASTERS IN TIN SPACEBOY),(POSTAGE),0.08126,0.620232,0.076285,0.938776,1.513587,0.025885,6.202875
4,(POSTAGE),(PLASTERS IN TIN WOODLAND ANIMALS),0.620232,0.104478,0.089552,0.144385,1.381971,0.024752,1.046642
5,(PLASTERS IN TIN WOODLAND ANIMALS),(POSTAGE),0.104478,0.620232,0.089552,0.857143,1.381971,0.024752,2.658375
6,(POSTAGE),(REGENCY CAKESTAND 3 TIER),0.620232,0.104478,0.091211,0.147059,1.407563,0.02641,1.049923
7,(REGENCY CAKESTAND 3 TIER),(POSTAGE),0.104478,0.620232,0.091211,0.873016,1.407563,0.02641,2.990672
8,(POSTAGE),(ROUND SNACK BOXES SET OF 4 FRUITS ),0.620232,0.119403,0.114428,0.184492,1.54512,0.04037,1.079814
9,(ROUND SNACK BOXES SET OF 4 FRUITS ),(POSTAGE),0.119403,0.620232,0.114428,0.958333,1.54512,0.04037,9.114428


OUTPUT Terms and their meaning:
- Antecendents: An antecedent is an item found within the data. 
- Consequents: A consequent is an item found in combination with the antecedent. 
- The implications are: 
   - lift may find very strong associations for less frequent items, 
   - leverage tends to prioritize items with higher frequencies/support in the dataset.
- Conviction : Conviction measures the expected error of the rule, that is, how often X. occurs in a transaction where Y does not. It is thus a measure of the strength of a rule with respect to the complement of the consequent.

**Step 11:** Output the rules with `lift` greater or equal to 6 and `confidence` greater or equal to 0.8.

In [35]:
rules [(rules['lift']>=4)&(rules['confidence']>=0.8)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
15,(ROUND SNACK BOXES SET OF 4 FRUITS ),(ROUND SNACK BOXES SET OF4 WOODLAND ),0.119403,0.185738,0.099502,0.833333,4.486607,0.077325,4.885572
16,"(POSTAGE, ROUND SNACK BOXES SET OF 4 FRUITS )",(ROUND SNACK BOXES SET OF4 WOODLAND ),0.114428,0.185738,0.094527,0.826087,4.447593,0.073274,4.682007


**Step 12:** Now apply the same process to different countries, e.g., `Germany`. What do you observe?