# Q2: Online Retail_Association Rule

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns

In [2]:
df = pd.read_csv("C:/Users/LIMI/Desktop/OnlineRetail.csv",encoding='unicode_escape')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [4]:
print(df.isnull().sum())

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64


Becuase the Apriori works between [1,0], so we transform NA value into 0.
And Quantity should greater than 0 

# Association Rule Learning with Apriori

In [5]:
# First, we need removes the missing values. 
df.dropna(axis=0, inplace=True)

# Removes spaces 
df["Description"] = df["Description"].str.strip(" ")

# Exclude these 'C' lines in the invoice， because it presents the cancel. 
df = df[~df["InvoiceNo"].str.contains("C", na=False)] 

# Retain all numeric values. 
df = df[df["StockCode"].apply(lambda x: str(x).isnumeric())]

#Let Quantity greater than 0
df = df[df["Quantity"] > 0]

We assign 0 to the NA value, and since the apriori function works on 1-0, So we convert it. 

Next, considering the accurancy of Apriori, we selected 'France' as our target analysing country. We also take the sum of the quantity values from the InvoiceNo, Description schedules. 

In [6]:
basket = (df[df["Country"] == "France"]
              .groupby(["InvoiceNo", "Description"])["Quantity"]
              .sum().unstack().fillna(0)
              .applymap(lambda x: 1 if x > 0 else 0))
basket.iloc[0:5, 0:5]

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
536370,0,0,0,0,0
536852,0,0,0,0,0
536974,0,0,0,0,0
537065,0,0,0,0,0
537463,0,0,0,0,0


In [7]:
#%pip install mlxtend
#pip install --upgrade mlxtend

In [8]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [9]:
# Convert basket dataframe to bool（True or False)
basket = basket.astype(bool)

# Use Apriori to find the Frequent Itemsets. 
frequent_itemsets = apriori(basket, min_support=0.02, use_colnames=True)

# Generate the Association rules from frequent_items. 
rules = association_rules(frequent_itemsets, num_itemsets=len(frequent_itemsets),
                          metric="support", min_threshold=0.01)

basket.iloc[0:5, 0:5]

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
536370,False,False,False,False,False
536852,False,False,False,False,False
536974,False,False,False,False,False
537065,False,False,False,False,False
537463,False,False,False,False,False


In [10]:
# We want to remove duplicates, such as front and back pieces swapping places
rules["rule_set"] = rules.apply(
    lambda row: frozenset(row["antecedents"]).union(row["consequents"]),
    axis=1
)
rules = rules.drop_duplicates(subset=["rule_set"]).drop(columns=["rule_set"])

# check it
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(36 PENCILS TUBE WOODLAND),(36 PENCILS TUBE RED RETROSPOT),0.023936,0.047872,0.021277,0.888889,18.567901,1.0,0.020131,8.569149,0.969346,0.421053,0.883302,0.666667
2,(36 PENCILS TUBE RED RETROSPOT),(PLASTERS IN TIN WOODLAND ANIMALS),0.047872,0.178191,0.021277,0.444444,2.494196,1.0,0.012746,1.479255,0.62919,0.103896,0.323984,0.281924
4,(4 TRADITIONAL SPINNING TOPS),(MINI PAINT SET VINTAGE),0.074468,0.109043,0.029255,0.392857,3.602787,1.0,0.021135,1.467459,0.780564,0.189655,0.31855,0.330575
6,(4 TRADITIONAL SPINNING TOPS),(SET/6 RED SPOTTY PAPER CUPS),0.074468,0.143617,0.023936,0.321429,2.238095,1.0,0.013241,1.262038,0.597701,0.123288,0.207631,0.244048
8,(4 TRADITIONAL SPINNING TOPS),(SET/6 RED SPOTTY PAPER PLATES),0.074468,0.132979,0.023936,0.321429,2.417143,1.0,0.014033,1.277716,0.633461,0.130435,0.217353,0.250714


### Parameters Interpretation

antecedents: The antecedents (If) of a rule, indicating which items occurred.

consequents: The consequents of the rule (Then), indicating which items could have been purchased at the same time as the antecedents.

support: The probability that both the antecedent and consequent of a rule will occur at the same time. Indicates the popularity or importance of the rule, with higher values indicating a wider application of the rule. If 200 out of 1000 transactions occurring at the same time, the support is 20%. 

confidence: The probability that the consequent will occur if the antecedent occurs.

lift:  The correlation between the antecedent and the consequent. If > 1, the correlation is high, which means purchase the antecedents increases the likelihood of purchasing the consequents. 


# What We Get? 

### Here we main focus on the parameters of 'support'、'confidence' and 'lift'

If people buy "36 PENCILS TUBE RED RETROSPOT", there is an 44.45% chance that they will buy “36 PENCILS TUBE WOODLAND” or "PLASTERS IN TIN WOODLAND ANIMALS)".Results say that the correlation of these two items are high. So "36 PENCILS TUBE RED RETROSPOT" greatly increases the likelihood of puchasing the 36 "PENCILS TUBE WOODLAND" and increases "PLASTERS IN TIN WOODLAND ANIMALS" as well. 	

If people buy "4 TRADITIONAL SPINNING TOPS", there is an 39.29% chance that they will buy "MINI PAINT SET VINTAGE". Results say that the correlation of these two items are also high. So "4 TRADITIONAL SPINNING TOPS" increases the likelihood of puchasing the "MINI PAINT SET VINTAGE".  

and so on. 