## Data: Online Retail II

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

The dataset can be accessed at the following link: https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci/

Attribute Information:

- InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
- UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
- CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal. The name of the country where a customer resides.

## Business Understanding

Market Basket Analysis is a data mining technique employed to discover relationships and patterns within large datasets, particularly in the context of market analysis. By identifying frequently co-occurring items in transactions, businesses can gain valuable insights into customer behavior, optimize product placement, and enhance overall marketing strategies.

## Objectives

1. **Association Rule Discovery**:
    Identify associations and correlations among products or items in a dataset. Discover rules that indicate the likelihood of certain items being bought together.
2. **Cross-Selling Opportunities**:
    Uncover opportunities for cross-selling by understanding which products are frequently purchased together.
3. **Promotion Planning**:
    Optimize promotional campaigns by identifying items that are frequently bought together. Design effective promotions and discounts to incentivize the purchase of complementary products.
4. **Optimizing Product Layout**:
    Arrange products in-store or online in a way that encourages the purchase of related items, creating a more convenient and satisfying shopping experience.

## Key Metrics

- **Support** 
    - Support measures how frequently an association rule happens in a dataset.
- **Confidence** 
    - Confidence measures how strong an association rule is. 
    - That is to say, in market basket analysis terms, how likely is a second product to be present in the basket if the first is.
    - Confidence(A→B) = Support(A∪B)/Support(A)×100%
    - Confidence(B→A) = Support(A∪B)/Support(B)×100%
- **Lift**
    - Lift measures how much more likely two items are to be bought together compared to being bought individually at random.
    - Lift(A→B)= Support(A∪B)/Support(A)×Support(B)
    - If Lift = 1, it means there is no association between A and B.
    - If Lift > 1, it indicates that A and B are more likely to be bought together than randomly.
    - If Lift < 1, it suggests that A and B are less likely to be bought together than randomly.
- **Leverage**
    - Leverage measures the difference between the observed frequency of A and B occurring together and the frequency that would be expected if A and B were statistically independent.
    - Leverage(A→B)=Support(A∪B)−(Support(A)×Support(B))
    - Positive leverage indicates that the items appear together more frequently than expected by chance.
    - Zero leverage means the items occur together exactly as expected based on their individual supports.
    - Negative leverage implies the items co-occur less frequently than expected.
    
- **Conviction**
    - Conviction measures the ratio of the expected frequency that A occurs without B to the frequency that A occurs when B is present.
    - Conviction(A→B)= 1−Support(B)/1−Confidence(A→B)
    - Conviction(B→A)= 1−Support(A)/1−Confidence(B→A)
    - If Conviction = 1, it means that A and B are independent of each other.
    - If Conviction > 1, it suggests that the presence of B has increased the likelihood of A, indicating a strong association.
    - If Conviction < 1, it indicates a negative association between A and B.
- **Zhang’s metric**
    - Measure that looks not only at positive associations, but also negative. It can says, for instances, if buying A makes someone NOT buy B.
    - Value: < 0 if there’s a negative association (dissociation), > 0 if there’s a positive association where -1 and 1 are the extreme values.

## Import Modules

In [1]:
import os
import re

import pandas as pd
import numpy as np

import plotly.express as px
import networkx as nx

import json
import copy

In [2]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

In [3]:
from networks.RulesGraphManager import RulesGraphManager as RGM
from networks.ProductNetwork import ProductNetwork
from networks.CrossSellingProducts import CrossSellingProducts

from grouper.NxGrouper import NxGrouper
from charts.HeatmapXTab import HeatmapCrosstab

from echarts.EgraphForce import EgraphForce
from echarts.EgraphStandard import EgraphStandard
from echarts.JupyterEcharts import JupyterEcharts

In [4]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)

## Data Preparation

### Load Dataset

In [5]:
pathname = os.path.join("F:\\Data\\datas", "online_retail_II.csv")
df = pd.read_csv(pathname, usecols=['Invoice', 'Description'])

In [6]:
df.shape

(1067371, 2)

### Data Cleaning

#### Drop missing values

In [7]:
df.isna().sum()

Invoice           0
Description    4382
dtype: int64

In [8]:
df = df.dropna()

#### Trim space in description column

In [9]:
df['Description'] = df['Description'].str.strip()
df['Description'] = df['Description'].replace(r'\s{2,}', ' ', regex=True)

#### Drop duplicate rows

There are two possible reasons for duplicate rows. 
1. The cashier might have scanned each product individually instead of scanning once and entering the quantity, resulting in multiple entries for each product on the invoice. 
2. The cashier might have accidentally scanned the same product twice. 

- **Decision process**: 
    - The first scenario is more common, so we assume that the duplicate rows are not mistakes.
    - But, since Apriori algorithm assumes that each item in each transaction is a binary occurrence, meaning it considers whether an item is present or not, but not the quantity of the item, so we decide to drop duplicate rows.
<br>
- **Action**: We drop duplicate rows

In [10]:
df.loc[df.duplicated()].shape

(46678, 2)

In [11]:
df = df.loc[~df.duplicated()]

In [12]:
df.shape

(1016311, 2)

#### Drop cancelation invoice rows

In [13]:
df[df['Invoice'].str.startswith('C')].shape

(18901, 2)

In [14]:
df = df[~df['Invoice'].str.startswith('C')]

In [15]:
df.shape

(997410, 2)

### Data Encoding

In [67]:
dfs = df.copy()
dfs['Description'] = dfs['Description'].transform(lambda x: [x])

In [68]:
dfs = dfs.groupby('Invoice')['Description'].sum() # use this, because we already dropped duplicate rows
# dfs = dfs.groupby('Invoice')['Description'].sum().map(lambda x: list(set(x)))
# df.groupby('Invoice')['Description'].unique() # this code is slow

In [69]:
encoder = TransactionEncoder()
df_encoder = encoder.fit(dfs).transform(dfs)

In [103]:
df_encoder[:5]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [71]:
one_hot_transactions = pd.DataFrame(df_encoder, columns=encoder.columns_)

In [72]:
one_hot_transactions.head()

Unnamed: 0,*Boombox Ipod Classic,*USB Office Glitter Lamp,*USB Office Mirror Ball,10 COLOUR SPACEBOY PEN,11 PC CERAMIC TEA SET POLKADOT,12 ASS ZINC CHRISTMAS DECORATIONS,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Data Mining (Rules Extraction)

The following parameters are configured for the algorithm:

- Maximum Combination Length
    - We set the maximum combination length to 2 items.
    - This choice is made to focus on pairs of items, allowing for a more targeted analysis of co-occurrences.

- Minimum Co-Occurrence Support Threshold
    - A minimum co-occurrence support threshold of 1% is established to filter out infrequent itemsets. 
    - This ensures that only associations with a significant presence in the dataset are considered.

In [74]:
frequent_itemsets = apriori(one_hot_transactions, min_support= 0.01, use_colnames=True, max_len = 2, low_memory=True)
rules = association_rules(frequent_itemsets, metric="support", min_threshold = 0.01)

In [75]:
rules['antecedents'] = rules['antecedents'].map(lambda x: ''.join(list(x)))
rules['consequents'] = rules['consequents'].map(lambda x: ''.join(list(x)))

In [76]:
rules.shape

(642, 10)

In [77]:
rules.sort_values('support', ascending=False).head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
566,RED HANGING HEART T-LIGHT HOLDER,WHITE HANGING HEART T-LIGHT HOLDER,0.042863,0.132973,0.030199,0.704545,5.298414,0.024499,2.934553,0.847595
567,WHITE HANGING HEART T-LIGHT HOLDER,RED HANGING HEART T-LIGHT HOLDER,0.132973,0.042863,0.030199,0.227106,5.298414,0.024499,1.238381,0.935685
637,WOODEN PICTURE FRAME WHITE FINISH,WOODEN FRAME ANTIQUE WHITE,0.048854,0.050145,0.02713,0.555334,11.074584,0.024681,2.136109,0.956429
636,WOODEN FRAME ANTIQUE WHITE,WOODEN PICTURE FRAME WHITE FINISH,0.050145,0.048854,0.02713,0.541039,11.074584,0.024681,2.072391,0.957728
297,JUMBO BAG RED RETROSPOT,JUMBO STORAGE BAG SUKI,0.079711,0.056745,0.026692,0.334861,5.90117,0.022169,1.418132,0.902479


## Network Visualization with Echarts

### Network Profile

We profile the network to get a quick summary of our products network 

In [78]:
myRGM = RGM(rules, 'antecedents', 'consequents')
df_nodes_profile, df_edges_profile = myRGM.get_graph_features()

In [79]:
df_nodes_profile = NxGrouper.greedy_modularity_communities(df_nodes_profile, df_edges_profile, min_member=4)

In [80]:
force_profile = EgraphForce(
    df_edges_profile, 
    df_nodes_profile, 
    col_source='antecedents', 
    col_target='consequents', 
    col_name='nodes',
)
force_profile_option = force_profile.get_option()

In [82]:
# JupyterEcharts.show(force_profile_option)

![Description](images/profile_network.png)

### Product Network

After we know our network profile, we can select and assess individual product networks.

In [83]:
MyPN = ProductNetwork(rules)
df_bfs, rules_bfs = MyPN.get_bfs_rules(['RED HANGING HEART T-LIGHT HOLDER'], 'support', 0, 3, 5)
df_nodes_pn, df_edges_pn = MyPN.get_graph_features(df_bfs, rules_bfs, strict_rules=True)

In [84]:
df_nodes_pn['label'] = df_nodes_pn['rank'].map(lambda x : {"show": True, "position": "right", "formatter": f"{x}"})

In [85]:
force_pn = EgraphForce(
    df_edges_pn, 
    df_nodes_pn,
    col_category='depth',
    col_source='antecedents', 
    col_target='consequents',
    col_name='nodes',
)
force_pn_option = force_pn.get_option(show_legend=True)

In [None]:
# JupyterEcharts.show(force_pn_option)

![Description](images/product_network.png)

### Cross Selling Products

Using this technique, we can bundle our products that are frequently purchased together.

In [86]:
MyCSP = CrossSellingProducts(rules)
cross_selling_rules = MyCSP.get_cross_selling_products(max_support_ratio_diff=2.0, min_confidence=0.35)

In [87]:
df_nodes_csp, df_edges_csp = MyCSP.get_graph_features(cross_selling_rules)
df_nodes_csp = NxGrouper.greedy_modularity_communities(df_nodes_csp, df_edges_csp)

df_bundle_products = df_nodes_csp[~df_nodes_csp['category'].str.contains('Others')]

In [88]:
force_csp = EgraphForce(
    df_edges_csp, df_nodes_csp,
    col_source='antecedents',
    col_target='consequents',
    col_name='nodes',
)
force_csp_option = force_csp.get_option()

In [None]:
# JupyterEcharts.show(force_csp_option)

![Description](images/cross_selling_products.png)

##  Product Placement with Plotly Heatmap

We can display our products side by side using Plotly heatmap

In [89]:
MyHM = HeatmapCrosstab(rules)

In [90]:
tabular = MyHM.get_tabular_data(
    ['WHITE HANGING HEART T-LIGHT HOLDER', 'RED HANGING HEART T-LIGHT HOLDER'], 
    'support',
    max_col=10
)

In [91]:
fig_heatmap = MyHM.plot_heatmap(tabular)

In [None]:
# fig_heatmap.show()

In [None]:
# fig_heatmap.write_image("images/heatmap_xtab.png")

![Description](images/heatmap_xtab.png)

## Summary Report

### Top 5 Support

In [93]:
rules.sort_values('lift', ascending=False).iloc[::2].head(5).reset_index(drop=True)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,POPPY'S PLAYHOUSE LIVINGROOM,POPPY'S PLAYHOUSE BEDROOM,0.013736,0.017121,0.011227,0.817376,47.741495,0.010992,5.381979,0.992689
1,POPPY'S PLAYHOUSE LIVINGROOM,POPPY'S PLAYHOUSE KITCHEN,0.013736,0.018071,0.011544,0.840426,46.507699,0.011296,6.153424,0.992126
2,POPPY'S PLAYHOUSE BEDROOM,POPPY'S PLAYHOUSE KITCHEN,0.017121,0.018071,0.013565,0.792319,43.845546,0.013256,4.728057,0.994214
3,WOODEN TREE CHRISTMAS SCANDINAVIAN,WOODEN STAR CHRISTMAS SCANDINAVIAN,0.013395,0.018655,0.010594,0.790909,42.396238,0.010344,4.693388,0.989669
4,SMALL MARSHMALLOWS PINK BOWL,SMALL DOLLY MIX DESIGN ORANGE BOWL,0.014953,0.017851,0.01113,0.7443,41.693982,0.010863,3.841014,0.990832


### Bunde Products

In [95]:
cross_selling_rules.sort_values('confidence_mean', ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,antecedent confidence,consequent confidence,support,support_ratio_diff,confidence_mean
0,POPPY'S PLAYHOUSE BEDROOM,POPPY'S PLAYHOUSE KITCHEN,0.017121,0.018071,0.792319,0.750674,0.013565,1.055477,0.77
1,POPPY'S PLAYHOUSE KITCHEN,POPPY'S PLAYHOUSE BEDROOM,0.018071,0.017121,0.750674,0.792319,0.013565,1.055477,0.77
2,GREEN REGENCY TEACUP AND SAUCER,ROSES REGENCY TEACUP AND SAUCER,0.032707,0.034364,0.762472,0.725726,0.024939,1.050633,0.74
4,PINK REGENCY TEACUP AND SAUCER,GREEN REGENCY TEACUP AND SAUCER,0.024963,0.032707,0.834146,0.636634,0.020823,1.310244,0.74
5,GREEN REGENCY TEACUP AND SAUCER,PINK REGENCY TEACUP AND SAUCER,0.032707,0.024963,0.636634,0.834146,0.020823,1.310244,0.74


In [104]:
df_bundle_products.groupby('category').agg(
    products=('nodes', 'unique'), 
    support_mean=('support', 'mean'), 
    n=('nodes', 'size')
).sort_values('support_mean', ascending=False)

Unnamed: 0_level_0,products,support_mean,n
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LUNCH BAG BLACK SKULL.,"[LUNCH BAG BLACK SKULL., LUNCH BAG SUKI DESIGN, LUNCH BAG CARS BLUE, LUNCH BAG SPACEBOY DESIGN, LUNCH BAG WOODLAND, LUNCH BAG RED RETROSPOT, LUNCH BAG PINK POLKADOT, LUNCH BAG APPLE DESIGN]",0.045582,8
JUMBO STORAGE BAG SUKI,"[JUMBO STORAGE BAG SUKI, JUMBO BAG RED RETROSPOT, JUMBO SHOPPER VINTAGE RED PAISLEY, JUMBO BAG STRAWBERRY, JUMBO BAG BAROQUE BLACK WHITE, JUMBO BAG PINK VINTAGE PAISLEY, JUMBO BAG WOODLAND ANIMALS, JUMBO BAG OWLS, JUMBO BAG SPACEBOY DESIGN, JUMBO STORAGE BAG SKULLS, JUMBO BAG SCANDINAVIAN PAISLEY, JUMBO BAG TOYS]",0.041373,12
COOK WITH WINE METAL SIGN,"[COOK WITH WINE METAL SIGN, GIN + TONIC DIET METAL SIGN, HAND OVER THE CHOCOLATE SIGN, PLEASE ONE PERSON METAL SIGN]",0.038382,4
60 TEATIME FAIRY CAKE CASES,"[60 TEATIME FAIRY CAKE CASES, PACK OF 60 DINOSAUR CAKE CASES, 72 SWEETHEART FAIRY CAKE CASES, PACK OF 60 PINK PAISLEY CAKE CASES, PACK OF 72 SKULL CAKE CASES, PACK OF 60 SPACEBOY CAKE CASES, PACK OF 72 RETRO SPOT CAKE CASES]",0.037843,7
CHOCOLATE HOT WATER BOTTLE,"[CHOCOLATE HOT WATER BOTTLE, HOT WATER BOTTLE I AM SO POORLY, HOT WATER BOTTLE TEA AND SYMPATHY, SCOTTIE DOG HOT WATER BOTTLE]",0.036555,4
PAPER CHAIN KIT VINTAGE CHRISTMAS,"[PAPER CHAIN KIT VINTAGE CHRISTMAS, SET OF 20 VINTAGE CHRISTMAS NAPKINS, 60 CAKE CASES VINTAGE CHRISTMAS, PAPER CHAIN KIT 50'S CHRISTMAS]",0.033913,4
RED RETROSPOT CHARLOTTE BAG,"[RED RETROSPOT CHARLOTTE BAG, CHARLOTTE BAG SUKI DESIGN, STRAWBERRY CHARLOTTE BAG, WOODLAND CHARLOTTE BAG, CHARLOTTE BAG PINK POLKADOT]",0.031787,5
PLASTERS IN TIN SPACEBOY,"[PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOODLAND ANIMALS, PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TIN VINTAGE PAISLEY, PLASTERS IN TIN STRONGMAN, PLASTERS IN TIN SKULLS]",0.028182,6
DOTCOM POSTAGE,"[DOTCOM POSTAGE, SMALL HEART MEASURING SPOONS, LARGE HEART MEASURING SPOONS, SUKI SHOULDER BAG, RECYCLING BAG RETROSPOT]",0.027749,5
ALARM CLOCK BAKELIKE GREEN,"[ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE RED, ALARM CLOCK BAKELIKE PINK, ALARM CLOCK BAKELIKE IVORY]",0.026491,4


### Product Layout

In [97]:
lunch_bag_bundles = df_bundle_products[df_bundle_products['category'].str.contains('LUNCH BAG BLACK SKULL.')]['nodes'].values

In [98]:
tabular = MyHM.get_tabular_data(
    lunch_bag_bundles, 
    'support',
    max_col=10,
    personal_placement=False
)

In [100]:
# MyHM.plot_heatmap(tabular)