## Data: Online Retail II

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

The dataset can be accessed at the following link: https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci/

Attribute Information:

- InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
- UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
- CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal. The name of the country where a customer resides.

## Business Understanding

Market Basket Analysis is a data mining technique employed to discover relationships and patterns within large datasets, particularly in the context of market analysis. By identifying frequently co-occurring items in transactions, businesses can gain valuable insights into customer behavior, optimize product placement, and enhance overall marketing strategies.

## Objectives

1. **Association Rule Discovery**:
    Identify associations and correlations among products or items in a dataset. Discover rules that indicate the likelihood of certain items being bought together.
2. **Cross-Selling Opportunities**:
    Uncover opportunities for cross-selling by understanding which products are frequently purchased together.
3. **Promotion Planning**:
    Optimize promotional campaigns by identifying items that are frequently bought together. Design effective promotions and discounts to incentivize the purchase of complementary products.
4. **Optimizing Product Layout**:
    Arrange products in-store or online in a way that encourages the purchase of related items, creating a more convenient and satisfying shopping experience.

## Key Metrics

- **Support** 
    - Support measures how frequently an association rule happens in a dataset.
- **Confidence** 
    - Confidence measures how strong an association rule is. 
    - That is to say, in market basket analysis terms, how likely is a second product to be present in the basket if the first is.
    - Confidence(A→B) = Support(A∪B)/Support(A)×100%
    - Confidence(B→A) = Support(A∪B)/Support(B)×100%
- **Lift**
    - Lift measures how much more likely two items are to be bought together compared to being bought individually at random.
    - Lift(A→B)= Support(A∪B)/Support(A)×Support(B)
    - If Lift = 1, it means there is no association between A and B.
    - If Lift > 1, it indicates that A and B are more likely to be bought together than randomly.
    - If Lift < 1, it suggests that A and B are less likely to be bought together than randomly.
- **Leverage**
    - Leverage measures the difference between the observed frequency of A and B occurring together and the frequency that would be expected if A and B were statistically independent.
    - Leverage(A→B)=Support(A∪B)−(Support(A)×Support(B))
    - Positive leverage indicates that the items appear together more frequently than expected by chance.
    - Zero leverage means the items occur together exactly as expected based on their individual supports.
    - Negative leverage implies the items co-occur less frequently than expected.
    
- **Conviction**
    - Conviction measures the ratio of the expected frequency that A occurs without B to the frequency that A occurs when B is present.
    - Conviction(A→B)= 1−Support(B)/1−Confidence(A→B)
    - Conviction(B→A)= 1−Support(A)/1−Confidence(B→A)
    - If Conviction = 1, it means that A and B are independent of each other.
    - If Conviction > 1, it suggests that the presence of B has increased the likelihood of A, indicating a strong association.
    - If Conviction < 1, it indicates a negative association between A and B.
- **Zhang’s metric**
    - Measure that looks not only at positive associations, but also negative. It can says, for instances, if buying A makes someone NOT buy B.
    - Value: < 0 if there’s a negative association (dissociation), > 0 if there’s a positive association where -1 and 1 are the extreme values.

## Import Modules

In [1]:
import os
import re

import pandas as pd
import numpy as np

import plotly.express as px
import networkx as nx

import json
import copy

In [2]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

In [3]:
from networks.RulesGraphManager import RulesGraphManager as RGM
from networks.ProductNetwork import ProductNetwork
from networks.CrossSellingProducts import CrossSellingProducts

from grouper.NxGrouper import NxGrouper
from charts.HeatmapXTab import HeatmapCrosstab

from echarts.EgraphForce import EgraphForce
from echarts.EgraphStandard import EgraphStandard
from echarts.JupyterEcharts import JupyterEcharts

In [4]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)

## Data Preparation

### Load Dataset

In [5]:
pathname = os.path.join("F:\\Data\\datas", "online_retail_II.csv")
df = pd.read_csv(
    pathname, 
    dtype = {'Customer ID': str, 'Invoice': str},
    parse_dates = ['InvoiceDate']
)

In [6]:
df.shape

(1067371, 8)

### Data Cleaning

#### Drop missing values

In [7]:
df.isna().sum()

Invoice             0
StockCode           0
Description      4382
Quantity            0
InvoiceDate         0
Price               0
Customer ID    243007
Country             0
dtype: int64

In [8]:
df = df.dropna()

#### Ignore duplicate rows

There are two possible reasons for duplicate rows. 
1. The cashier might have scanned each product individually instead of scanning once and entering the quantity, resulting in multiple entries for each product on the invoice. 
2. The cashier might have accidentally scanned the same product twice. 

**Decision process**: The first scenario is more common, so we assume that the duplicate rows are not mistakes. <br>
**Action**: We don't address duplicates because we lack clarification on the matter. 

In [None]:
df.loc[df.duplicated()].shape

#### Drop rows with StockCode column has "TEST" value

In [9]:
df = df[~df['StockCode'].str.contains('TEST')]

#### Drop rows with price less or equal to zero

In [10]:
idx_price_less0 = df.loc[df['Price'] <= 0].index
df = df.drop(index=idx_price_less0)

#### Drop cancelation invoice rows

In [11]:
df = df[~df['Invoice'].str.startswith('C')]

#### Trim space in description column

In [12]:
df['Description'] = df['Description'].str.strip()
df['Description'] = df['Description'].replace(r'\s{2,}', ' ', regex=True)

### Data Selection

In [13]:
df_basket = df[['Description', 'Invoice']]

### Data Encoding

In [14]:
def encoding_data(df, col_item_id, col_order_id):
    df = df.copy()
    df[col_item_id] = df[col_item_id].transform(lambda x: [x])
    df = df.groupby(col_order_id, as_index=False).sum()[col_item_id]
    df = df.map(lambda x: list(set(x)))
                          
    encoder = TransactionEncoder()
    one_hot_transactions = pd.DataFrame(
        encoder.fit(df).transform(df), columns=encoder.columns_
    )
    
    return one_hot_transactions

In [15]:
df_transactions = encoding_data(df_basket, 'Description', 'Invoice')

## Data Mining (Rules Extraction)

The following parameters are configured for the algorithm:

- Maximum Combination Length
    - We set the maximum combination length to 2 items.
    - This choice is made to focus on pairs of items, allowing for a more targeted analysis of co-occurrences.

- Minimum Co-Occurrence Support Threshold
    - A minimum co-occurrence support threshold of 1% is established to filter out infrequent itemsets. 
    - This ensures that only associations with a significant presence in the dataset are considered.

In [17]:
frequent_itemsets = apriori(df_transactions, min_support= 0.01, use_colnames=True, max_len = 2, low_memory=True)
rules = association_rules(frequent_itemsets, metric="support", min_threshold = 0.01)

In [18]:
rules['antecedents'] = rules['antecedents'].map(lambda x: ''.join(list(x)))
rules['consequents'] = rules['consequents'].map(lambda x: ''.join(list(x)))

In [19]:
rules.shape

(354, 10)

In [20]:
rules.sort_values('support', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
308,WHITE HANGING HEART T-LIGHT HOLDER,RED HANGING HEART T-LIGHT HOLDER,0.132255,0.044130,0.031197,0.235884,5.345205,0.025360,1.250949,0.936815
309,RED HANGING HEART T-LIGHT HOLDER,WHITE HANGING HEART T-LIGHT HOLDER,0.044130,0.132255,0.031197,0.706928,5.345205,0.025360,2.960863,0.850447
351,WOODEN PICTURE FRAME WHITE FINISH,WOODEN FRAME ANTIQUE WHITE,0.044861,0.047972,0.026868,0.598914,12.484645,0.024716,2.373628,0.963107
350,WOODEN FRAME ANTIQUE WHITE,WOODEN PICTURE FRAME WHITE FINISH,0.047972,0.044861,0.026868,0.560068,12.484645,0.024716,2.171106,0.966255
247,LUNCH BAG SPACEBOY DESIGN,LUNCH BAG SUKI DESIGN,0.050705,0.053356,0.024027,0.473853,8.880894,0.021321,1.799199,0.934797
...,...,...,...,...,...,...,...,...,...,...
88,GIN + TONIC DIET METAL SIGN,NO SINGING METAL SIGN,0.037474,0.026245,0.010065,0.268592,10.233911,0.009082,1.331343,0.937414
136,JUMBO BAG OWLS,JUMBO BAG RED RETROSPOT,0.024892,0.070673,0.010065,0.404348,5.721398,0.008306,1.560184,0.846284
137,JUMBO BAG RED RETROSPOT,JUMBO BAG OWLS,0.070673,0.024892,0.010065,0.142420,5.721398,0.008306,1.137045,0.887973
287,PLASTERS IN TIN SPACEBOY,PLASTERS IN TIN SKULLS,0.031278,0.024487,0.010038,0.320934,13.106529,0.009272,1.436552,0.953526


## Network Visualization with Echarts

### Network Profile

We profile the network to get a quick summary of our products network 

In [21]:
myRGM = RGM(rules, 'antecedents', 'consequents')
df_nodes, df_edges = myRGM.get_graph_features()

In [22]:
df_nodes_profile = NxGrouper.greedy_modularity_communities(df_nodes, df_edges, 4)

In [23]:
force = EgraphForce(
    df_edges, 
    df_nodes_profile, 
    col_source='antecedents', 
    col_target='consequents', 
    col_name='nodes',
)
profile_force_option = force.get_option()

In [None]:
# JupyterEcharts.show(profile_force_option)

![Description](images/profile_network.png)

### Product Network

After we know our network profile, we can select and assess individual product networks.

In [24]:
MyPN = ProductNetwork(rules)
df_bfs, rules_bfs = MyPN.get_bfs_rules(['RED HANGING HEART T-LIGHT HOLDER'], 'support', 0, 3, 5)
df_nodes, df_edges = MyPN.get_graph_features(df_bfs, rules_bfs, strict_rules=True)

In [25]:
df_nodes['label'] = df_nodes['rank'].map(lambda x : {"show": True, "position": "right", "formatter": f"{x}"})

In [26]:
product_force = EgraphForce(
    df_edges, 
    df_nodes,
    col_category='depth',
    col_source='antecedents', 
    col_target='consequents',
    col_name='nodes',
)
product_force_option = product_force.get_option(show_legend=True)

In [None]:
# JupyterEcharts.show(product_force_option)

![Description](images/product_network.png)

### Cross Selling Products

Using this technique, we can bundle our products that are frequently purchased together.

In [27]:
MyCSP = CrossSellingProducts(rules)
cross_selling_rules = MyCSP.get_cross_selling_products(max_support_ratio_diff=2.0, min_confidence=0.35)

In [28]:
df_nodes, df_edges = MyCSP.get_graph_features(cross_selling_rules)
df_nodes = NxGrouper.greedy_modularity_communities(df_nodes, df_edges)

In [29]:
csp_force = EgraphForce(
    df_edges, df_nodes,
    col_source='antecedents',
    col_target='consequents',
    col_name='nodes',
)
csp_force_option = csp_force.get_option()

In [None]:
# JupyterEcharts.show(csp_force_option)

![Description](images/cross_selling_products.png)

##  Product Placement with Plotly Heatmap

We can display our products side by side using Plotly heatmap

In [30]:
MyHM = HeatmapCrosstab(rules)

In [31]:
tabular = MyHM.get_tabular_data(
    ['WHITE HANGING HEART T-LIGHT HOLDER', 'RED HANGING HEART T-LIGHT HOLDER'], 
    'support',
    max_col=10
)

In [32]:
fig = MyHM.plot_heatmap(tabular)

In [None]:
# fig.show()

In [None]:
# fig.write_image("images/heatmap_xtab.png")

![Description](images/heatmap_xtab.png)