# Finding Commonly Ordered Groups of Products in an Online Gift Store Using Unsupervised Learning

## Background
Rule mining involves the process of discovering valuable association rules from data. Typically, these rules are expressed in the form of "if-then" conditions (e.g., "if a customer purchases X, then they will purchase Y"). These rules can be used to identify groups of products that are frequently ordered together, such as cake and candles, notebooks and pens, or bread and strawberry jam. The information gained from these rules can serve a variety of purposes, including:

- Increasing order picking efficiency in a warehouse by placing commonly ordered items close to each other.
- Providing customers with relevant recommendations based on their purchase history.
- Offering strategic discounts or bundled deals based on frequently paired items.
- Optimizing physical store product placement to increase cross-selling and improve the customer experience.

In this project, I used the apriori algorithm to determine the groups of items that are frequently ordered together and relevant metrics to measure the strength of each rule. The data used for this analysis was obtained from the Online Retail dataset, which can be found at https://archive.ics.uci.edu/ml/datasets/Online+Retail#. It corresponds to the sales made by a UK-based online retail store that sells unique all-occasion gifts.

## The Apriori Algorithm

The **Apriori algorithm** is a method used to efficiently identify association rules for frequently ordered groups of products. It achieves this by iteratively discarding products or groups of products that are not ordered enough to be relevant. This is accomplished by eliminating items and groups with low **support**, which is defined as the number of orders in which the item or group was included.

The strength of an association rule can be evaluated in various ways, but two of the most important measures are:

- **Confidence**: This metric represents the percentage of orders in which the association rule is valid. For example, a statement like "Product X leads to buying Product Y with 40% confidence" would mean that 40% of orders that include X also contain Y.

- **Lift**: This measurement indicates the increased likelihood that product Y will be in an order if product X is present. Suppose Y is in 10% of all orders and in 40% of orders containing X; in that case, the lift of our rule would be 0.4 / 0.1 = 4.

## The code

### Getting and preparing the data

First, we must get our data and organize it into a binary matrix that indicates whether a product was included in an order, as is required for this implementation of the apriori algorithm.

In [1]:
import pandas as pd

In [2]:
# Read the data
raw_df = pd.read_excel("online_retail.xlsx")
raw_df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


In [3]:
# Get the binary matrix
raw_df = raw_df[['InvoiceNo', 'Description']] # Keep only the necessary columns
raw_df = raw_df.copy()
raw_df['Description'] = raw_df['Description'].astype(str) # Convert all values in the column to strings
df = raw_df.pivot_table(index='InvoiceNo', columns='Description', aggfunc=lambda x: 1, fill_value=0).astype(bool) # Binary matrix
df

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536366,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536367,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536368,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
536369,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C581484,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
C581490,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
C581499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
C581568,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Getting the most common itemsets

Now, we must get the most commonly ordered items and groups of items. We will use these for our analysis.

In [4]:
from mlxtend.frequent_patterns import apriori

In [5]:
# Get the common itemsets
common_itemsets = apriori(df, min_support=0.018, use_colnames=True)
common_itemsets # Print the common itemsets

Unnamed: 0,support,itemsets
0,0.018224,(3 STRIPEY MICE FELTCRAFT)
1,0.018764,(4 TRADITIONAL SPINNING TOPS)
2,0.037104,(6 RIBBONS RUSTIC CHARM)
3,0.024015,(60 CAKE CASES VINTAGE CHRISTMAS)
4,0.032278,(60 TEATIME FAIRY CAKE CASES)
...,...,...
265,0.018919,"(STRAWBERRY CHARLOTTE BAG, RED RETROSPOT CHARL..."
266,0.019575,"(WOODLAND CHARLOTTE BAG, RED RETROSPOT CHARLOT..."
267,0.020734,"(ROSES REGENCY TEACUP AND SAUCER , REGENCY CAK..."
268,0.021042,"(WOODEN PICTURE FRAME WHITE FINISH, WOODEN FRA..."


### Getting the association rules

In [6]:
from mlxtend.frequent_patterns import association_rules

In [7]:
# Get the association rules
rules = association_rules(common_itemsets, metric='lift', min_threshold=1) # Get all association rules with a positive correlation (lift > 1)
rules.to_csv('rules.csv', index=False) # Save rules to a csv file

### Final results

In [8]:
# Print the 10 association rules with the highest confidence
pd.set_option('display.max_colwidth', 100) # Increase max. column with to make all values readable
rules.sort_values(['confidence'], ascending=False).head(10).loc[:,['antecedents', 'consequents', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,confidence,lift
94,"(ROSES REGENCY TEACUP AND SAUCER , PINK REGENCY TEACUP AND SAUCER)",(GREEN REGENCY TEACUP AND SAUCER),0.894137,21.909313
96,"(PINK REGENCY TEACUP AND SAUCER, GREEN REGENCY TEACUP AND SAUCER)",(ROSES REGENCY TEACUP AND SAUCER ),0.852484,19.713703
14,(PINK REGENCY TEACUP AND SAUCER),(GREEN REGENCY TEACUP AND SAUCER),0.803995,19.70054
83,(PINK REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.766542,17.72628
19,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.741722,17.152318
12,(GARDENERS KNEELING PAD CUP OF TEA ),(GARDENERS KNEELING PAD KEEP CALM ),0.717647,20.115865
95,"(ROSES REGENCY TEACUP AND SAUCER , GREEN REGENCY TEACUP AND SAUCER)",(PINK REGENCY TEACUP AND SAUCER),0.700255,22.642456
18,(ROSES REGENCY TEACUP AND SAUCER ),(GREEN REGENCY TEACUP AND SAUCER),0.7,17.152318
4,(CHARLOTTE BAG PINK POLKADOT),(RED RETROSPOT CHARLOTTE BAG),0.692105,17.07193
98,(PINK REGENCY TEACUP AND SAUCER),"(ROSES REGENCY TEACUP AND SAUCER , GREEN REGENCY TEACUP AND SAUCER)",0.685393,22.642456


For example, from the first rule printed above, two statements can be made based on confidence and lift:

1. When a customer buys a Roses Regency Teacup and Saucer and a Pink Regency Teacup and Saucer, there is an 89.4% chance they will also buy a Green Regency Teacup and Saucer.
2. When a customer buys a Roses Regency Teacup and Saucer and a Pink Regency Teacup and Saucer, they are 21.9 times more likely to buy a Green Regency Teacup and Saucer compared to the average likelihood of buying a Green Regency Teacup and Saucer.

In [9]:
# Print the 10 association rules with the highest lift
rules.sort_values(['lift'], ascending=False).head(10).loc[:,['antecedents', 'consequents', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,confidence,lift
95,"(ROSES REGENCY TEACUP AND SAUCER , GREEN REGENCY TEACUP AND SAUCER)",(PINK REGENCY TEACUP AND SAUCER),0.700255,22.642456
98,(PINK REGENCY TEACUP AND SAUCER),"(ROSES REGENCY TEACUP AND SAUCER , GREEN REGENCY TEACUP AND SAUCER)",0.685393,22.642456
99,(GREEN REGENCY TEACUP AND SAUCER),"(ROSES REGENCY TEACUP AND SAUCER , PINK REGENCY TEACUP AND SAUCER)",0.519395,21.909313
94,"(ROSES REGENCY TEACUP AND SAUCER , PINK REGENCY TEACUP AND SAUCER)",(GREEN REGENCY TEACUP AND SAUCER),0.894137,21.909313
12,(GARDENERS KNEELING PAD CUP OF TEA ),(GARDENERS KNEELING PAD KEEP CALM ),0.717647,20.115865
13,(GARDENERS KNEELING PAD KEEP CALM ),(GARDENERS KNEELING PAD CUP OF TEA ),0.594156,20.115865
96,"(PINK REGENCY TEACUP AND SAUCER, GREEN REGENCY TEACUP AND SAUCER)",(ROSES REGENCY TEACUP AND SAUCER ),0.852484,19.713703
97,(ROSES REGENCY TEACUP AND SAUCER ),"(PINK REGENCY TEACUP AND SAUCER, GREEN REGENCY TEACUP AND SAUCER)",0.490179,19.713703
15,(GREEN REGENCY TEACUP AND SAUCER),(PINK REGENCY TEACUP AND SAUCER),0.609272,19.70054
14,(PINK REGENCY TEACUP AND SAUCER),(GREEN REGENCY TEACUP AND SAUCER),0.803995,19.70054
