# Market Basket Analysis on Online Retail Data

## Apriori Algorithm and Association Rules

### 1. Overview:
- Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items.
- It works by looking for combinations of items that occur together frequently in transactions.
- To put it another way, it allows retailers to identify relationships between the items that people buy.
- Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

In [2]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
import datetime as dt

# Runtime Configuration Parameters for Matplotlib
plt.rcParams['font.family'] = 'Verdana'
plt.style.use('ggplot')

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Apriori Algorithm and Association Rules
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [4]:
# Read data
retail = pd.read_csv('/kaggle/input/cleanretaildata/CleanRetailData.csv')
retail.head()

  retail = pd.read_csv('/kaggle/input/cleanretaildata/CleanRetailData.csv')


Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date,Time,Hour,Time of Day,Month,Week of the Year,Day of Week,Sales Revenue
0,0,539993,22386,JUMBO BAG PINK POLKADOT,10,2011-01-04 10:00:00,1.95,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,19.5
1,1,539993,21499,BLUE POLKADOT WRAP,25,2011-01-04 10:00:00,0.42,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,10.5
2,2,539993,21498,RED RETROSPOT WRAP,25,2011-01-04 10:00:00,0.42,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,10.5
3,3,539993,22379,RECYCLING BAG RETROSPOT,5,2011-01-04 10:00:00,2.1,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,10.5
4,4,539993,20718,RED RETROSPOT SHOPPER BAG,10,2011-01-04 10:00:00,1.25,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,12.5


In [5]:
# List of all countries
country_list = list(dict(retail['Country'].value_counts()).keys())
print('List of all countries:', country_list)

List of all countries: ['United Kingdom', 'Germany', 'France', 'EIRE', 'Spain', 'Netherlands', 'Switzerland', 'Belgium', 'Portugal', 'Australia', 'Norway', 'Channel Islands', 'Italy', 'Finland', 'Cyprus', 'Sweden', 'Austria', 'Denmark', 'Poland', 'Israel', 'Hong Kong', 'Japan', 'Singapore', 'USA', 'Iceland', 'Canada', 'Greece', 'Malta', 'United Arab Emirates', 'European Community', 'RSA', 'Lebanon', 'Brazil', 'Czech Republic', 'Bahrain', 'Saudi Arabia']


In [6]:
# Function that filters the data frame based on country name
def choose_country(country = "all", data = retail):
    if country == "all":
        return data
    else:
        temp_df = data[data["Country"] == country]
        temp_df.reset_index(drop= True, inplace= True)
        return temp_df

### 2. United Kingdom data

- Most of the transaction come from UK and so, we limit the data that we will use to only the transaction that come from UK

In [7]:
uk_retail = choose_country("United Kingdom")
uk_retail.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date,Time,Hour,Time of Day,Month,Week of the Year,Day of Week,Sales Revenue
0,0,539993,22386,JUMBO BAG PINK POLKADOT,10,2011-01-04 10:00:00,1.95,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,19.5
1,1,539993,21499,BLUE POLKADOT WRAP,25,2011-01-04 10:00:00,0.42,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,10.5
2,2,539993,21498,RED RETROSPOT WRAP,25,2011-01-04 10:00:00,0.42,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,10.5
3,3,539993,22379,RECYCLING BAG RETROSPOT,5,2011-01-04 10:00:00,2.1,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,10.5
4,4,539993,20718,RED RETROSPOT SHOPPER BAG,10,2011-01-04 10:00:00,1.25,13313,United Kingdom,2011-01-04,10:00:00,10,Morning,January,1,Tuesday,12.5


#### 2.1 Create basket

- We now create the basket data that will contain the Quantity of each items bought per transaction (InvoiceNo)
- This dataframe is basically the ‘basket’ that our customers ‘carry on’ to the cashier in our shop
- It shows us how much this customer / transaction (InvoiveNo) bought a particular item
- If the number is 0, then this customer didn’t buy that particular item
- If it shows another value (12 for instances), it means that the customer has bought as many as 12 items.

In [8]:
basket_uk = uk_retail.groupby(['InvoiceNo', 'Description']).sum()['Quantity'].unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket_uk.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
539993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
540001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
540002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
540003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
540004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 2.2 Encode

- In market basket analysis, the number of each item bought is not really important
- The important one is whether an item is bought or not, because, we only would like to know, what is the association of buying some items and buying some others
- So we encode the basket data into a binary data that shows whether an items is bought (1) or not (0)

In [9]:
# Encode
def encoder(x):
    if(x <= 0): return 0
    if(x >= 1): return 1

In [11]:
# Apply to the dataframe
basket_uk_encoded = basket_uk.applymap(encoder)
basket_uk_encoded.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
539993,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540003,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540004,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 2.3 Filter transactions with more than 1 items

- In market basket analysis, we are going to uncover the association between 2 or more items that is bought according to historical data

In [12]:
# We will filter out transactions that have less than 2 items 
basket_uk_encoded_filtered = basket_uk_encoded[ (basket_uk_encoded > 0).sum(axis=1) >= 2] # Columnwise sum of encoding should be >= 2
basket_uk_encoded_filtered.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
539993,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540003,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540005,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3. Apriori Algorithm

- Apriori algorithm is simply used to find the frequently bought items in the dataset
- In applying apriori algorithm, we are able to define the frequent data that we wanted by giving the support value
- Here we define frequently bought item as item that is bought as many as 2% out of the whole transaction, it means I will give the support value of 0.02

In [13]:
frequent_itemsets = apriori(basket_uk_encoded_filtered, min_support=0.02, use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.022008,(3 HOOK PHOTO SHELF ANTIQUE WHITE)
1,0.025884,(3 STRIPEY MICE FELTCRAFT)
2,0.023124,(4 TRADITIONAL SPINNING TOPS)
3,0.051701,(6 RIBBONS RUSTIC CHARM)
4,0.022467,(60 CAKE CASES DOLLY GIRL DESIGN)
...,...,...
476,0.023387,"(JUMBO SHOPPER VINTAGE RED PAISLEY, JUMBO BAG ..."
477,0.025489,"(JUMBO STORAGE BAG SUKI, JUMBO BAG PINK POLKAD..."
478,0.024307,"(JUMBO SHOPPER VINTAGE RED PAISLEY, JUMBO BAG ..."
479,0.021679,"(LUNCH BAG PINK POLKADOT, LUNCH BAG BLACK SKU..."


### 4. Assocition Rule Mining

- After applying the apriori algorithm and finding the frequently bought item, it is now the time for us to apply the association rules
- From association rules, we could extract information and discover knowledge about which items are more effective to be sold together

In [14]:
# Create Association Rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Sort values based on lift
rules = rules.sort_values("lift",ascending=False).reset_index(drop= True)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(WOODEN HEART CHRISTMAS SCANDINAVIAN),(WOODEN STAR CHRISTMAS SCANDINAVIAN),0.031599,0.029497,0.023190,0.733888,24.880265,0.022258,3.646969,0.991126
1,(WOODEN STAR CHRISTMAS SCANDINAVIAN),(WOODEN HEART CHRISTMAS SCANDINAVIAN),0.029497,0.031599,0.023190,0.786192,24.880265,0.022258,4.529292,0.988979
2,(SMALL MARSHMALLOWS PINK BOWL),(SMALL DOLLY MIX DESIGN ORANGE BOWL),0.027657,0.032519,0.021613,0.781473,24.031469,0.020714,4.427278,0.985648
3,(SMALL DOLLY MIX DESIGN ORANGE BOWL),(SMALL MARSHMALLOWS PINK BOWL),0.032519,0.027657,0.021613,0.664646,24.031469,0.020714,2.899456,0.990601
4,"(ROSES REGENCY TEACUP AND SAUCER, GREEN REGENC...",(PINK REGENCY TEACUP AND SAUCER),0.042898,0.045658,0.032190,0.750383,16.435004,0.030232,3.823224,0.981248
...,...,...,...,...,...,...,...,...,...,...
311,(PARTY BUNTING),(REGENCY CAKESTAND 3 TIER),0.100775,0.099199,0.022927,0.227510,2.293479,0.012931,1.166101,0.627186
312,(WHITE HANGING HEART T-LIGHT HOLDER),(PARTY BUNTING),0.126330,0.100775,0.023913,0.189288,1.878315,0.011182,1.109179,0.535223
313,(PARTY BUNTING),(WHITE HANGING HEART T-LIGHT HOLDER),0.100775,0.126330,0.023913,0.237288,1.878315,0.011182,1.145478,0.520012
314,(WHITE HANGING HEART T-LIGHT HOLDER),(JUMBO BAG RED RETROSPOT),0.126330,0.118841,0.026278,0.208008,1.750306,0.011265,1.112586,0.490656


**Observation:**

**Lift:**
- We see that WOODEN HEART CHRISTMAS SCANDINAVIAN and WOODEN STAR CHRISTMAS SCANDINAVIAN are the items that has the highest association each other since these two items has the highest “lift” value
- **The higher the lift value, the higher the association between the items will be**

**Support:**
- The support value of WOODEN HEART CHRISTMAS SCANDINAVIAN and WOODEN STAR CHRISTMAS SCANDINAVIAN are 0.02319% which means there are 2.319% out of total transaction that these 2 items were sold together

**Confidence:**
- From the confidence, we could even extract more information, since the confidence value is influenced by the antecedent and consequent
- If the antecedent is higher than the consequent, then the rule that will be applied is rule number 1 (not number 2)
- In this case, the consequent value is higher than the antecedent  value
- It means we will apply rule number 2 which is WOODEN STAR CHRISTMAS SCANDINAVIAN → WOODEN HEART CHRISTMAS SCANDINAVIAN 
- In a more detail explanation, it means that a customer will tend to buy WOODEN HEART CHRISTMAS SCANDINAVIAN **AFTER** they buy WOODEN STAR CHRISTMAS SCANDINAVIAN, not in the other way around
- This could be a very valuable information, because we are now aware which products should we put the discounts on. We could give a discounts on WOODEN HEART CHRISTMAS SCANDINAVIAN if a customer buy WOODEN STAR CHRISTMAS SCANDINAVIAN.

In [15]:
rules.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316 entries, 0 to 315
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   antecedents         316 non-null    object 
 1   consequents         316 non-null    object 
 2   antecedent support  316 non-null    float64
 3   consequent support  316 non-null    float64
 4   support             316 non-null    float64
 5   confidence          316 non-null    float64
 6   lift                316 non-null    float64
 7   leverage            316 non-null    float64
 8   conviction          316 non-null    float64
 9   zhangs_metric       316 non-null    float64
dtypes: float64(8), object(2)
memory usage: 24.8+ KB


In [16]:
# Save rules dataframe for next steps
rules.to_csv('Rules.csv')

### 5. Product Recommendation based on Association Rules

- The product recommendation part of this project is going to make use of the Association Rules that where uncovered in the Market Basket Analysis section
- Product recomentation is basically one of the advantages of Market Basket Analysis where you can recommend to customers products that are in the same itemsets as the customer's current products

In [17]:
# List of all products
product_catalog = list(uk_retail['Description'].unique())
print(f'There are {len(product_catalog)} unique products in the UK marketplace.')

There are 3893 unique products in the UK marketplace.


In [18]:
def remove_from_list(y, item_to_search):
    newlist = list()
    for i in y:
        if i not in item_to_search:
            newlist.append(i)
    return newlist

In [19]:
def search_list(item_to_search, list_to_search = rules['antecedents']):
    print(item_to_search)
    max_lift = 0
    item_to_recommend = ''
    for i, item in enumerate(list_to_search):
        if set(list(item_to_search)).issubset(set(list(item))):
            if rules['lift'][i] > max_lift:
                max_lift = rules['lift'][i]
                y = list(rules['antecedents'][i])
                x = remove_from_list(y, item_to_search)
                item_to_recommend = list(rules['consequents'][i]) + x
    
    if item_to_recommend == '':
        item_to_recommend = []
        print(f"Oops! No product recommendations available right now!: {item_to_recommend}")
    else:
        print(f"People who bought this also bought: {item_to_recommend}")
    return item_to_search, item_to_recommend

In [20]:
dict_to_store = {}
for i in range(len(product_catalog)):
    key, value = search_list([product_catalog[i]])
    dict_to_store[key[0]] = value

['JUMBO BAG PINK POLKADOT']
People who bought this also bought: ['JUMBO BAG WOODLAND ANIMALS', 'JUMBO BAG RED RETROSPOT']
['BLUE POLKADOT WRAP']
Oops! No product recommendations available right now!: []
['RED RETROSPOT WRAP']
Oops! No product recommendations available right now!: []
['RECYCLING BAG RETROSPOT']
People who bought this also bought: ['DOTCOM POSTAGE']
['RED RETROSPOT SHOPPER BAG']
Oops! No product recommendations available right now!: []
['JUMBO BAG RED RETROSPOT']
People who bought this also bought: ['JUMBO BAG PINK POLKADOT', 'JUMBO BAG WOODLAND ANIMALS']
['RED RETROSPOT CHILDRENS UMBRELLA']
Oops! No product recommendations available right now!: []
['JAM MAKING SET PRINTED']
People who bought this also bought: ['JAM MAKING SET WITH JARS']
['RECIPE BOX RETROSPOT']
Oops! No product recommendations available right now!: []
['CHILDRENS APRON APPLES DESIGN']
Oops! No product recommendations available right now!: []
['PEG BAG APPLES DESIGN']
Oops! No product recommendations av

In [21]:
dict_to_store['JAM MAKING SET PRINTED']

['JAM MAKING SET WITH JARS']

In [22]:
import json

json_file = json.dumps(dict_to_store)
# open file for writing, "w" 
f = open("item_sets.json","w")
# write json object to file
f.write(json_file)

# close file
f.close()


In [23]:
# Opening JSON file
with open('item_sets.json') as json_file:
    data = json.load(json_file)

In [24]:
for a in data['JUMBO BAG PINK POLKADOT']:
    print(a)

JUMBO BAG WOODLAND ANIMALS
JUMBO BAG RED RETROSPOT
