# Market Basket Analysis

Market Basket Analysis (MBA), also known as Affinity Analysis, is a data mining technique used to identify relationships between products that are frequently purchased together by customers. It involves analyzing the transactional data of a store or website to find products that are commonly bought together or in sequence. The output of MBA is a set of rules, known as association rules, which show the likelihood of a product being purchased given the purchase of another product. MBA is used in retail, e-commerce, and other industries to inform pricing decisions, inventory management, and marketing strategies.
This project is a practical application of the Apriori Algorithm which is a Machine Learning algorithm used to gain insight into the structured relationships between different items involved. The algorithm is used to recommend products based on the products already present in the user’s cart.

importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

importing dataset


link to dataset: http://archive.ics.uci.edu/ml/datasets/Online+Retail

In [6]:
online_retail = pd.read_csv(r"C:\Users\lolad\Desktop\Online Retail.csv", encoding = 'unicode_escape')
online_retail

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01/12/2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01/12/2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01/12/2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01/12/2010 08:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,09/12/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,09/12/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,09/12/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,09/12/2011 12:50,4.15,12680.0,France


data preparation

In [7]:
#removing whitespaces from description column
online_retail["Description"] = online_retail["Description"].str.strip()
#changing the invoiceno column data type to str
online_retail["InvoiceNo"].astype("str")
#removing transaction that were done on credit
online_retail = online_retail[~online_retail["InvoiceNo"].str.contains("C")]
#dropping null in the InvoiceNo column
online_retail = online_retail.dropna(subset = "InvoiceNo")
#dropping POSTAGE from the description
online_retail = online_retail[online_retail["Description"] != "POSTAGE"]

In [97]:
online_retail["Country"].value_counts()

United Kingdom          487570
Germany                   8668
France                    8108
EIRE                      7894
Spain                     2423
Netherlands               2326
Switzerland               1936
Belgium                   1935
Portugal                  1471
Australia                 1184
Norway                    1052
Channel Islands            748
Italy                      741
Finland                    648
Cyprus                     613
Unspecified                446
Sweden                     429
Austria                    384
Denmark                    367
Poland                     325
Japan                      321
Israel                     295
Hong Kong                  282
Singapore                  222
Iceland                    182
USA                        179
Canada                     150
Greece                     142
Malta                      109
United Arab Emirates        67
RSA                         58
European Community          57
Lebanon 

Splitting the data according to the region of transaction (France, EIRE, Spain, Netherlands, Germany)

In [8]:
#France basket
france_basket = online_retail[online_retail["Country"] == "France"].groupby(["InvoiceNo", "Description"])["Quantity"].sum().unstack().reset_index().fillna(0).set_index("InvoiceNo")
#Spain basket
spain_basket = online_retail[online_retail["Country"] == "Spain"].groupby(["InvoiceNo", "Description"])["Quantity"].sum().unstack().reset_index().fillna(0).set_index("InvoiceNo")
#Ireland basket
EIRE_basket = online_retail[online_retail["Country"] == "EIRE"].groupby(["InvoiceNo", "Description"])["Quantity"].sum().unstack().reset_index().fillna(0).set_index("InvoiceNo")
#Neatherland basket
Netherlands_basket = online_retail[online_retail["Country"] == "Netherlands"].groupby(["InvoiceNo", "Description"])["Quantity"].sum().unstack().reset_index().fillna(0).set_index("InvoiceNo")
#Germany basket
Germany_basket = online_retail[online_retail["Country"] == "Germany"].groupby(["InvoiceNo", "Description"])["Quantity"].sum().unstack().reset_index().fillna(0).set_index("InvoiceNo")

defining the hot encoding function to make the data suitable for the concerned libraries

In [9]:
def convert_numbers(x):
    if x <= 0:
        return False
    if x >= 1:
        return True
#encoding the datasets
france_basket = france_basket.applymap(convert_numbers) 

spain_basket = spain_basket.applymap(convert_numbers) 

EIRE_basket = EIRE_basket.applymap(convert_numbers) 

Netherlands_basket = Netherlands_basket.applymap(convert_numbers) 

Germany_basket = Germany_basket.applymap(convert_numbers) 


Training the model

In [11]:
#building model for france
frequent_itemset= apriori(france_basket, min_support = 0.05, use_colnames= True)
#collecting the inferred rules in a dataframe
france_rules = association_rules(frequent_itemset, metric = "lift", min_threshold= 1)
france_rules = france_rules.sort_values(["confidence", "lift"], ascending = [False, False])
france_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
82,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER PLATES),0.103359,0.129199,0.100775,0.975000,7.546500,0.087421,34.832041
80,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.103359,0.139535,0.100775,0.975000,6.987500,0.086353,34.418605
66,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.129199,0.139535,0.124031,0.960000,6.880000,0.106003,21.511628
9,(CHILDRENS CUTLERY SPACEBOY),(CHILDRENS CUTLERY DOLLY GIRL),0.069767,0.072351,0.064599,0.925926,12.797619,0.059552,12.523256
39,(PACK OF 6 SKULL PAPER PLATES),(PACK OF 6 SKULL PAPER CUPS),0.056848,0.064599,0.051680,0.909091,14.072727,0.048007,10.289406
...,...,...,...,...,...,...,...,...,...
48,(PLASTERS IN TIN CIRCUS PARADE),(RED TOADSTOOL LED NIGHT LIGHT),0.170543,0.183463,0.051680,0.303030,1.651729,0.020391,1.171554
55,(RED TOADSTOOL LED NIGHT LIGHT),(PLASTERS IN TIN WOODLAND ANIMALS),0.183463,0.173127,0.054264,0.295775,1.708430,0.022501,1.174160
57,(RED TOADSTOOL LED NIGHT LIGHT),(RABBIT NIGHT LIGHT),0.183463,0.191214,0.054264,0.295775,1.546821,0.019183,1.148475
56,(RABBIT NIGHT LIGHT),(RED TOADSTOOL LED NIGHT LIGHT),0.191214,0.183463,0.054264,0.283784,1.546821,0.019183,1.140071


In [13]:
#building model for spain
spain_frequent_itemset = apriori(spain_basket, min_support = 0.05, use_colnames= True)
#collecting the inferred rules in a dataframe
spain_rules = association_rules(spain_frequent_itemset, metric = "lift", min_threshold= 1)
spain_rules = spain_rules.sort_values(["confidence", "lift"], ascending = [False, False])
spain_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
104,"(LUNCH BAG PINK POLKADOT, LUNCH BAG CARS BLUE)",(LUNCH BAG BLACK SKULL.),0.056818,0.056818,0.056818,1.000000,17.600000,0.053590,inf
109,(LUNCH BAG BLACK SKULL.),"(LUNCH BAG PINK POLKADOT, LUNCH BAG CARS BLUE)",0.056818,0.056818,0.056818,1.000000,17.600000,0.053590,inf
38,(PINK REGENCY TEACUP AND SAUCER),(GREEN REGENCY TEACUP AND SAUCER),0.056818,0.068182,0.056818,1.000000,14.666667,0.052944,inf
55,(LUNCH BAG BLACK SKULL.),(LUNCH BAG CARS BLUE),0.056818,0.068182,0.056818,1.000000,14.666667,0.052944,inf
98,"(PINK REGENCY TEACUP AND SAUCER, ROSES REGENCY...",(GREEN REGENCY TEACUP AND SAUCER),0.056818,0.068182,0.056818,1.000000,14.666667,0.052944,inf
...,...,...,...,...,...,...,...,...,...
33,(REGENCY CAKESTAND 3 TIER),(DANISH ROSE DECORATIVE PLATE),0.250000,0.056818,0.056818,0.227273,4.000000,0.042614,1.220588
76,(REGENCY CAKESTAND 3 TIER),(RED RETROSPOT CAKE STAND),0.250000,0.090909,0.056818,0.227273,2.500000,0.034091,1.176471
31,(REGENCY CAKESTAND 3 TIER),(CLASSIC METAL BIRDCAGE PLANT HOLDER),0.250000,0.113636,0.056818,0.227273,2.000000,0.028409,1.147059
48,(REGENCY CAKESTAND 3 TIER),(JAM MAKING SET WITH JARS),0.250000,0.159091,0.056818,0.227273,1.428571,0.017045,1.088235


In [14]:
#building models for ireland
EIRE_frequent_patterns = apriori(EIRE_basket, min_support = 0.05, use_colnames= True)
#collecting the inferred rules in a dataframe
EIRE_rules = association_rules(EIRE_frequent_patterns, metric = "lift", min_threshold= 1)
EIRE_rules = EIRE_rules.sort_values(["confidence", "lift"], ascending  = [False, False])
EIRE_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
194,"(REGENCY CAKESTAND 3 TIER, REGENCY TEA PLATE P...",(REGENCY TEA PLATE GREEN),0.055556,0.079861,0.055556,1.000000,12.521739,0.051119,inf
184,"(REGENCY TEAPOT ROSES, REGENCY CAKESTAND 3 TIER)",(REGENCY SUGAR BOWL GREEN),0.052083,0.086806,0.052083,1.000000,11.520000,0.047562,inf
117,"(REGENCY CAKESTAND 3 TIER, GREEN REGENCY TEACU...",(ROSES REGENCY TEACUP AND SAUCER),0.086806,0.166667,0.086806,1.000000,6.000000,0.072338,inf
129,"(REGENCY SUGAR BOWL GREEN, GREEN REGENCY TEACU...",(ROSES REGENCY TEACUP AND SAUCER),0.062500,0.166667,0.062500,1.000000,6.000000,0.052083,inf
141,"(REGENCY TEAPOT ROSES, GREEN REGENCY TEACUP AN...",(ROSES REGENCY TEACUP AND SAUCER),0.052083,0.166667,0.052083,1.000000,6.000000,0.043403,inf
...,...,...,...,...,...,...,...,...,...
113,(REGENCY CAKESTAND 3 TIER),"(REGENCY TEA PLATE GREEN, GREEN REGENCY TEACUP...",0.246528,0.055556,0.052083,0.211268,3.802817,0.038387,1.197421
187,(REGENCY CAKESTAND 3 TIER),"(REGENCY SUGAR BOWL GREEN, REGENCY TEAPOT ROSES)",0.246528,0.062500,0.052083,0.211268,3.380282,0.036675,1.188616
45,(REGENCY CAKESTAND 3 TIER),(REGENCY TEAPOT ROSES),0.246528,0.065972,0.052083,0.211268,3.202372,0.035819,1.184214
179,(REGENCY CAKESTAND 3 TIER),"(ROSES REGENCY TEACUP AND SAUCER, REGENCY MILK...",0.246528,0.065972,0.052083,0.211268,3.202372,0.035819,1.184214


In [15]:
#building models for neatherland
Netherlands_frequent_itemsets = apriori(Netherlands_basket, min_support= 0.05, use_colnames= True)
#collecting the inferred rules in a dataframe
Netherlands_rules = association_rules(Netherlands_frequent_itemsets, metric = "lift", min_threshold= 1)
Netherlands_rules = Netherlands_rules.sort_values(["confidence", "lift"], ascending = [False, False])
Netherlands_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
650,(FOLDING BUTTERFLY MIRROR RED),(FOLDING BUTTERFLY MIRROR HOT PINK),0.053191,0.053191,0.053191,1.000000,18.800000,0.050362,inf
651,(FOLDING BUTTERFLY MIRROR HOT PINK),(FOLDING BUTTERFLY MIRROR RED),0.053191,0.053191,0.053191,1.000000,18.800000,0.050362,inf
1350,"(CARD DOLLY GIRL, FOOD CONTAINER SET 3 LOVE HE...",(10 COLOUR SPACEBOY PEN),0.053191,0.053191,0.053191,1.000000,18.800000,0.050362,inf
1355,(10 COLOUR SPACEBOY PEN),"(CARD DOLLY GIRL, FOOD CONTAINER SET 3 LOVE HE...",0.053191,0.053191,0.053191,1.000000,18.800000,0.050362,inf
1380,"(CARD DOLLY GIRL, STRAWBERRY LUNCH BOX WITH CU...",(10 COLOUR SPACEBOY PEN),0.053191,0.053191,0.053191,1.000000,18.800000,0.050362,inf
...,...,...,...,...,...,...,...,...,...
221,(SPACEBOY LUNCH BOX),(CARD GINGHAM ROSE),0.297872,0.095745,0.053191,0.178571,1.865079,0.024672,1.100833
2763,(SPACEBOY LUNCH BOX),"(CARD DOLLY GIRL, ROUND SNACK BOXES SET OF4 WO...",0.297872,0.095745,0.053191,0.178571,1.865079,0.024672,1.100833
2787,(SPACEBOY LUNCH BOX),"(CARD DOLLY GIRL, SPACEBOY BIRTHDAY CARD)",0.297872,0.095745,0.053191,0.178571,1.865079,0.024672,1.100833
204,(SPACEBOY LUNCH BOX),(CARD DOLLY GIRL),0.297872,0.127660,0.053191,0.178571,1.398810,0.015165,1.061980


In [16]:
#building models for germarny
Germany_frequent_itemsets = apriori(Germany_basket, min_support= 0.05, use_colnames= True)
#collecting the inferred rules in a dataframe
Germany_rules = association_rules(Germany_frequent_itemsets, metric = "lift", min_threshold= 1)
Germany_rules = Germany_rules.sort_values(["confidence", "lift"], ascending = [False, False])
Germany_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
11,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.071269,0.129176,0.060134,0.84375,6.531789,0.050927,5.573274
12,(ROUND SNACK BOXES SET OF 4 FRUITS),(ROUND SNACK BOXES SET OF4 WOODLAND),0.160356,0.249443,0.13363,0.833333,3.340774,0.09363,4.503341
14,(SPACEBOY LUNCH BOX),(ROUND SNACK BOXES SET OF4 WOODLAND),0.104677,0.249443,0.071269,0.680851,2.729483,0.045159,2.351745
1,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.11804,0.140312,0.069042,0.584906,4.168613,0.05248,2.071067
6,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.109131,0.140312,0.062361,0.571429,4.072562,0.047048,2.005939
8,(PLASTERS IN TIN WOODLAND ANIMALS),(ROUND SNACK BOXES SET OF4 WOODLAND),0.140312,0.249443,0.075724,0.539683,2.163549,0.040724,1.63052
13,(ROUND SNACK BOXES SET OF4 WOODLAND),(ROUND SNACK BOXES SET OF 4 FRUITS),0.249443,0.160356,0.13363,0.535714,3.340774,0.09363,1.808463
17,(WOODLAND CHARLOTTE BAG),(ROUND SNACK BOXES SET OF4 WOODLAND),0.129176,0.249443,0.064588,0.5,2.004464,0.032366,1.501114
0,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN CIRCUS PARADE),0.140312,0.11804,0.069042,0.492063,4.168613,0.05248,1.736359
5,(PLASTERS IN TIN CIRCUS PARADE),(ROUND SNACK BOXES SET OF4 WOODLAND),0.11804,0.249443,0.057906,0.490566,1.966644,0.028462,1.473315
