# Market Basket Analysis
Market Basket Analysis (MBA) is used in the retail industry to know what products customers are buying together. Retailers can leverage this to cross-sell products, improve customer experience, increase impulse purchases made by customers, and offer promotions on associated products.

For in-store retailers, this means optimizing the store layout for product placement and display; for example, when customers buy spaghetti, they may also buy tomato soup cans with it so the store can display spaghetti and tomato soup cans close to each other.

For online retailers, this means personalizing suggestions to customers in the *"These products may also interest you", "Customers who bought this also bought...", etc.* categories; for example, customers who bought iPhone 15 also bought Lightning USB-C to iPhone charger.

## Performing an MBA
To conduct an MBA, customer transactions data containing order invoices, order items, and other relevant information is collected.

Association rules algorithms are used to identify items that often appear together in an order, and calculate the likelihood of an item being purchased given the purchase of another.
To do this, we find the set of items that are purchased together using the apriori algorithm. To do this, we need to set a minimum threshold for support. **Support** is the frequency of an item set occuring in the total orders. That is, if there were 1000 orders and spaghetti was bought 10 times, tomato cans were bought 8 times, and spaghetti and tomato cans were bought together 6 times, the support is $6/10=0.06$. This metric highlights the importance of an itemset in the data and there is no one-size-fits-all.

After finding itemsets, we use association rules to estimate the possibility of these itemsets being bought together. Key features and metrics here include:
1. Confidence: This represents the probability of items being purchased together. That is, the confidence that a customer buying spaghetti will buy tomato cans is the measure of how often customers buy tomato cans when buying spaghetti ($6/10$) and the confidence that a customer buying tomato cans will buy spaghetti is $6/$8. This can be used to increase profitability by placing high-margin items close to high confidence items.
2. Antecedent: The first item in an itemset. In an itemset of spaghetti and tomato cans, the antecedent is spaghetti.
3. Consequent: The item(s) found in association with the first item in an itemset. In an itemset of spaghetti and tomato cans, the consequent is tomato cans.
4. Lift: This is the ratio of confidence to the possibility of buying only the antecedent. Good associations (itemsets bought frequently by customers) have lift greater than 1.

### In-store Case Study
Emily was the new store manager at a physical grocery store (we'll call them SunnyVille). In her first couple of months familiarizing herself with the new role, she couldn't help but notice a slight disconnect. Products that naturally complemented each other were scattered across the store. A customer buying eggs, for example, had to walk to the other end of the store, if they wanted to include milk. She also noticed from the inventory that there were some products that just didn't sell out fast enough.

From these, she discovered that there was potential to enhance shopping experiences for their customers by transforming the store layout to create "mini-environments" where related items harmonised with each other. Strategically placed slow moving items with promotions could also find their way into customers' baskets and reduce the inventory stock.

#### Modules
These are the Python modules/packages used in performing the analysis.

In [1]:
# import packages
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
import pandas as pd

# set random seed for reproducibility
random_seed = 42

#### Data Loading
To protect confidential information, open source data on grocery store orders downloaded from [Kaggle](https://www.kaggle.com/datasets/rupakroy/market-basket-optimization) is used here.

In [2]:
# load data
orders = pd.read_csv('store_data.csv', header=None, sep='\t')

#### Data Inspection and Cleaning
The data is inspected, cleaned, and preprocessed to handle missing data and transform the data into analysis format

In [3]:
# inspect data
orders.head()

Unnamed: 0,0
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."


In [4]:
# each row represents a unique order
# set the column name
orders.columns = ['product']
# create a column with the index to represent the order number
orders['id'] = orders.index + 1

In [5]:
# create a list from the products listed in each row
orders['product'] = orders['product'].str.split(',')
# put each order product in its own row
orders = orders.explode('product', ignore_index=True)

In [6]:
# remove null or empty products from the data
orders['product'].dropna(inplace=True)
orders = orders.loc[orders['product'] != '', :]
# strip any whitespace
orders['product'] = orders['product'].str.strip()

In [7]:
# inspect data
orders.head()

Unnamed: 0,product,id
0,shrimp,1
1,almonds,1
2,avocado,1
3,vegetables mix,1
4,green grapes,1


#### Data Transformation
The orders need to be transformed into a one-hot encoded format such that each column represents a product and each row represents whether or not the product is present in the order.

In [8]:
# quantify the items in the orders
orders = orders.groupby(['id','product'], as_index=False).size()

In [9]:
# transform orders data to one-hot encoded format
market_basket = pd.crosstab(orders['id'], orders['product'])

In [10]:
# define function to convert from binary to boolean
def encoding(x):
    if x<=0: return False
    else: return True

In [11]:
# convert from binary to boolean
market_basket = market_basket.map(encoding)

#### Associations
Here, we find the itemsets and create the association rules. The minimum support we will be using is for itemsets that were bought together at least 30 times. We will also set a lift threshold of 1 in the association rules to derive associations for items frequently bought together(already explained above).

In [12]:
# calculate minimum support threshold
frequent_itemset = 30
total_purchases = market_basket.index.nunique()
min_support = frequent_itemset/total_purchases

In [13]:
# create itemsets
frequent_items = apriori(market_basket, min_support=min_support, max_len=2, use_colnames=True)
# define association rules and unpack consequents
rules = association_rules(frequent_items, metric='lift', min_threshold=1).sort_values('confidence', ascending=False)
rules = rules.explode('antecedents').explode('consequents')

#### Enhancing shopping experience
Here, we find high confidence associated items and SunnyVille can optimize their product placement strategy to display high confidence consequents next to their antecedents.

In [14]:
# set higher confidence itemsets to be itemsets with confidence greater than the 75th percentile confidence value
higher_confidence_items = rules.loc[rules['confidence'] > rules['confidence'].mean(), :]

In [15]:
# inspect
higher_confidence_items.sample(5, random_state=random_seed)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1051,whole wheat rice,spaghetti,0.058526,0.17411,0.014131,0.241458,1.386811,0.003942,1.088786,0.29626
589,fresh tuna,frozen vegetables,0.022264,0.095321,0.004133,0.185629,1.947414,0.002011,1.110893,0.497576
614,frozen smoothie,milk,0.063325,0.129583,0.014265,0.225263,1.738373,0.006059,1.123501,0.453465
473,energy bar,frozen vegetables,0.027063,0.095321,0.004133,0.152709,1.60206,0.001553,1.067732,0.386257
791,whole wheat rice,ground beef,0.058526,0.098254,0.008132,0.138952,1.41422,0.002382,1.047266,0.311104


#### Addressing the challenge of slow moving items
Here, we identify slow moving items and find the products that customers frequently buy together with them. SunnyVille can then display these items next to each other to encourage the sales of the slow moving items

In [16]:
# calculate the quantity of products sold
items_move = orders.groupby('product').agg({'size':'sum'})

In [17]:
# set slow moving items as products whose sold units are less than the average quantity of products sold
slower_moving_items = items_move[items_move['size']<= items_move['size'].mean()].index.unique()

In [18]:
# filter the association results to associations whose consequents are part of the slow moving items
rules_slower_moving_items = rules.loc[(rules['consequents'].isin(slower_moving_items))]

In [19]:
# inspect results
rules_slower_moving_items.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
976,olive oil,whole wheat pasta,0.065858,0.029463,0.007999,0.121457,4.12241,0.006059,1.104713,0.810823
584,fresh bread,tomato juice,0.043061,0.030396,0.004266,0.099071,3.259356,0.002957,1.076227,0.724384
590,honey,fresh tuna,0.04746,0.022264,0.003999,0.08427,3.78507,0.002943,1.067712,0.772466
402,frozen smoothie,cottage cheese,0.063325,0.031862,0.004933,0.077895,2.444721,0.002915,1.049921,0.630908
911,milk,whole wheat pasta,0.129583,0.029463,0.009865,0.076132,2.583999,0.006047,1.050515,0.704263


##### Addendum
Here, we see that the confidence of customers buying the itemsets are low. To encourage purchases, SunnyVille can additionally offer discounts and promotions on the sales price of the consequents (which are the slow moving items) when customers buy the antecedents.

### Online Store Case Study
Alex is the sales lead at an online tech products and merch store (we'll call them TechToys). He wants to run a sales campaign to boost sales by identifying cross-selling opportunities through personalized recommendations to customers and offering enticing package deals.

#### Data Loading
To protect confidential information, open source data on ecommerce sales downloaded from [Kaggle](https://www.kaggle.com/datasets/rishikumarrajvansh/marketing-insights-for-e-commerce-company/data) is used here.

In [20]:
# load data
orders = pd.read_csv('MarketingData/Online_Sales.csv')

#### Data Inspection

In [21]:
# inspect data
orders.head()

Unnamed: 0,CustomerID,Transaction_ID,Transaction_Date,Product_SKU,Product_Description,Product_Category,Quantity,Avg_Price,Delivery_Charges,Coupon_Status
0,17850,16679,1/1/2019,GGOENEBJ079499,Nest Learning Thermostat 3rd Gen-USA - Stainle...,Nest-USA,1,153.71,6.5,Used
1,17850,16680,1/1/2019,GGOENEBJ079499,Nest Learning Thermostat 3rd Gen-USA - Stainle...,Nest-USA,1,153.71,6.5,Used
2,17850,16681,1/1/2019,GGOEGFKQ020399,Google Laptop and Cell Phone Stickers,Office,1,2.05,6.5,Used
3,17850,16682,1/1/2019,GGOEGAAB010516,Google Men's 100% Cotton Short Sleeve Hero Tee...,Apparel,5,17.53,6.5,Not Used
4,17850,16682,1/1/2019,GGOEGBJL013999,Google Canvas Tote Natural/Navy,Bags,1,16.5,6.5,Used


#### Data Transformation
The orders need to be transformed into a one-hot encoded format such that each column represents a product and each row represents whether or not the product is present in the order.

In [22]:
# transform orders data to one-hot encoded format
market_basket = pd.crosstab(orders['Transaction_ID'], orders['Product_Description'])

In [23]:
# define function to convert from binary to boolean
def encoding(x):
    if x<=0: return False
    else: return True

In [24]:
# convert from binary to boolean
market_basket = market_basket.map(encoding)

#### Associations
Here, we find the itemsets and create the association rules. The minimum support we will be using is for itemsets that were bought together at least 30 times. We will also set a lift threshold of 1 in the association rules to derive associations for items frequently bought together (already explained above).

In [25]:
# calculate minimum support threshold
frequent_itemset = 30
total_purchases = orders['Transaction_ID'].nunique()
min_support = frequent_itemset/total_purchases

In [26]:
# create itemsets
frequent_items = apriori(market_basket, min_support=min_support, use_colnames=True)
# define association rules and unpack consequents
rules = association_rules(frequent_items, metric='lift', min_threshold=1).sort_values('confidence', ascending=False)
rules = rules.explode('consequents')

#### Cross selling experiences
Here, we find high confidence associated items and TechToys can cross sell high confidence consequents next to their antecedents.

In [27]:
# set higher confidence itemsets to be itemsets with confidence greater than the mean confidence value
higher_confidence_items = rules.loc[rules['confidence'] > rules['confidence'].mean(), :]

In [28]:
# inspect
higher_confidence_items.sample(5, random_state=random_seed)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
61,(Ballpoint LED Light Pen),Google Sunglasses,0.011173,0.027692,0.001596,0.142857,5.158707,0.001287,1.134359,0.815262
67,(Ballpoint Stick Pen 4 Pack),Google Laptop and Cell Phone Stickers,0.004429,0.032162,0.001317,0.297297,9.24388,0.001174,1.377309,0.895788
95,(Four Color Retractable Pen),Google Laptop and Cell Phone Stickers,0.012928,0.032162,0.002354,0.182099,5.662006,0.001938,1.183319,0.834169
289,(Nest Protect Smoke + CO White Battery Alarm-USA),Nest Learning Thermostat 3rd Gen-USA - Stainle...,0.054307,0.140098,0.009018,0.166054,1.185272,0.00141,1.031125,0.165288
20,(8 pc Android Sticker Sheet),Google Doodle Decal,0.010295,0.012769,0.002115,0.205426,16.088094,0.001983,1.242467,0.947598


#### Package deals
Sales can be boosted by offering package deals on associated high-revenue items.

In [29]:
# create a price list for all products using the most recent prices
price_list = orders.groupby('Product_Description', as_index=False).agg({'Avg_Price':'last'})

In [30]:
# find prices of antecedents and consequents
antecedent_prices = higher_confidence_items['antecedents'].apply(lambda x: [price_list.loc[price_list['Product_Description'] == elem,
                                                                             'Avg_Price'].iloc[0] for elem in x])
consequent_prices = higher_confidence_items['consequents'].apply(lambda x: [price_list.loc[price_list['Product_Description'] == x,
                                                                             'Avg_Price'].iloc[0]])
total_prices = antecedent_prices + consequent_prices
# price of package deal
higher_confidence_items.loc[:, 'price'] = total_prices.apply(lambda x: sum(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  higher_confidence_items.loc[:, 'price'] = total_prices.apply(lambda x: sum(x))


In [31]:
# set higher revenue itemsets to be itemsets with price greater than the mean price
higher_rev_items = higher_confidence_items.loc[higher_confidence_items['price'] > higher_confidence_items['price'].mean(), :]\
    .sort_values('price', ascending=False)

In [32]:
higher_rev_items

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,price
279,(Nest Cam IQ Outdoor - USA (Preorder)),Nest Secure Alarm System Starter Pack - USA,0.005906,0.019872,0.001716,0.290541,14.620957,0.001598,1.381514,0.937139,634.74
273,(Nest Cam IQ Outdoor - USA (Preorder)),Nest Cam IQ - USA,0.005906,0.023902,0.001436,0.243243,10.176826,0.001295,1.289844,0.907094,481.84
285,(Nest Detect - USA),Nest Secure Alarm System Starter Pack - USA,0.002953,0.019872,0.001756,0.594595,29.921958,0.001697,2.41765,0.969442,405.69
317,"(Nest Cam Indoor Security Camera - USA, Nest L...",Nest Cam Outdoor Security Camera - USA,0.009098,0.132796,0.003392,0.372807,2.807367,0.002184,1.382675,0.649705,394.48
316,"(Nest Cam Outdoor Security Camera - USA, Nest ...",Nest Cam Indoor Security Camera - USA,0.012011,0.128886,0.003392,0.282392,2.19103,0.001844,1.213914,0.550202,394.48
328,(Nest Learning Thermostat 3rd Gen-USA - Stainl...,Nest Cam Indoor Security Camera - USA,0.009018,0.128886,0.001476,0.163717,1.27025,0.000314,1.04165,0.214689,353.7
327,"(Nest Cam Indoor Security Camera - USA, Nest P...",Nest Learning Thermostat 3rd Gen-USA - Stainle...,0.005985,0.140098,0.001476,0.246667,1.76067,0.000638,1.141463,0.434636,353.7
333,"(Nest Cam Outdoor Security Camera - USA, Nest ...",Nest Learning Thermostat 3rd Gen-USA - Stainle...,0.007661,0.140098,0.001836,0.239583,1.71011,0.000762,1.13083,0.418448,353.7
332,"(Nest Cam Outdoor Security Camera - USA, Nest ...",Nest Protect Smoke + CO White Battery Alarm-USA,0.012011,0.054307,0.001836,0.152824,2.814049,0.001183,1.116288,0.652477,353.7
334,(Nest Learning Thermostat 3rd Gen-USA - Stainl...,Nest Cam Outdoor Security Camera - USA,0.009018,0.132796,0.001836,0.20354,1.532726,0.000638,1.088823,0.35073,353.7
