In my Capstone project I analyse InstaCart Online Grocery Baskets. The main question that this project aims to address is how to optimize the product offerings of retailers by identifying customer behavior. By predicting if a customer has a healthy product basket based on the product names and clustering customers based on their product baskets, this project aims to provide insights into how retail should approach next best offer. The goal is to increase profits for retailers by better understanding and targeting the needs and preferences of their customers.


__Please note: this is notebook 4 of 5.__


In this notebook, I performed a market basket analysis to identify which products are frequently purchased together. To achieve this, I utilized the Apriori algorithm and association rules. By analyzing transactional data, I was able to identify strong relationships between certain products and generate rules that can be used to make recommendations or optimize product placement in stores or on websites. The insights gained from this analysis can be highly beneficial for retailers looking to improve customer experience, increase sales, and develop more effective marketing strategies.

In [17]:
import pandas as pd

In [28]:
import session_info
session_info.show()

In [18]:
df = pd.read_csv("/Users/evgenijkucukov/Desktop/Brainstation/Capstone/order_prodact_name_final")
df

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,aisle,department,food,healthy,healthy_product_share,healthy_basket
0,1,49302,1,1,Bulgarian Yogurt,120,16,112108,train,4,4,10,9.0,yogurt,dairy eggs,1,1,1.000000,1
1,1,11109,2,1,Organic 4% Milk Fat Whole Milk Cottage Cheese,108,16,112108,train,4,4,10,9.0,other creams cheeses,dairy eggs,1,1,1.000000,1
2,1,10246,3,0,Organic Celery Hearts,83,4,112108,train,4,4,10,9.0,fresh vegetables,produce,1,1,1.000000,1
3,1,49683,4,0,Cucumber Kirby,83,4,112108,train,4,4,10,9.0,fresh vegetables,produce,1,1,1.000000,1
4,1,43633,5,1,Lightly Smoked Sardines in Olive Oil,95,15,112108,train,4,4,10,9.0,canned meat seafood,canned goods,1,1,1.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1384612,3421063,14233,3,1,Natural Artesian Water,115,7,169679,train,30,0,10,4.0,water seltzer sparkling water,beverages,1,1,0.500000,0
1384613,3421063,35548,4,1,Twice Baked Potatoes,13,20,169679,train,30,0,10,4.0,prepared meals,deli,1,0,0.500000,0
1384614,3421070,35951,1,1,Organic Unsweetened Almond Milk,91,16,139822,train,15,6,10,8.0,soy lactosefree,dairy eggs,1,1,0.666667,0
1384615,3421070,16953,2,1,Creamy Peanut Butter,88,13,139822,train,15,6,10,8.0,spreads,pantry,1,0,0.666667,0


In [19]:
# create df grouped by order_id
basket_df = df.loc[:,['order_id','product_name']]
basket_df

Unnamed: 0,order_id,product_name
0,1,Bulgarian Yogurt
1,1,Organic 4% Milk Fat Whole Milk Cottage Cheese
2,1,Organic Celery Hearts
3,1,Cucumber Kirby
4,1,Lightly Smoked Sardines in Olive Oil
...,...,...
1384612,3421063,Natural Artesian Water
1384613,3421063,Twice Baked Potatoes
1384614,3421070,Organic Unsweetened Almond Milk
1384615,3421070,Creamy Peanut Butter


In [20]:
basket_series = basket_df.groupby('order_id')['product_name'].apply(list)

In [21]:
from mlxtend.preprocessing import TransactionEncoder

# Transform our basket series into a transaction matrix
te = TransactionEncoder()
transaction_matrix = te.fit_transform(basket_series, sparse=True)

# Convert to dataframe
transaction_df = pd.DataFrame.sparse.from_spmatrix(transaction_matrix, 
                                                  columns = te.columns_)
transaction_df.head()

Unnamed: 0,#2 Coffee Filters,#2 Cone White Coffee Filters,#2 Mechanical Pencils,#4 Natural Brown Coffee Filters,& Go! Hazelnut Spread + Pretzel Sticks,+Energy Black Cherry Vegetable & Fruit Juice,0 Calorie Acai Raspberry Water Beverage,0 Calorie Fuji Apple Pear Water Beverage,0 Calorie Strawberry Dragonfruit Water Beverage,0% Fat Black Cherry Greek Yogurt y,...,with Sweet Cinnamon Bunches Cereal,with Xylitol Cinnamon 18 Sticks Sugar Free Gum,with Xylitol Island Berry Lime 18 Sticks Sugar Free Gum,with Xylitol Minty Sweet Twist 18 Sticks Sugar Free Gum,with Xylitol Original Flavor 18 Sticks Sugar Free Gum,with Xylitol Unwrapped Original Flavor 50 Sticks Sugar Free Gum,with Xylitol Unwrapped Spearmint 50 Sticks Sugar Free Gum,with Xylitol Watermelon Twist 18 Sticks Sugar Free Gum,with a Splash of Mango Coconut Water,with a Splash of Pineapple Coconut Water
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's create a dataframe that contains frequently purchased items. To achieve this, I will only consider items that are present in more than 1% of orders. By filtering out less frequently purchased items, we can focus on the most relevant products and simplify our analysis.

In [22]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

freq_itemsets = apriori(transaction_df, min_support=0.01, use_colnames=True, max_len = 3)
freq_itemsets.sort_values('support', ascending=False)

Unnamed: 0,support,itemsets
5,0.142719,(Banana)
4,0.117980,(Bag of Organic Bananas)
73,0.083028,(Organic Strawberries)
38,0.074568,(Organic Baby Spinach)
29,0.062000,(Large Lemon)
...,...,...
115,0.010281,"(Large Lemon, Organic Avocado)"
72,0.010228,(Organic Sticks Low Moisture Part Skim Mozzare...
102,0.010175,(Whole Milk)
109,0.010144,"(Limes, Banana)"


Lets identify which products are bought together.

In [23]:
rules = association_rules(freq_itemsets, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
21,(Limes),(Large Lemon),0.04598,0.062,0.012156,0.264379,4.264159,0.009305,1.275113
20,(Large Lemon),(Limes),0.062,0.04598,0.012156,0.196066,4.264159,0.009305,1.18669
30,(Organic Raspberries),(Organic Strawberries),0.042268,0.083028,0.012728,0.301118,3.62671,0.009218,1.312056
31,(Organic Strawberries),(Organic Raspberries),0.083028,0.042268,0.012728,0.153295,3.62671,0.009218,1.131128
22,(Large Lemon),(Organic Avocado),0.062,0.056467,0.010281,0.165827,2.936692,0.00678,1.131099
23,(Organic Avocado),(Large Lemon),0.056467,0.062,0.010281,0.182076,2.936692,0.00678,1.146805
2,(Organic Hass Avocado),(Bag of Organic Bananas),0.055583,0.11798,0.018444,0.331825,2.81256,0.011886,1.320044
3,(Bag of Organic Bananas),(Organic Hass Avocado),0.11798,0.055583,0.018444,0.156331,2.81256,0.011886,1.119416
4,(Organic Raspberries),(Bag of Organic Bananas),0.042268,0.11798,0.013566,0.320952,2.7204,0.008579,1.298907
5,(Bag of Organic Bananas),(Organic Raspberries),0.11798,0.042268,0.013566,0.114987,2.7204,0.008579,1.082167


In [25]:
rules['antecedents'] = rules['antecedents'].apply(lambda x:list(x)).copy()
rules['consequents'] = rules['consequents'].apply(lambda x:list(x)).copy()

Lets create a recomendation based on rules above.

In [27]:
import numpy as np
# Input basket
mybasket = ['Large Lemon', 'Organic Raspberries', 'Organic Strawberries', 'Large Lemon']

#metric
metric = 'lift'

#COMPLETE THIS FUNCTION
def product_recs(basket, rule, metric):
    
    # Randomly select an item from the basket
    random_item = np.random.choice(basket, 1)[0]
    print(random_item)
    
    # Find rules where the item is in the antecedent
    rule_filter = rule['antecedents'].apply(lambda x: x[0]) == random_item
    
    # Filter the dataframe using rule_filter and sort by the selected metric
    filtered_df = rule[rule_filter].sort_values(by=metric)
    
    # Randomly return one of the top 20 items from the filtered dataframe
    reco = filtered_df.head(20).sample(1)['consequents']
    
    return reco

product_recs(mybasket, rules, metric )

Large Lemon


8    [Banana]
Name: consequents, dtype: object

As a result of market basket analysis I create rules. Based on the firts 'lime-lemon' example we can see that lime and lemon together occur in 1.2% of orders. A confidence value of 0.26 would mean that, in 26% of transactions where lime is present, lemon is also purchased. Moreover, the probability of observing both lime and lemon together in a transaction is 4.26 times higher than if they were chosen independently of each other.

A leverage value of 0.009 suggests that lime and lemon have a positive relationship, but it is not particularly strong. A conviction value of 1.27 indicates that there is a moderate dependence of lime on lemon. Higher conviction values indicate stronger dependence and can suggest that lemon is more likely to be purchased with lime.

In summary, the results of the market basket analysis revealed some interesting and meaningful patterns. However, it should be noted that the analysis was limited to items in the fruit and vegetable category, as these were the most commonly purchased items. As a result, the association rules generated only pertain to these specific categories. Nonetheless, the insights gained from this analysis can still be valuable for optimizing product offerings and improving the customer experience in the produce section of the store or on the website.