<a href="https://www.kaggle.com/code/sayan15/market-basket-analysis?scriptVersionId=159192639" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Market Basket Analysis**


## **Introduction**
Market Basket Analysis serves as a valuable tool for businesses aiming to refine their product offerings, boost cross-selling opportunities, and enhance marketing strategies. Its application can result in increased revenue, heightened customer satisfaction, and overall business success.

Let's see how we can achieve this all.
Steps we gonna follow are -


1.   Data Loading
2.   DataCleaning
3.   Exploratory Data Anlysis
4.   Algorithm Selection and Model training
5.   Results
6.   Summary



To proceed with our analysis lets proceed with the search of dataset to analyze the trends.

(Dataset - https://statso.io/market-basket-analysis-case-study/)

## **Data Loading**

Once we have downloaded our data, it's time to load it and explore it.

In [1]:
# Importing Libraries
import pandas as pd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning, module="ipykernel")


data=pd.read_csv('/kaggle/input/market-basket-dataset/market_basket_dataset.csv')
print(data.head())

   BillNo  Itemname  Quantity  Price  CustomerID
0    1000    Apples         5   8.30       52299
1    1000    Butter         4   6.06       11752
2    1000      Eggs         4   2.66       16415
3    1000  Potatoes         4   8.10       22889
4    1004   Oranges         2   7.26       52255


We can see the data consists of various fields, let's analyze them. Before that  let's check for null values first and fix them.

## **Data Cleaning**

In [2]:
data.isnull().sum()

BillNo        0
Itemname      0
Quantity      0
Price         0
CustomerID    0
dtype: int64

There's no null value, hence we can proceed further.

## **Exploratory Data Analysis**

 Let's have a look on basic statistics of each attribute. Quantity and price are something we are concerned with, since other numerical attributes are just sequential number or IDs assigned.

In [3]:
# The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.
data[['Quantity','Price']].describe()

Unnamed: 0,Quantity,Price
count,500.0,500.0
mean,2.978,5.61766
std,1.426038,2.572919
min,1.0,1.04
25%,2.0,3.57
50%,3.0,5.43
75%,4.0,7.92
max,5.0,9.94


Let's have a look at sales distribution of each item and number of times it has been billed.

In [4]:
# Create a histogram
fig = px.histogram(data, x='Itemname', title='Item Distribution')

# Add y-axis name
fig.update_layout(
    yaxis=dict(title='Item count'),
)

# Show the plot
fig.show()

We can see, maximum time banana item was sold. But lets dive deeper and see which items are sold most based on quantity (For now let's take Top 10 popular items).

In [5]:
# Calculate item popularity
item_popularity=data.groupby('Itemname')['Quantity'].sum().sort_values(ascending=False)

top_n = 10
fig = go.Figure()
fig.add_trace(go.Bar(x=item_popularity.index[:top_n], y=item_popularity.values[:top_n],
                     text=item_popularity.values[:top_n], textposition='auto',
                     marker=dict(color='skyblue')))
fig.update_layout(title=f'Top {top_n} Most Popular Items',
                  xaxis_title='Item Name', yaxis_title='Total Quantity Sold')
fig.show()

It's pretty visible that, Banana is the most popular item in both order count and quantity sold.
Now, let's have a look on customer behaviour, their average basket size and spending amount.

In [6]:
# Calculate average quantity and spending per customer
customer_behavior = data.groupby('CustomerID').agg({'Quantity': 'mean', 'Price': 'sum'}).reset_index()

# Create a DataFrame to display the values
table_data = pd.DataFrame({
    'CustomerID': customer_behavior['CustomerID'],
    'Average Quantity': customer_behavior['Quantity'],
    'Total Spending': customer_behavior['Price']
})


# Create a Plotly table
fig = go.Figure(data=[go.Table(
    header=dict(values=table_data.columns, fill=dict(color='#f2f2f2'), align='left', font=dict(size=14, color='black', family='Arial, sans-serif')),
    cells=dict(values=[table_data[col] for col in table_data.columns], align='left', font=dict(size=12, color='black', family='Arial, sans-serif'), height=30),
)])

# Customize the layout to fix headers
fig.update_layout(
    height=300,  # Set the overall height of the plot
    margin=dict(l=0, r=0, b=0, t=0),  # Adjust margins
)

## **Algorithm Selection and Training**

From looking at data it's getting difficult to understand the pattern or items brought by customer and the association within the items is still not clear.

Let's apply some algorithm to understand this. Appriori algorithm is one of the most widely used algorithms to understand association between items and their itemsets. Before that let's look at basic definition of Apriori algorithm.



The **Apriori algorithm** is a classic algorithm in data mining and machine learning used for association rule mining. Association rule mining aims to discover interesting relationships or patterns within large datasets. Specifically, Apriori is designed to identify **Frequent Itemsets** and generate **Association Rules** based on the concept of **Support**.



1.   **Frequent Itemsets** - Sets of items that appear frequently enough, based on a chosen support threshold.
2.   **Association Rules** - Express relationships like "if you buy X, you're likely to buy Y." Measured by metrics like confidence and lift.
3.   **Support** - A measure of how often a group of items (itemset) appears in the dataset.




In [7]:
from mlxtend.frequent_patterns import apriori, association_rules

# Group items by BillNo and create a list of items for each bill
basket = data.groupby('BillNo')['Itemname'].apply(list).reset_index()

# Encode items as binary variables using one-hot encoding
basket_encoded = basket['Itemname'].str.join('|').str.get_dummies('|')

# Find frequent itemsets using Apriori algorithm with lower support
frequent_itemsets = apriori(basket_encoded, min_support=0.01, use_colnames=True)

# Generate association rules with lower lift threshold (50%)
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=0.5)



DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type



## **Results**

Let's have a look on the associations, and try to understand them.

In [8]:
# Display association rules

print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10))


  antecedents consequents   support  confidence      lift
0     (Bread)    (Apples)  0.045752    0.304348  1.862609
1    (Apples)     (Bread)  0.045752    0.280000  1.862609
2    (Butter)    (Apples)  0.026144    0.160000  0.979200
3    (Apples)    (Butter)  0.026144    0.160000  0.979200
4    (Cereal)    (Apples)  0.019608    0.096774  0.592258
5    (Apples)    (Cereal)  0.019608    0.120000  0.592258
6    (Cheese)    (Apples)  0.039216    0.214286  1.311429
7    (Apples)    (Cheese)  0.039216    0.240000  1.311429
8    (Apples)   (Chicken)  0.032680    0.200000  1.530000
9   (Chicken)    (Apples)  0.032680    0.250000  1.530000


Here are our result. Let's understand each piece of information by referring the first association rule.



*   **Antecedent** - These are the items that are considered as the starting point or “if” part of the association rule. In first association rule, Apples is the antecendent, means "if Apples are bought".
*   **Consequent** - These are the items that tend to be purchased along with the antecedent or the “then” part of the association rule. In first rule, Bread is the consequent, means "if Apples are bought then Bread can be bought too".
*   **Support** - Support measures how frequently a particular combination of items (both antecedents and consequents) appears in the dataset. It is essentially the proportion of transactions in which the items are bought together. In first row, support is 4.5%, that means in 4.5% bills (or basket) Apples and Bread has been bought together.
*   **Confidence** -  Confidence quantifies the likelihood of the consequent item being purchased when the antecedent item is already in the basket. In other words, it shows the probability of buying the consequent item when the antecedent item is bought. In our example (first rule), if Apples are already in basket then there is 28% chances of buying Bread.
*   **Lift** - Lift measures the degree of association between the antecedent and consequent items, while considering the baseline purchase probability of the consequent item. A lift value greater than 1 indicates a positive association, meaning that the items are more likely to be bought together than independently. A value less than 1 indicates a negative association. For example, the first rule has a lift of approximately 1.86, suggesting a positive association between Apples and Bread.


So, keeping Bread and Apples together will surely help in increase in sell. Similarly, you can relate each association rule and derive with a strategy of placcing the items.




Let's find **Top** product association.

In [9]:
# Display Top 50 association rules
rules=rules.sort_values(by=['confidence', 'lift'], ascending=False)
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(50))

                             antecedents                        consequents  \
13485          (Butter, Yogurt, Oranges)                      (Eggs, Bread)   
13492                      (Eggs, Bread)          (Butter, Yogurt, Oranges)   
15715             (Eggs, Onions, Apples)             (Pasta, Milk, Chicken)   
15716            (Onions, Pasta, Apples)              (Eggs, Milk, Chicken)   
15717          (Onions, Apples, Chicken)                (Eggs, Pasta, Milk)   
15719              (Eggs, Onions, Pasta)            (Milk, Apples, Chicken)   
15723              (Onions, Pasta, Milk)            (Eggs, Apples, Chicken)   
15724            (Onions, Milk, Chicken)              (Eggs, Pasta, Apples)   
15725              (Eggs, Pasta, Apples)            (Onions, Milk, Chicken)   
15726            (Eggs, Apples, Chicken)              (Onions, Pasta, Milk)   
15730            (Milk, Apples, Chicken)              (Eggs, Onions, Pasta)   
15732                (Eggs, Pasta, Milk)          (O

From this we can conclude some top associations.

1. If Apples and juice are in basket, then there is a high chance the person will also buy Yogurt, Sugar, Potatoes and Coffee.
2. If Eggs and Bread are in basket, then there is a high chance the person will also buy Yogurt, Oranges, Butter.
3. If Tea, Coffee and Milk are in basket, then there is a high chance the person will also buy Pasta, Apples, Yogurt.
4. Similarly, (Sugar, Tea, Milk) are associated with (Pasta, Apples, Coffee).
5. Similarly, if Chicken and Milk are in basket, there is a high chance that person will buy Eggs too.

## **Summary**
Market Basket Analysis is a valuable tool for businesses seeking to optimize their product offerings, increase cross-selling opportunities, and improve marketing strategies. It can lead to higher revenue, enhanced customer satisfaction, and overall business success.