In [1]:
from google.cloud import bigquery
from google.oauth2 import service_account
import time
import matplotlib.pyplot as plt
import seaborn as sns
from google.cloud import bigquery_storage
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from itertools import combinations, islice
from collections import defaultdict
from tqdm import tqdm

# Explanation to Apriori Algorithm and Association Rules Mning

## Apriori Algorithm

The Apriori algorithm is a popular algorithm used for association rule mining in data mining and machine learning. It aims to discover relationships or associations between items in a dataset. The algorithm follows a "bottom-up" approach and uses a breadth-first search strategy to find frequent itemsets.

Here's a high-level overview of how the Apriori algorithm works:

Support: The algorithm begins by calculating the support of each item in the dataset. Support represents the frequency of occurrence of an item in the dataset.

Frequent Itemsets: Next, the algorithm generates a set of frequent itemsets based on a minimum support threshold set by the user. Frequent itemsets are subsets of items that meet the minimum support threshold.

Candidate Generation: The algorithm uses the frequent itemsets from the previous step to generate a new set of candidate itemsets. These candidate itemsets are formed by joining the frequent itemsets with themselves.

Pruning: The algorithm prunes the candidate itemsets that contain subsets that are not frequent. This step helps reduce the search space and improve the efficiency of the algorithm.

Repeat: Steps 3 and 4 are repeated iteratively until no new frequent itemsets can be generated.

## Association Rules Mining

Association rules mining is the process of discovering interesting relationships or associations among items in a dataset. It is based on the concept of conditional probability and is often used in market basket analysis, where the goal is to find associations between products frequently purchased together.

Association rules are typically represented in the form of "if-then" statements. For example, if a customer buys product A, then there is a high probability that they will also buy product B.

## General Exmaple

If a customer buys bread and milk, then there is a high probability that they will also buy butter.
This rule implies that customers who purchase bread and milk are likely to purchase butter as well. This insight can be valuable for the grocery store in terms of product placement, promotions, and recommendations to customers.

## Support

Support is a measure that indicates the frequency or popularity of a combination of items in a dataset. It quantifies how often a particular combination of items appears together in the dataset. Support is calculated as the ratio of the number of transactions containing the combination of items to the total number of transactions in the dataset.  

For example, if we have a dataset of 100 transactions, and the combination of items A and B appears together in 30 transactions, then the support for the combination (A, B) would be 30/100 = 0.3 or 30%.

## Confidence (from A to B)

Confidence measures the reliability or strength of an association rule. Specifically, it quantifies how often an item B is purchased when item A is purchased. Confidence is calculated as the ratio of the number of transactions containing both items A and B to the number of transactions containing item A.  

For example, if item A appears in 50 transactions, and out of those 50 transactions, item B also appears in 30 transactions, then the confidence from A to B would be 30/50 = 0.6 or 60%.

## Lift

Lift is a measure that assesses the strength of the association between two items (A and B) beyond what would be expected by chance. It compares the observed probability of items A and B appearing together to the expected probability of their co-occurrence if they were statistically independent. Lift is calculated as the ratio of the support for the combination of items (A, B) to the product of the individual supports of items A and B.
A lift value greater than 1 indicates a positive association or correlation between the items. The higher the lift value, the stronger the association.  

For example, if the support for the combination (A, B) is 0.3, and the individual supports for items A and B are 0.5 and 0.4 respectively, then the lift would be 0.3 / (0.5 * 0.4) = 1.5.

# Load data from BigQuery

In [3]:
def bq_connector():
    key_path = "C:/Users/HEN1/Projects/Instacart_Market_Basket_Analysis/keys/plucky-mile-327121-255163f80b63.json"
    credentials = service_account.Credentials.from_service_account_file(
        key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
    )

    bqclient = bigquery.Client(credentials=credentials, project=credentials.project_id,)
    bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)
    return bqclient, bqstorageclient

def bq_full_table_df(bqclient, bqstorageclient, table_name):
    sql_query = f"SELECT * FROM instacart.{table_name}"
    query_job = bqclient.query(sql_query)
    time.sleep(30)
    count =0 
    while query_job.state !='DONE':
        print("NOT DONE")
        if query_job.state =='PENDING':
            print(f"job from {table_name} is pending")
            break
        if query_job.state =='RUNNING':
            print(f"job from {table_name} is running")
            print(query_job.result())
            time.sleep(60)
            query_job.reload()
            time.sleep(10)
            count += 1
            if count>3:
                break
        else:
            print("may meet an error")
            break
    if query_job.state == 'DONE':
        print(f"successfully finished getting data from {table_name} table")
        df = query_job.to_dataframe(bqstorage_client=bqstorageclient, 
                                    progress_bar_type='tqdm_notebook',)
        print("successfully transferred to df")
        time.sleep(3)
    else:
        print("error")

    return df

In [4]:
bqclient, bqstorageclient = bq_connector()

In [4]:
aisles = bq_full_table_df(bqclient, bqstorageclient, 'aisles')
time.sleep(3)

successfully finished getting data from aisles table


Query is running:   0%|          |

Downloading:   0%|          |

successfully transferred to df


In [5]:
departments = bq_full_table_df(bqclient, bqstorageclient, 'departments')
time.sleep(60)

successfully finished getting data from departments table


Query is running:   0%|          |

Downloading:   0%|          |

successfully transferred to df


In [6]:
orders = bq_full_table_df(bqclient, bqstorageclient, 'orders')
time.sleep(60)

NOT DONE
job from orders is running
<google.cloud.bigquery.table.RowIterator object at 0x0000025CC6B07100>
successfully finished getting data from orders table


Query is running:   0%|          |

Downloading:   0%|          |

successfully transferred to df


In [12]:
products = bq_full_table_df(bqclient, bqstorageclient, 'products')
time.sleep(60)

successfully finished getting data from products table


Query is running:   0%|          |

Downloading:   0%|          |

successfully transferred to df


In [5]:
order_products_prior = bq_full_table_df(bqclient, bqstorageclient, 'order_products_prior')
time.sleep(60)

NOT DONE
job from order_products_prior is running
<google.cloud.bigquery.table.RowIterator object at 0x000001D0BEE14640>
successfully finished getting data from order_products_prior table


Query is running:   0%|          |

Downloading:   0%|          |

successfully transferred to df


In [9]:
order_products_train = bq_full_table_df(bqclient, bqstorageclient, 'order_products_train')

NOT DONE
job from order_products_train is running
<google.cloud.bigquery.table.RowIterator object at 0x0000025CC6AE9220>
successfully finished getting data from order_products_train table


Query is running:   0%|          |

Downloading:   0%|          |

successfully transferred to df


# Extract data from dataset

In [17]:
order_products_prior[['order_id', 'product_id']].head(20)

Unnamed: 0,order_id,product_id
0,1091226,27735
1,1679699,1100
2,3172993,31404
3,490071,6187
4,400235,4809
5,583860,24184
6,3092797,37935
7,155898,2245
8,241827,33754
9,1495286,39275


In [6]:
transactions = order_products_prior.groupby('order_id')['product_id'].apply(list).tolist()


In [7]:
def generate_itemsets(transactions):
    itemsets = []
    num_transactions = len(transactions)

    # Generate all possible itemsets of length 2
    for transaction in tqdm(transactions, desc="Generating Itemsets"):
        pairs = list(combinations(transaction, 2))
        itemsets.extend(pairs)

    # Create a DataFrame from itemsets
    itemsets_df = pd.DataFrame(itemsets, columns=['itemA_id', 'itemB_id'])
    itemsets_df['count'] = 1

    return itemsets_df


def analyze_itemsets(itemsets_df, transactions):
    num_transactions = len(transactions)

    # Calculate supportAB for each itemset
    itemsets_df_groupby = itemsets_df.groupby(['itemA_id', 'itemB_id'], as_index=False)['count'].sum()
    itemsets_df_groupby['supportAB'] = itemsets_df_groupby['count'] / num_transactions
    print("Finished calculate supportAB")
    itemsets_df_groupby.drop(['count'], axis=1, inplace=True)

    # Sort itemsets based on supportAB in descending order
    itemsets_df_groupby.sort_values('supportAB', ascending=False, inplace=True)

    # Create a DataFrame to store the item counts for single items
    single_item_counts = pd.Series(transactions).explode().value_counts().reset_index()
    single_item_counts.columns = ['item_id', 'count']

    # Calculate supportA and supportB for the top itemsets
    output = pd.merge(itemsets_df_groupby, single_item_counts, left_on='itemA_id', right_on='item_id', how='left')
    output['supportA'] = output['count'] / num_transactions
    output.drop(['item_id', 'count'], axis=1, inplace=True)

    output = pd.merge(output, single_item_counts, left_on='itemB_id', right_on='item_id', how='left')
    output['supportB'] = output['count'] / num_transactions
    output.drop(['item_id', 'count'], axis=1, inplace=True)

    # Calculate confidenceAtoB, confidenceBtoA, and lift for the top itemsets
    output['confidenceAtoB'] = output['supportAB'] / output['supportA']
    output['confidenceBtoA'] = output['supportAB'] / output['supportB']
    output['lift'] = output['supportAB'] / (output['supportA'] * output['supportB'])

    return output


In [8]:
# Generate itemsets
itemsets_df = generate_itemsets(transactions)

Generating Itemsets: 100%|██████████| 3214874/3214874 [00:24<00:00, 129648.85it/s]


In [10]:

# Analyze top K itemsets and store the results in a DataFrame
top_itemsets_df = analyze_itemsets(itemsets_df, transactions)

Finished calculate supportAB


In [19]:
top_itemsets_df = top_itemsets_df.merge(products[['product_id', 'product_name']], left_on='itemA_id', right_on='product_id', how='left').rename(columns={'product_name': 'productA_name'})
top_itemsets_df = top_itemsets_df.merge(products[['product_id', 'product_name']], left_on='itemB_id', right_on='product_id', how='left').rename(columns={'product_name': 'productB_name'})


In [22]:
top_itemsets_df = top_itemsets_df.drop(columns=['product_id_x', 'product_id_y'])

In [43]:
pd.set_option('display.max_rows', 200)

In [23]:
top_itemsets_df.head(100)

Unnamed: 0,itemA_id,itemB_id,supportAB,supportA,supportB,confidenceAtoB,confidenceBtoA,lift,productA_name,productB_name
0,13176,47209,0.009736,0.11803,0.066436,0.082488,0.146547,1.241609,Bag of Organic Bananas,Organic Hass Avocado
1,47209,13176,0.009655,0.066436,0.11803,0.145334,0.081805,1.231335,Organic Hass Avocado,Bag of Organic Bananas
2,13176,21137,0.009602,0.11803,0.082331,0.081352,0.116626,0.988111,Bag of Organic Bananas,Organic Strawberries
3,21137,13176,0.009568,0.082331,0.11803,0.116211,0.081062,0.98459,Organic Strawberries,Bag of Organic Bananas
4,24852,21137,0.008838,0.146993,0.082331,0.060123,0.107344,0.730261,Banana,Organic Strawberries
5,21137,24852,0.00863,0.082331,0.146993,0.10482,0.058709,0.713092,Organic Strawberries,Banana
6,24852,47766,0.008329,0.146993,0.054999,0.056663,0.151441,1.030256,Banana,Organic Avocado
7,47766,24852,0.00828,0.054999,0.146993,0.150542,0.056327,1.024139,Organic Avocado,Banana
8,24852,21903,0.008035,0.146993,0.075251,0.054663,0.106779,0.726418,Banana,Organic Baby Spinach
9,21903,24852,0.007951,0.075251,0.146993,0.105667,0.054094,0.718854,Organic Baby Spinach,Banana


In [44]:
top_itemsets_df[(top_itemsets_df['productA_name']=='Organic Strawberries') & (top_itemsets_df['supportAB'] > 0.0005)].sort_values(by='lift', ascending=False)

Unnamed: 0,itemA_id,itemB_id,supportAB,supportA,supportB,confidenceAtoB,confidenceBtoA,lift,productA_name,productB_name
1854,21137,19706,0.000757,0.082331,0.004075,0.009192,0.185711,2.25567,Organic Strawberries,Organic Nectarine
835,21137,38159,0.001182,0.082331,0.006626,0.014357,0.178395,2.166813,Organic Strawberries,Organic Yellow Peaches
1129,21137,38777,0.001004,0.082331,0.006764,0.012196,0.148448,1.803068,Organic Strawberries,Organic Green Seedless Grapes
2292,21137,26790,0.000665,0.082331,0.005017,0.008081,0.132618,1.610799,Organic Strawberries,Organic AppleApple
1754,21137,26940,0.000781,0.082331,0.005999,0.009487,0.130198,1.581403,Organic Strawberries,Organic Large Green Asparagus
1749,21137,17600,0.000784,0.082331,0.0061,0.009521,0.128499,1.560769,Organic Strawberries,"YoKids Squeezers Organic Low-Fat Yogurt, Straw..."
285,21137,39928,0.001997,0.082331,0.015597,0.024259,0.128059,1.55542,Organic Strawberries,Organic Kiwi
554,21137,43122,0.001482,0.082331,0.011729,0.018003,0.126372,1.534936,Organic Strawberries,Organic Bartlett Pear
25,21137,27966,0.005299,0.082331,0.042632,0.06436,0.124291,1.509659,Organic Strawberries,Organic Raspberries
3153,21137,11782,0.000559,0.082331,0.004547,0.006789,0.122931,1.493131,Organic Strawberries,Bing Cherries


In [35]:
top_itemsets_df[top_itemsets_df['lift'].between(0.99, 1.01, inclusive=False)].head()

  top_itemsets_df[top_itemsets_df['lift'].between(0.99, 1.01, inclusive=False)].head()


Unnamed: 0,itemA_id,itemB_id,supportAB,supportA,supportB,confidenceAtoB,confidenceBtoA,lift,productA_name,productB_name
12,24852,16797,0.006566,0.146993,0.044466,0.044669,0.147666,1.004576,Banana,Strawberries
65,21903,47626,0.003585,0.075251,0.047485,0.04764,0.075496,1.003263,Organic Baby Spinach,Large Lemon
152,13176,44359,0.002488,0.11803,0.02118,0.021075,0.117446,0.995054,Bag of Organic Bananas,Organic Small Bunch Celery
208,24852,25890,0.002282,0.146993,0.01564,0.015524,0.1459,0.992562,Banana,Boneless Skinless Chicken Breasts
228,22035,13176,0.002192,0.018562,0.11803,0.118071,0.018569,1.000351,Organic Whole String Cheese,Bag of Organic Bananas


In [124]:
top_itemsets_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55137021 entries, 0 to 55137020
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   itemA_id        int64  
 1   itemB_id        int64  
 2   supportAB       float64
 3   supportA        float64
 4   supportB        float64
 5   confidenceAtoB  float64
 6   confidenceBtoA  float64
 7   lift            float64
dtypes: float64(6), int64(2)
memory usage: 3.7 GB


In [24]:
top_itemsets_df.supportAB.max()

0.009735995874177339

# Insights And Findings

1. The association rule between "Bag of Organic Bananas" and "Organic Hass Avocado" has a high support (0.009736) and confidence (0.082488), indicating that these two items are frequently purchased together. The lift value (1.241609) suggests that the occurrence of "Organic Hass Avocado" and the occurrence of "Bag of Organic Bananas" are strong correlated. This is a pretty straight forward finding, because people who are willing to buy organic banana are possible to buy organic avocado as carbohydrates for health reasons.

2. The association rule between "Organic Strawberries" and "Banana" has a relatively high support (0.008630) and confidence (0.104820). This suggests that these two items are often bought together by customers. The lift value (0.713092) indicates that the occurrence of "Banana" is slightly reduced when "Organic Strawberries" is purchased. From the data above, we can tell that people who buy organic strawberries are more willing to buy healthy food like natural or organic food, and not that willing to buy groceries like garlic. 

3. The association rule between "Banana" and "Strawberries" has a moderate support (0.006566) and confidence (0.044669). This indicates that there is a moderate association between these two items. The lift value (1.004576) close to 1 suggests that the occurrence of "Strawberries" has minimal impact on the likelihood of purchasing "Banana". The popular itemsets like "Organic Baby Spinach" and "Large Lemon", "Bags of Organic Bananas" and "Organic Small Bunch Celery", "Bag of Organic Bananas" and "Organic Small Bunch Celery", "Banana" and "Boneless Skinless Chicken Breasts", "Organic Whole String Cheese" and "Bag of Organic Bananas" are under the same situation. It implies that these items in itemsets are independent to each other. It is pretty hard to use common sense to explain this phenomenon, but we can suggest that the target customers for product A and product B are different.

4. The association rule between "Organic Hass Avocado" and "Organic Raspberries" has a relatively low support (0.004014) and confidence (0.060417). This suggests that these two items are not frequently purchased together. The lift value (1.417158) indicates that the occurrence of "Organic Raspberries" increases by 41.72% when "Organic Hass Avocado" is also purchased. We can use this finding to boost our sales by creating a "BUY X GET Y" discount for these two products. 

# Limitations

1. Due to the dataset is pretty large, this notebook only focus on itemset with 2 items. There are still many relationship worth to explore.

2. The Apriori algorithm focuses on identifying frequent itemsets but does not consider the order in which items appear within a transaction. It may overlook important sequential patterns or temporal dependencies present in the data.

3. Association rules generated by the Apriori algorithm provide information about co-occurrence patterns but lack insights into the strength or causality of relationships between items. It does not capture the complex interactions and dependencies that may exist between items.