# ***Grocery List Apriori Algorithm***
---

## Purpose:
The purpose of this jupyter notebook is to take a grocery list (provided here by the kind folks at Kaggle: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset ) and create an Apriori Algorithm from scratch in order to create association rules for grocery items. In layman's terms:
1) If a customer goes to the store, how likely are they to buy milk and bread **(Support)?**
2) If they buy milk, how confident are we that they will also buy bread **(Confidence)?**
3) When controlled for the fact that almost everyone buys milk at the store, does milk actually increase the likelihood that someone will buy bread **(lift)?**

These association rules can be applied to purchasing patterns (market basket analysis), netflix / youtube / spotify / etc watching and listening patterns (content recommendation), fraud detection, healthcare diagnosis, inventory management, and so much more! There are prebuilt Apriori packages for Python, but in order to test and improve my programming, math, and reasoning skills - I'm building this one from scratch.

## The Math Behind Apriori
---

There are three important metrics with the apriori algorithm. I will explain the use case, as well as how to calculate them so that we have a strict understanding of what we need to program.
1) **Support:** How likely a set of items is to appear on a receipt. This is the probability of milk and bread together. It can be calculated by: (#times milk and bread appears on receipts together / total receipts)
2) **Confidence:** How likely a person is to buy item B if they buy item A. It can be calculated by: support / P item B (#times bread appears on receipts / total receipts).
3) **Lift:** When you control for how likely it is for someone to buy milk, is there still a correlation between someone buying milk and then buying bread, would they be just as likely to buy bread without milk, and does buying milk have a negative correlation - meaning does someone buying milk make them LESS likely to buy bread? This can be calculated by: Confidence / P of item A (#times milk appears on receipts)

When it gets time to choose what items to showcase, I'll explain these numbers in greater detail.

# The Code
---

In [4]:
# Importing packages and creating global variables.
import itertools
from collections import defaultdict
import pandas as pd
GROCERIES = pd.read_csv("Groceries_dataset.csv")
display(GROCERIES.head())
grocery_lists = defaultdict(int)
member_numbers = []
item_combos = []
combo_counts = defaultdict(int)
item_counts = defaultdict(int)
ASSOCIATION_RULES = {}

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


In [5]:
def create_member_list():
    """Creates a list of member numbers in order to see what they buy at the store to determine relationships."""
    for member_num in GROCERIES['Member_number']:
        if member_num not in member_numbers:
            member_numbers.append(member_num)

In [6]:
def create_grocery_list():
    """Uses member numbers to sort the items they bought into lists."""
    for i in member_numbers:
        grocery_lists.update({i: GROCERIES[GROCERIES['Member_number'] == i]['itemDescription'].values})

In [7]:
def create_combos():
    """Creates every possible item combo that could be made from member lists. This also counts the number times an item appears in a list. We only count the item once per list, even if the member bought 6 gallons of milk in that time period."""
    for list in grocery_lists:
        for item in grocery_lists[list]:
            if item in grocery_lists[list]:
                item_counts[item] += 1
        item_combos.append(set(itertools.combinations(grocery_lists[list], 2)))

In [8]:
def count_combos():
    """ Counts how often the items are paired together. Makes sure it counts a,b and b,a because combinations doesn't account for flipping the order."""
    for items in item_combos:
        for item in items:
            combo_counts[(item[0], item[1])] += 1
            combo_counts[(item[1], item[0])] += 1

### Important Math Ideas
---
Coming back to the 3 values we talked about earlier, here is a bit more information about what they mean:
1) ### **Support:**
- **Very Low Support (Under 0.1%):** Should only be used when determining rare but critical associations. Most commonly used in healthcare or fraud detection, when the consequences of not noticing the relationship could be dire.
- **Low to Medium Support (0.1% - 5%):** Looking at interesting patterns. Most commonly used in Market Basket Analysis.
- **High Support > 5%:** Captures extremely common associations.
- Keep in mind that we have about 3800 members in that list. a 0.1% support would only be us trying to make a decision off 3.8 receipts, which isn't a good idea. This is why I chose a support of 1% and higher.
2) ### **Confidence:**
- **Low Confidence (<50%):** Too weak of a confidence to be accepted in most contexts.
- **Medium Confidence (50-70%):** Indicates strong associations. Often most used in market basket analysis.
- **High Confidence (>70%):** Strong associations, especially if there is high lift and support.
- **Very High Confidence (>90%):** Be careful! This could be from errors or a rule that is very ridiculous. (This is how I caught that I was double and trip counting items, because the confidence of someone buying milk on one trip to the store followed buying milk on a subsequent visit is extremely high).
3) ### **Lift:**
- **Lift = 1:** A and B are independent of each other (meaning there isn't an association)
- **Lift > 1: A** and B are positively correlated. This means that A increases the likelihood of B)
- **Lift <1: A** and B are Negatively Correlated. (A decreases the likelihood of B).
- All 3 numbers are interesting! For this, the goal was to find if buying A will increase B, so I will use a lift of > 1.

In [9]:
def create_association_rules():
    """Creates the support, confidence, and lift values for each group of items to determine which ones are most likely to be purchased
    together."""
    for a,b in combo_counts:
        support = round(combo_counts[a,b] / len(grocery_lists), 3)
        confidence = round(support / (item_counts[a] / len(grocery_lists)), 3)
        lift = round(confidence / (item_counts[b] / len(grocery_lists)), 3)
        if support >= 0.01 and confidence >= 0.5 and lift >= 1:
            ASSOCIATION_RULES[a,b]= {'support':support, 'confidence':confidence, 'lift':lift}

In [10]:
def association_rule_csv():
    association_rules_df = pd.DataFrame(ASSOCIATION_RULES).transpose()
    display(association_rules_df)
    association_rules_df.to_csv('ASSOCIATION_RULES.csv')

In [11]:
create_member_list()
create_grocery_list()
create_combos()
count_combos()
create_association_rules()
association_rule_csv()

Unnamed: 0,Unnamed: 1,support,confidence,lift
pot plants,other vegetables,0.016,0.529,1.086
liquor,whole milk,0.019,0.719,1.12
mustard,whole milk,0.017,0.72,1.122
liquor (appetizer),soda,0.01,0.582,1.498
UHT-milk,other vegetables,0.044,0.531,1.091
semi-finished bread,other vegetables,0.019,0.522,1.072
soft cheese,other vegetables,0.02,0.513,1.054
specialty chocolate,other vegetables,0.031,0.503,1.033
specialty bar,other vegetables,0.027,0.501,1.029
cat food,other vegetables,0.026,0.573,1.177


The most interesting associations are: People who buy milk also buy the following: mustard (72% confidence), liquor (71.9%), zwieback(71.5%), condensed milk (67.6%). It seems that there appears to be associations between other vegetables and rolls/buns and different items as well. These associations can help us create either recommendations on mobile ordering platforms, or coupons to encourage increased buying habits.

## Conclusion
---
Thank you so much for exploring the Apriori Algorithm with me. It's use cases are so interesting, from healthcare, to market basket analysis, to fraud detection - this is one I definitely will add to my toolbelt. (Albeit, I'll use the prebuilt algorithm next time to save myself 4 or so hours). If you found this interesting, feel free to connect with me on https://www.linkedin.com/in/pskavs1775/ or read some of my blogs found here: https://medium.com/@pskavs