# Market Basket

**Market Basket Analysis** is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

**Apriori** is an algorithm for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

**An example of Association Rules**   
Assume there are 100 customers   
10 of them bought milk, 8 bought butter and 6 bought both of them.   
bought milk => bought butter   
support = P(Milk & Butter) = 6/100 = 0.06   
confidence = support/P(Butter) = 0.06/0.08 = 0.75   
lift = confidence/P(Milk) = 0.75/0.10 = 7.5

Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

**Some important terms**:   
**Support**: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.   
**Confidence**: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.   
**Lift**: This says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.
Source: https://www.kaggle.com/heeraldedhia/groceries-dataset




### Why?

I found this open source dataset on Kaggle. This interested me because I'm interested in working for HEB (Texas based grocery company). This looked like a way to potentially gain some experience with a ML algorithm common to this industry that is outside of the ML algorithms covered in the curriculum. 

# Plan



In [None]:
# find more info on what an Apirori algorithm is, what the inputs are, and how to use and evaluate the algorithm
# exobrain: https://www.geeksforgeeks.org/apriori-algorithm/?ref=lbp
# https://www.geeksforgeeks.org/frequent-item-set-in-data-set-association-rule-mining/?ref=lbp
# https://analyticsindiamag.com/beginners-guide-to-understanding-apriori-algorithm-with-implementation-in-python/
# above is python 2.7 only, did not work

In [None]:
# used pip to install apriori, this did not work, apriori installed, but get error message when importing
# installed mlxtend instead and will try that

# Acquire / Prep

In [1]:
import pandas as pd
import numpy as np
# port efficient_apriori as apriori
# pip install apriori was python 2.7 version, google search for error found efficient-apriori which is python 3

In [8]:
df = pd.read_csv('Market_Basket_Optimisation.csv', header=None)
# store_data.csv is for python 2.7 example and does not work

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [13]:
df.fillna(0, inplace=True)

In [14]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,chutney,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,turkey,avocado,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,mineral water,milk,energy bar,whole wheat rice,green tea,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
encodedf = df[['itemDescription']]

In [None]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(encodedf).transform(encodedf)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)
df_encoded

In [None]:
from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.6)

In [None]:
# store_data.csv is a dataset where each row is a transaction and lists the items
# walking through example using info at: 
# https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
# need better understanding of how to manipulate other datasets into this format for the algorithm

In [None]:
# apriori requires a list of lists and not a panadas dataframe
records = []
for i in range(0, 7501):
    records.append([str(df.values[i,j]) for j in range(0, 20)])

In [None]:
records

In [None]:
# remove nulls from list
no_null = list(filter(None, records))
no_null

In [None]:
# Let's suppose that we want rules for only those items that are purchased at least 5 times a day, 
# or 7 x 5 = 35 times in one week, since our dataset is for a one-week time period. The support for those
# items can be calculated as 35/7500 = 0.0045. The minimum confidence for the rules is 20% or 0.2. 
# Similarly, we specify the value for lift as 3 and finally min_length is 2 since we want at least two 
# products in our rules.

# association_rules = apriori.apriori(records, min_support=0.0045, min_confidence=0.2)
# association_results = list(association_rules)
# print(association_rules[0])

In [None]:
itemsets, rules = apriori.apriori(records, min_support=0.0045,  min_confidence=0.2)
print(rules)

In [None]:
# Print out every rule with 2 items on the left hand side,
# 1 item on the right hand side, sorted by lift
rules_rhs = filter(lambda rule: len(rule.lhs) == 2 and len(rule.rhs) == 1, rules)
for rule in sorted(rules_rhs, key=lambda rule: rule.lift):
  print(rule) # Prints the rule and its confidence, support, lift, ...