<a href="https://colab.research.google.com/github/KelvinLam05/Market-Basket-Analysis-with-Apriori/blob/main/market_basket_analysis_using_FP_growth_algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

Market basket analysis, also known as affinity analysis, is a key data mining and statistical technique used by retailers to better understand consumer purchasing patterns. It works by analyzing customer purchases that frequently take place together and allows retailers to identify associations between items. When used appropriately, market basket analysis can be an effective tool in gaining an integral advantage in today’s retailer market by helping retailers gain the necessary information to not only better understand consumer behavior but also influence it. 

**Dataset information**

The dataset belongs to "The Bread Basket" a retail bakery located in Edinburgh. The dataset has 20507 entries, over 9000 transactions, and 5 columns.


**Load the packages**

In [198]:
# Importing libraries
import pandas as pd

**Load the data**

In [199]:
# Load the data
dataset = pd.read_csv('/content/bakery_transactions.csv')

In [200]:
# Change the data frame's column names to lower case
dataset.columns = dataset.columns.str.lower()

In [201]:
# Transform every single string inside of the data frame to lower case
dataset = dataset.applymap(lambda s: s.lower() if type(s) == str else s)

In [202]:
# Examine the data
dataset.head()

Unnamed: 0,transaction,item,date_time,period_day,weekday_weekend
0,1,bread,30-10-2016 09:58,morning,weekend
1,2,scandinavian,30-10-2016 10:05,morning,weekend
2,2,scandinavian,30-10-2016 10:05,morning,weekend
3,3,hot chocolate,30-10-2016 10:07,morning,weekend
4,3,jam,30-10-2016 10:07,morning,weekend


In [203]:
# Overview of all variables, their datatypes
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20507 entries, 0 to 20506
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   transaction      20507 non-null  int64 
 1   item             20507 non-null  object
 2   date_time        20507 non-null  object
 3   period_day       20507 non-null  object
 4   weekday_weekend  20507 non-null  object
dtypes: int64(1), object(4)
memory usage: 801.2+ KB


**Check for missing values**

Before moving on, we will check to see if there are any null values to impute. However, the data were all fine, so there was nothing to do.

In [204]:
dataset.isnull().sum()

transaction        0
item               0
date_time          0
period_day         0
weekday_weekend    0
dtype: int64

**Getting the list of transactions**

Once we have read the dataset, we need to get the list of items in each transaction. This list will work as a training set from where we can generate the list of association rules.

In [205]:
# Convert the data frame to list
transaction_list = dataset.groupby(['transaction', 'date_time'])['item'].apply(lambda x: list(x))

In [206]:
transaction_list.head()

transaction  date_time       
1            30-10-2016 09:58                          [bread]
2            30-10-2016 10:05     [scandinavian, scandinavian]
3            30-10-2016 10:07    [hot chocolate, jam, cookies]
4            30-10-2016 10:08                         [muffin]
5            30-10-2016 10:13          [coffee, pastry, bread]
Name: item, dtype: object

In [207]:
# Converting the data frame into a list of lists 
df = transaction_list.values.tolist()

In [208]:
df[:5]

[['bread'],
 ['scandinavian', 'scandinavian'],
 ['hot chocolate', 'jam', 'cookies'],
 ['muffin'],
 ['coffee', 'pastry', 'bread']]

**One-hot encoding transaction data**

Using an TransactionEncoder object, we can transform this dataset into an array format suitable for typical machine learning APIs. Via the fit method, the TransactionEncoder learns the unique labels in the dataset, and via the transform method, it transforms the input dataset (a Python list of lists) into a one-hot encoded NumPy boolean array.

In [209]:
from mlxtend.preprocessing import TransactionEncoder

In [210]:
# Instantiate TransactionEncoder and identify unique items
encoder = TransactionEncoder().fit(df)

In [211]:
# One-hot encode transactions
onehot = encoder.transform(df)

In [212]:
# Convert one-hot encoded data to data frame
transf_df = pd.DataFrame(onehot, columns = encoder.columns_)

In [213]:
transf_df.head()

Unnamed: 0,adjustment,afternoon with the baker,alfajores,argentina night,art tray,bacon,baguette,bakewell,bare popcorn,basket,bowl nic pitt,bread,bread pudding,brioche and salami,brownie,cake,caramel bites,cherry me dried fruit,chicken sand,chicken stew,chimichurri oil,chocolates,christmas common,coffee,coffee granules,coke,cookies,crepes,crisps,drinking chocolate spoons,duck egg,dulce de leche,eggs,ella's kitchen pouches,empanadas,extra salami or feta,fairy doors,farm house,focaccia,frittata,...,lemon and coconut,medialuna,mighty protein,mineral water,mortimer,muesli,muffin,my-5 fruit shoot,nomad bag,olum & polenta,panatone,pastry,pick and mix bowls,pintxos,polenta,postcard,raspberry shortbread sandwich,raw bars,salad,sandwich,scandinavian,scone,siblings,smoothies,soup,spanish brunch,spread,tacos/fajita,tartine,tea,the bart,the nomad,tiffin,toast,truffles,tshirt,valentine's card,vegan feast,vegan mincepie,victorian sponge
0,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


**Definition**

Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association.

**Measure 1: Support.** This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. 

**Measure 2: Confidence.** This says how likely item Y is purchased when item X is purchased, expressed as {X → Y}.

**Measure 3: Lift.** This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing.

**Run the FP-growth algorithm**

We will generate the association rules using the FP-growth algorithm.


In [214]:
from mlxtend.frequent_patterns import fpgrowth

In [215]:
# Compute frequent items using the FP-growth algorithm
frequent_itemsets = fpgrowth(transf_df, min_support = 0.05, use_colnames = True)

In [216]:
frequent_itemsets.sort_values('support', ascending = False)

Unnamed: 0,support,itemsets
3,0.478394,(coffee)
0,0.327205,(bread)
6,0.142631,(tea)
7,0.103856,(cake)
9,0.090016,"(bread, coffee)"
4,0.086107,(pastry)
8,0.071844,(sandwich)
5,0.061807,(medialuna)
1,0.05832,(hot chocolate)
10,0.054728,"(cake, coffee)"


**Examining the frequent itemsets**

In [217]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.327205,(bread),1
1,0.05832,(hot chocolate),1
2,0.054411,(cookies),1
3,0.478394,(coffee),1
4,0.086107,(pastry),1
5,0.061807,(medialuna),1
6,0.142631,(tea),1
7,0.103856,(cake),1
8,0.071844,(sandwich),1
9,0.090016,"(bread, coffee)",2


If we print the value_counts( ) for the length column we will see that we get back a number of itemsets that contain multiple items and which contain single items.

In [218]:
frequent_itemsets['length'].value_counts()

1    9
2    2
Name: length, dtype: int64

**Calculate association rules**

In [219]:
from mlxtend.frequent_patterns import association_rules

In [220]:
# Compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, metric = 'lift', min_threshold = 1.0)

In [221]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(cake),(coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664
1,(coffee),(cake),0.478394,0.103856,0.054728,0.114399,1.101515,0.005044,1.011905


The {cake → coffee} rule has the highest confidence at 52.7%. However, both cake and coffee appear frequently across all transactions (see frequent_itemsets), so their association could simply be a fluke. As it turns out, lift is greater than  1.0. This does give us good confidence that the association rule we recommended did not arise by random chance.



As a result, if item X and Y are bought together more frequently, then several steps can be taken to increase the profit. For instance:

* Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.

* Promotional discounts could be applied to just one out of the two items.

* Advertisements on X could be targeted at buyers who purchase Y.

* X and Y could be combined into a new product, such as having Y in flavors of X.