# Apply association rule to find rules with the current dataset? For example, a customer who purchased peanut butter and jelly together has also purchased bread

In [1]:
# import required lib
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar 
from apyori import apriori
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import fpgrowth

import warnings
warnings.filterwarnings("ignore")

# Load data from all the sources

In [2]:
#Read the data
orders_all = pd.read_csv("perf_test_orderdata/orders_all.csv")
orders_times = pd.read_csv("perf_test_orderdata/orders_times.csv")

In [3]:
#join the column to connect the two csv data
merge = pd.merge(orders_all,orders_times,how='left',on='admin_reference')

In [4]:
#load product data collected from API
# products = pd.read_csv("products.csv")

# Prepare Data

In [5]:
#drop the features which has most null values
orders = merge.drop(['completed_at_x','customer_company','bill_state_name','ship_state_name','ship_company','subsite_store','campaign_code','bill_company'],axis=1)

In [6]:
#for campaign 
campaign_code = merge.drop(['completed_at_x','customer_company','bill_state_name','ship_state_name','ship_company','subsite_store','bill_company'],axis=1)

In [7]:
#prepasre the dat 
def prep_data(data):
    #Date and time is splitted
    new = data["completed_at_y"].str.split(" ", n = 1, expand = True) 
    data['Date'] = new[0]
    data['Time'] = new[1]
    #Day, month and year is splitted
    new = data["Date"].str.split("-", n = 2, expand = True) 
    data['Year'] = new[0]
    data['Month'] = new[1]
    data['Day'] = new[2]
    #drop null values
    data = data.dropna()
    #convert month number to month name
    data['Month'] = data['Month'].astype(int).apply(lambda x: calendar.month_abbr[x])
    #Date and time is splitted
    new = data["Time"].str.split(":", n = 1, expand = True) 
    data['Hour'] = new[0]
    data['Minute'] = new[1]

    data['Date'] = pd.to_datetime(data['Date'])  # Step 1
    data['DayofWeek'] =data['Date'].dt.day_name()  # Step 2
    return data

In [8]:
orders = prep_data(orders)
campaign_code = prep_data(campaign_code)

In [9]:
#update values in orders table
# orders['group_name'] = None
# orders.update(products)

In [10]:
orders = orders.drop([ 'state', 'shipment_state',
       'currency', 'bill_zipcode',
       'ship_city', 'ship_zipcode', 'ship_country_iso_name'],axis=1)

In [11]:
orders.head(2)

Unnamed: 0,admin_reference,payment_state,total,bill_city,bill_country_iso_name,product_name,quantity,sku,completed_at_y,Date,Time,Year,Month,Day,Hour,Minute,DayofWeek
1,O160651894,paid,97.21,Hafrsfjord,NO,AROMA Svartvinbärstoppar 900g,1,WEB7098,2018-05-31 09:08,2018-05-31,09:08,2018,May,31,9,8,Thursday
2,O160651894,paid,97.21,Hafrsfjord,NO,AROMA HALLON & LAKRITSBÅTAR 900G 2:a sor,1,WEB7080,2018-05-31 09:08,2018-05-31,09:08,2018,May,31,9,8,Thursday


In [12]:
# products = products.drop(['ID', 'product_id', 'ean', 'is_master', 'weight',
#        'reference', 'source_owner', 'source_id', 'best_price', 'stock_available_qty', 'sku_api',
#        'group_description', 'group_properties','current_price'],axis=1)

In [13]:
orders.groupby(by = ['admin_reference','product_name']).sum().drop(['total','quantity'],axis=1)

admin_reference,product_name
O000000802,3 brett Fanta Lemon 33 cl
O000001094,Choco/krokant - 1 kg
O000001094,Chocolate Liqueurs 250g
O000001094,LAKRITSI LEMON - Hel låda - 30 st
O000001094,Lutti Krokodiler 1 kg
...,...
O999997046,Bamsemums Stor - 250g
O999997046,Brynild Myke Lakrisbåter 425 g
O999997046,Brynild Pulverpadder Original 120g
O999997046,Lutti Smiley Fizz 90g


## Association rule.
Generate strong/weak rule sets from the data. (Rule based system)

1. Terms: 
    1. Support: Support is an indication of how frequently the itemset appears in the dataset.
    2. Confidence: Confidence is an indication of how often the rule has been found to be true.
    3. Lift: The ratio of the confidence of the rule and the expected confidence of the rule
    4. Conviction: Conviction compares the probability that X appears without Y if they were dependent with the actual frequency of the appearance of X without Y.

2. Apriori: (https://www.youtube.com/watch?v=h_l3b2CIQ_o)
    1. Steps
         1. Determine the support of itemsets in the transactional database, and select the minimum support and confidence.
         2. Take all supports in the transaction with higher support value than the minimum or selected support value.
         3. Find all the rules of these subsets that have higher confidence value than the threshold or minimum confidence.
         4. Sort the rules as the decreasing order of lift.
    2. Pros:
        1. This is easy to understand algorithm 
        2. The join and prune steps of the algorithm can be easily implemented on large datasets.
    3. Cons: 
        1. The apriori algorithm works slow compared to other algorithms. 
        2. The overall performance can be reduced as it scans the database for multiple times. 
        3. The time complexity and space complexity of the apriori algorithm is O(2D), which is very high. Here D represents the horizontal width present in the database.

3. FP-Growth (https://www.youtube.com/watch?v=ToswH_dA7KU)
    1. Steps:
        1. Scan DB once, find frequent 1-itemset (single item pattern)
        2. Sort frequent items in frequency descending order, f-list
        3. Scan DB again, construct FP-tree
        4. Construct the conditional FP tree in the sequence of reverse order of F - List - generate frequent item set
    2. Pros
         1. This algorithm needs to scan the database only twice when compared to Apriori which scans the transactions for each iteration.
        2. The pairing of items is not done in this algorithm and this makes it faster.
        3. The database is stored in a compact version in memory. 
        4. It is efficient and scalable for mining both long and short frequent patterns.
    3. Cons
        1. FP Tree is more cumbersome and difficult to build than Apriori.
        2. When the database is large, the algorithm may not fit in the shared memory.
        3. Expensive.
#### Not Implemented
4. ECLAT (Equivalence class transformation) https://www.youtube.com/watch?v=oBiq8cMkTCU
    Eclat algorithm uses a Depth first search for discovering frequent item sets, whereas Apriori algorithm uses breadth first search. It represents the data in vertical manner unlike Apriori algorithm which represents data in horizontal pattern. This vertical pattern of Eclat algorithm making it into faster algorithm compared to Apriori algorithm. Hence, Eclat algorithm is more efficient and scalable version of the Association Rule Learning.
    
    1. Pros:
        1. Since the Eclat algorithm uses a Depth-First Search approach, it consumes less memory than the Apriori algorithm.
        2. The Eclat algorithm is naturally faster compared to the Apriori algorithm.
        3. The Eclat algorithm does not involve in the repeated scanning of the data in order to calculate the individual support values.
        4. This algorithm is better suited for small and medium datasets where as Apriori algorithm is used for large datasets.
        5. Eclat algorithm scans the currently generated dataset unlike Apriori which scans the original dataset.
    2. Cons:
        1. More memory space and processing time are required for intersecting long TID sets
![image.png](attachment:image.png)

### Apriori

In [14]:
#Remove space in the product name
orders['product_name'] = orders['product_name'].str.strip()

In [15]:
#drop empty admin_reference
orders.dropna(axis=0, subset=['admin_reference'], inplace=True)

In [16]:
#Data clean
orders = orders[~orders['admin_reference'].str.contains('C')]
orders

Unnamed: 0,admin_reference,payment_state,total,bill_city,bill_country_iso_name,product_name,quantity,sku,completed_at_y,Date,Time,Year,Month,Day,Hour,Minute,DayofWeek
1,O160651894,paid,97.21,Hafrsfjord,NO,AROMA Svartvinbärstoppar 900g,1,WEB7098,2018-05-31 09:08,2018-05-31,09:08,2018,May,31,09,08,Thursday
2,O160651894,paid,97.21,Hafrsfjord,NO,AROMA HALLON & LAKRITSBÅTAR 900G 2:a sor,1,WEB7080,2018-05-31 09:08,2018-05-31,09:08,2018,May,31,09,08,Thursday
3,O082676927,paid,435.00,Vrigstad,SE,FREIA 43G MANDELSTANG x 30 st,1,KLIPP66654,2018-06-12 11:44,2018-06-12,11:44,2018,Jun,12,11,44,Tuesday
4,O082676927,paid,435.00,Vrigstad,SE,"DEVILS JORDGUBB/LAKRITS - 1,5 kg",1,GRA669,2018-06-12 11:44,2018-06-12,11:44,2018,Jun,12,11,44,Tuesday
5,O082676927,paid,435.00,Vrigstad,SE,Jordgubbsmattor Sockrade - 1 kg,1,KLIPP41192,2018-06-12 11:44,2018-06-12,11:44,2018,Jun,12,11,44,Tuesday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886928,O944466975,paid,2666.97,Støren,NO,Kinder Maxi 36 st,1,FER505,2020-09-28 15:45,2020-09-28,15:45,2020,Sep,28,15,45,Monday
886929,O944466975,paid,2666.97,Støren,NO,M&M Peanut 1 kg,1,WEB275810,2020-09-28 15:45,2020-09-28,15:45,2020,Sep,28,15,45,Monday
886930,O944466975,paid,2666.97,Støren,NO,Snickers 50g x 32 st,1,MAS280972,2020-09-28 15:45,2020-09-28,15:45,2020,Sep,28,15,45,Monday
886931,O944466975,paid,2666.97,Støren,NO,WEB-AFTER EIGHT 400G,1,WEB35256,2020-09-28 15:45,2020-09-28,15:45,2020,Sep,28,15,45,Monday


In [17]:
#prepare a basket
basket = orders.groupby(['admin_reference', 'product_name'])['quantity'].sum().unstack().reset_index().fillna(0).set_index('admin_reference')

In [18]:
basket

product_name,"""3DTavla """"Home""""""","""Chupa Chups """"Want U"""" magnet""",1 Flak FantaMezzo & 1 Flak CocaCola 33cl,1 brett Coca-Cola Zero,1 brett Coca-Cola och 1 Brett 7-UP,1 brett Dr Pepper + 1 brett Pepsi Max,1 brett Fanta Exotic + 2 brett Fanta org,1 brett Pepsi Max + 1 After Eight 400g,1 brett Sprite + 2 brett Coca-Cola,1 brett Sprite Zero - PANTFRITT,...,ÄT - Twist 145 g 1 aug,ÄT- Dipmix Vitl &Gurka 24g - 16st,ÄT- Rocher 4-pack - 16st 15 juni,ÄT- Rocher 4-pack - 16st 29juni,Ägglåda för 6st ägg,Äggskallar skum - 1 kg,"Äggskallar skum - 2,8 kg",Äppel Salmiak Öra 1 kg påse,Äppel ananas Öra 1 kg påse,"Äpple/kanelkola 1,3 kg"
admin_reference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
O000000802,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
O000001094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
O000003524,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
O000010704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
O000014086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
O999988295,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
O999988411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
O999990596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
O999995529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
#There are a lot of zeros in the data but we also need to make sure any positive values are converted to a 1 
#and anything less the 0 is set to 0. This step will complete the one hot encoding of the data 
#and remove the postage column (since that charge is not one we wish to explore):
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)

In [20]:
basket_sets

product_name,"""3DTavla """"Home""""""","""Chupa Chups """"Want U"""" magnet""",1 Flak FantaMezzo & 1 Flak CocaCola 33cl,1 brett Coca-Cola Zero,1 brett Coca-Cola och 1 Brett 7-UP,1 brett Dr Pepper + 1 brett Pepsi Max,1 brett Fanta Exotic + 2 brett Fanta org,1 brett Pepsi Max + 1 After Eight 400g,1 brett Sprite + 2 brett Coca-Cola,1 brett Sprite Zero - PANTFRITT,...,ÄT - Twist 145 g 1 aug,ÄT- Dipmix Vitl &Gurka 24g - 16st,ÄT- Rocher 4-pack - 16st 15 juni,ÄT- Rocher 4-pack - 16st 29juni,Ägglåda för 6st ägg,Äggskallar skum - 1 kg,"Äggskallar skum - 2,8 kg",Äppel Salmiak Öra 1 kg påse,Äppel ananas Öra 1 kg påse,"Äpple/kanelkola 1,3 kg"
admin_reference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
O000000802,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
O000001094,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
O000003524,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
O000010704,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
O000014086,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
O999988295,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
O999988411,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
O999990596,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
O999995529,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
#Frequent items
frequent_itemsets = apriori(basket_sets, min_support=0.003, use_colnames=True)

In [22]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.004385,(1 st Mars Celebrations Kalender 215 g)
1,0.010164,(20-pack Mixade stycksaker)
2,0.019179,(3 BRETT PEPSI MAX)
3,0.007154,(3 Brett Coca-Cola - 394 kr ink. frakt)
4,0.006238,(3 Brett Pepsi Max + Milka 100g)
...,...,...
172,0.005676,"(Norsk Pant 2-kr 24st, Coca-Cola 24 st - Max 1..."
173,0.003105,"(Norsk Pant 2-kr 24st, Pepsi Max - 24st - MAKS..."
174,0.003044,"(Svensk Pant 1 kr 24, Pepsi Max 33 cl x 24 st)"
175,0.003392,"(Royal Crown Energy 25 cl x 24 st, Svensk Pant..."


In [23]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.head(4)

Unnamed: 0,support,itemsets,length
0,0.004385,(1 st Mars Celebrations Kalender 215 g),1
1,0.010164,(20-pack Mixade stycksaker),1
2,0.019179,(3 BRETT PEPSI MAX),1
3,0.007154,(3 Brett Coca-Cola - 394 kr ink. frakt),1


In [24]:
frequent_itemsets['length'].value_counts()

1    172
2      5
Name: length, dtype: int64

In [25]:
#Apply association rule
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.1)
rules.sort_values(by='confidence')

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(Svensk Pant 1 kr 24),(Pepsi Max 33 cl x 24 st),0.078301,0.003063,0.003044,0.03888,12.691636,0.002804,1.037265
7,(Svensk Pant 1 kr 24),(Royal Crown Energy 25 cl x 24 st),0.078301,0.003392,0.003392,0.043319,12.771257,0.003126,1.041735
8,(Svensk Pant 1 kr 24),(SVENSK PANT 1 KR 12-pack),0.078301,0.016047,0.003461,0.044197,2.754287,0.002204,1.029452
9,(SVENSK PANT 1 KR 12-pack),(Svensk Pant 1 kr 24),0.016047,0.078301,0.003461,0.215663,2.754287,0.002204,1.175131
1,(Coca-Cola 24 st - Max 1 per order),(Norsk Pant 2-kr 24st),0.019626,0.008919,0.005676,0.289218,32.426804,0.005501,1.394352
3,(Pepsi Max - 24st - MAKS 1 PER ORDER),(Norsk Pant 2-kr 24st),0.009454,0.008919,0.003105,0.328485,36.829412,0.003021,1.475888
2,(Norsk Pant 2-kr 24st),(Pepsi Max - 24st - MAKS 1 PER ORDER),0.008919,0.009454,0.003105,0.34818,36.829412,0.003021,1.519662
0,(Norsk Pant 2-kr 24st),(Coca-Cola 24 st - Max 1 per order),0.008919,0.019626,0.005676,0.636403,32.426804,0.005501,2.696318
5,(Pepsi Max 33 cl x 24 st),(Svensk Pant 1 kr 24),0.003063,0.078301,0.003044,0.993766,12.691636,0.002804,147.840547
6,(Royal Crown Energy 25 cl x 24 st),(Svensk Pant 1 kr 24),0.003392,0.078301,0.003392,1.0,12.771257,0.003126,inf


### FP-Growth

In [26]:
#without column names
fpgrowth(basket_sets, min_support=0.003)

Unnamed: 0,support,itemsets
0,0.008159,(3098)
1,0.003774,(2894)
2,0.025718,(2391)
3,0.007494,(2126)
4,0.019626,(1112)
...,...,...
172,0.005676,"(1112, 3878)"
173,0.003105,"(4198, 3878)"
174,0.003392,"(5342, 4623)"
175,0.003461,"(4945, 5342)"


In [27]:
#with column names
frequent_itemsets = fpgrowth(basket_sets, min_support=0.003, use_colnames=True)

In [28]:
#Apply association rule
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.1)
rules.sort_values(by='confidence')

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8,(Svensk Pant 1 kr 24),(Pepsi Max 33 cl x 24 st),0.078301,0.003063,0.003044,0.03888,12.691636,0.002804,1.037265
5,(Svensk Pant 1 kr 24),(Royal Crown Energy 25 cl x 24 st),0.078301,0.003392,0.003392,0.043319,12.771257,0.003126,1.041735
6,(Svensk Pant 1 kr 24),(SVENSK PANT 1 KR 12-pack),0.078301,0.016047,0.003461,0.044197,2.754287,0.002204,1.029452
7,(SVENSK PANT 1 KR 12-pack),(Svensk Pant 1 kr 24),0.016047,0.078301,0.003461,0.215663,2.754287,0.002204,1.175131
1,(Coca-Cola 24 st - Max 1 per order),(Norsk Pant 2-kr 24st),0.019626,0.008919,0.005676,0.289218,32.426804,0.005501,1.394352
3,(Pepsi Max - 24st - MAKS 1 PER ORDER),(Norsk Pant 2-kr 24st),0.009454,0.008919,0.003105,0.328485,36.829412,0.003021,1.475888
2,(Norsk Pant 2-kr 24st),(Pepsi Max - 24st - MAKS 1 PER ORDER),0.008919,0.009454,0.003105,0.34818,36.829412,0.003021,1.519662
0,(Norsk Pant 2-kr 24st),(Coca-Cola 24 st - Max 1 per order),0.008919,0.019626,0.005676,0.636403,32.426804,0.005501,2.696318
9,(Pepsi Max 33 cl x 24 st),(Svensk Pant 1 kr 24),0.003063,0.078301,0.003044,0.993766,12.691636,0.002804,147.840547
4,(Royal Crown Energy 25 cl x 24 st),(Svensk Pant 1 kr 24),0.003392,0.078301,0.003392,1.0,12.771257,0.003126,inf


##

Antecedents: Left hand side rule(A) Consequents: Right hand side rule(B) If someone buys A then they are often to buy B

1. Same results are produced from Apriori and FP-Growth.
2. 10 rules are generated but the minimum support given is low, which is 0.003. 