In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder

!pip install mlxtend==0.23.1

Collecting mlxtend==0.23.1
  Obtaining dependency information for mlxtend==0.23.1 from https://files.pythonhosted.org/packages/1c/07/512f6a780239ad6ce06ce2aa7b4067583f5ddcfc7703a964a082c706a070/mlxtend-0.23.1-py3-none-any.whl.metadata
  Downloading mlxtend-0.23.1-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   - -------------------------------------- 0.0/1.4 MB 393.8 kB/s eta 0:00:04
   - -------------------------------------- 0.1/1.4 MB 409.6 kB/s eta 0:00:04
   - -------------------------------------- 0.1/1.4 MB 409.6 kB/s eta 0:00:04
   - -------------------------------------- 0.1/1.4 MB 409.6 kB/s eta 0:00:04
   - -------------------------------------- 0.1/1.4 MB 409.6

# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [2]:
# load the data set ans show the first five transaction
url = "https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [8]:
# Meratakan dataset menjadi array 1D
df2 = np.ravel(df)

# Mendapatkan set unik dari elemen dataset
set2 = set(df2)

# Daftar urutan yang diinginkan
dataset = ['Bagel', 'Wine', 'Cheese', 'Milk', 'Diaper', 'Meat', 'Eggs', 'Bread', 'Pencil', np.nan]

# Menyusun elemen sesuai urutan yang diinginkan
items = [item for item in dataset if item in set2]

print(items)

['Bagel', 'Wine', 'Cheese', 'Milk', 'Diaper', 'Meat', 'Eggs', 'Bread', 'Pencil', nan]


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [17]:
itemset = set(items)
# encoding the feature
encodedValue = []
for index, row in df.iterrows():
    rowset = set(row) 
    labels = {}
    uncommons = list(itemset - rowset)
    commons = list(itemset.intersection(rowset))
    for i in uncommons:
        labels[i] = 0
    for j in commons:
        labels[j] = 1
    encodedValue.append(labels)

print(labels)

{'Cheese': 0, 'Diaper': 0, 'Pencil': 0, 'Milk': 0, 'Meat': 1, 'Wine': 1, 'Bread': 1, 'Bagel': 1, 'Eggs': 1, nan: 1}


In [18]:
  # create new dataframe from the encoded features
encodeddf = pd.DataFrame(encodedValue)
  # show the new dataframe
encodeddf.head()

Unnamed: 0,Bagel,NaN,Milk,Meat,Bread,Cheese,Diaper,Eggs,Wine,Pencil
0,0,0,0,1,1,1,1,1,1,1
1,0,0,1,1,1,1,1,0,1,1
2,0,1,1,1,0,1,0,1,1,0
3,0,1,1,1,0,1,0,1,1,0
4,0,1,0,1,0,0,0,0,1,1


In [20]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.
encodeddf = encodeddf.drop(encodeddf.columns[2], axis=1)
encodeddf.head()

Unnamed: 0,Bagel,NaN,Bread,Cheese,Diaper,Eggs,Wine,Pencil
0,0,0,1,1,1,1,1,1
1,0,0,1,1,1,0,1,1
2,0,1,0,1,0,1,1,0
3,0,1,0,1,0,1,1,0
4,0,1,0,0,0,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [21]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules
freqpurchase = apriori(encodeddf, min_support=0.2, use_colnames=True)
freqpurchase.head(33)




Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.869841,(nan)
2,0.504762,(Bread)
3,0.501587,(Cheese)
4,0.406349,(Diaper)
5,0.438095,(Eggs)
6,0.438095,(Wine)
7,0.361905,(Pencil)
8,0.336508,"(Bagel, nan)"
9,0.279365,"(Bread, Bagel)"


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [22]:
assRules = association_rules(freqpurchase, metric="confidence", min_threshold=0.6)
assRules.head(14)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Bagel),(nan),0.425397,0.869841,0.336508,0.791045,0.909413,-0.03352,0.622902,-0.147743
1,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
2,(Bread),(nan),0.504762,0.869841,0.396825,0.786164,0.903801,-0.042237,0.608683,-0.176903
3,(Cheese),(nan),0.501587,0.869841,0.393651,0.78481,0.902245,-0.042651,0.604855,-0.178565
4,(Diaper),(nan),0.406349,0.869841,0.31746,0.78125,0.898152,-0.035999,0.595011,-0.160381
5,(Eggs),(nan),0.438095,0.869841,0.336508,0.768116,0.883053,-0.044565,0.56131,-0.190735
6,(Wine),(nan),0.438095,0.869841,0.31746,0.724638,0.833069,-0.063613,0.472682,-0.262869
7,(Pencil),(nan),0.361905,0.869841,0.266667,0.736842,0.8471,-0.048133,0.494603,-0.220499
8,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
9,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)


Berikut adalah penjelasan singkat mengenai istilah-istilah yang digunakan dalam analisis asosiasi:

1. Antecedent Support: Frekuensi transaksi yang mengandung item antecedent (produk pertama dalam aturan asosiasi).

2. Consequent Support: Frekuensi transaksi yang mengandung item consequent (produk kedua dalam aturan asosiasi).

3. Support: Ukuran seberapa sering suatu item atau kombinasi item muncul dalam dataset.

4. Confidence: Kemungkinan item consequent muncul jika item antecedent sudah ada dalam transaksi.

5. Lift: Mengukur seberapa kuat hubungan antara item dibandingkan dengan apa yang diharapkan secara acak. Nilai lebih dari 1 menunjukkan hubungan positif yang kuat.

6. Leverage: Mengukur perbedaan antara seberapa sering item muncul bersama-sama dibandingkan dengan jika mereka muncul secara independen.

7. Conviction: Mengukur kekuatan aturan asosiasi, semakin tinggi nilainya, semakin kuat hubungan antara antecedent dan consequent.

Interpretasi untuk Kasus:
Dalam kasus produk seperti Milk, Bread, dll., ukuran-ukuran ini membantu untuk memahami hubungan antara produk yang sering dibeli bersama, misalnya, berapa banyak kemungkinan Milk dibeli jika Bread sudah dibeli, atau seberapa kuat asosiasi antara produk-produk tersebut