<a href="https://colab.research.google.com/github/IvanIndargo/Datamining_exersice/blob/main/%5BQuestion%5D_Exercise_Week11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder

!pip install mlxtend==0.23.1



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [2]:
# load the data set ans show the first five transaction
url = 'https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv'
df = pd.read_csv(url)

# Display 10 sample rows from the dataset
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [3]:
products = set()
for col in df.columns:
    products.update(df[col].unique())
products

{'Bagel',
 'Bread',
 'Cheese',
 'Diaper',
 'Eggs',
 'Meat',
 'Milk',
 'Pencil',
 'Wine',
 nan}

## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [5]:
#create an itemset based on the products
products = set()
for col in df.columns:
    products.update(df[col].unique())

encoded_transactions = []
for _, row in df.iterrows():
    transaction_dict = {product: (1 if product in row.values else 0) for product in products}
    encoded_transactions.append(transaction_dict)

# Menampilkan hasil untuk transaksi pertama
encoded_transactions[0]

{nan: 0,
 'Eggs': 1,
 'Bread': 1,
 'Diaper': 1,
 'Milk': 0,
 'Meat': 1,
 'Wine': 1,
 'Pencil': 1,
 'Cheese': 1,
 'Bagel': 0}

In [7]:
# Replace NaN with a specific value, e.g., 'No Product'
df_with_nan = df.fillna('NaN')

# Flatten the data again, but keep 'No Product' as a category
flattened_data = df_with_nan.values.flatten()

# Step 1: Use OneHotEncoder, including 'No Product' (which represents missing values)
# Replace 'sparse' with 'sparse_output'
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')  # Make sure unknown values are ignored
encoded_data = encoder.fit_transform(flattened_data.reshape(-1, 1))

# Step 2: Create the one-hot encoded DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=encoder.categories_[0])

# Step 3: Create a DataFrame with 0s and 1s, representing whether the item is bought
product_data = pd.DataFrame(0, index=df.index, columns=encoder.categories_[0])

# Loop through each row and mark presence of products, including 'No Product'
for i, row in df_with_nan.iterrows():
    for product in row:
        product_data.loc[i, product] = 1

# Display the final DataFrame
product_data.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,NaN,Pencil,Wine
0,0,1,1,1,1,1,0,0,1,1
1,0,1,1,1,0,1,1,0,1,1
2,0,0,1,0,1,1,1,1,0,1
3,0,0,1,0,1,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1,1


In [8]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.
if 'NaN' in product_data.columns:
    product_data.drop(columns=['NaN'], inplace=True)

product_data.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
0,0,1,1,1,1,1,0,1,1
1,0,1,1,1,0,1,1,1,1
2,0,0,1,0,1,1,1,0,1
3,0,0,1,0,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [16]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules
min_support = 0.2
frequent_itemsets = apriori(product_data, min_support=min_support, use_colnames=True)
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.504762,(Bread)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)
5,0.47619,(Meat)
6,0.501587,(Milk)
7,0.361905,(Pencil)
8,0.438095,(Wine)
9,0.279365,"(Bagel, Bread)"


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [17]:
confidence_threshold = 0.6
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=confidence_threshold)
rules.drop(columns=['zhangs_metric'], inplace=True)
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
2,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
3,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
4,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
5,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
8,"(Cheese, Eggs)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773
9,"(Cheese, Meat)",(Eggs),0.32381,0.438095,0.215873,0.666667,1.521739,0.074014,1.685714


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)

#Support = mengukur frekuensi aturan dalam himpunan data, yang menunjukkan seberapa sering item muncul bersama. Ini menunjukkan proporsi transaksi di mana anteseden dan konsekuensi terjadi, membantu menemukan pola umum. Support digunakan untuk menunjukkan seberapa sering aturan berlaku di dalam dataset. Support yang lebih tinggi menunjukkan bahwa aturan tersebut berlaku untuk porsi yang lebih besar dari kumpulan data, membuatnya berpotensi lebih berharga untuk analisis

#confidence = mengukur seberapa besar kemungkinan konsekuensi (B) muncul ketika anteseden (A) hadir. Ini menghitung probabilitas bahwa transaksi yang berisi A juga termasuk B. Confidence memberikan indikasi keandalan aturan asosiasi.confidence membantu menilai kekuatan prediktif aturan. confidence yang lebih tinggi berarti aturan lebih mungkin berlaku dalam data di masa mendatang, sehingga berguna untuk aplikasi seperti rekomendasi produk.


#lift = mengevaluasi kekuatan asosiasi dengan membandingkan kepercayaan aturan dengan kepercayaan yang diharapkan jika item tersebut independen. Nilai lift yang lebih besar dari 1 menunjukkan korelasi positif, yang berarti terjadinya A meningkatkan kemungkinan B. Lift membantu menentukan apakah hubungan antar item bermakna.

#leverage = menilai perbedaan antara frekuensi aturan yang diamati dan frekuensi yang diharapkan jika anteseden dan konsekuensinya independen. Ini membantu untuk mengidentifikasi aturan yang terjadi lebih sering daripada yang disarankan secara acak. Leverage berguna dalam menemukan aturan yang mewakili asosiasi yang signifikan secara statistik, yang dapat berharga untuk memahami pola yang langka tetapi penting.

#conviction = mengukur seberapa sering aturan membuat prediksi yang benar dibandingkan dengan kasus di mana itu salah. Ini mempertimbangkan rasio kejadian yang diharapkan dari A tanpa B. conviction menambahkan lapisan analisis lain dengan memperhitungkan skenario di mana aturan tidak berlaku, memberikan ukuran keandalan yang lebih realistis. Nilai conviction yang lebih besar menunjukkan aturan yang lebih kuat.

#antecedent support = Mengukur proporsi transaksi dalam dataset yang mengandung antecedent item atau kumpulan item yang mendahului konsekuen dalam aturan asosiasi.

#consequent support = Mengukur proporsi transaksi dalam dataset yang mengandung consequent item atau kumpulan item yang menjadi hasil dalam aturan asosiasi.

#Sumber:
#https://herovired.com/learning-hub/topics/association-rules-in-data-mining/
#https://jurnal.amikom.ac.id/index.php/infos/article/download/561/235