# **Association Rules**

### **Import required packages**

In [None]:
import  pandas as pd
import numpy as np
import  matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

  and should_run_async(code)


### **Data Preprocessing:**

In [None]:
#Read the Dataset
data=pd.read_excel('Online retail.xlsx')
#Print
data

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
...,...
7495,"butter,light mayo,fresh bread"
7496,"burgers,frozen vegetables,eggs,french fries,ma..."
7497,chicken
7498,"escalope,green tea"


In [None]:
#Name the column name as products
data.columns = ['Products']

In [None]:
data.head()

Unnamed: 0,Products
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


In [None]:
data.tail()

Unnamed: 0,Products
7495,"butter,light mayo,fresh bread"
7496,"burgers,frozen vegetables,eggs,french fries,ma..."
7497,chicken
7498,"escalope,green tea"
7499,"eggs,frozen smoothie,yogurt cake,low fat yogurt"


In [None]:
data.shape

(7500, 1)

In [None]:
data.describe()

Unnamed: 0,Products
count,7500
unique,5175
top,cookies
freq,223


In [None]:
#Check for null values
data.isnull()

Unnamed: 0,Products
0,False
1,False
2,False
3,False
4,False
...,...
7495,False
7496,False
7497,False
7498,False


In [None]:
#Lets sumup the null values
print(data.isnull().sum())

Products    0
dtype: int64


No null values in the dataset

In [None]:
#Drop the duplicate values
df = data.drop_duplicates()

In [None]:
df.shape

(5175, 1)

In [None]:
#Create customerID column for better understanding
df['CustomerID'] = range(1, len(df) + 1)

In [None]:
cols = ['CustomerID'] + [col for col in df.columns if col != 'CustomerID']
df = df[cols]

In [None]:
df.head()

Unnamed: 0,CustomerID,Products
0,1,"burgers,meatballs,eggs"
1,2,chutney
2,3,"turkey,avocado"
3,4,"mineral water,milk,energy bar,whole wheat rice..."
4,5,low fat yogurt


### **Split the Products into Individual Items:**

In [None]:
# Split the 'Products' column into separate products
df = df.assign(Products=df['Products'].str.split(',')).explode('Products')

# Remove any leading/trailing whitespace from product names
df['Products'] = df['Products'].str.strip()

### **Create the Binary Matrix:**

In [None]:
# Pivot the dataset to create a binary matrix
basket = pd.crosstab(df['CustomerID'], df['Products'])

# Convert to binary (1 if item was purchased, 0 otherwise)
basket = basket.applymap(lambda x: 1 if x > 0 else 0)

# Display the binary matrix
basket.head()


Products,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# most popular items
count = basket.loc[:,:].sum()
count

Unnamed: 0_level_0,0
Products,Unnamed: 1_level_1
almonds,151
antioxydant juice,57
asparagus,35
avocado,237
babies food,31
...,...
whole wheat pasta,210
whole wheat rice,403
yams,78
yogurt cake,170


### **Association Rule Mining:**

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

# Apply the Apriori algorithm to find frequent itemsets
frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True)

# Generate the association rules
rules = association_rules(frequent_itemsets,num_itemsets = len(frequent_itemsets), metric="lift", min_threshold=1)

# Display the generated rules
print(rules.head())


       antecedents      consequents  antecedent support  consequent support  \
0        (almonds)  (mineral water)            0.029179            0.299710   
1  (mineral water)        (almonds)            0.299710            0.029179   
2        (avocado)      (chocolate)            0.045797            0.205217   
3      (chocolate)        (avocado)            0.205217            0.045797   
4        (avocado)   (french fries)            0.045797            0.192657   

    support  confidence      lift  representativity  leverage  conviction  \
0  0.010821    0.370861  1.237399               1.0  0.002076    1.113092   
1  0.010821    0.036106  1.237399               1.0  0.002076    1.007186   
2  0.010242    0.223629  1.089716               1.0  0.000843    1.023715   
3  0.010242    0.049906  1.089716               1.0  0.000843    1.004325   
4  0.011594    0.253165  1.314069               1.0  0.002771    1.081019   

   zhangs_metric   jaccard  certainty  kulczynski  
0       0.

In [None]:
# Sort the rules by confidence, lift, or support to find the most significant ones
rules = rules.sort_values(by='lift', ascending=False)

# Display the top rules
rules.head()


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
425,(whole wheat pasta),(olive oil),0.04058,0.087536,0.011014,0.271429,3.100757,1.0,0.007462,1.252401,0.706154,0.094059,0.201534,0.198628
424,(olive oil),(whole wheat pasta),0.087536,0.04058,0.011014,0.125828,3.100757,1.0,0.007462,1.097519,0.742493,0.094059,0.088854,0.198628
797,(soup),"(milk, mineral water)",0.070918,0.067826,0.012367,0.174387,2.571089,1.0,0.007557,1.129069,0.657703,0.097859,0.114314,0.178362
792,"(milk, mineral water)",(soup),0.067826,0.070918,0.012367,0.182336,2.571089,1.0,0.007557,1.136264,0.655521,0.097859,0.119923,0.178362
298,(herb & pepper),(ground beef),0.066473,0.135845,0.022802,0.343023,2.5251,1.0,0.013772,1.31535,0.646983,0.127018,0.239746,0.255438


### **Insights:**

1. **olive oil -> whole wheat pasta**  
 **Support: 0.011014**   
This rule applies to about 1.1% of the transactions in the dataset.  
 **Confidence: 0.125828**  
When "olive oil" is purchased, there's a 12.6% chance that "whole wheat pasta" will also be purchased.  
 **Lift: 3.100757**
This lift value suggests that customers who buy "olive oil" are about 3.1 times more likely to buy "whole wheat pasta" than customers in general.

2. **whole wheat pasta -> olive oil**  
  **Support: 0.011014**  
This rule applies to the same 1.1% of transactions as the previous rule.  
  **Confidence: 0.271429**  
When "whole wheat pasta" is purchased, there's a 27.1% chance that "olive oil" will also be purchased.  
  **Lift: 3.100757**  
The lift is the same as in the first rule, again showing a strong association.

The secound rule is more reliable than the first in predicting "olive oil" purchases when "whole wheat pasta" is bought, with a higher confidence of 27.1%. However, the support is still low

### **Interview Questions:**

**1.What is lift and why is it important in Association rules?**

Lift is used to measure how much more likely the consequent (the item on the right side of the rule) is to be purchased when the antecedent (the item on the left side of the rule) is purchased, compared to how likely it is to be purchased in general  
Lift=
Confidence(Antecedent→Consequent)/Support(Consequent)  
Lift > 1: The items are positively associated (buying one increases the chance of buying the other).     
Lift = 1: No association (the items are independent of each other).  
Lift < 1: Negative association (buying one decreases the chance of buying the other).  
**Importance:** Lift helps to identify meaningful and non-trivial relationships between items, filtering out rules that might be common but not necessarily interesting. This makes it a valuable metric for uncovering insights that can inform marketing strategies, such as product bundling or targeted promotions.

**2.What is support and Confidence. How do you calculate them?**

Support is a measure of how frequently an itemset appears in the dataset. It represents the proportion of transactions in which a particular itemset occurs.

Calculation of Support
For an itemset
X:

Support
(
𝑋
)
=
Number of transactions containing
𝑋/
Total number of transactions

Confidence is a measure of the reliability of an association rule. It indicates the probability that the consequent is purchased given that the antecedent is purchased.

Calculation of Confidence
For a rule
X→Y:

Confidence
(
𝑋
→
𝑌
)
=
Support
(
𝑋
∪
𝑌
)/
Support
(
𝑋
)

Where

X∪Y is the itemset containing both
X (antecedent) and
Y (consequent).





**3.What are some limitations or challenges of Association rules mining?**

**Association rule mining has several limitations:**

**1.Too Many Rules:** Large datasets can produce an overwhelming number of rules, many of which may be trivial or irrelevant.  
**2.Complex Interpretation**: Understanding and applying the generated rules can be challenging, especially with complex data.  
**3.Rare Items Ignored:** Rules involving rare items might be missed due to low support.  
**4.Threshold Selection:** Choosing the right support, confidence, and lift thresholds is crucial but difficult, affecting the quality of the results.  
**5.Lack of Sequence/Timestamps:** The method doesn't account for the order or timing of purchases.  
**6.Binary Data Limitation:** Often requires simplifying data into binary form, ignoring quantities or other nuances.