# Association Rules

Data preprocessing

In [1]:
# reading the excel file
import pandas as pd
file_path = 'Online retail.xlsx'
df = pd.read_excel(file_path, header=None)

In [2]:
df.head()

Unnamed: 0,0
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."


In [3]:
df.isnull().sum()

0    0
dtype: int64

In [6]:
# converting the data into one hot encoded format
from mlxtend.preprocessing import TransactionEncoder
transactions = df[0].str.split(',').tolist()
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)

df_transformed = pd.DataFrame(te_ary, columns=te.columns_)


In [7]:
df_transformed.head(5)

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


Association Rule Mining

In [11]:
from mlxtend.frequent_patterns import apriori, association_rules

#Apriori algorithm
min_support = 0.05 #threshold
frequent_itemsets = apriori(df_transformed, min_support=min_support, use_colnames=True)

print(frequent_itemsets)


     support                    itemsets
0   0.087188                   (burgers)
1   0.081056                      (cake)
2   0.059992                   (chicken)
3   0.163845                 (chocolate)
4   0.080389                   (cookies)
5   0.051060               (cooking oil)
6   0.179709                      (eggs)
7   0.079323                  (escalope)
8   0.170911              (french fries)
9   0.063325           (frozen smoothie)
10  0.095321         (frozen vegetables)
11  0.052393             (grated cheese)
12  0.132116                 (green tea)
13  0.098254               (ground beef)
14  0.076523            (low fat yogurt)
15  0.129583                      (milk)
16  0.238368             (mineral water)
17  0.065858                 (olive oil)
18  0.095054                  (pancakes)
19  0.071457                    (shrimp)
20  0.050527                      (soup)
21  0.174110                 (spaghetti)
22  0.068391                  (tomatoes)
23  0.062525    

In [12]:
# extracting association rules
min_confidence = 0.1  #minimum confidence threshold
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
print(rules)


       antecedents      consequents  antecedent support  consequent support  \
0  (mineral water)      (chocolate)            0.238368            0.163845   
1      (chocolate)  (mineral water)            0.163845            0.238368   
2  (mineral water)           (eggs)            0.238368            0.179709   
3           (eggs)  (mineral water)            0.179709            0.238368   
4  (mineral water)      (spaghetti)            0.238368            0.174110   
5      (spaghetti)  (mineral water)            0.174110            0.238368   

    support  confidence      lift  leverage  conviction  zhangs_metric  
0  0.052660    0.220917  1.348332  0.013604    1.073256       0.339197  
1  0.052660    0.321400  1.348332  0.013604    1.122357       0.308965  
2  0.050927    0.213647  1.188845  0.008090    1.043158       0.208562  
3  0.050927    0.283383  1.188845  0.008090    1.062815       0.193648  
4  0.059725    0.250559  1.439085  0.018223    1.102008       0.400606  
5  0.059

In [14]:
#Filter rules by lift
min_lift = 1.0  # Setting minimum lift threshold
rules = rules[rules['lift'] >= min_lift]

print(rules)


       antecedents      consequents  antecedent support  consequent support  \
0  (mineral water)      (chocolate)            0.238368            0.163845   
1      (chocolate)  (mineral water)            0.163845            0.238368   
2  (mineral water)           (eggs)            0.238368            0.179709   
3           (eggs)  (mineral water)            0.179709            0.238368   
4  (mineral water)      (spaghetti)            0.238368            0.174110   
5      (spaghetti)  (mineral water)            0.174110            0.238368   

    support  confidence      lift  leverage  conviction  zhangs_metric  
0  0.052660    0.220917  1.348332  0.013604    1.073256       0.339197  
1  0.052660    0.321400  1.348332  0.013604    1.122357       0.308965  
2  0.050927    0.213647  1.188845  0.008090    1.043158       0.208562  
3  0.050927    0.283383  1.188845  0.008090    1.062815       0.193648  
4  0.059725    0.250559  1.439085  0.018223    1.102008       0.400606  
5  0.059

Analysis and Interpretation

1. Rule 1: Mineral water => chocolate
    confidence = 0.2209
    lift = 1.34
    Customers who buy mineral water will buy chocolate 22.09% of the time. This is 1.35 times more likely than if they were bought independently.

2. Rule 2: chocolate => water
    confidence = 0.3214
    lift = 1.34
    Customers who buy chocolate also tend to buy mineral water 32.14% of the time. This is 1.35 times more likely than if they were bought independently. This further confirms the relationship seen in Rule 1.

3. Rule 3: mineral water => eggs
    confidence = 0.2136
    lift = 1.18
    Customers who buy mineral water also tend to buy eggs 21.36% of the time. This is 1.19 times more likely than if they were bought independently.

4. Rule 4: eggs => mineral water
    Confidence: 0.2833
    Lift: 1.18
    Customers who buy eggs also tend to buy mineral water 28.34% of the time. This is 1.19 times more likely than if they were bought independently. This supports the relationship seen in Rule 3.

5. Rule 5: mineral water => spaghetti
    Confidence: 0.2505
    Lift: 1.43
    Customers who buy mineral water also tend to buy spaghetti 25.06% of the time. This is 1.44 times more likely than if they were bought independently

6. spaghetti => mineral water
    Confidence: 0.3430
    Lift: 1.43
    Customers who buy spaghetti also tend to buy mineral water 34.30% of the time. This is 1.44 times more likely than if they were bought independently. This confirms the relationship seen in Rule 5.

Customer Purchasing Behavior Insights
Mineral Water as a Central Product:

Mineral water is frequently associated with multiple products (chocolate, eggs, spaghetti).

Chocolates and Mineral Water:

There's a notable association between chocolate and mineral water.

Eggs and Mineral Water:

While the association is weaker than with chocolate or spaghetti, it still exists. 

# Interview questions

1.	What is lift and why is it important in Association rules?

Lift measures the strength of an association between itemsets in association rule mining.
importance:
Identifies Strong Associations: Highlights significant item relationships.
Filters Meaningful Rules: Goes beyond support and confidence to find truly interesting rules
Guides Marketing: Informs effective promotions and product placements.

2.	What is support and Confidence. How do you calculate them?

Support: measures how often an itemset appears in the dataset
formula = support(a) = transactions with A / total transactions

confidence: measures how often items in the consquent appear in transactions that contain the antecedent.
formula = confidence(a->b) = support(aUb) / support(a)

3.	What are some limitations or challenges of Association rules mining?

High Dimensionality:

With a large number of items, the number of possible rules to consider becomes vast, making it difficult to identify relevant and actionable rules.

Need for Interpretation:

Association rules provide correlations but do not imply causation. Interpreting the discovered rules requires domain knowledge and careful consideration of contextual factors.

Choosing Appropriate Metrics:

Selecting suitable metrics such as support, confidence, and lift is subjective and depends on the specific application and domain. Choosing inappropriate metrics can lead to the discovery of irrelevant or misleading rules.