## ASSOCIATION RULES

The Objective of this assignment is to introduce students to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

#### Dataset:
Use the Online retail dataset to apply the association rules.

#### Data Preprocessing:
Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  

#### Association Rule Mining:
•	Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.

•	Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.

•	Set appropriate threshold for support, confidence and lift to extract meaning full rules.

#### Analysis and Interpretation:
•	Analyse the generated rules to identify interesting patterns and relationships between the products.

•	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.


In [3]:
!pip install mlxtend



In [4]:
# Load Libraries
import numpy as np
import pandas as pd

In [5]:
# Load dataset
retail = pd.read_excel("Online retail.xlsx", header = None)
retail

  now = datetime.datetime.utcnow()
  now = datetime.datetime.utcnow()


Unnamed: 0,0
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."
...,...
7496,"butter,light mayo,fresh bread"
7497,"burgers,frozen vegetables,eggs,french fries,ma..."
7498,chicken
7499,"escalope,green tea"


#### Data Pre-Processing

In [6]:
# Convert each transaction to a list of items
transactions = retail[0].apply(lambda x: x.split(','))
transactions

0       [shrimp, almonds, avocado, vegetables mix, gre...
1                              [burgers, meatballs, eggs]
2                                               [chutney]
3                                       [turkey, avocado]
4       [mineral water, milk, energy bar, whole wheat ...
                              ...                        
7496                    [butter, light mayo, fresh bread]
7497    [burgers, frozen vegetables, eggs, french frie...
7498                                            [chicken]
7499                                [escalope, green tea]
7500    [eggs, frozen smoothie, yogurt cake, low fat y...
Name: 0, Length: 7501, dtype: object

In [7]:
# One-hot encode the transaction data
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

In [8]:
df_encoded

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7497,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


#### Exploratory Data Analysis

In [9]:
df_encoded.shape

(7501, 120)

In [10]:
# check for Duplicates
duplicates = df_encoded[df_encoded.duplicated()]
print("Duplicated Rows: ")
duplicates

Duplicated Rows: 


Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
34,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
42,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
60,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
64,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
65,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7491,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7492,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7495,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [11]:
# Remove duplicates
df_encoded.drop_duplicates(inplace = True)

#check for duplicates
duplicates = df_encoded[df_encoded.duplicated()]
duplicates

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini


In [12]:
df_encoded.shape

(5154, 120)

In [13]:
# Check for rows with all null values
all_null_rows = df_encoded.isnull().all(axis=1)
if all_null_rows.any():
    print("DataFrame contains rows with all null values.")
else:
    print("DataFrame does not contain rows with all null values.")

DataFrame does not contain rows with all null values.


### APRIORI ALGORITHM

In [71]:
from mlxtend.frequent_patterns import apriori

# Generate frequent itemsets with a minimum support threshold
frequent_itemsets = apriori(df_encoded, 
                            min_support=0.03, 
                            use_colnames=True, 
                            max_len= 4)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.046178,(avocado)
1,0.045208,(brownies)
2,0.114280,(burgers)
3,0.041327,(butter)
4,0.103609,(cake)
...,...,...
89,0.034730,"(mineral water, tomatoes)"
90,0.032596,"(olive oil, spaghetti)"
91,0.036282,"(pancakes, spaghetti)"
92,0.030462,"(shrimp, spaghetti)"


In [72]:
# Generate association rules
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric = "lift", min_threshold = 1.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(eggs),(burgers),0.207994,0.11428,0.036282,0.17444,1.526427,0.012513,1.072872,0.435445
1,(burgers),(eggs),0.11428,0.207994,0.036282,0.317487,1.526427,0.012513,1.160427,0.389373
2,(mineral water),(cake),0.299961,0.103609,0.037641,0.125485,1.211143,0.006562,1.025015,0.249034
3,(cake),(mineral water),0.103609,0.299961,0.037641,0.363296,1.211143,0.006562,1.099473,0.194484
4,(mineral water),(chicken),0.299961,0.084012,0.032596,0.108668,1.29347,0.007396,1.027661,0.324105
5,(chicken),(mineral water),0.084012,0.299961,0.032596,0.387991,1.29347,0.007396,1.143837,0.247695
6,(frozen vegetables),(chocolate),0.130384,0.203725,0.033178,0.254464,1.249056,0.006616,1.068057,0.229291
7,(chocolate),(frozen vegetables),0.203725,0.130384,0.033178,0.162857,1.249056,0.006616,1.03879,0.25041
8,(ground beef),(chocolate),0.136399,0.203725,0.033372,0.244666,1.200959,0.005584,1.054202,0.193761
9,(chocolate),(ground beef),0.203725,0.136399,0.033372,0.16381,1.200959,0.005584,1.03278,0.210144


In [73]:
# Filter rules with higher confidence and lift
filtered_rules = rules[(rules['support'] >= 0.03) & 
                       (rules['confidence'] >= 0.3) & 
                       (rules['lift'] >= 1.2)]
filtered_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1,(burgers),(eggs),0.11428,0.207994,0.036282,0.317487,1.526427,0.012513,1.160427,0.389373
3,(cake),(mineral water),0.103609,0.299961,0.037641,0.363296,1.211143,0.006562,1.099473,0.194484
5,(chicken),(mineral water),0.084012,0.299961,0.032596,0.387991,1.29347,0.007396,1.143837,0.247695
16,(frozen vegetables),(mineral water),0.130384,0.299961,0.05064,0.388393,1.29481,0.01153,1.144589,0.261824
18,(frozen vegetables),(spaghetti),0.130384,0.230113,0.039193,0.300595,1.306297,0.00919,1.100775,0.269633
23,(ground beef),(mineral water),0.136399,0.299961,0.058983,0.432432,1.441628,0.018069,1.233402,0.354724
24,(ground beef),(spaghetti),0.136399,0.230113,0.056073,0.411095,1.786497,0.024686,1.307321,0.509779
27,(milk),(mineral water),0.170353,0.299961,0.067908,0.398633,1.328949,0.016809,1.16408,0.298351
30,(olive oil),(mineral water),0.088087,0.299961,0.038805,0.440529,1.468619,0.012382,1.25125,0.349911
33,(pancakes),(mineral water),0.12534,0.299961,0.048894,0.390093,1.300478,0.011297,1.147779,0.264162


In [74]:
result = pd.DataFrame(filtered_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
result

Unnamed: 0,antecedents,consequents,support,confidence,lift
1,(burgers),(eggs),0.036282,0.317487,1.526427
3,(cake),(mineral water),0.037641,0.363296,1.211143
5,(chicken),(mineral water),0.032596,0.387991,1.29347
16,(frozen vegetables),(mineral water),0.05064,0.388393,1.29481
18,(frozen vegetables),(spaghetti),0.039193,0.300595,1.306297
23,(ground beef),(mineral water),0.058983,0.432432,1.441628
24,(ground beef),(spaghetti),0.056073,0.411095,1.786497
27,(milk),(mineral water),0.067908,0.398633,1.328949
30,(olive oil),(mineral water),0.038805,0.440529,1.468619
33,(pancakes),(mineral water),0.048894,0.390093,1.300478


In [75]:
# Display the top rules
top_rules = result.sort_values(by='lift', ascending = False)
top_rules

Unnamed: 0,antecedents,consequents,support,confidence,lift
24,(ground beef),(spaghetti),0.056073,0.411095,1.786497
40,(olive oil),(spaghetti),0.032596,0.370044,1.6081
35,(soup),(mineral water),0.033566,0.47139,1.571502
1,(burgers),(eggs),0.036282,0.317487,1.526427
30,(olive oil),(mineral water),0.038805,0.440529,1.468619
23,(ground beef),(mineral water),0.058983,0.432432,1.441628
46,(tomatoes),(spaghetti),0.030074,0.32563,1.415091
44,(shrimp),(spaghetti),0.030462,0.306641,1.332568
27,(milk),(mineral water),0.067908,0.398633,1.328949
18,(frozen vegetables),(spaghetti),0.039193,0.300595,1.306297


## Analysis & Interpretation

**Support:** Proportion of transactions that contain the itemset.
This indicates how popular an itemset is.
Higher support indicates more frequent occurrence.

**Confidence:** Probability that if a customer buys [antecedents], they also buy [consequent].
This measures the reliability of the rule.
Higher confidence shows stronger association.

**Lift:** Measures how much more likely [consequent] is purchased when [antecedent] is purchased, compared to chance.
Lift > 1: Positive correlation between [antecedents] and [consequent].

### Insights into Customer Purchasing Behavior
**🛒 Product Pairing Trends**

*Spaghetti* frequently appears as a consequent, especially with:

*Ground beef (Lift: 1.79), Olive oil (Lift: 1.61), Tomatoes (Lift: 1.41)*.

This combination strongly suggests customers are buying ingredients for pasta dishes.

**🥗 Meal and Dietary Patterns**

*Frozen vegetables, shrimp*, and *milk* appear alongside *mineral water*, indicating customers interested in:

    Lighter, health-conscious meals.
    
    Bulk shopping of daily-use items.

**💧 Mineral Water is a Common Consequent**

Appears with *soup, milk, cake, pancakes* — indicating high co-occurrence.

***Lift values > 1.2 suggest positive association, meaning mineral water is part of many basket types (meals, snacks, or breakfast).***

**🍰 Occasional and Treat-Based Purchases**

*Cake* and *pancakes* appear with *mineral water* — ***potentially indicating party, breakfast, or casual shopping habits.***

**1. Product Bundling Opportunities:**

Retailers could create combo offers like:

*"Buy ground beef + spaghetti + olive oil" = pasta night pack.*

*"Frozen vegetables + mineral water" = healthy living pack.*

**2. Targeted Promotions:**

*Suggest spaghetti to customers buying olive oil or tomatoes using lift-based recommendations.*

**3. Shelf Placement Strategies:**

*Place associated items (e.g., ground beef and spaghetti) closer together to encourage cross-selling.*

## Interview Questions

**1.	What is lift and why is it important in Association rules?**

Lift is a measure that evaluates the strength of an association rule beyond what would be expected by chance.

Lift helps to identify useful and non-trivial rules. It prevents misleading rules that might have high confidence but are due to high frequency of the consequent.

**2. What is Support and Confidence? How do you calculate them?**

**SUPPORT**

**Definition:** *Proportion of transactions that contain both A and B.* It measures how frequently the rule occurs in the dataset.

**Formula:**

Support (A → B) = Transactions with both A and B / Total Transactions.


**CONFIDENCE :** *Proportion of transactions containing A that also contain B.* It measures the reliability of the rule - how likely B is bought when A is bought.

**Formula:**
Confidence (A → B) = Support(A∪B) / Support (A)​


**3. What are some limitations or challenges of Association Rules Mining?**

*1. Large number of rules:*
   
It may generate thousands of rules, many of which are irrelevant or redundant. Therefore, it needs filtering (e.g., by lift, confidence, etc.).

*2. Sparsity of data:*

In real-world datasets, many combinations are rare or missing, which reduces support.

*3. High computational cost:*

Mining of all possible item combinations is computationally expensive for large datasets.

*4. Interpretability:*

Difficult to interpret or use rules without domain knowledge.

*5. No temporal/causal insight:*

Association rules only show correlation, not causation. They do not consider the order in which items are purchased.