In [None]:
!pip install mlxtend



In [None]:
import mlxtend

In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

In [None]:
df_path ='/content/Online retail.xlsx'
retail = pd.read_excel(df_path)



In [None]:
retail.head()

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


In [None]:
retail.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 1 columns):
 #   Column                                                                                                                                                                                                                           Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                                           --------------  ----- 
 0   shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil  7500 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB


In [None]:
retail.describe()

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
count,7500
unique,5175
top,cookies
freq,223


In [None]:
# Splitting each transaction into a list of items
retail['Transaction'] = retail.iloc[:, 0].apply(lambda x: x.split(','))


In [None]:
 # Removing duplicates if any
df = retail.drop_duplicates(subset=['Transaction'])

In [None]:
 # Checking for missing values in the transactions
missing_values = retail['Transaction'].apply(lambda x: any(pd.isnull(x)))

In [None]:
# Displaying the first few rows after preprocessing
df_head_preprocessed = df.head()
missing_values_summary = retail['Transaction'].isnull().sum()

In [None]:

df_head_preprocessed, missing_values_summary

(  shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil  \
 0                             burgers,meatballs,eggs                                                                                                                                                                                
 1                                            chutney                                                                                                                                                                                
 2                                     turkey,avocado                                                                                                                                                                                
 3  mineral water,milk,energy bar,whole wheat rice...                           

In [None]:
# Transform the dataset into the appropriate format
re = TransactionEncoder()
re_ary = re.fit(df['Transaction']).transform(df['Transaction'])
df_transformed = pd.DataFrame(re_ary, columns=re.columns_)

In [None]:
# Generate frequent itemsets with a minimum support threshold
frequent_itemset = apriori(df_transformed, min_support=0.01, use_colnames=True)

In [None]:
# Generate association rules with a minimum confidence threshold
rules = association_rules(frequent_itemset, metric="confidence", min_threshold=0.5)

In [None]:
# Sorting the rules by lift
rules = rules.sort_values(by=['lift'], ascending=False)

In [None]:
top_rules = rules.head()

frequent_itemset.head(), top_rules

(    support             itemsets
 0  0.029179            (almonds)
 1  0.011014  (antioxydant juice)
 2  0.045797            (avocado)
 3  0.012560              (bacon)
 4  0.015459     (barbecue sauce),
                         antecedents      consequents  antecedent support  \
 4  (frozen vegetables, ground beef)      (spaghetti)            0.024541   
 8                      (milk, soup)  (mineral water)            0.021449   
 3  (frozen vegetables, ground beef)  (mineral water)            0.024541   
 9                 (spaghetti, soup)  (mineral water)            0.020676   
 6           (pancakes, ground beef)  (mineral water)            0.020870   
 
    consequent support   support  confidence      lift  representativity  \
 4            0.229565  0.012560    0.511811  2.229480               1.0   
 8            0.299710  0.012367    0.576577  1.923781               1.0   
 3            0.299710  0.013333    0.543307  1.812775               1.0   
 9            0.299710  0.0

What is lift and why is it important in Association rules?

->

Lift measures how much more likely item Y is purchased when item X is purchased, compared to when it's not.

     Lift(X⇒Y)= Support(X∪Y) / Support(X)×Support(Y)

  or

     Lift= Confidence(X⇒Y) / Support(Y)

Why is Lift Important

* It helps filter out trivial rules that have high confidence just because the item is common.

* It ensures you're finding genuinely interesting associations.

* High lift = rules that reveal customer behavior patterns.

What is support and confidence. How do you calculate them?

->

Support:
  * Support measures how frequently an itemset appears in the dataset.
  * It is calculated as the number of transactions containing the itemset divided by the total number of transactions.

        Support(X)= Number of transactions containing X / Total number of transactions


Example:
    Let’s say you have 1,000 transactions in a supermarket.

    100 of them contain Milk

    → Support(Milk) = 100 / 1000 = 0.10 (or 10%)


Confidence:
  * Confidence measures how often Y is bought when X is bought — it's conditional probability.
  * It is calculated as the number of transactions containing both X and Y divided by the number of transactions containing X.

          Confidence(X=>Y) =Support(X ∪ Y)/ Support(X)

  
Example:
Support({Milk, Butter}) = 50 / 1000 = 0.05

Support({Milk}) = 100 / 1000 = 0.10

Confidence
      (
      𝑀
      𝑖
      𝑙
      𝑘
      ⇒
      𝐵
      𝑢
      𝑡
      𝑡
      𝑒
      𝑟
      )
      =
      0.05
      0.10
      =
      0.5
      Confidence(Milk⇒Butter)=
      0.10
      0.05
      ​
      =0.5

What are some limitations or challenges of Association Rule Mining?

->

* Combinatorial Explosion:
  As the number of items increases, the number of possible itemsets grows exponentially.


* Too Many Rules Generated:
  Even with modest datasets, you can end up with thousands of rules, many of which are:
  Redundant,
  Trivial,
  Not actionable


* Lack of Causality:
   Association rules only tell you that items appear together, not that one causes the other.


*  Ignores Temporal or Sequential Information:
    It doesn’t consider the order of transactions.


* Uniform Treatment of All Items:
    No way to factor in,
    Price,
    Profit margin,
    Popularity

* Sensitivity to Support/Confidence Thresholds:
  Setting the thresholds too high may miss important rules.
  Setting them too low may generate too many irrelevant or noisy rules