## Association Rules

In [20]:
#Data Preprocessing for Market Basket Analysis
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [21]:
#Load dataset
file_path="Online retail.xlsx"
df=pd.read_excel(file_path,header=None)

In [22]:
print("Initial shape:",df.shape)
print(df.info())

Initial shape: (7501, 1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7501 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB
None


In [23]:
#Rename column for clarity
df.columns=["Items"]

#Drop empty rows (if any)
df=df.dropna()

#Split the string of items into Python lists
transactions=df["Items"].apply(lambda x: [i.strip().lower() for i in str(x).split(",")])

#Preview first 5 transactions
print("First 5 transactions:")
print(transactions.head())

First 5 transactions:
0    [shrimp, almonds, avocado, vegetables mix, gre...
1                           [burgers, meatballs, eggs]
2                                            [chutney]
3                                    [turkey, avocado]
4    [mineral water, milk, energy bar, whole wheat ...
Name: Items, dtype: object


In [24]:
#1. Transform transaction list into one-hot encoded DataFrame
te=TransactionEncoder()
te_array=te.fit(transactions).transform(transactions)
basket=pd.DataFrame(te_array,columns=te.columns_)

print("Basket shape:", basket.shape)
print(basket.head())

Basket shape: (7501, 119)
   almonds  antioxydant juice  asparagus  avocado  babies food  bacon  \
0     True               True      False     True        False  False   
1    False              False      False    False        False  False   
2    False              False      False    False        False  False   
3    False              False      False     True        False  False   
4    False              False      False    False        False  False   

   barbecue sauce  black tea  blueberries  body spray  ...  turkey  \
0           False      False        False       False  ...   False   
1           False      False        False       False  ...   False   
2           False      False        False       False  ...   False   
3           False      False        False       False  ...    True   
4           False      False        False       False  ...   False   

   vegetables mix  water spray  white wine  whole weat flour  \
0            True        False       False        

In [25]:
#2.Apply Apriori algorithm (minimum support = 0.01 → 1%)
frequent_itemsets=apriori(basket,min_support=0.01,use_colnames=True)

print("\nFrequent itemsets:")
print(frequent_itemsets.sort_values("support",ascending=False).head(10))


Frequent itemsets:
     support             itemsets
46  0.238368      (mineral water)
19  0.179709               (eggs)
63  0.174110          (spaghetti)
24  0.170911       (french fries)
13  0.163845          (chocolate)
32  0.132116          (green tea)
45  0.129583               (milk)
33  0.098254        (ground beef)
30  0.095321  (frozen vegetables)
53  0.095054           (pancakes)


In [26]:
#3. Generate association rules (confidence >= 0.3, lift >= 1.2)
rules=association_rules(frequent_itemsets,metric="lift",min_threshold=1.2)
rules=rules[(rules["confidence"]>=0.3)&(rules["lift"]>=1.2)]
print(rules[["antecedents","consequents","support","confidence","lift"]].head(10))

       antecedents      consequents   support  confidence      lift
0        (avocado)  (mineral water)  0.011598    0.348000  1.459926
5        (burgers)           (eggs)  0.028796    0.330275  1.837830
32          (cake)  (mineral water)  0.027463    0.338816  1.421397
39       (cereals)  (mineral water)  0.010265    0.398964  1.673729
50       (chicken)  (mineral water)  0.022797    0.380000  1.594172
70     (chocolate)  (mineral water)  0.052660    0.321400  1.348332
92   (cooking oil)  (mineral water)  0.020131    0.394256  1.653978
94   (cooking oil)      (spaghetti)  0.015865    0.310705  1.784531
107       (turkey)           (eggs)  0.019464    0.311301  1.732245
116  (fresh bread)  (mineral water)  0.013332    0.309598  1.298820


## Task 3: Analysis and Interpretation

### Key Metrics
- **Support**  
  - Indicates how frequently an itemset appears in transactions.  
  - Example: `{milk, bread}` with support = 0.08 → appears in 8% of all baskets.  

- **Confidence**  
  - Shows the likelihood of buying the consequent when the antecedent is bought.  
  - Example: `{eggs} → {bread}` with confidence = 0.6 → whenever eggs are bought, 60% of the time bread is also purchased.  

- **Lift**  
  - Evaluates the strength of a rule compared to random chance.  
  - Lift > 1 → strong positive association.  
  - Lift ≈ 1 → weak/neutral relationship.  
  - Lift < 1 → negative relationship.  

### Example Insights (Hypothetical)
- `{mineral water} -> {spaghetti}`  
  - Lift = 2.1, Confidence = 0.45  
  - Customers buying mineral water are **twice as likely** to also buy spaghetti.  

- `{eggs, milk} -> {bread}`  
  - Support = 0.06  
  - Classic grocery basket: eggs + milk + bread often appear together.  

- `{frozen smoothie} -> {green tea}`  
  - Confidence = 0.38, Lift = 1.5  
  - Indicates a **healthy lifestyle** purchasing trend.  

### Business Interpretation
- **Product Bundling** → Promote high-lift pairs together (for example, milk + bread).  
- **Cross-Selling** → Suggest related items during checkout (for example, eggs -> bread).  
- **Inventory Planning** → Frequently paired items should be stocked together.  
- **Customer Segmentation** → Healthy vs indulgent buyers can be identified and targeted differently.  


### Interview Questions

##### 1. What is Lift and why is it important in Association Rules?
- **Lift** tells us how much more likely two items are bought together compared to them being bought independently.  
- If **Lift > 1** → items are positively related.  
- If **Lift = 1** → no special relationship (just by chance).  
- If **Lift < 1** → items are negatively related.  
* It’s important because it shows if a rule is truly useful or just a coincidence.  

##### 2. What is Support and Confidence? How do you calculate them?
- **Support**: How often an item or rule appears in the dataset.  
  - Formula:  
    Support(A->B)=(Transactions containing A and B) / (Total transactions) 

- **Confidence**: How often B is bought when A is bought.  
  - Formula:  
    Confidence(A->B) =(Transactions containing A and B)(Transactions containing A)

* Example: If 100 people bought bread, and 60 of them also bought butter:  
- Support(bread->butter) = 60 / Total transactions  
- Confidence(bread->butter) = 60 / 100 = 0.6 (60%)  


##### 3. What are some limitations or challenges of Association Rules Mining?
- **Too many rules**: Generates thousands of rules, hard to filter useful ones.  
- **Uninteresting rules**: High support but no real business meaning.  
- **Rare item problem**: Rare but important items may get ignored.  
- **Computationally expensive**: Takes lots of time and memory for big datasets.  
- **No causation**: Shows patterns, but doesn’t mean one item causes the other.  
