<div style="width: 100%; text-align: center;">
    <h2>Unsupervised Learning - Association Rule Learning</h2>
</div>

`Association Rule Learning` is a type of unsupervised learning used to find interesting relationships (associations) or patterns among a set of items in large datasets. It’s commonly used in market basket analysis, where the goal is to discover rules that describe how products are purchased together.
### 🔑 Key Concepts:

**Itemset**: A collection of one or more items.

**Support**: Frequency of an itemset in the dataset.

$$
\text{Support}(A) = \frac{\text{Transactions containing A}}{\text{Total transactions}}
$$

---

**Confidence**: Likelihood that item B is bought when item A is bought.

$$
\text{Confidence}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)}
$$

---

**Lift**: Measures how much more often A and B occur together than expected if they were independent.

$$
\text{Lift}(A \Rightarrow B) = \frac{\text{Confidence}(A \Rightarrow B)}{\text{Support}(B)}
$$

**Interpretation of Lift**:
- **Lift > 1**: Positive association  
- **Lift = 1**: No association  
- **Lift < 1**: Negative association


## FP-Growth Algorithm:
FP-Growth (Frequent Pattern Growth) is a very popular algorithm for frequent itemset mining — that means, finding patterns (sets of items) that occur together frequently in a large dataset (usually transactions, like shopping carts).

It was designed to be faster and more efficient than older methods like the Apriori algorithm.

---
### Why FP-Growth?
- Apriori generates lots of candidate itemsets and scans the database many times.
- FP-Growth is smarter:
    - It compresses the database into a special tree structure called FP-Tree.
    - Then, it mines frequent patterns directly from that tree — without generating all candidates.

In [24]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import json
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth, association_rules

### Find the information about dataset from the below Link 
- https://www.kaggle.com/datasets/samps74/e-commerce-customer-behavior-dataset/data

## 1. Data Collection

In [2]:
# read the csv 
df = pd.read_csv('https://github.com/Namachivayam2001/Public_Datasets/raw/main/E-commerce.csv')

## 2. Data Inspection

In [3]:
df.head()

Unnamed: 0,Customer ID,Age,Gender,Location,Annual Income,Purchase History,Browsing History,Product Reviews,Time on Site
0,1001,25,Female,City D,45000,"[{""Date"": ""2022-03-05"", ""Category"": ""Clothing""...","[{""Timestamp"": ""2022-03-10T14:30:00Z""}, {""Time...","Great pair of jeans, very comfortable. Rating:...",32.5
1,1001,28,Female,City D,52000,"[{""Product Category"": ""Clothing"", ""Purchase Da...","[{""Product Category"": ""Home & Garden"", ""Timest...",Great customer service!,123.45
2,1001,28,Female,City D,65000,"[{""Product Category"": ""Electronics"", ""Purchase...","[{""Product Category"": ""Clothing"", ""Timestamp"":...",Great electronics. The sound quality is excell...,125.6
3,1001,45,Female,City D,70000,"{'Purchase Date': '2022-08-15', 'Product Categ...",{'Timestamp': '2022-09-03 14:30:00'},"{""Product 1"": {""Rating"": 4, ""Review"": ""Great e...",327.6
4,1002,34,Male,City E,45000,"{'Purchase Date': '2022-07-25', 'Product Categ...",{'Timestamp': '2022-08-10 17:15:00'},"{""Product 1"": {""Rating"": 3, ""Review"": ""Good pr...",214.9


In [4]:
df['Purchase History'][1]

'[{"Product Category": "Clothing", "Purchase Date": "2022-05-15", "Price": 34.56}, {"Product Category": "Electronics", "Purchase Date": "2022-06-02", "Price": 150.99}]'

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer ID       50 non-null     int64  
 1   Age               50 non-null     int64  
 2   Gender            50 non-null     object 
 3   Location          50 non-null     object 
 4   Annual Income     50 non-null     int64  
 5   Purchase History  50 non-null     object 
 6   Browsing History  50 non-null     object 
 7   Product Reviews   50 non-null     object 
 8   Time on Site      50 non-null     float64
dtypes: float64(1), int64(3), object(5)
memory usage: 3.6+ KB


## Purchases is in string datatype, We need to create a column with the list of Purchase product catagories to implement `Apriori Algoritham`

In [6]:
# def extract_product_category(history):
#     try:
#         history_list = json.loads(history)
#         if isinstance(history_list, list):
#             return [item.get('Product Category', '') for item in history_list if 'Product Category' in item]
#     except (json.JSONDecodeError, TypeError):
       
#         return None

def extract_product_category(history):
    try:
        history_list = json.loads(history)
        if isinstance(history_list, list):
            return [item.get('Product Category', '') for item in history_list if 'Product Category' in item]
    except (json.JSONDecodeError, TypeError):
        # Fix malformed string: wrap in brackets and replace single quotes with double quotes
        try:
            fixed_history = '[' + history + ']'
            fixed_history = fixed_history.replace("'", '"')
            history_list = json.loads(fixed_history)
            if isinstance(history_list, list):
                return [item.get('Product Category', '') for item in history_list if 'Product Category' in item]
        except Exception:
            return None

In [7]:
df['Product Categories'] = df['Purchase History'].apply(extract_product_category)

In [8]:
df[['Customer ID', 'Purchase History']].duplicated().sum()

np.int64(0)

In [9]:
final_df = df[['Product Categories']]
final_df.head()

Unnamed: 0,Product Categories
0,[]
1,"[Clothing, Electronics]"
2,"[Electronics, Home & Garden]"
3,"[Electronics, Clothing, Home & Garden, Electro..."
4,"[Clothing, Home & Garden, Electronics]"


In [10]:
pd.DataFrame(final_df['Product Categories'].value_counts())

Unnamed: 0_level_0,count
Product Categories,Unnamed: 1_level_1
"[Clothing, Home & Garden]",13
"[Electronics, Clothing, Home & Garden]",9
"[Electronics, Home & Garden]",8
"[Electronics, Clothing]",5
"[Clothing, Electronics]",4
"[Home & Garden, Electronics]",4
[Electronics],2
[],1
"[Electronics, Clothing, Home & Garden, Electronics]",1
"[Clothing, Home & Garden, Electronics]",1


---
## 🛠 Step 0: Understand the Dataset
You have 50 transactions.  
Items: `Clothing`, `Electronics`, `Home & Garden`

---
## 🛠 Step 1: Frequency Count (Support Count)
First, let's count how many times each item appears across all transactions:

| Item            | Count |
|-----------------|-------|
| Clothing        | 35    |
| Electronics     | 35    |
| Home & Garden   | 37    |

---
## 🛠 Step 2: Define Minimum Support
Suppose minimum support = 0.4 (means 40%).
Since 50 transactions,
- Min support count = 50 × 0.4 = 20
  
All three items have support greater than 20, so they are frequent.

---
## 🛠 Step 3: Sort Items in Each Transaction
When we insert into FP-Tree, sort items by frequency (descending):  
| Item Frequency | 1st: **Home & Garden (37) > Clothing (35) > Electronics (35)**  |  
Thus, order in each transaction:

| Item            | Count |
|-----------------|-------|
| Home & Garden   | 37    |
| Clothing        | 35    |
| Electronics     | 35    |

---
## 🛠 Step 4: Build FP-Tree
⚡ Insert each transaction (after sorting) into a tree. (Shared prefixes are merged.)

```mermaid
graph TD
    None --> C1["Clothing (10)"]
    None --> H["Home & Garden (37)"]
    None --> E1["Electronics (2)"]
    
    C1 --> F1["Electronics (9)"]
    
    H --> F2["Electronics (12)"]
    H --> C2["Clothing (25)"]
    
    C2 --> E2["Electronics (12)"]
```

---
## 🛠 Step 5: Mining Frequent Patterns
### 1. Mine **Electronics**

There are 4 **Electronics** nodes:
- **E1**: Electronics (9) under Clothing
- **E2**: Electronics (12) under Home & Garden
- **E3**: Electronics (12) under Home & Garden → Clothing
- **E4**: Electronics (2) directly under root

👉 **Conditional Pattern Bases for Electronics**:

| Prefix Path               | Count |
|:--------------------------:|:-----:|
| Clothing                   | 9     |
| Home & Garden              | 12    |
| Home & Garden → Clothing   | 12    |
| None (root)                | 2     |

✅ **Total support for Electronics = (9 + 12 + 12 + 2)/50 = 35/50 = 0.7**  
Itemsets:

| Itemset | Count | Support (%) |
|:-------------------------------|:-----:|:-----------:|
| {Electronics} | 35 | 70.0% |
| {Clothing, Electronics} | 21 | 42.0% |
| {Home & Garden, Electronics} | 24 | 48.0% |
| {Home & Garden, Clothing, Electronics} | 12 | 24.0% |

---
### 2. Mine **Clothing**

There are 2 **Clothing** nodes:
- **C1**: Clothing (10) under root
- **C2**: Clothing (25) under Home & Garden (37)

👉 **Conditional Pattern Bases for Clothing**:

| Prefix Path    | Count |
|:--------------:|:-----:|
| None (root)    | 10    |
| Home & Garden  | 25    |

✅ **Total support for Clothing = (10 + 25)/50 = 35/50 = 0.7**  
Itemset:

| Itemset | Count | Support (%) |
|:-------------------------------|:-----:|:-----------:|
| {Clothing} | 35 | 70.0% |
| {Home & Garden, Clothing} | 25 | 50.0% |

---
### 3. Mine **Home & Garden**

- **H**: Home & Garden (37) directly under root.

✅ **Support for Home & Garden = 37/50 = 0.74**

---

## 🛠 Manual Result (Frequent Patterns)

| Frequent Itemset | Count | Support (%) |
|:----------------------------------:|:-----:|:-----------:|
| {Home & Garden} | 37 | 74.0% |
| {Clothing} | 35 | 70.0% |
| {Electronics} | 35 | 70.0% |
| {Home & Garden, Clothing} | 25 | 50.0% |
| {Home & Garden, Electronics} | 24 | 48.0% |
| {Clothing, Electronics} | 21 | 42.0% |
| {Home & Garden, Clothing, Electronics} | 12 | 24.0% |

---




In [11]:
# Create TransactionEncoder object
te = TransactionEncoder()

In [12]:
# Preprocessing: Convert dataset into a format suitable for apriori
te_ary = te.fit(final_df['Product Categories']).transform(final_df['Product Categories'])
te_df = pd.DataFrame(te_ary, columns=te.columns_)

In [13]:
te_df.head()

Unnamed: 0,Clothing,Electronics,Home & Garden
0,False,False,False
1,True,True,False
2,False,True,True
3,True,True,True
4,True,True,True


In [21]:
# Apply FP-Growth
frequent_itemsets = fpgrowth(te_df, min_support=0.5, use_colnames=True)

In [22]:
frequent_itemsets.sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets
2,0.74,(Home & Garden)
0,0.7,(Electronics)
1,0.7,(Clothing)
3,0.5,"(Home & Garden, Clothing)"


## 🔥 Code to generate association rules

In [25]:
# generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

In [26]:
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values(by='confidence', ascending=False)

Unnamed: 0,antecedents,consequents,support,confidence,lift
1,(Clothing),(Home & Garden),0.5,0.714286,0.965251
0,(Home & Garden),(Clothing),0.5,0.675676,0.965251
