<div style="width: 100%; text-align: center;">
    <h2>Unsupervised Learning - Association Rule Learning</h2>
</div>

`Association Rule Learning` is a type of unsupervised learning used to find interesting relationships (associations) or patterns among a set of items in large datasets. It’s commonly used in market basket analysis, where the goal is to discover rules that describe how products are purchased together.
### 🔑 Key Concepts:

**Itemset**: A collection of one or more items.

**Support**: Frequency of an itemset in the dataset.

$$
\text{Support}(A) = \frac{\text{Transactions containing A}}{\text{Total transactions}}
$$

---

**Confidence**: Likelihood that item B is bought when item A is bought.

$$
\text{Confidence}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)}
$$

---

**Lift**: Measures how much more often A and B occur together than expected if they were independent.

$$
\text{Lift}(A \Rightarrow B) = \frac{\text{Confidence}(A \Rightarrow B)}{\text{Support}(B)}
$$

**Interpretation of Lift**:
- **Lift > 1**: Positive association  
- **Lift = 1**: No association  
- **Lift < 1**: Negative association


---
# Apriori Algorithm
The Apriori Algorithm is a classic algorithm used in Association Rule Learning to find frequent itemsets in a dataset and generate association rules. It's widely used for market basket analysis, where we look for combinations of items that frequently occur together in transactions.

In [1]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import json
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

### Find the information about dataset from the below Link 
- https://www.kaggle.com/datasets/samps74/e-commerce-customer-behavior-dataset/data

## 1. Data Collection

In [2]:
# read the csv 
df = pd.read_csv('https://github.com/Namachivayam2001/Public_Datasets/raw/main/E-commerce.csv')

## 2. Data Inspection

In [3]:
df.head()

Unnamed: 0,Customer ID,Age,Gender,Location,Annual Income,Purchase History,Browsing History,Product Reviews,Time on Site
0,1001,25,Female,City D,45000,"[{""Date"": ""2022-03-05"", ""Category"": ""Clothing""...","[{""Timestamp"": ""2022-03-10T14:30:00Z""}, {""Time...","Great pair of jeans, very comfortable. Rating:...",32.5
1,1001,28,Female,City D,52000,"[{""Product Category"": ""Clothing"", ""Purchase Da...","[{""Product Category"": ""Home & Garden"", ""Timest...",Great customer service!,123.45
2,1001,28,Female,City D,65000,"[{""Product Category"": ""Electronics"", ""Purchase...","[{""Product Category"": ""Clothing"", ""Timestamp"":...",Great electronics. The sound quality is excell...,125.6
3,1001,45,Female,City D,70000,"{'Purchase Date': '2022-08-15', 'Product Categ...",{'Timestamp': '2022-09-03 14:30:00'},"{""Product 1"": {""Rating"": 4, ""Review"": ""Great e...",327.6
4,1002,34,Male,City E,45000,"{'Purchase Date': '2022-07-25', 'Product Categ...",{'Timestamp': '2022-08-10 17:15:00'},"{""Product 1"": {""Rating"": 3, ""Review"": ""Good pr...",214.9


In [4]:
df['Purchase History'][1]

'[{"Product Category": "Clothing", "Purchase Date": "2022-05-15", "Price": 34.56}, {"Product Category": "Electronics", "Purchase Date": "2022-06-02", "Price": 150.99}]'

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer ID       50 non-null     int64  
 1   Age               50 non-null     int64  
 2   Gender            50 non-null     object 
 3   Location          50 non-null     object 
 4   Annual Income     50 non-null     int64  
 5   Purchase History  50 non-null     object 
 6   Browsing History  50 non-null     object 
 7   Product Reviews   50 non-null     object 
 8   Time on Site      50 non-null     float64
dtypes: float64(1), int64(3), object(5)
memory usage: 3.6+ KB


In [6]:
df['Customer ID'].nunique()

13

## Purchases is in string datatype, We need to create a column with the list of Purchase product catagories to implement `Apriori Algoritham`
json.load() method read only list of jsons, If Some contains single-cote(') instide of double-code(") it will show error, So we should enclose the jsons with [] and replace single-cote(') to double-code(") 


In [7]:
# def extract_product_category(history):
#     try:
#         history_list = json.loads(history)
#         if isinstance(history_list, list):
#             return [item.get('Product Category', '') for item in history_list if 'Product Category' in item]
#     except (json.JSONDecodeError, TypeError):
       
#         return None

def extract_product_category(history):
    try:
        history_list = json.loads(history)
        if isinstance(history_list, list):
            return [item.get('Product Category', '') for item in history_list if 'Product Category' in item]
    except (json.JSONDecodeError, TypeError):
        # Fix malformed string: wrap in brackets and replace single quotes with double quotes
        try:
            fixed_history = '[' + history + ']'
            fixed_history = fixed_history.replace("'", '"')
            history_list = json.loads(fixed_history)
            if isinstance(history_list, list):
                return [item.get('Product Category', '') for item in history_list if 'Product Category' in item]
        except Exception:
            return None

In [8]:
df['Product Categories'] = df['Purchase History'].apply(extract_product_category)

In [9]:
df[['Customer ID', 'Purchase History']].duplicated().sum()

np.int64(0)

In [10]:
final_df = df[['Product Categories']]
final_df.head()

Unnamed: 0,Product Categories
0,[]
1,"[Clothing, Electronics]"
2,"[Electronics, Home & Garden]"
3,"[Electronics, Clothing, Home & Garden, Electro..."
4,"[Clothing, Home & Garden, Electronics]"


In [11]:
pd.DataFrame(final_df['Product Categories'].value_counts())

Unnamed: 0_level_0,count
Product Categories,Unnamed: 1_level_1
"[Clothing, Home & Garden]",13
"[Electronics, Clothing, Home & Garden]",9
"[Electronics, Home & Garden]",8
"[Electronics, Clothing]",5
"[Clothing, Electronics]",4
"[Home & Garden, Electronics]",4
[Electronics],2
[],1
"[Electronics, Clothing, Home & Garden, Electronics]",1
"[Clothing, Home & Garden, Electronics]",1


## ✅ Apply the Apriori Algorithm Manually (Short Form)
### 🔹 Step 1: Total Transactions
Let’s assume all 50 transactions are valid. So,
Total transactions (N) = 50

---
### 🔹 Step 2: Count 1-itemsets (Support Threshold = 0.5)
We count the number of transactions each item appears in:

| Item            | Count | Support (Count / 50) |
|-----------------|-------|----------------------|
| **Clothing**     | 35    | 0.70 ✅              |
| **Electronics**  | 35    | 0.70 ✅              |
| **Home & Garden**| 37    | 0.74 ✅              |

✅ Since our minimum support threshold is 0.5, all 1-itemsets pass.

---

### 🔹 Step 3: Count 2-itemsets
We now count how many times each pair occurs:

| Itemset                        | Count | Support |
|--------------------------------|-------|---------|
| (Clothing, Electronics)        | 21    | 0.42 ❌ |
| (Clothing, Home & Garden)      | 25    | 0.50 ✅ |
| (Electronics, Home & Garden)   | 24    | 0.48 ❌ |

The valid 2-itemsets passing the support threshold of 0.5:

(Clothing, Home & Garden)

---

### 🔹 Step 4: Count 3-itemsets
We check only combinations from the 2-itemsets that passed:

| Itemset                                    | Count | Support |
|-------------------------------------------|-------|---------|
| (Clothing, Electronics, Home & Garden)    | 12     | 0.24 ❌ |

❌ No 3-itemset passes the 0.5 threshold.

---

### ✅ Final Manual Output

| Frequent Itemset | Count | Support (%) |
|:----------------------------------:|:-----:|:-----------:|
| {Home & Garden} | 37 | 74.0% |
| {Clothing} | 35 | 70.0% |
| {Electronics} | 35 | 70.0% |
| {Home & Garden, Clothing} | 25 | 50.0% |
| {Home & Garden, Electronics} | 24 | 48.0% |
| {Clothing, Electronics} | 21 | 42.0% |
| {Home & Garden, Clothing, Electronics} | 12 | 24.0% |

#### 📌 Frequent Itemsets (min support = 0.5):

#### 1-itemsets:
- {Clothing}, {Electronics}, {Home & Garden}

#### 2-itemsets:
- {Clothing, Home & Garden}


In [12]:
# Create TransactionEncoder object
te = TransactionEncoder()

In [13]:
# Preprocessing: Convert dataset into a format suitable for apriori
te_ary = te.fit(final_df['Product Categories']).transform(final_df['Product Categories'])
te_df = pd.DataFrame(te_ary, columns=te.columns_)

In [14]:
te_df.head()

Unnamed: 0,Clothing,Electronics,Home & Garden
0,False,False,False
1,True,True,False
2,False,True,True
3,True,True,True
4,True,True,True


In [20]:
# Running Apriori algorithm to find frequent itemsets with a minimum support of 0.5
frequent_itemsets = apriori(te_df, min_support=0.5, use_colnames=True)

In [21]:
frequent_itemsets.sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets
2,0.74,(Home & Garden)
0,0.7,(Clothing)
1,0.7,(Electronics)
3,0.5,"(Home & Garden, Clothing)"


## 🔥 Code to generate association rules

In [22]:
# generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

In [23]:
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values(by='confidence', ascending=False)

Unnamed: 0,antecedents,consequents,support,confidence,lift
1,(Clothing),(Home & Garden),0.5,0.714286,0.965251
0,(Home & Garden),(Clothing),0.5,0.675676,0.965251
