

# 🛒 **Apriori Algorithm**

👉 **Apriori** is an algorithm used to **find frequent itemsets** in a dataset and generate **association rules**.

It helps answer: **"If a customer buys item A, will they also buy item B?"**

---

# 🧠 **What You Need to Know**

### 📌 **1. Purpose**

* Used in **Market Basket Analysis**

* Helps in **product recommendations**



### 🔁 **2. How It Works**

1. ✅ **Find Frequent Itemsets**:

   Items that appear together frequently (based on a **minimum support**).

2. 🔄 **Generate Association Rules**:

   Rules like `A ➡️ B` with strong **confidence** and **lift**.



### 📊 **3. Key Terms**

* **Support (%)** 📈: How often items appear together

* **Confidence (%)** 🔒: How often B appears when A is bought

* **Lift (>1)** 🚀: How much more likely B is bought when A is bought (vs. at random)



### ⚙️ **4. Parameters to Set**

* `min_support`

* `min_confidence`

* `min_lift`



### ✅ **5. Example**

🛍️ If 60% of transactions contain both Milk & Bread (support = 60%)
and in 80% of Milk-buying transactions, Bread is also bought (confidence = 80%)
then Milk ➡️ Bread is a **strong rule**.



### 💡 **6. Use Cases**

* Market basket analysis
* Recommender systems
* Web usage mining





In [17]:
# ! pip install mlxtend

## 📦 **Real-World Project Dataset (Large)**

### 🛍️ **1. Instacart Market Basket Dataset**

This is one of the most popular datasets for association rule mining.

* **📈 Size**: \~3 million orders 💥
* **🧾 Data**: Customer orders, products, aisles, departments
* **🛒 Usage**: Identify what products are bought together

🔗 **Download from Kaggle**:
[https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis](https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis)

### 📂 Dataset Includes:

* `orders.csv`: order IDs, user IDs, order timing
* `order_products__prior.csv`: product IDs in each order (used for training)
* `products.csv`: product names
* `aisles.csv`: aisle names
* `departments.csv`: department names

---

## 💡 **Project Idea Using Apriori**

### 🧠 **Goal**:

Discover frequent product combinations and generate **association rules** like:

> "If someone buys **Organic Strawberries**, they also buy **Bananas** with high confidence."

---

## 🚀 **Project Workflow**

1. ✅ Load the dataset (use `order_products__prior.csv`)
2. 🔁 Convert orders into lists of products (per transaction)
3. 🧾 One-hot encode the transactions
4. 📊 Apply Apriori to find frequent itemsets
5. 🔗 Generate association rules using confidence/lift
6. 📈 Visualize top rules with bar plots or network graphs



In [18]:
# LOADING DATASET:
import pandas as pd
ORDERS=pd.read_csv(r"C:\Users\Nagesh Agrawal\OneDrive\Desktop\6_MACHINE LEARNING\1_DATASETS\association rules\order_products__prior.csv")
PRODUCTS=pd.read_csv(r"C:\Users\Nagesh Agrawal\OneDrive\Desktop\6_MACHINE LEARNING\1_DATASETS\association rules\products.csv")

In [19]:
ORDERS# product_id is the foreign key

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0
...,...,...,...,...
32434484,3421083,39678,6,1
32434485,3421083,11352,7,0
32434486,3421083,4600,8,0
32434487,3421083,24852,9,1


In [20]:
PRODUCTS # products id is the primary key

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13
...,...,...,...,...
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5
49684,49685,En Croute Roast Hazelnut Cranberry,42,1
49685,49686,Artisan Baguette,112,3
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8


In [21]:
PRODUCTS["product_id"].value_counts() # Total rows are 49688 and  49688 unique products THERE FORE IT IS A PRIMARY KEY .

product_id
1        1
33142    1
33120    1
33121    1
33122    1
        ..
16566    1
16567    1
16568    1
16569    1
49688    1
Name: count, Length: 49688, dtype: int64

In [22]:
ORDERS["product_id"].value_counts()# 49677 unique products 10 products are missing.(total rows are 32434489)
# therefore product_id is a foreign key in orders table.

product_id
24852    472565
13176    379450
21137    264683
21903    241921
47209    213584
          ...  
14756         1
20264         1
31254         1
13397         1
23624         1
Name: count, Length: 49677, dtype: int64


### 🔑 Primary Key:

* A **Primary Key** uniquely identifies each row in a table.
* It **cannot have NULL values**.
* It must be **unique** for each row.


### 🌐 Foreign Key:

* A **Foreign Key** is a column that creates a **relationship** between two tables.
* It refers to the **primary key** in another table.
* It can have **duplicate values** and can be **NULL** (unless restricted).

### Quick Comparison:

| Feature      | Primary Key    | Foreign Key      |
| ------------ | -------------- | ---------------- |
| Uniqueness   | Must be unique | Can be duplicate |
| NULL Allowed | No             | Yes (optional)   |
| Purpose      | Identify row   | Link tables      |



In [23]:
DATA=pd.merge(ORDERS,PRODUCTS, on="product_id", how="left")

In [24]:
DATA.head(10)# 4 + 4 = 7 cols one column is common in both tables.

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,2,33120,1,1,Organic Egg Whites,86,16
1,2,28985,2,1,Michigan Organic Kale,83,4
2,2,9327,3,0,Garlic Powder,104,13
3,2,45918,4,1,Coconut Butter,19,13
4,2,30035,5,0,Natural Sweetener,17,13
5,2,17794,6,1,Carrots,83,4
6,2,40141,7,1,Original Unflavored Gelatine Mix,105,13
7,2,1819,8,1,All Natural No Stir Creamy Almond Butter,88,13
8,2,43668,9,0,Classic Blend Cole Slaw,123,4
9,3,33754,1,1,Total 2% with Strawberry Lowfat Greek Strained...,120,16


| `how`     | Description                                             |
| --------- | ------------------------------------------------------- |
| `'inner'` | Only matching rows from both DataFrames                 |
| `'left'`  | All rows from the left, matches from the right (if any) |
| `'right'` | All rows from the right, matches from the left (if any) |
| `'outer'` | All rows from both, fill missing with `NaN`             |


#### 🔹 Create Transaction Data (Basket Format)

In [25]:
# Group by order_id and list product names.
BASKET = DATA.groupby('order_id')['product_name'].apply(list)
BASKET# 3214874 ARE UNIQUE ORDER_IDs.
# FOR EACH ORDER_ID THERE ARE MULTIPLE PRODUCT NAMES.

order_id
2          [Organic Egg Whites, Michigan Organic Kale, Ga...
3          [Total 2% with Strawberry Lowfat Greek Straine...
4          [Plain Pre-Sliced Bagels, Honey/Lemon Cough Dr...
5          [Bag of Organic Bananas, Just Crisp, Parmesan,...
6          [Cleanse, Dryer Sheets Geranium Scent, Clean D...
                                 ...                        
3421079                                      [Moisture Soap]
3421080    [Organic Whole Milk, Vanilla Bean Ice Cream, O...
3421081    [Hint of Lime Flavored Tortilla Chips, Classic...
3421082    [Fresh 99% Lean Ground Turkey, Original Spray,...
3421083    [Freeze Dried Mango Slices, Purple Carrot & bl...
Name: product_name, Length: 3214874, dtype: object

In [26]:
type(BASKET)
# BASKET is a series object. HAVING 3214874 ORDER_ID AS INDEX AND PRODUCT NAMES AS VALUES.

pandas.core.series.Series

#### 🔹 One-Hot Encode the Basket

In [27]:
'''from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_data = te.fit_transform(BASKET.tolist())
DF = pd.DataFrame(te_data, columns=te.columns_)
DF'''

'from mlxtend.preprocessing import TransactionEncoder\n\nte = TransactionEncoder()\nte_data = te.fit_transform(BASKET.tolist())\nDF = pd.DataFrame(te_data, columns=te.columns_)\nDF'

WE're getting a `MemoryError` because the data is **too large to fit into memory**:
Over **3.2 million transactions** × **49,677 unique items** = **\~149 GB** boolean matrix.


In [28]:
# Take the first 50 transactions
subset_basket = BASKET[:50]

In [None]:
subset_of_data=BASKET[50:100]
#subset_of_data.to_csv(r"C:\Users\Nagesh Agrawal\OneDrive\Desktop\6_MACHINE LEARNING\1_DATASETS\association rules\subset_basket.csv", index=False)

In [29]:
from mlxtend.preprocessing import TransactionEncoder
TE=TransactionEncoder()
TE_DATA=TE.fit(subset_basket).transform(subset_basket)
TE_DATA=pd.DataFrame(TE_DATA, columns=TE.columns_)
TE_DATA

Unnamed: 0,1% Lowfat Milk,100% Apple Juice Original,100% Cranberry Juice,100% Juice No Added Sugar Orange Tangerine,100% Juice No Sugar Added Apple,100% Pure Rosemary,100% Recycled Paper Towels,100% Whole Wheat Bread,13 Gallon Kitchen Drawstring Trash Bags,2% Milk,...,White Cheddar Semisoft Cheese,White Corn,Whole Organic Omega 3 Milk,Whole White Mushrooms,Wint-O-Green,Yellow Onions,Yellow Straightneck Squash,Yo Baby Organic Whole Milk Banana Mango Yogurt,"YoKids Squeezers Organic Low-Fat Yogurt, Strawberry",Yuba Tofu Skin
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,True,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [30]:
from mlxtend.frequent_patterns import apriori
FREQUENT_ITEMSETS = apriori(TE_DATA, min_support=0.03, use_colnames=True)
FREQUENT_ITEMSETS
# FREQUENT ITEMSETS ARE THE PRODUCTS THAT ARE FREQUENTLY BOUGHT TOGETHER.

Unnamed: 0,support,itemsets
0,0.04,(2% Reduced Fat Milk)
1,0.04,(Air Chilled Organic Boneless Skinless Chicken...
2,0.04,(Asparagus)
3,0.12,(Bag of Organic Bananas)
4,0.14,(Banana)
5,0.04,(Carrots)
6,0.04,(Cream Cheese)
7,0.04,(Cucumber Kirby)
8,0.04,(Extra Virgin Olive Oil)
9,0.04,(Green Beans)


In [31]:
from mlxtend.frequent_patterns import association_rules
RULES = association_rules(FREQUENT_ITEMSETS, metric="lift", min_threshold=1)
RULES

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Bag of Organic Bananas),(Organic Raspberries),0.12,0.06,0.04,0.333333,5.555556,1.0,0.0328,1.41,0.931818,0.285714,0.29078,0.5
1,(Organic Raspberries),(Bag of Organic Bananas),0.06,0.12,0.04,0.666667,5.555556,1.0,0.0328,2.64,0.87234,0.285714,0.621212,0.5
2,(Bag of Organic Bananas),(Organic Tomato Cluster),0.12,0.04,0.04,0.333333,8.333333,1.0,0.0352,1.44,1.0,0.333333,0.305556,0.666667
3,(Organic Tomato Cluster),(Bag of Organic Bananas),0.04,0.12,0.04,1.0,8.333333,1.0,0.0352,inf,0.916667,0.333333,1.0,0.666667
4,(Banana),(Nilla Wafers),0.14,0.04,0.04,0.285714,7.142857,1.0,0.0344,1.344,1.0,0.285714,0.255952,0.642857
5,(Nilla Wafers),(Banana),0.04,0.14,0.04,1.0,7.142857,1.0,0.0344,inf,0.895833,0.285714,1.0,0.642857
6,(Organic Avocado),(Banana),0.1,0.14,0.04,0.4,2.857143,1.0,0.026,1.433333,0.722222,0.2,0.302326,0.342857
7,(Banana),(Organic Avocado),0.14,0.1,0.04,0.285714,2.857143,1.0,0.026,1.26,0.755814,0.2,0.206349,0.342857
8,(Michigan Organic Kale),(Carrots),0.04,0.04,0.04,1.0,25.0,1.0,0.0384,inf,1.0,1.0,1.0,1.0
9,(Carrots),(Michigan Organic Kale),0.04,0.04,0.04,1.0,25.0,1.0,0.0384,inf,1.0,1.0,1.0,1.0
