# Apriori Algorithm - Association Rule Learning

## What is Association Rule Learning?

Association Rule Learning is a machine learning technique that identifies frequent patterns and relationships between different items in large datasets. It's particularly useful in market basket analysis, where we want to discover which products are frequently bought together.

## What is the Apriori Algorithm?

The Apriori algorithm is one of the most popular algorithms for association rule learning. It works by:

1. **Finding frequent itemsets**: Identifying groups of items that appear together frequently
2. **Generating association rules**: Creating rules that show relationships between items
3. **Filtering by metrics**: Using support, confidence, and lift to find meaningful rules

## Key Concepts:

- **Support**: How frequently an itemset appears in the dataset
- **Confidence**: How often rule A → B is true when A is present
- **Lift**: How much more likely B is to be bought when A is bought (compared to B's general popularity)

## Business Application:

In this example, we'll analyze market basket data to discover:
- Which products are frequently bought together
- Recommendations for product placement
- Cross-selling opportunities
- Customer behavior patterns

## Step 1: Importing the Required Libraries

Before we start, we need to import the necessary libraries:

- **numpy**: For numerical operations and array handling
- **matplotlib.pyplot**: For creating visualizations (though we won't use it extensively in this example)
- **pandas**: For data manipulation and analysis
- **apyori**: The library that implements the Apriori algorithm

**Note**: The apyori library is not installed by default, so we'll need to install it first.

In [0]:
!pip install apyori



### Installing the Apyori Library

The apyori library contains the implementation of the Apriori algorithm. We install it using pip:

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing Standard Libraries

Now let's import our standard data science libraries:

## Step 2: Data Preprocessing

In this step, we'll load and prepare our market basket data for the Apriori algorithm.

### Understanding Market Basket Data

Market basket data typically consists of transactions where each row represents a shopping basket, and each column represents a different product that might be in that basket. Our dataset contains 7,501 transactions with up to 20 different products per transaction.

### Data Structure Transformation

The Apriori algorithm expects data in a specific format:
- Each transaction should be a list of items
- We need to convert our CSV data into a list of transactions
- Each transaction will contain the products bought together

Let's load and transform our data:

In [0]:
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)
transactions = []
for i in range(0, 7501):
  transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

In [None]:
# Let's first examine our dataset structure
print("Dataset shape:", dataset.shape)
print("\nFirst few rows of the dataset:")
print(dataset.head())
print("\nSample transaction (first row):")
print([item for item in dataset.iloc[0] if pd.notna(item)])

### Converting Data to Transaction Format

Now let's convert our data into the format required by the Apriori algorithm. We need to:
1. Remove empty/null values from each transaction
2. Convert each row into a list of items
3. Store all transactions in a list

In [None]:
# Convert the data into transaction format
transactions = []
for i in range(0, len(dataset)):
    # Get all non-null items from the current row
    transaction = [str(dataset.values[i,j]) for j in range(0, dataset.shape[1]) if str(dataset.values[i,j]) != 'nan']
    if transaction:  # Only add non-empty transactions
        transactions.append(transaction)

print(f"Total number of transactions: {len(transactions)}")
print(f"Sample transactions:")
for i in range(3):
    print(f"Transaction {i+1}: {transactions[i]}")
    
print(f"\nAverage items per transaction: {sum(len(t) for t in transactions)/len(transactions):.2f}")

## Step 3: Training the Apriori Algorithm

Now we'll apply the Apriori algorithm to find association rules in our market basket data.

### Understanding the Parameters

The Apriori algorithm uses several important parameters:

1. **min_support**: The minimum support threshold
   - Support = (Number of transactions containing the itemset) / (Total transactions)
   - We set it to 0.003, meaning an itemset must appear in at least 0.3% of transactions
   - For 7,501 transactions, this means at least ~23 transactions

2. **min_confidence**: The minimum confidence threshold
   - Confidence = P(B|A) = (Support of A and B) / (Support of A)
   - We set it to 0.2, meaning if someone buys A, there's at least a 20% chance they'll buy B

3. **min_lift**: The minimum lift threshold
   - Lift = Confidence / (Support of B)
   - Lift > 1 means the rule is better than random chance
   - We set it to 3, meaning B is 3 times more likely to be bought when A is bought

4. **min_length & max_length**: The size of itemsets to consider
   - We focus on pairs (length = 2) to find simple "A → B" relationships

Let's apply the algorithm:

In [0]:
from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2, max_length = 2)

### Applying the Apriori Algorithm

We'll import the apriori function and apply it to our transaction data:

In [None]:
print("Applying Apriori algorithm...")
print("This may take a moment as we're analyzing", len(transactions), "transactions...")

# Apply the Apriori algorithm
rules = apriori(
    transactions=transactions, 
    min_support=0.003,      # Item(s) must appear in at least 0.3% of transactions
    min_confidence=0.2,     # 20% confidence threshold
    min_lift=3,             # 3x better than random chance
    min_length=2,           # Consider pairs of items
    max_length=2            # Only pairs, not larger groups
)

print("Apriori algorithm completed!")
print("Now converting results to a readable format...")

## Step 4: Analyzing and Visualizing the Results

The Apriori algorithm has found association rules in our data. Now we need to extract and interpret these results.

### Understanding the Raw Output

The apriori function returns a generator object containing RelationRecord objects. Each record contains:
- **Items**: The itemset (products that appear together)
- **Support**: How frequently this itemset appears
- **Ordered Statistics**: Rules with confidence and lift values

Let's first convert the results to a list so we can examine them:

### Displaying the first results coming directly from the output of the apriori function

In [0]:
results = list(rules)

In [None]:
# Convert the generator to a list
results = list(rules)
print(f"Number of association rules found: {len(results)}")

if len(results) > 0:
    print("\nFirst rule structure:")
    print("Rule:", results[0])
    print("\nLet's examine the components:")
    print("Items:", results[0].items)
    print("Support:", results[0].support)
    print("Ordered Statistics:", results[0].ordered_statistics)
else:
    print("No rules found with the current parameters. You might need to lower the thresholds.")

In [0]:
# Display the raw results
if len(results) > 0:
    print("Raw results from Apriori algorithm:")
    for i, result in enumerate(results[:5]):  # Show first 5 results
        print(f"\nRule {i+1}:")
        print(result)
else:
    print("No association rules found with current parameters.")
    print("This might happen if the thresholds are too strict for the dataset.")
    print("Consider lowering min_support, min_confidence, or min_lift values.")

[RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]),
 RelationRecord(items=frozenset({'honey', 'fromage blanc'}), support=0.003332888948140248, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0

### Creating a Structured DataFrame

The raw output is difficult to read. Let's create a function to extract the important information and organize it into a clear pandas DataFrame.

#### What Each Column Means:
- **Left Hand Side (LHS)**: The "if" part of the rule (antecedent)
- **Right Hand Side (RHS)**: The "then" part of the rule (consequent)
- **Support**: How often both items appear together
- **Confidence**: How often the rule is correct (when LHS is bought, RHS is also bought)
- **Lift**: How much better this rule is compared to random chance

In [0]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

In [None]:
# Enhanced function to inspect results with better error handling
def inspect(results):
    """
    Extract association rule components from apyori results
    Returns lists of LHS, RHS, Support, Confidence, and Lift values
    """
    if not results:
        return [], [], [], [], []
    
    lhs = []
    rhs = []
    supports = []
    confidences = []
    lifts = []
    
    for result in results:
        # Extract support
        supports.append(result.support)
        
        # Extract ordered statistics (rules)
        for rule_stat in result.ordered_statistics:
            lhs.append(list(rule_stat.items_base)[0] if rule_stat.items_base else "")
            rhs.append(list(rule_stat.items_add)[0] if rule_stat.items_add else "")
            confidences.append(rule_stat.confidence)
            lifts.append(rule_stat.lift)
    
    return list(zip(lhs, rhs, supports[:len(lhs)], confidences, lifts))

# Create the DataFrame
if len(results) > 0:
    resultsinDataFrame = pd.DataFrame(
        inspect(results), 
        columns=['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift']
    )
    print(f"Successfully created DataFrame with {len(resultsinDataFrame)} association rules")
else:
    print("No results to convert to DataFrame")

### Displaying All Association Rules

Let's examine all the association rules discovered by our algorithm. Each row represents a rule like "If customers buy X, then they're likely to also buy Y".

In [0]:
resultsinDataFrame

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
2,pasta,escalope,0.005866,0.372881,4.700812
3,fromage blanc,honey,0.003333,0.245098,5.164271
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
6,light cream,olive oil,0.0032,0.205128,3.11471
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
8,pasta,shrimp,0.005066,0.322034,4.506672


In [None]:
# Display the results with better formatting
if 'resultsinDataFrame' in locals() and not resultsinDataFrame.empty:
    print("Association Rules Found:")
    print("=" * 80)
    
    # Format the display for better readability
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', 30)
    
    display_df = resultsinDataFrame.copy()
    display_df['Support'] = display_df['Support'].round(4)
    display_df['Confidence'] = display_df['Confidence'].round(4)
    display_df['Lift'] = display_df['Lift'].round(2)
    
    print(display_df)
    
    print(f"\nSummary:")
    print(f"Total rules found: {len(resultsinDataFrame)}")
    print(f"Average confidence: {resultsinDataFrame['Confidence'].mean():.3f}")
    print(f"Average lift: {resultsinDataFrame['Lift'].mean():.2f}")
else:
    print("No association rules found or DataFrame not created.")

### Top Association Rules by Lift

Lift is often the most important metric because it tells us how much better a rule performs compared to random chance. Let's look at the top rules sorted by lift value.

#### Interpreting Lift Values:
- **Lift = 1**: The rule performs no better than random chance
- **Lift > 1**: Positive correlation - items appear together more often than expected
- **Lift < 1**: Negative correlation - items appear together less often than expected
- **Higher lift values**: Stronger associations

Let's examine the top 10 rules with the highest lift:

In [0]:
resultsinDataFrame.nlargest(n = 10, columns = 'Lift')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
3,fromage blanc,honey,0.003333,0.245098,5.164271
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812
8,pasta,shrimp,0.005066,0.322034,4.506672
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
6,light cream,olive oil,0.0032,0.205128,3.11471


In [None]:
# Display top rules by lift with detailed analysis
if 'resultsinDataFrame' in locals() and not resultsinDataFrame.empty:
    print("TOP 10 ASSOCIATION RULES BY LIFT")
    print("=" * 50)
    
    top_rules = resultsinDataFrame.nlargest(n=10, columns='Lift')
    
    # Format for better display
    for idx, (index, rule) in enumerate(top_rules.iterrows(), 1):
        print(f"\n{idx}. Rule: '{rule['Left Hand Side']}' → '{rule['Right Hand Side']}'")
        print(f"   Support: {rule['Support']:.4f} ({rule['Support']*len(transactions):.0f} transactions)")
        print(f"   Confidence: {rule['Confidence']:.4f} ({rule['Confidence']*100:.1f}%)")
        print(f"   Lift: {rule['Lift']:.2f} (>{rule['Lift']:.1f}x better than random)")
        
        # Business interpretation
        if rule['Lift'] > 3:
            strength = "Very Strong"
        elif rule['Lift'] > 2:
            strength = "Strong"
        else:
            strength = "Moderate"
            
        print(f"   Interpretation: {strength} association - customers who buy")
        print(f"   '{rule['Left Hand Side']}' are {rule['Lift']:.1f}x more likely to also buy '{rule['Right Hand Side']}'")
    
    print(f"\nDataFrame sorted by Lift (top 10):")
    display(top_rules)
    
else:
    print("No rules available to sort by lift.")

## Step 5: Business Insights and Recommendations

Based on our association rule analysis, we can derive several actionable business insights:

### Key Findings:

1. **Product Placement**: Items with high lift values should be placed near each other in the store
2. **Cross-selling Opportunities**: Train sales staff to recommend the "Right Hand Side" items when customers show interest in "Left Hand Side" items
3. **Bundle Promotions**: Create product bundles based on the strongest associations
4. **Inventory Management**: Ensure adequate stock of both items in strong association pairs
5. **Marketing Campaigns**: Design targeted campaigns promoting complementary products

### Parameter Sensitivity:

Our current parameters were:
- **min_support = 0.003**: Items must appear in at least 0.3% of transactions
- **min_confidence = 0.2**: Rules must be correct at least 20% of the time  
- **min_lift = 3**: Associations must be 3x better than random chance

### Next Steps:

1. **Experiment with Parameters**: Try different thresholds to find more or fewer rules
2. **Seasonal Analysis**: Run this analysis on different time periods to find seasonal patterns
3. **Customer Segmentation**: Apply association rules to different customer segments
4. **Implementation**: Use these insights to optimize store layout and marketing strategies

### Technical Notes:

- The Apriori algorithm scales with the number of transactions and unique items
- For larger datasets, consider using more efficient algorithms like FP-Growth
- Regular reanalysis is recommended as customer behavior patterns change over time