# Market Basket Analysis with Apriori Algorithm

## Overview
This notebook implements a comprehensive market basket analysis using the Apriori algorithm, featuring data preprocessing, association rule mining, customer segmentation, and interactive visualizations.

## Key Features
- **Data Loading & Cleaning**: Robust data preprocessing pipeline
- **Exploratory Data Analysis**: Transaction pattern analysis
- **Association Rule Mining**: Apriori algorithm implementation with mlxtend
- **Customer Segmentation**: RFM analysis with K-means clustering
- **Interactive Visualizations**: Plotly-based charts and network graphs

---

## 1. Library Imports and Setup

Essential libraries for data analysis, visualization, and machine learning operations.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
from datetime import datetime
from mlxtend.frequent_patterns import apriori, association_rules
import plotly.graph_objects as go
import plotly.express as px
import networkx as nx
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Set display options for better output readability
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', '{:.2f}'.format)
warnings.filterwarnings('ignore')

# Create a logger function to track each step
def log_step(step_name):
    print(f"\n{'=' * 80}")
    print(f"STEP: {step_name}")
    print(f"{'=' * 80}")

## 2. Data Loading

Load and examine the transaction dataset structure.

In [2]:
# =========================
# STEP 1: DATA LOADING
# =========================
log_step("DATA LOADING")

def load_data(file_path, sep=';', dtype_dict=None):
    """
    Load data from a file with specified parameters.
    
    Parameters:
    -----------
    file_path : str
        Path to the data file
    sep : str, default ';'
        Delimiter to use
    dtype_dict : dict, default None
        Dictionary of column data types
        
    Returns:
    --------
    pandas.DataFrame
        Loaded DataFrame
    """
    if dtype_dict is None:
        dtype_dict = {'BillNo': str}
        
    try:
        df = pd.read_csv(file_path, sep=sep, dtype=dtype_dict)
        print(f"✓ Successfully loaded data from {file_path}")
        print(f"✓ Dataset shape: {df.shape}")
        return df
    except Exception as e:
        print(f"✗ Error loading data: {e}")
        return None

# Load the dataset
purchase_df = load_data('Assignment-1_Data.csv')


STEP: DATA LOADING
✗ Error loading data: [Errno 2] No such file or directory: 'Assignment-1_Data.csv'


## 3. Data Exploration

Initial examination of dataset characteristics and quality.

In [3]:
# =========================
# STEP 2: INITIAL DATA EXPLORATION
# =========================
log_step("INITIAL DATA EXPLORATION")

def explore_data(df):
    """
    Perform initial exploration of the dataset.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame to explore
    """
    # Display basic information
    print("\n📊 Basic DataFrame Information:")
    print(df.info())
    
    # Check for missing values
    print("\n🔍 Missing Values Check:")
    missing_values = df.isnull().sum()
    print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values found!")
    
    # Show unique counts for categorical columns
    print("\n🔢 Unique Value Counts:")
    print(df.nunique())
    
    # Display a sample of the data
    print("\n📋 Data Sample:")
    print(df.head())
    
    # Generate descriptive statistics
    print("\n📈 Descriptive Statistics:")
    print(df.describe().round(2))
    
    # Check for duplicates
    print(f"\n🔄 Duplicate Rows: {df.duplicated().sum()}")
    
    return

# Explore the loaded dataset
explore_data(purchase_df)


STEP: INITIAL DATA EXPLORATION

📊 Basic DataFrame Information:


AttributeError: 'NoneType' object has no attribute 'info'

## 4. Data Cleaning

Preprocessing pipeline to handle missing values, data types, and invalid records.

In [None]:
# =========================
# STEP 3: DATA CLEANING
# =========================
log_step("DATA CLEANING")

def clean_data(df):
    """
    Clean the dataset by handling missing values, converting data types,
    and filtering out records with invalid values.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame to clean
    
    Returns:
    --------
    pandas.DataFrame
        Cleaned DataFrame
    """
    # Create a copy to avoid modifying the original
    clean_df = df.copy()
    
    # Track the original row count
    original_rows = len(clean_df)
    print(f"Original row count: {original_rows}")
    
    # Convert date column to datetime
    try:
        clean_df['Date'] = pd.to_datetime(clean_df['Date'], format='%d.%m.%Y %H:%M')
        print("✓ Converted 'Date' column to datetime")
    except Exception as e:
        print(f"✗ Error converting dates: {e}")
    
    # Convert price column to float
    try:
        clean_df['Price'] = clean_df['Price'].str.replace(',', '.').astype(float)
        print("✓ Converted 'Price' column to float")
    except Exception as e:
        print(f"✗ Error converting prices: {e}")
    
    # Convert CustomerID to string if it exists
    if 'CustomerID' in clean_df.columns:
        clean_df['CustomerID'] = clean_df['CustomerID'].astype('str')
        print("✓ Converted 'CustomerID' column to string")
    
    # Drop rows with missing values
    clean_df = clean_df.dropna()
    print(f"✓ Dropped rows with missing values. Remaining: {len(clean_df)}")
    
    # Filter out rows with non-positive quantity or price
    clean_df = clean_df[clean_df['Quantity'] > 0]
    clean_df = clean_df[clean_df['Price'] > 0]
    print(f"✓ Filtered out rows with non-positive quantity or price. Remaining: {len(clean_df)}")
    
    # Calculate and report the percentage of data retained
    retained_pct = (len(clean_df) / original_rows) * 100
    print(f"✓ Retained {retained_pct:.2f}% of original data after cleaning")
    
    return clean_df

# Clean the dataset
cleaned_purchase_df = clean_data(purchase_df)

## 5. Exploratory Data Analysis

Analyzing transaction patterns, customer behavior, and product popularity.

In [None]:
# =========================
# STEP 4: EXPLORATORY DATA ANALYSIS
# =========================
log_step("EXPLORATORY DATA ANALYSIS")

def perform_eda(df):
    """
    Perform exploratory data analysis on the cleaned dataset.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Cleaned DataFrame to analyze
    """
    # Temporal analysis
    if 'Date' in df.columns:
        # Extract date components
        df['Year'] = df['Date'].dt.year
        df['Month'] = df['Date'].dt.month
        df['Day'] = df['Date'].dt.day
        df['DayOfWeek'] = df['Date'].dt.dayofweek
        df['Hour'] = df['Date'].dt.hour
        
        print("\n📅 Transaction distribution by month:")
        monthly_counts = df.groupby('Month').size()
        print(monthly_counts)
        
        print("\n📅 Transaction distribution by day of week:")
        day_counts = df.groupby('DayOfWeek').size()
        print(day_counts.index.map({0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}))
        print(day_counts.values)
    
    # Country analysis if available
    if 'Country' in df.columns:
        print("\n🌍 Transaction distribution by country:")
        country_counts = df['Country'].value_counts()
        print(country_counts.head(10))
    
    # Product analysis
    print("\n🛒 Top 10 most popular products:")
    top_products = df.groupby('Itemname').size().sort_values(ascending=False).head(10)
    print(top_products)
    
    # Customer analysis
    if 'CustomerID' in df.columns:
        print("\n👤 Customer purchase frequency:")
        customer_purchases = df.groupby('CustomerID').size().describe().round(2)
        print(customer_purchases)
    
    # Transaction value analysis
    df['TransactionValue'] = df['Quantity'] * df['Price']
    print("\n💰 Transaction value statistics:")
    transaction_stats = df.groupby('BillNo')['TransactionValue'].sum().describe().round(2)
    print(transaction_stats)
    
    return df

# Perform EDA on the cleaned dataset
df_with_eda = perform_eda(cleaned_purchase_df)

## 6. Transaction Basket Creation

Converting transaction data into market basket format for association rule mining.

In [None]:
# =========================
# STEP 5: BASKET TRANSFORMATION
# =========================
log_step("BASKET TRANSFORMATION")

def create_transaction_basket(df):
    """
    Transform the transaction data into a format suitable for market basket analysis.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Cleaned DataFrame with transaction data
    
    Returns:
    --------
    pandas.DataFrame
        Transaction data in basket format
    pandas.DataFrame
        One-hot encoded transaction data
    """
    # Create transactions by grouping items by bill number
    print("Creating transaction baskets...")
    transaction = df.groupby(['BillNo', 'Date'])['Itemname'].apply(lambda x: ', '.join(x)).reset_index()
    print(f"✓ Created {len(transaction)} transaction baskets")
    
    # Display a sample of the transactions
    print("\nSample transactions:")
    print(transaction.head())
    
    # Drop unnecessary columns
    transaction.drop(columns=['BillNo', 'Date'], inplace=True)
    
    # Split the 'Itemname' column into separate rows
    transaction = transaction.assign(Itemname=transaction['Itemname'].str.split(', ')).explode('Itemname')
    
    # Create a crosstab (binary matrix) of items per transaction
    print("\nCreating one-hot encoded transaction matrix...")
    basket_encoded = pd.crosstab(index=transaction.index, columns=transaction['Itemname'])
    
    # Convert to a binary matrix (0 or 1)
    basket_encoded = basket_encoded.astype(bool).astype(int)
    
    print(f"✓ Created one-hot encoded matrix with {basket_encoded.shape[0]} transactions and {basket_encoded.shape[1]} unique items")
    
    # Output sample of the binary matrix
    print("\nSample of one-hot encoded basket matrix:")
    print(basket_encoded.iloc[:5, :5])
    
    return transaction, basket_encoded

# Create transaction basket and one-hot encoded matrix
transaction_df, basket_matrix = create_transaction_basket(cleaned_purchase_df)

## 7. Apriori Algorithm Implementation

Mining frequent itemsets and generating association rules using the Apriori algorithm.

In [None]:
# =========================
# STEP 6: APRIORI ALGORITHM IMPLEMENTATION
# =========================
log_step("APRIORI ALGORITHM IMPLEMENTATION")

def apply_apriori(basket_matrix, min_support=0.01, min_confidence=0.5, min_lift=1.0):
    """
    Apply the Apriori algorithm to generate frequent itemsets and association rules.
    
    Parameters:
    -----------
    basket_matrix : pandas.DataFrame
        One-hot encoded transaction data
    min_support : float, default 0.01
        Minimum support threshold for frequent itemsets
    min_confidence : float, default 0.5
        Minimum confidence threshold for association rules
    min_lift : float, default 1.0
        Minimum lift threshold for association rules
    
    Returns:
    --------
    pandas.DataFrame
        Frequent itemsets
    pandas.DataFrame
        Association rules
    """
    # Generate frequent itemsets
    print(f"Generating frequent itemsets with minimum support = {min_support}...")
    try:
        frequent_itemsets = apriori(basket_matrix, min_support=min_support, use_colnames=True)
        print(f"✓ Found {len(frequent_itemsets)} frequent itemsets")
        
        # Display a sample of frequent itemsets
        if not frequent_itemsets.empty:
            print("\nSample frequent itemsets:")
            print(frequent_itemsets.sort_values('support', ascending=False).head())
        else:
            print("No frequent itemsets found with the specified parameters.")
            return pd.DataFrame(), pd.DataFrame()
        
        # Generate association rules
        print(f"\nGenerating association rules with minimum confidence = {min_confidence}...")
        # Fix: The correct function call for association_rules in mlxtend
        rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
        
        # Filter rules by lift if specified
        if min_lift > 1.0:
            rules = rules[rules['lift'] >= min_lift]
        
        print(f"✓ Found {len(rules)} association rules")
        
        # Display a sample of association rules
        if not rules.empty:
            # Convert frozensets to lists for better readability
            sample_rules = rules.sort_values('lift', ascending=False).head().copy()
            sample_rules['antecedents'] = sample_rules['antecedents'].apply(lambda x: list(x))
            sample_rules['consequents'] = sample_rules['consequents'].apply(lambda x: list(x))
            print("\nSample association rules (sorted by lift):")
            print(sample_rules)
        else:
            print("No association rules found with the specified parameters.")
        
        return frequent_itemsets, rules
    
    except Exception as e:
        print(f"✗ Error applying Apriori algorithm: {e}")
        import traceback
        traceback.print_exc()
        return pd.DataFrame(), pd.DataFrame()

# Apply Apriori algorithm to the basket matrix
frequent_itemsets, rules = apply_apriori(basket_matrix, min_support=0.01, min_confidence=0.5, min_lift=1.0)

## 8. Association Rules Visualization

Interactive visualizations for exploring discovered patterns and relationships.

In [None]:
# =========================
# STEP 7: VISUALIZATION OF ASSOCIATION RULES
# =========================
log_step("VISUALIZATION OF ASSOCIATION RULES")

def visualize_association_rules(rules_df, itemsets_df=None):
    """
    Create visualizations for association rules analysis.
    
    Parameters:
    -----------
    rules_df : pandas.DataFrame
        Association rules DataFrame
    itemsets_df : pandas.DataFrame, optional
        Frequent itemsets DataFrame
    """
    if rules_df.empty:
        print("❌ No association rules available for visualization")
        return
    
    print(f"Creating visualizations for {len(rules_df)} association rules...")
    
    # 1. Scatter plot of Support vs Confidence
    print("\n1. Support vs Confidence Visualization")
    
    # Create a copy of rules and convert frozensets to strings for visualization
    plot_rules = rules_df.copy()
    plot_rules['antecedents_str'] = plot_rules['antecedents'].apply(lambda x: ', '.join(list(x)))
    plot_rules['consequents_str'] = plot_rules['consequents'].apply(lambda x: ', '.join(list(x)))
    
    # Create a scatter plot using plotly
    fig = go.Figure()
    
    # Add scatter trace
    scatter = go.Scatter(
        x=plot_rules['support'],
        y=plot_rules['confidence'],
        mode='markers',
        marker=dict(
            size=plot_rules['lift'] * 2,
            color=plot_rules['lift'],
            colorscale='Viridis',
            colorbar=dict(title='Lift'),
            showscale=True
        ),
        text=[
            f"Antecedents: {a}<br>"
            f"Consequents: {c}<br>"
            f"Support: {support:.3f}<br>"
            f"Confidence: {confidence:.3f}<br>"
            f"Lift: {lift:.3f}"
            for a, c, support, confidence, lift in zip(
                plot_rules['antecedents_str'], 
                plot_rules['consequents_str'],
                plot_rules['support'], 
                plot_rules['confidence'], 
                plot_rules['lift']
            )
        ],
        hoverinfo='text'
    )
    
    fig.add_trace(scatter)
    
    # Update layout
    fig.update_layout(
        title='Association Rules: Support vs Confidence',
        xaxis_title='Support',
        yaxis_title='Confidence',
        template='plotly_white'
    )
    
    print("✓ Created Support vs Confidence scatter plot")
    
    # 2. Network Graph of Top Association Rules
    print("\n2. Network Graph Visualization")
    
    # Select top rules by lift for the network graph (to avoid overcrowding)
    top_rules = plot_rules.sort_values('lift', ascending=False).head(20)
    
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add edges with attributes
    for _, row in top_rules.iterrows():
        antecedent = row['antecedents_str']
        consequent = row['consequents_str']
        G.add_edge(
            antecedent, 
            consequent, 
            weight=row['lift'],
            support=row['support'],
            confidence=row['confidence']
        )
    
    # Get node positions
    pos = nx.spring_layout(G, k=0.5, iterations=50)
    
    # Create edge traces
    edge_traces = []
    for edge in G.edges(data=True):
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        
        # Width based on lift
        width = edge[2]['weight'] * 0.5
        
        edge_trace = go.Scatter(
            x=[x0, x1, None],
            y=[y0, y1, None],
            line=dict(width=width, color='rgba(150,150,150,0.7)'),
            hoverinfo='none',
            mode='lines'
        )
        
        edge_traces.append(edge_trace)
    
    # Create node trace
    node_x = []
    node_y = []
    node_text = []
    
    for node in G.nodes():
        x, y = pos[node]
        node_x.append(x)
        node_y.append(y)
        node_text.append(node)
    
    node_trace = go.Scatter(
        x=node_x,
        y=node_y,
        mode='markers+text',
        marker=dict(
            size=15,
            color='skyblue',
            line=dict(width=1, color='darkslategray')
        ),
        text=node_text,
        textposition='top center',
        hoverinfo='text'
    )
    
    # Create the figure
    fig_network = go.Figure(
        data=edge_traces + [node_trace],
        layout=go.Layout(
            title='Network of Top Association Rules by Lift',
            showlegend=False,
            hovermode='closest',
            margin=dict(b=20, l=5, r=5, t=40),
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
        )
    )
    
    print("✓ Created network graph of top association rules")
    
    # 3. Heatmap of item co-occurrence (if itemsets are provided)
    if itemsets_df is not None and not itemsets_df.empty:
        print("\n3. Item Co-occurrence Heatmap")
        
        # Get top frequent items
        top_items = []
        for itemset in itemsets_df['itemsets']:
            if len(itemset) == 1:
                item = list(itemset)[0]
                top_items.append((item, itemsets_df.loc[itemsets_df['itemsets'] == itemset, 'support'].values[0]))
        
        # Sort by support and take top 15
        top_items = sorted(top_items, key=lambda x: x[1], reverse=True)[:15]
        top_item_names = [item[0] for item in top_items]
        
        # Create co-occurrence matrix
        cooc_matrix = np.zeros((len(top_item_names), len(top_item_names)))
        
        # Fill co-occurrence matrix
        for i, item1 in enumerate(top_item_names):
            for j, item2 in enumerate(top_item_names):
                if i == j:
                    # Diagonal - use the item's own support
                    cooc_matrix[i, j] = next(item[1] for item in top_items if item[0] == item1)
                else:
                    # Off-diagonal - find rules containing both items
                    pair_rules = rules_df[
                        rules_df['antecedents'].apply(lambda x: item1 in x and item2 in x) |
                        rules_df['consequents'].apply(lambda x: item1 in x and item2 in x) |
                        (rules_df['antecedents'].apply(lambda x: item1 in x) & 
                         rules_df['consequents'].apply(lambda x: item2 in x)) |
                        (rules_df['antecedents'].apply(lambda x: item2 in x) & 
                         rules_df['consequents'].apply(lambda x: item1 in x))
                    ]
                    
                    if not pair_rules.empty:
                        cooc_matrix[i, j] = pair_rules['support'].max()
        
        # Create heatmap
        fig_heatmap = go.Figure(data=go.Heatmap(
            z=cooc_matrix,
            x=top_item_names,
            y=top_item_names,
            colorscale='Viridis',
            hoverongaps=False
        ))
        
        fig_heatmap.update_layout(
            title='Item Co-occurrence Heatmap (based on support)',
            xaxis=dict(tickangle=-45),
            yaxis=dict(tickangle=0)
        )
        
        print("✓ Created item co-occurrence heatmap")
    
    print("\n✓ All visualizations created successfully")

# Visualize the generated association rules
if 'rules' in locals() and not rules.empty:
    visualize_association_rules(rules, frequent_itemsets)

## 9. Country-Specific Analysis

Regional market basket analysis to understand geographical patterns.

In [None]:
# =========================
# STEP 8: COUNTRY-SPECIFIC ANALYSIS
# =========================
log_step("COUNTRY-SPECIFIC ANALYSIS")

def analyze_by_country(df, country_name):
    """
    Perform country-specific market basket analysis.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Cleaned DataFrame with transaction data
    country_name : str
        Name of the country to analyze
    
    Returns:
    --------
    pandas.DataFrame
        Country-specific frequent itemsets
    pandas.DataFrame
        Country-specific association rules
    """
    # Check if Country column exists
    if 'Country' not in df.columns:
        print("❌ Country column not available in dataset")
        return None, None
    
    # Filter by country
    country_df = df[df['Country'] == country_name]
    
    if len(country_df) == 0:
        print(f"❌ No transactions found for country: {country_name}")
        return None, None
    
    print(f"Analyzing transactions from {country_name}...")
    print(f"✓ Found {len(country_df)} transactions from {country_name}")
    
    # Create transaction basket for the country
    country_transaction, country_basket = create_transaction_basket(country_df)
    
    # Apply Apriori algorithm to the country-specific basket
    country_itemsets, country_rules = apply_apriori(country_basket, min_support=0.01, min_confidence=0.5)
    
    # Visualize country-specific rules
    if not country_rules.empty:
        print(f"\nVisualizing association rules for {country_name}...")
        visualize_association_rules(country_rules, country_itemsets)
    
    return country_itemsets, country_rules

# Perform country-specific analysis (e.g., for United Kingdom)
if 'Country' in cleaned_purchase_df.columns:
    uk_itemsets, uk_rules = analyze_by_country(cleaned_purchase_df, 'United Kingdom')

## 10. Customer Segmentation with RFM Analysis

Understanding customer behavior through Recency, Frequency, and Monetary analysis.

In [None]:
# =========================
# STEP 9: RFM ANALYSIS
# =========================
log_step("RFM ANALYSIS")

def perform_rfm_analysis(df):
    """
    Perform RFM (Recency, Frequency, Monetary) analysis on the transaction data.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Cleaned DataFrame with transaction data
    
    Returns:
    --------
    pandas.DataFrame
        RFM scores by customer
    """
    # Check if necessary columns exist
    required_cols = ['CustomerID', 'Date', 'BillNo']
    for col in required_cols:
        if col not in df.columns:
            print(f"❌ {col} column not available for RFM analysis")
            return None
    
    print("Computing RFM metrics...")
    
    # Add monetary value if not already present
    if 'TransactionValue' not in df.columns:
        df['TransactionValue'] = df['Quantity'] * df['Price']
    
    # Define today's date for recency calculation (use the most recent date in the dataset)
    today_date = df['Date'].max()
    print(f"Reference date for recency calculation: {today_date}")
    
    # Group by CustomerID to calculate RFM metrics
    rfm = df.groupby('CustomerID').agg({
        'Date': lambda x: (today_date - x.max()).days,   # Recency
        'BillNo': 'nunique',                             # Frequency
        'TransactionValue': 'sum'                         # Monetary
    })
    
    # Rename columns for clarity
    rfm.columns = ['recency', 'frequency', 'monetary']
    
    # Reset index
    rfm = rfm.reset_index()
    
    print("✓ RFM metrics calculated successfully")
    print(f"✓ Number of unique customers analyzed: {len(rfm)}")
    
    # Display summary statistics of RFM metrics
    print("\nRFM metrics summary:")
    print(rfm[['recency', 'frequency', 'monetary']].describe().round(2))
    
    return rfm

# Perform RFM analysis
rfm_data = perform_rfm_analysis(cleaned_purchase_df)

## 11. K-Means Clustering on RFM Data

Advanced customer segmentation using machine learning clustering techniques.

In [None]:
# =========================
# STEP 10: CUSTOMER SEGMENTATION
# =========================
log_step("CUSTOMER SEGMENTATION")

def perform_customer_segmentation(rfm_data, n_clusters=3):
    """
    Perform customer segmentation using K-means clustering on RFM metrics.
    
    Parameters:
    -----------
    rfm_data : pandas.DataFrame
        RFM metrics by customer
    n_clusters : int, default 3
        Number of clusters to create
    
    Returns:
    --------
    pandas.DataFrame
        RFM data with cluster assignments
    pandas.DataFrame
        Cluster profiles
    """
    if rfm_data is None:
        print("❌ No RFM data available for segmentation")
        return None, None
    
    # Handle extreme outliers before scaling (optional but recommended)
    rfm_clean = rfm_data.copy()
    for col in ['recency', 'frequency', 'monetary']:
        q1 = rfm_clean[col].quantile(0.01)  # 1st percentile
        q3 = rfm_clean[col].quantile(0.99)  # 99th percentile
        iqr = q3 - q1
        lower_bound = q1 - (1.5 * iqr)
        upper_bound = q3 + (1.5 * iqr)
        
        # Cap extreme outliers
        rfm_clean.loc[rfm_clean[col] > upper_bound, col] = upper_bound
        rfm_clean.loc[rfm_clean[col] < lower_bound, col] = lower_bound
    
    # Normalize the RFM metrics
    print("Normalizing RFM metrics...")
    scaler = StandardScaler()
    rfm_scaled = scaler.fit_transform(rfm_clean[['recency', 'frequency', 'monetary']])
    
    # Perform K-means clustering
    print(f"Performing K-means clustering with {n_clusters} clusters...")
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
    rfm_clean['cluster'] = kmeans.fit_predict(rfm_scaled)
    
    # Transfer cluster assignments back to original data
    rfm_data['cluster'] = rfm_clean['cluster']
    
    print("✓ Clustering completed successfully")
    
    # Generate cluster profiles
    cluster_profile = rfm_data.groupby('cluster').agg({
        'recency': 'mean',
        'frequency': 'mean',
        'monetary': 'mean',
        'CustomerID': 'count'  # Count of customers in each cluster
    }).rename(columns={'CustomerID': 'n_customers'})
    
    # Display cluster profiles
    print("\nCluster profiles:")
    print(cluster_profile.round(2))
    
    # Interpret clusters
    print("\nCluster Interpretation:")
    for cluster in range(n_clusters):
        profile = cluster_profile.loc[cluster]
        print(f"\nCluster {cluster} ({profile['n_customers']} customers):")
        
        # Recency interpretation (lower is better)
        if profile['recency'] <= cluster_profile['recency'].quantile(0.33):
            recency_desc = "Recent customers"
        elif profile['recency'] <= cluster_profile['recency'].quantile(0.67):
            recency_desc = "Moderately recent customers"
        else:
            recency_desc = "Less recent customers"
        
        # Frequency interpretation (higher is better)
        if profile['frequency'] >= cluster_profile['frequency'].quantile(0.67):
            frequency_desc = "Frequent shoppers"
        elif profile['frequency'] >= cluster_profile['frequency'].quantile(0.33):
            frequency_desc = "Moderate shoppers"
        else:
            frequency_desc = "Infrequent shoppers"
        
        # Monetary interpretation (higher is better)
        if profile['monetary'] >= cluster_profile['monetary'].quantile(0.67):
            monetary_desc = "High spenders"
        elif profile['monetary'] >= cluster_profile['monetary'].quantile(0.33):
            monetary_desc = "Medium spenders"
        else:
            monetary_desc = "Low spenders"
        
        print(f"  • {recency_desc}, {frequency_desc}, {monetary_desc}")
        print(f"  • Avg. Days Since Last Purchase: {profile['recency']:.1f}")
        print(f"  • Avg. Purchase Frequency: {profile['frequency']:.1f}")
        print(f"  • Avg. Monetary Value: ${profile['monetary']:.2f}")
    
    # Create visualizations of the clusters
    print("\nCreating customer segmentation visualizations...")
    
    # 1. 3D Scatter Plot of RFM Clusters
    fig = go.Figure()
    
    # Add a scatter trace for each cluster
    for cluster_id in sorted(rfm_data['cluster'].unique()):
        cluster_data = rfm_data[rfm_data['cluster'] == cluster_id]
        
        # Choose cluster colors
        cluster_colors = {0: 'rgb(31, 119, 180)', 1: 'rgb(255, 127, 14)', 2: 'rgb(44, 160, 44)', 
                          3: 'rgb(214, 39, 40)', 4: 'rgb(148, 103, 189)'}
        
        fig.add_trace(go.Scatter3d(
            x=cluster_data['recency'],
            y=cluster_data['frequency'],
            z=cluster_data['monetary'],
            mode='markers',
            marker=dict(
                size=5,
                color=cluster_colors.get(cluster_id, f'rgb({(cluster_id*50) % 255}, {(cluster_id*80) % 255}, {(cluster_id*120) % 255})'),
                opacity=0.7
            ),
            name=f'Cluster {cluster_id}'
        ))
    
    # Update layout
    fig.update_layout(
        title='3D Customer Segmentation based on RFM',
        scene=dict(
            xaxis_title='Recency (days)',
            yaxis_title='Frequency (purchases)',
            zaxis_title='Monetary (value)'
        ),
        legend=dict(
            x=0,
            y=1,
            traceorder="normal",
            font=dict(
                family="sans-serif",
                size=12,
                color="black"
            )
        )
    )
    
    print("✓ Created 3D visualization of customer segments")
    
    # 2. RFM Parallel Coordinates Plot
    # Normalize the data for better visualization
    rfm_viz = rfm_data.copy()
    
    # Invert recency so that higher is better (like frequency and monetary)
    rfm_viz['recency_inv'] = rfm_viz['recency'].max() - rfm_viz['recency']
    
    # Create parallel coordinates plot
    fig_parallel = go.Figure(data=
        go.Parcoords(
            line=dict(
                color=rfm_viz['cluster'],
                colorscale='Viridis',
                showscale=True
            ),
            dimensions=list([
                dict(range=[rfm_viz['recency_inv'].min(), rfm_viz['recency_inv'].max()],
                     label='Recency (Inverted)', 
                     values=rfm_viz['recency_inv']),
                dict(range=[rfm_viz['frequency'].min(), rfm_viz['frequency'].max()],
                     label='Frequency', 
                     values=rfm_viz['frequency']),
                dict(range=[rfm_viz['monetary'].min(), rfm_viz['monetary'].max()],
                     label='Monetary', 
                     values=rfm_viz['monetary'])
            ])
        )
    )
    
    fig_parallel.update_layout(
        title='Parallel Coordinates Plot of RFM Segments'
    )
    
    print("✓ Created parallel coordinates plot of RFM metrics")
    
    # 3. Radar Chart for Cluster Profiles
    # Prepare data for radar chart
    radar_data = cluster_profile[['recency', 'frequency', 'monetary']].copy()
    
    # Invert recency so that higher values are better (closer to center of radar)
    radar_data['recency'] = radar_data['recency'].max() - radar_data['recency']
    
    # Normalize each metric to 0-1 scale for radar chart
    for col in radar_data.columns:
        if radar_data[col].max() > 0:
            radar_data[col] = radar_data[col] / radar_data[col].max()
    
    # Create radar chart
    categories = ['Recency', 'Frequency', 'Monetary']
    fig_radar = go.Figure()
    
    for cluster_id in sorted(radar_data.index):
        fig_radar.add_trace(go.Scatterpolar(
            r=radar_data.loc[cluster_id].values.tolist(),
            theta=categories,
            fill='toself',
            name=f'Cluster {cluster_id}'
        ))
    
    fig_radar.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                range=[0, 1]
            )
        ),
        title='Radar Chart of Cluster Profiles'
    )
    
    print("✓ Created radar chart of cluster profiles")
    
    return rfm_data, cluster_profile

# Perform customer segmentation using K-means clustering
if rfm_data is not None:
    segmented_rfm, cluster_profiles = perform_customer_segmentation(rfm_data, n_clusters=3)

## 12. Summary and Business Recommendations

Key insights and actionable recommendations from the analysis.

In [None]:
# =========================
# STEP 11: SUMMARY AND RECOMMENDATIONS
# =========================
log_step("SUMMARY AND RECOMMENDATIONS")

def summarize_analysis():
    """
    Summarize the market basket analysis and provide recommendations.
    """
    print("📊 MARKET BASKET ANALYSIS SUMMARY 📊")
    print("\n1. Data Quality Assessment:")
    print("   • Initial dataset size and quality checked")
    print("   • Missing values and outliers handled")
    print("   • Data type conversions performed")
    
    print("\n2. Transaction Analysis:")
    print("   • Transaction patterns explored across time and geography")
    print("   • Popular products and purchasing patterns identified")
    
    print("\n3. Association Rules:")
    print("   • Frequent itemsets generated using Apriori algorithm")
    print("   • Association rules discovered based on support and confidence thresholds")
    
    print("\n4. Customer Segmentation:")
    print("   • RFM analysis performed to understand customer behavior")
    print("   • Customer segments identified via clustering")
    
    print("\n🚀 ACTIONABLE RECOMMENDATIONS 🚀")
    print("\n1. Product Placement and Bundling:")
    print("   • Arrange frequently co-purchased items together on shelves")
    print("   • Create product bundles based on high-confidence association rules")
    
    print("\n2. Targeted Marketing:")
    print("   • Design personalized promotions for each customer segment")
    print("   • Cross-sell products based on association rules")
    
    print("\n3. Inventory Management:")
    print("   • Optimize inventory based on frequent itemsets")
    print("   • Ensure complementary products are always in stock together")
    
    print("\n4. Website/Store Layout:")
    print("   • Optimize navigation to place associated products near each other")
    print("   • Implement 'Frequently Bought Together' recommendations")
    
    print("\n5. Next Steps:")
    print("   • Continuously update the analysis with new transaction data")
    print("   • Conduct A/B testing to validate recommendations")
    print("   • Integrate findings with other business metrics")

# Generate summary and recommendations
summarize_analysis()