# Transaction Data Cleaning, Preprocessing, and Association Rule Mining

## Overview
This notebook provides a complete solution for:
1. **Loading messy transaction data**
2. **Cleaning and preprocessing** multiple data quality issues
3. **Performing association rule mining** to discover patterns
4. **Visualizing and interpreting results**

---

## STEP 1: Import Required Libraries

**Explanation:** We need these libraries for data manipulation, analysis, and association rule mining.

**Libraries used:**
- `pandas`: Data manipulation and analysis
- `numpy`: Numerical operations
- `datetime`: Date parsing and standardization
- `mlxtend`: Association rule mining (Apriori algorithm)
- `re`: Regular expressions for text cleaning

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import warnings
warnings.filterwarnings('ignore')

---
## STEP 2: Load the Raw Data

**Explanation:** Load the CSV file to examine the original messy data with all its issues.

**What to look for:**
- Missing values
- Duplicate transactions
- Inconsistent formats
- Data type issues

In [74]:
# Load the data
df_raw = pd.read_csv('unclean_movie_transactions_data.csv', on_bad_lines='skip')

print(f"✓ Data loaded successfully!")
print(f"✓ Total rows: {len(df_raw)}")
print(f"✓ Total columns: {len(df_raw.columns)}")
print("\n" + "="*100)

✓ Data loaded successfully!
✓ Total rows: 64
✓ Total columns: 11



In [75]:
# Display the complete raw dataset
print("COMPLETE RAW DATA:")
print("="*100)
print(df_raw.to_string())

COMPLETE RAW DATA:
   transaction_id         date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001   2024-01-15         John Smith                  Dune: Part Two       ADULT                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002   01/15/2024      Sarah Johnson                  Dune: Part Two       adult                2          15.5                 popcorn;soda;candy             28.50    Credit Card         59.50
2          TXN003          NaN      Michael Brown                      The Batman      Senior                1          12.0                   Popcorn, , ,Soda             15.75           Cash           NaN
3          TXN004   2024-01-16        Emily Davis                     Oppenheimer     STUDENT                3          13.0               Nachos;Pretzel

In [76]:
# Show data types and missing values
print("DATA TYPES:")
print("="*100)
print(df_raw.dtypes)
print("\n" + "="*100)
print("MISSING VALUES SUMMARY:")
print("="*100)
print(df_raw.isnull().sum())

DATA TYPES:
transaction_id       object
date                 object
customer_name        object
movie_title          object
ticket_type          object
ticket_quantity       int64
ticket_price        float64
concession_items     object
concession_total    float64
payment_method       object
total_amount        float64
dtype: object

MISSING VALUES SUMMARY:
transaction_id      0
date                7
customer_name       0
movie_title         1
ticket_type         0
ticket_quantity     0
ticket_price        8
concession_items    2
concession_total    0
payment_method      1
total_amount        9
dtype: int64


---
## STEP 3: Remove Duplicate Transactions

**Explanation:** Duplicate transaction IDs indicate the same transaction recorded multiple times. This can skew our analysis by:
- Inflating sales numbers
- Creating false patterns in association rules
- Distorting customer behavior analysis

**Solution:** Keep only the first occurrence of each transaction ID.

In [77]:
# Identify duplicates
print("IDENTIFYING DUPLICATE TRANSACTIONS:")
print("="*100)

duplicates = df_raw[df_raw.duplicated(subset=['transaction_id'], keep=False)]
print(f"Found {len(duplicates)} duplicate records across {duplicates['transaction_id'].nunique()} transaction IDs:")
print("\n")
print(duplicates[['transaction_id', 'date', 'customer_name', 'movie_title', 'total_amount']].to_string())

IDENTIFYING DUPLICATE TRANSACTIONS:
Found 10 duplicate records across 5 transaction IDs:


   transaction_id        date   customer_name                   movie_title  total_amount
0          TXN001  2024-01-15      John Smith                Dune: Part Two          59.5
3          TXN004  2024-01-16     Emily Davis                   Oppenheimer          61.0
4          TXN001  2024-01-15      John Smith                Dune: Part Two          59.5
11         TXN004  2024-01-16     Emily Davis                   Oppenheimer          61.0
16         TXN015  2024/01/20  Jessica Martin              American Fiction          78.5
23         TXN015  2024/01/20  Jessica Martin              American Fiction          78.5
33         TXN031   26-1-2024    Thomas Baker  The Hunger Games: The Ballad          88.5
40         TXN031   26-1-2024    Thomas Baker  The Hunger Games: The Ballad          88.5
44         TXN042  2024-01-31     Gary Rogers   Taylor Swift: The Eras Tour         101.0
63       

In [78]:
# Remove duplicates
df_step3 = df_raw.drop_duplicates(subset=['transaction_id'], keep='first')

print("DUPLICATE REMOVAL RESULTS:")
print("="*100)
print(f"✓ Removed {len(df_raw) - len(df_step3)} duplicate transactions")
print(f"✓ Rows before: {len(df_raw)}")
print(f"✓ Rows after: {len(df_step3)}")

DUPLICATE REMOVAL RESULTS:
✓ Removed 5 duplicate transactions
✓ Rows before: 64
✓ Rows after: 59


In [79]:
# Display complete data after removing duplicates
print("COMPLETE DATA AFTER REMOVING DUPLICATES:")
print("="*100)
print(df_step3.to_string())

COMPLETE DATA AFTER REMOVING DUPLICATES:
   transaction_id         date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001   2024-01-15         John Smith                  Dune: Part Two       ADULT                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002   01/15/2024      Sarah Johnson                  Dune: Part Two       adult                2          15.5                 popcorn;soda;candy             28.50    Credit Card         59.50
2          TXN003          NaN      Michael Brown                      The Batman      Senior                1          12.0                   Popcorn, , ,Soda             15.75           Cash           NaN
3          TXN004   2024-01-16        Emily Davis                     Oppenheimer     STUDENT                3          13.0       

---
## STEP 4: Standardize Date Formats

**Explanation:** Dates come in many different formats:
- `2024-01-15` (ISO format)
- `01/15/2024` (US format)
- `16-01-2024` (European format)
- `Feb 1 2024` (Text format)
- And many more...

**Problem:** Inconsistent date formats make it impossible to:
- Sort transactions chronologically
- Analyze trends over time
- Filter by date range

**Solution:** Parse all formats and convert to standard format: `YYYY-MM-DD`

In [80]:
# Show original date formats
print("ORIGINAL DATE FORMATS (Sample):")
print("="*100)
print(df_step3[['transaction_id', 'date']].head(20).to_string())

ORIGINAL DATE FORMATS (Sample):
   transaction_id        date
0          TXN001  2024-01-15
1          TXN002  01/15/2024
2          TXN003         NaN
3          TXN004  2024-01-16
5          TXN005  16-01-2024
6          TXN006  2024/01/17
7          TXN007  01-17-2024
8          TXN008  2024-01-17
9          TXN009         NaN
10         TXN010  2024-01-18
12         TXN011  18/01/2024
13         TXN012  2024-01-19
14         TXN013  19-01-2024
15         TXN014         NaN
16         TXN015  2024/01/20
17         TXN016  20-01-2024
18         TXN017  2024-01-20
19         TXN018  01/21/2024
20         TXN019   2024/1/22
21         TXN020  2024-01-22


In [81]:
# Define date parsing function
def parse_date(date_str):
    """
    Parse multiple date formats and return standardized date
    """
    if pd.isna(date_str) or date_str == '' or str(date_str).lower() in ['n/a', 'null', 'none']:
        return None
    
    date_str = str(date_str).strip()
    
    # List of possible date formats
    date_formats = [
        '%Y-%m-%d',      # 2024-01-15
        '%m/%d/%Y',      # 01/15/2024
        '%d-%m-%Y',      # 16-01-2024
        '%Y/%m/%d',      # 2024/01/17
        '%d/%m/%Y',      # 17/01/2024
        '%Y/%m/%d',      # 2024/1/22
        '%d/%m/%Y',      # 23/1/2024
        '%Y.%m.%d',      # 2024.01.27
        '%d/%m/%y',      # 27/01/24
        '%m-%d-%y',      # 01-28-24
        '%d-%m-%Y',      # 23-01-2024
        '%Y-%m-%d',      # 2024-1-28
        '%d-%m-%y',      # 26-1-2024
        '%b %d %Y',      # Feb 1 2024
        '%d-%b-%Y',      # 04-Feb-2024
        '%d/%m/%Y',      # 3/2/2024
    ]
    
    for fmt in date_formats:
        try:
            return pd.to_datetime(date_str, format=fmt).strftime('%Y-%m-%d')
        except:
            continue
    
    return None

print("✓ Date parsing function defined")

✓ Date parsing function defined


In [82]:
# Apply date standardization
df_step4 = df_step3.copy()
df_step4['date'] = df_step4['date'].apply(parse_date)
df_step4['date'] = pd.to_datetime(df_step4['date'], errors='coerce')

print("DATE STANDARDIZATION RESULTS:")
print("="*100)
print(f"✓ Dates standardized to YYYY-MM-DD format")
print(f"✓ Valid dates: {df_step4['date'].notna().sum()}")
print(f"✓ Missing dates: {df_step4['date'].isna().sum()}")

DATE STANDARDIZATION RESULTS:
✓ Dates standardized to YYYY-MM-DD format
✓ Valid dates: 50
✓ Missing dates: 9


In [83]:
# Display complete data with standardized dates
print("COMPLETE DATA AFTER DATE STANDARDIZATION:")
print("="*100)
print(df_step4.to_string())

COMPLETE DATA AFTER DATE STANDARDIZATION:
   transaction_id       date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001 2024-01-15         John Smith                  Dune: Part Two       ADULT                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002 2024-01-15      Sarah Johnson                  Dune: Part Two       adult                2          15.5                 popcorn;soda;candy             28.50    Credit Card         59.50
2          TXN003        NaT      Michael Brown                      The Batman      Senior                1          12.0                   Popcorn, , ,Soda             15.75           Cash           NaN
3          TXN004 2024-01-16        Emily Davis                     Oppenheimer     STUDENT                3          13.0               N

---
## STEP 5: Standardize Text Capitalization

**Explanation:** Different capitalizations represent the same value:
- Ticket types: `ADULT`, `adult`, `Adult` → all mean the same thing
- Movie titles: `THE HOLDOVERS`, `The Holdovers` → same movie
- Payment methods: `Credit Card`, `credit card`, `CREDIT CARD` → same method

**Problem:** This causes:
- Incorrect counting of categories
- Duplicate entries in analysis
- Incorrect association rules

**Solution:** Convert all text to consistent Title Case format.

In [84]:
df_step5 = df_step4.copy()

# Show before standardization
print("BEFORE TEXT STANDARDIZATION:")
print("="*100)
print("\nTicket Type Distribution:")
print(df_step5['ticket_type'].value_counts())
print("\nPayment Method Distribution:")
print(df_step5['payment_method'].value_counts())
print(f"\nUnique Movies: {df_step5['movie_title'].nunique()}")

BEFORE TEXT STANDARDIZATION:

Ticket Type Distribution:
ticket_type
ADULT      11
adult      11
Adult       8
Senior      7
CHILD       6
Student     5
Child       3
student     3
SENIOR      2
child       2
STUDENT     1
Name: count, dtype: int64

Payment Method Distribution:
payment_method
Cash           17
Credit Card    15
Debit Card     14
credit card     4
cash            3
Debit card      2
CREDIT CARD     1
debit card      1
CASH            1
Name: count, dtype: int64

Unique Movies: 56


In [85]:
# Standardize text fields
df_step5['ticket_type'] = df_step5['ticket_type'].str.strip().str.title()
df_step5['movie_title'] = df_step5['movie_title'].str.strip().str.title()
df_step5['payment_method'] = df_step5['payment_method'].str.strip().str.title()

print("AFTER TEXT STANDARDIZATION:")
print("="*100)
print("\nTicket Type Distribution:")
print(df_step5['ticket_type'].value_counts())
print("\nPayment Method Distribution:")
print(df_step5['payment_method'].value_counts())
print(f"\nUnique Movies: {df_step5['movie_title'].nunique()}")
print("\n✓ All text fields standardized to Title Case")

AFTER TEXT STANDARDIZATION:

Ticket Type Distribution:
ticket_type
Adult      30
Child      11
Senior      9
Student     9
Name: count, dtype: int64

Payment Method Distribution:
payment_method
Cash           21
Credit Card    20
Debit Card     17
Name: count, dtype: int64

Unique Movies: 41

✓ All text fields standardized to Title Case


In [86]:
# Display complete data after text standardization
print("COMPLETE DATA AFTER TEXT STANDARDIZATION:")
print("="*100)
print(df_step5.to_string())

COMPLETE DATA AFTER TEXT STANDARDIZATION:
   transaction_id       date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001 2024-01-15         John Smith                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002 2024-01-15      Sarah Johnson                  Dune: Part Two       Adult                2          15.5                 popcorn;soda;candy             28.50    Credit Card         59.50
2          TXN003        NaT      Michael Brown                      The Batman      Senior                1          12.0                   Popcorn, , ,Soda             15.75           Cash           NaN
3          TXN004 2024-01-16        Emily Davis                     Oppenheimer     Student                3          13.0               N

---
## STEP 6: Clean Concession Items

**Explanation:** Concession items have multiple problems:
1. **Mixed separators:** Some use commas (`,`), others use semicolons (`;`)
2. **Blank items:** Empty strings between separators like `Popcorn, , ,Soda`
3. **Inconsistent capitalization:** `Popcorn` vs `popcorn`

**Problem:** These issues prevent accurate analysis of:
- Which items are purchased together
- Item popularity
- Association rules between items

**Solution:**
- Unify separators to commas
- Remove blank/empty items
- Standardize capitalization to Title Case

In [87]:
# Show examples of messy concession items
print("CONCESSION ITEMS BEFORE CLEANING (Examples):")
print("="*100)
print(df_step5[['transaction_id', 'concession_items']].head(20).to_string())

CONCESSION ITEMS BEFORE CLEANING (Examples):
   transaction_id            concession_items
0          TXN001          Popcorn,Soda,Candy
1          TXN002          popcorn;soda;candy
2          TXN003            Popcorn, , ,Soda
3          TXN004        Nachos;Pretzel;Water
5          TXN005  Hot Dog;Popcorn;Soda;Candy
6          TXN006         Popcorn,, , , ,Soda
7          TXN007               Candy;Popcorn
8          TXN008                         NaN
9          TXN009               Water,Pretzel
10         TXN010     Popcorn;Soda; ; ;Nachos
12         TXN011         Candy,Popcorn,Soda,
13         TXN012                Hot Dog,Soda
14         TXN013     Popcorn; ; ;Candy;Water
15         TXN014                 Nachos,Soda
16         TXN015         ,Popcorn,Soda,Candy
17         TXN016             Pretzel;Water; 
18         TXN017              pretzel,water,
19         TXN018       Popcorn,, ,Soda,Candy
20         TXN019          Candy;Popcorn;Soda
21         TXN020        Hot Dog;,N

In [88]:
# Define cleaning function
def clean_concession_items(items_str):
    """
    Clean concession items:
    - Handle both comma and semicolon separators
    - Remove blank/empty items
    - Standardize capitalization
    - Return comma-separated string
    """
    if pd.isna(items_str) or items_str == '':
        return ''
    
    items_str = str(items_str)
    
    # Replace semicolons with commas for consistent splitting
    items_str = items_str.replace(';', ',')
    
    # Split by comma
    items = items_str.split(',')
    
    # Clean each item: strip whitespace, remove empty items, title case
    cleaned_items = []
    for item in items:
        item = item.strip()
        if item:  # Only keep non-empty items
            cleaned_items.append(item.title())
    
    # Return comma-separated string
    return ','.join(cleaned_items)

print("✓ Concession item cleaning function defined")

✓ Concession item cleaning function defined


In [89]:
# Apply cleaning
df_step6 = df_step5.copy()
df_step6['concession_items'] = df_step6['concession_items'].apply(clean_concession_items)

print("CONCESSION ITEMS AFTER CLEANING:")
print("="*100)
print(df_step6[['transaction_id', 'concession_items']].head(20).to_string())
print("\n✓ Concession items cleaned:")
print("  - Mixed separators (';' and ',') unified to ','")
print("  - Blank items removed")
print("  - Capitalization standardized to Title Case")

CONCESSION ITEMS AFTER CLEANING:
   transaction_id            concession_items
0          TXN001          Popcorn,Soda,Candy
1          TXN002          Popcorn,Soda,Candy
2          TXN003                Popcorn,Soda
3          TXN004        Nachos,Pretzel,Water
5          TXN005  Hot Dog,Popcorn,Soda,Candy
6          TXN006                Popcorn,Soda
7          TXN007               Candy,Popcorn
8          TXN008                            
9          TXN009               Water,Pretzel
10         TXN010         Popcorn,Soda,Nachos
12         TXN011          Candy,Popcorn,Soda
13         TXN012                Hot Dog,Soda
14         TXN013         Popcorn,Candy,Water
15         TXN014                 Nachos,Soda
16         TXN015          Popcorn,Soda,Candy
17         TXN016               Pretzel,Water
18         TXN017               Pretzel,Water
19         TXN018          Popcorn,Soda,Candy
20         TXN019          Candy,Popcorn,Soda
21         TXN020         Hot Dog,Nachos,Soda



In [90]:
# Display complete data after concession items cleaning
print("COMPLETE DATA AFTER CONCESSION ITEMS CLEANING:")
print("="*100)
print(df_step6.to_string())

COMPLETE DATA AFTER CONCESSION ITEMS CLEANING:
   transaction_id       date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001 2024-01-15         John Smith                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002 2024-01-15      Sarah Johnson                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
2          TXN003        NaT      Michael Brown                      The Batman      Senior                1          12.0                       Popcorn,Soda             15.75           Cash           NaN
3          TXN004 2024-01-16        Emily Davis                     Oppenheimer     Student                3          13.0           

---
## STEP 7: Handle Missing Numeric Values

**Explanation:** Some transactions have missing prices or totals:
- Missing `ticket_price`: Can't calculate revenue per ticket type
- Missing `total_amount`: Can't calculate total sales

**Strategy:**
1. **For missing ticket_price:** Use the average price for that ticket type
2. **For missing total_amount:** Calculate it as:
   - `total_amount = (ticket_quantity × ticket_price) + concession_total`

**Why this works:** Ticket prices are typically fixed by type (Adult, Child, Senior, Student)

In [91]:
# Analyze missing values
print("MISSING NUMERIC VALUES ANALYSIS:")
print("="*100)
print(f"Missing ticket_price: {df_step6['ticket_price'].isna().sum()}")
print(f"Missing total_amount: {df_step6['total_amount'].isna().sum()}")

# Show transactions with missing values
missing_values = df_step6[df_step6['ticket_price'].isna() | df_step6['total_amount'].isna()]
print(f"\nTransactions with missing numeric values: {len(missing_values)}")
print("\n")
print(missing_values[['transaction_id', 'ticket_type', 'ticket_quantity', 'ticket_price', 
                       'concession_total', 'total_amount']].to_string())

MISSING NUMERIC VALUES ANALYSIS:
Missing ticket_price: 8
Missing total_amount: 9

Transactions with missing numeric values: 10


   transaction_id ticket_type  ticket_quantity  ticket_price  concession_total  total_amount
2          TXN003      Senior                1          12.0             15.75           NaN
5          TXN005       Adult                4           NaN             45.25        107.25
13         TXN012       Adult                2           NaN             16.50           NaN
22         TXN021       Adult                1           NaN             13.25           NaN
25         TXN023       Adult                2          15.5             29.00           NaN
31         TXN029       Adult                2           NaN             19.50           NaN
41         TXN039       Adult                2           NaN             36.50           NaN
48         TXN046       Child                3           NaN             29.25           NaN
55         TXN053       Adult     

In [92]:
# Calculate average prices by ticket type
ticket_type_avg_prices = df_step6.groupby('ticket_type')['ticket_price'].mean()

print("AVERAGE PRICES BY TICKET TYPE:")
print("="*100)
print(ticket_type_avg_prices)
print("\nThese averages will be used to fill missing ticket prices")

AVERAGE PRICES BY TICKET TYPE:
ticket_type
Adult      15.5
Child      10.0
Senior     12.0
Student    13.0
Name: ticket_price, dtype: float64

These averages will be used to fill missing ticket prices


In [93]:
# Fill missing values
df_step7 = df_step6.copy()

for idx, row in df_step7.iterrows():
    # If ticket_price is missing, fill with average for that ticket type
    if pd.isna(row['ticket_price']):
        ticket_type = row['ticket_type']
        if ticket_type in ticket_type_avg_prices.index:
            df_step7.at[idx, 'ticket_price'] = round(ticket_type_avg_prices[ticket_type], 2)
    
    # If total_amount is missing, calculate it
    if pd.isna(row['total_amount']):
        ticket_total = row['ticket_quantity'] * row['ticket_price']
        concession_total = row['concession_total'] if not pd.isna(row['concession_total']) else 0
        df_step7.at[idx, 'total_amount'] = round(ticket_total + concession_total, 2)

print("MISSING VALUES HANDLING RESULTS:")
print("="*100)
print("✓ Missing ticket prices filled with ticket type averages")
print("✓ Missing total amounts calculated from components")
print(f"\nVerification:")
print(f"  Missing ticket_price: {df_step7['ticket_price'].isna().sum()}")
print(f"  Missing total_amount: {df_step7['total_amount'].isna().sum()}")

MISSING VALUES HANDLING RESULTS:
✓ Missing ticket prices filled with ticket type averages
✓ Missing total amounts calculated from components

Verification:
  Missing ticket_price: 0
  Missing total_amount: 7


In [94]:
# Display complete data after handling missing values
print("COMPLETE DATA AFTER HANDLING MISSING VALUES:")
print("="*100)
print(df_step7.to_string())

COMPLETE DATA AFTER HANDLING MISSING VALUES:
   transaction_id       date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001 2024-01-15         John Smith                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002 2024-01-15      Sarah Johnson                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
2          TXN003        NaT      Michael Brown                      The Batman      Senior                1          12.0                       Popcorn,Soda             15.75           Cash         27.75
3          TXN004 2024-01-16        Emily Davis                     Oppenheimer     Student                3          13.0             

---
## STEP 8: Final Cleaned Data Summary

**Explanation:** This is our final cleaned dataset with all issues resolved:
- ✓ No duplicates
- ✓ Standardized dates
- ✓ Consistent text capitalization
- ✓ Clean concession items
- ✓ No missing numeric values

This clean data is now ready for association rule mining!

In [95]:
df_cleaned = df_step7.copy()

print("COMPLETE FINAL CLEANED DATASET:")
print("="*100)
print(df_cleaned.to_string())

COMPLETE FINAL CLEANED DATASET:
   transaction_id       date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001 2024-01-15         John Smith                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002 2024-01-15      Sarah Johnson                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
2          TXN003        NaT      Michael Brown                      The Batman      Senior                1          12.0                       Popcorn,Soda             15.75           Cash         27.75
3          TXN004 2024-01-16        Emily Davis                     Oppenheimer     Student                3          13.0               Nachos,Pret

In [96]:
# Summary statistics
print("CLEANING SUMMARY:")
print("="*100)
print(f"✓ Original rows: {len(df_raw)}")
print(f"✓ Final rows: {len(df_cleaned)}")
print(f"✓ Duplicates removed: {len(df_raw) - len(df_step3)}")
print(f"✓ Dates standardized: {df_cleaned['date'].notna().sum()}")
print(f"✓ Missing dates: {df_cleaned['date'].isna().sum()}")
print(f"✓ Text fields standardized: ticket_type, movie_title, payment_method")
print(f"✓ Concession items cleaned and standardized")
print(f"✓ Missing numeric values handled: 0 missing")

print("\nDATA QUALITY METRICS:")
print("="*100)
print(f"Unique transactions: {df_cleaned['transaction_id'].nunique()}")
print(f"Unique customers: {df_cleaned['customer_name'].nunique()}")
print(f"Unique movies: {df_cleaned['movie_title'].nunique()}")
print(f"Date range: {df_cleaned['date'].min()} to {df_cleaned['date'].max()}")
print(f"Total revenue: ${df_cleaned['total_amount'].sum():.2f}")

# Save cleaned data
output_file = 'cleaned_transactions.csv'
df_cleaned.to_csv(output_file, index=False)
print(f"\n✓ Cleaned data saved to: {output_file}")

CLEANING SUMMARY:
✓ Original rows: 64
✓ Final rows: 59
✓ Duplicates removed: 5
✓ Dates standardized: 50
✓ Missing dates: 9
✓ Text fields standardized: ticket_type, movie_title, payment_method
✓ Concession items cleaned and standardized
✓ Missing numeric values handled: 0 missing

DATA QUALITY METRICS:
Unique transactions: 59
Unique customers: 59
Unique movies: 41
Date range: 2024-01-15 00:00:00 to 2024-03-02 00:00:00
Total revenue: $2828.75

✓ Cleaned data saved to: cleaned_transactions.csv


In [97]:
# Check for remaining missing values
print("REMAINING MISSING VALUES ANALYSIS:")
print("="*100)

missing_mask = df_cleaned.isnull().any(axis=1)
rows_with_missing = df_cleaned[missing_mask]

print(f"Total rows with ANY missing values: {len(rows_with_missing)}")
print(f"\nMissing values by column:")
print(df_cleaned.isnull().sum())

if len(rows_with_missing) > 0:
    print("\n" + "="*100)
    print("DETAILED VIEW OF ROWS WITH MISSING VALUES:")
    print("="*100)
    print(rows_with_missing[['transaction_id', 'date', 'ticket_price', 'concession_total', 'total_amount']].to_string())


REMAINING MISSING VALUES ANALYSIS:
Total rows with ANY missing values: 18

Missing values by column:
transaction_id      0
date                9
customer_name       0
movie_title         1
ticket_type         0
ticket_quantity     0
ticket_price        0
concession_items    0
concession_total    0
payment_method      1
total_amount        7
dtype: int64

DETAILED VIEW OF ROWS WITH MISSING VALUES:
   transaction_id       date  ticket_price  concession_total  total_amount
2          TXN003        NaT          12.0             15.75         27.75
7          TXN007        NaT          10.0             12.50         32.50
9          TXN009        NaT          12.0              8.50         20.50
13         TXN012 2024-01-19          15.5             16.50           NaN
15         TXN014        NaT          15.5             18.25         33.75
19         TXN018 2024-01-21          12.0             21.50         33.50
22         TXN021 2024-01-22          15.5             13.25           NaN


In [98]:
# Fix remaining missing values
print("HANDLING REMAINING MISSING VALUES:")
print("="*100)

df_cleaned_final = df_cleaned.copy()

# 1. Fill missing dates with median date
valid_dates = df_cleaned_final['date'].dropna()
if len(valid_dates) > 0:
    median_date = valid_dates.median()
    print(f"Using median date {median_date.date()} for {df_cleaned_final['date'].isna().sum()} missing dates")
    df_cleaned_final['date'].fillna(median_date, inplace=True)

# 2. Fix missing total_amount values - recalculate from components
for idx, row in df_cleaned_final.iterrows():
    if pd.isna(row['total_amount']):
        ticket_total = row['ticket_quantity'] * row['ticket_price']
        concession_total = row['concession_total'] if not pd.isna(row['concession_total']) else 0
        calculated_total = round(ticket_total + concession_total, 2)
        df_cleaned_final.at[idx, 'total_amount'] = calculated_total
        print(f"  TXN: {row['transaction_id']}, Calculated total_amount: ${calculated_total}")

# 3. Fill missing movie_title with "Unknown"
missing_movie_count = df_cleaned_final['movie_title'].isna().sum()
if missing_movie_count > 0:
    print(f"Filled {missing_movie_count} missing movie_title with 'Unknown'")
    df_cleaned_final['movie_title'].fillna('Unknown', inplace=True)

# 4. Fill missing payment_method with "Unknown"
missing_payment_count = df_cleaned_final['payment_method'].isna().sum()
if missing_payment_count > 0:
    print(f"Filled {missing_payment_count} missing payment_method with 'Unknown'")
    df_cleaned_final['payment_method'].fillna('Unknown', inplace=True)

# 5. Fill missing concession_total with 0
missing_concession_count = df_cleaned_final['concession_total'].isna().sum()
if missing_concession_count > 0:
    print(f"Filled {missing_concession_count} missing concession_total with 0")
    df_cleaned_final['concession_total'].fillna(0, inplace=True)

print("\n" + "="*100)
print("VERIFICATION - REMAINING MISSING VALUES:")
print("="*100)
print(df_cleaned_final.isnull().sum())
print(f"\n✓ Total missing values remaining: {df_cleaned_final.isnull().sum().sum()}")


HANDLING REMAINING MISSING VALUES:
Using median date 2024-01-26 for 9 missing dates
  TXN: TXN012, Calculated total_amount: $47.5
  TXN: TXN021, Calculated total_amount: $28.75
  TXN: TXN029, Calculated total_amount: $50.5
  TXN: TXN039, Calculated total_amount: $67.5
  TXN: TXN046, Calculated total_amount: $59.25
  TXN: TXN053, Calculated total_amount: $19.0
  TXN: TXN059, Calculated total_amount: $71.0
Filled 1 missing movie_title with 'Unknown'
Filled 1 missing payment_method with 'Unknown'

VERIFICATION - REMAINING MISSING VALUES:
transaction_id      0
date                0
customer_name       0
movie_title         0
ticket_type         0
ticket_quantity     0
ticket_price        0
concession_items    0
concession_total    0
payment_method      0
total_amount        0
dtype: int64

✓ Total missing values remaining: 0


In [99]:
# Update df_cleaned and resave
df_cleaned = df_cleaned_final.copy()

# Save the completely cleaned data
output_file = 'cleaned_transactions.csv'
df_cleaned.to_csv(output_file, index=False)
print(f"✓ Completely cleaned data (0 missing values) saved to: {output_file}")

# Show final summary
print("\nFINAL DATA QUALITY METRICS:")
print("="*100)
print(f"Total rows: {len(df_cleaned)}")
print(f"Unique transactions: {df_cleaned['transaction_id'].nunique()}")
print(f"Unique customers: {df_cleaned['customer_name'].nunique()}")
print(f"Unique movies: {df_cleaned['movie_title'].nunique()}")
print(f"Date range: {df_cleaned['date'].min().date()} to {df_cleaned['date'].max().date()}")
print(f"Total revenue: ${df_cleaned['total_amount'].sum():.2f}")
print(f"\n✓ DATA IS NOW 100% CLEAN - READY FOR ANALYSIS")


✓ Completely cleaned data (0 missing values) saved to: cleaned_transactions.csv

FINAL DATA QUALITY METRICS:
Total rows: 59
Unique transactions: 59
Unique customers: 59
Unique movies: 42
Date range: 2024-01-15 to 2024-03-02
Total revenue: $3172.25

✓ DATA IS NOW 100% CLEAN - READY FOR ANALYSIS


In [100]:
# FINAL VERIFICATION - Show the completely cleaned dataset
print("COMPLETE FINAL CLEANED DATASET (100% Clean):")
print("="*100)
print(df_cleaned.to_string())
print("\n" + "="*100)
print(f"Total missing values in df_cleaned: {df_cleaned.isnull().sum().sum()}")


COMPLETE FINAL CLEANED DATASET (100% Clean):
   transaction_id       date      customer_name                     movie_title ticket_type  ticket_quantity  ticket_price                   concession_items  concession_total payment_method  total_amount
0          TXN001 2024-01-15         John Smith                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
1          TXN002 2024-01-15      Sarah Johnson                  Dune: Part Two       Adult                2          15.5                 Popcorn,Soda,Candy             28.50    Credit Card         59.50
2          TXN003 2024-01-26      Michael Brown                      The Batman      Senior                1          12.0                       Popcorn,Soda             15.75           Cash         27.75
3          TXN004 2024-01-16        Emily Davis                     Oppenheimer     Student                3          13.0             

---
# ASSOCIATION RULE MINING

## What is Association Rule Mining?

Association rule mining discovers **relationships between items** that are frequently purchased together.

**Example Rule:** `{Popcorn} → {Soda}`
- **Meaning:** "If a customer buys Popcorn, they are likely to also buy Soda"

**Key Metrics:**
1. **Support:** How frequently the itemset appears
2. **Confidence:** How often the rule is true
3. **Lift:** How much more likely items appear together vs. independently

---

## STEP 9: Prepare Data for Association Rule Mining

**Explanation:** Association rule mining requires a specific data format:
- Each row = one transaction
- Each column = one item (binary: bought/not bought)
- This is called a **transaction-item matrix** or **one-hot encoding**

**What we'll include as "items":**
- Movie titles (prefixed with "Movie:")
- Ticket types (prefixed with "Ticket:")
- Concession items (prefixed with "Concession:")
- Payment methods (prefixed with "Payment:")

In [101]:
# Create list of transactions
transactions = []
transaction_ids = []

for idx, row in df_cleaned.iterrows():
    items = []
    
    # Add movie title as an item
    if pd.notna(row['movie_title']) and row['movie_title']:
        items.append(f"Movie:{row['movie_title']}")
    
    # Add ticket type as an item
    if pd.notna(row['ticket_type']) and row['ticket_type']:
        items.append(f"Ticket:{row['ticket_type']}")
    
    # Add concession items
    if pd.notna(row['concession_items']) and row['concession_items']:
        concession_list = row['concession_items'].split(',')
        for item in concession_list:
            item = item.strip()
            if item:
                items.append(f"Concession:{item}")
    
    # Add payment method
    if pd.notna(row['payment_method']) and row['payment_method']:
        items.append(f"Payment:{row['payment_method']}")
    
    if items:  # Only add non-empty transactions
        transactions.append(items)
        transaction_ids.append(row['transaction_id'])

print(f"TRANSACTION LIST CREATED:")
print("="*100)
print(f"✓ Created {len(transactions)} transaction records for analysis")
print("\nSample Transactions (First 10):")
for i in range(min(10, len(transactions))):
    print(f"{transaction_ids[i]}: {transactions[i]}")

TRANSACTION LIST CREATED:
✓ Created 59 transaction records for analysis

Sample Transactions (First 10):
TXN001: ['Movie:Dune: Part Two', 'Ticket:Adult', 'Concession:Popcorn', 'Concession:Soda', 'Concession:Candy', 'Payment:Credit Card']
TXN002: ['Movie:Dune: Part Two', 'Ticket:Adult', 'Concession:Popcorn', 'Concession:Soda', 'Concession:Candy', 'Payment:Credit Card']
TXN003: ['Movie:The Batman', 'Ticket:Senior', 'Concession:Popcorn', 'Concession:Soda', 'Payment:Cash']
TXN004: ['Movie:Oppenheimer', 'Ticket:Student', 'Concession:Nachos', 'Concession:Pretzel', 'Concession:Water', 'Payment:Debit Card']
TXN005: ['Movie:Barbie', 'Ticket:Adult', 'Concession:Hot Dog', 'Concession:Popcorn', 'Concession:Soda', 'Concession:Candy', 'Payment:Credit Card']
TXN006: ['Movie:Poor Things', 'Ticket:Adult', 'Concession:Popcorn', 'Concession:Soda', 'Payment:Cash']
TXN007: ['Movie:Killers Of The Flower Moon', 'Ticket:Child', 'Concession:Candy', 'Concession:Popcorn', 'Payment:Credit Card']
TXN008: ['Movie:T

In [102]:
# Convert to one-hot encoded DataFrame
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_array, columns=te.columns_)

print("ONE-HOT ENCODED TRANSACTION MATRIX:")
print("="*100)
print(f"Matrix shape: {df_encoded.shape[0]} transactions × {df_encoded.shape[1]} items")
print("\nFirst 10 transactions, First 15 items:")
print(df_encoded.iloc[:10, :15].to_string())

ONE-HOT ENCODED TRANSACTION MATRIX:
Matrix shape: 59 transactions × 57 items

First 10 transactions, First 15 items:
   Concession:Candy  Concession:Hot Dog  Concession:Nachos  Concession:Popcorn  Concession:Pretzel  Concession:Soda  Concession:Water  Movie:All Of Us Strangers  Movie:American Fiction  Movie:Anyone But You  Movie:Aquaman And The Lost Kingdom  Movie:Barbie  Movie:Dune: Part Two  Movie:Ferrari  Movie:Freelance
0              True               False              False                True               False             True             False                      False                   False                 False                               False         False                  True          False            False
1              True               False              False                True               False             True             False                      False                   False                 False                               False         False     

In [103]:
# Show complete encoded matrix
print("COMPLETE ONE-HOT ENCODED MATRIX:")
print("="*100)
print(df_encoded.to_string())

COMPLETE ONE-HOT ENCODED MATRIX:
    Concession:Candy  Concession:Hot Dog  Concession:Nachos  Concession:Popcorn  Concession:Pretzel  Concession:Soda  Concession:Water  Movie:All Of Us Strangers  Movie:American Fiction  Movie:Anyone But You  Movie:Aquaman And The Lost Kingdom  Movie:Barbie  Movie:Dune: Part Two  Movie:Ferrari  Movie:Freelance  Movie:Godzilla Minus One  Movie:Killers Of The Flower Moon  Movie:Leo  Movie:Maestro  Movie:Mean Girls  Movie:Memory  Movie:Migration  Movie:Napoleon  Movie:Next Goal Wins  Movie:Night Swim  Movie:Nyad  Movie:Oppenheimer  Movie:Poor Things  Movie:Renaissance: A Film By Beyoncé  Movie:Ricky Stanicky  Movie:Saltburn  Movie:Silent Night  Movie:Society Of The Snow  Movie:Taylor Swift: The Eras Tour  Movie:The Batman  Movie:The Beekeeper  Movie:The Boy And The Heron  Movie:The Bricklayer  Movie:The Color Purple  Movie:The Holdovers  Movie:The Hunger Games: The Ballad  Movie:The Iron Claw  Movie:The Marvels  Movie:The Shift  Movie:The Zone Of Interest 

In [104]:
# Item frequency analysis
item_counts = df_encoded.sum().sort_values(ascending=False)

print("ITEM FREQUENCY ANALYSIS:")
print("="*100)
print("How many times each item appears in transactions:\n")
print(item_counts.to_string())
print(f"\n✓ Data prepared for association rule mining")

ITEM FREQUENCY ANALYSIS:
How many times each item appears in transactions:

Concession:Soda                         40
Concession:Popcorn                      36
Concession:Candy                        31
Ticket:Adult                            30
Payment:Cash                            21
Payment:Credit Card                     20
Payment:Debit Card                      17
Concession:Nachos                       15
Concession:Hot Dog                      13
Concession:Water                        12
Ticket:Child                            11
Ticket:Senior                            9
Ticket:Student                           9
Concession:Pretzel                       8
Movie:The Holdovers                      2
Movie:Taylor Swift: The Eras Tour        2
Movie:Silent Night                       2
Movie:Wish                               2
Movie:Ricky Stanicky                     2
Movie:Dune: Part Two                     2
Movie:Wonka                              2
Movie:Aquaman And The

---
## STEP 10: Apply Apriori Algorithm (Find Frequent Itemsets)

**Explanation:** The Apriori algorithm identifies **itemsets** (combinations of items) that appear frequently together.

**Support Metric:**
- **Definition:** The proportion of transactions containing the itemset
- **Formula:** `Support(A) = (# of transactions containing A) / (total # of transactions)`
- **Example:** If Popcorn appears in 30 out of 50 transactions:
  - Support = 30/50 = 0.6 (60%)

**Minimum Support Threshold:**
- We set a threshold (e.g., 0.1 = 10%)
- Only itemsets appearing in ≥10% of transactions are kept
- This filters out rare, insignificant patterns

In [105]:
# Apply Apriori algorithm
min_support = 0.1  # Items must appear in at least 10% of transactions

print("APPLYING APRIORI ALGORITHM:")
print("="*100)
print(f"Minimum support threshold: {min_support} ({min_support*100}%)")
print("This means itemsets must appear in at least 10% of transactions\n")

frequent_itemsets = apriori(df_encoded, min_support=min_support, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

print(f"✓ Found {len(frequent_itemsets)} frequent itemsets")

APPLYING APRIORI ALGORITHM:
Minimum support threshold: 0.1 (10.0%)
This means itemsets must appear in at least 10% of transactions

✓ Found 93 frequent itemsets


In [106]:
# Itemset size distribution
print("FREQUENT ITEMSETS BY SIZE:")
print("="*100)
print(frequent_itemsets.groupby('length').size())

FREQUENT ITEMSETS BY SIZE:
length
1    14
2    37
3    29
4    11
5     2
dtype: int64


In [107]:
# Show all frequent itemsets sorted by support
print("ALL FREQUENT ITEMSETS (Sorted by Support):")
print("="*100)
frequent_itemsets_sorted = frequent_itemsets.sort_values('support', ascending=False)
print(frequent_itemsets_sorted.to_string())

ALL FREQUENT ITEMSETS (Sorted by Support):
     support                                                                                    itemsets  length
5   0.677966                                                                           (Concession:Soda)       1
3   0.610169                                                                        (Concession:Popcorn)       1
0   0.525424                                                                          (Concession:Candy)       1
10  0.508475                                                                              (Ticket:Adult)       1
29  0.474576                                                       (Concession:Popcorn, Concession:Soda)       2
14  0.457627                                                      (Concession:Popcorn, Concession:Candy)       2
15  0.423729                                                         (Concession:Candy, Concession:Soda)       2
41  0.372881                                         

In [108]:
# Single items analysis
print("SINGLE ITEMS (Length=1):")
print("="*100)
single_items = frequent_itemsets[frequent_itemsets['length'] == 1].sort_values('support', ascending=False)
print(single_items.to_string())
print(f"\nThese are the most popular individual items")

SINGLE ITEMS (Length=1):
     support               itemsets  length
5   0.677966      (Concession:Soda)       1
3   0.610169   (Concession:Popcorn)       1
0   0.525424     (Concession:Candy)       1
10  0.508475         (Ticket:Adult)       1
7   0.355932         (Payment:Cash)       1
8   0.338983  (Payment:Credit Card)       1
9   0.288136   (Payment:Debit Card)       1
2   0.254237    (Concession:Nachos)       1
1   0.220339   (Concession:Hot Dog)       1
6   0.203390     (Concession:Water)       1
11  0.186441         (Ticket:Child)       1
12  0.152542        (Ticket:Senior)       1
13  0.152542       (Ticket:Student)       1
4   0.135593   (Concession:Pretzel)       1

These are the most popular individual items


In [109]:
# Item pairs analysis
print("ITEM PAIRS (Length=2):")
print("="*100)
pairs = frequent_itemsets[frequent_itemsets['length'] == 2].sort_values('support', ascending=False)
print(pairs.to_string())
print(f"\nThese pairs of items are frequently purchased together")

ITEM PAIRS (Length=2):
     support                                   itemsets  length
29  0.474576      (Concession:Popcorn, Concession:Soda)       2
14  0.457627     (Concession:Popcorn, Concession:Candy)       2
15  0.423729        (Concession:Candy, Concession:Soda)       2
41  0.372881            (Ticket:Adult, Concession:Soda)       2
39  0.322034     (Payment:Credit Card, Concession:Soda)       2
33  0.288136         (Concession:Popcorn, Ticket:Adult)       2
49  0.271186        (Payment:Credit Card, Ticket:Adult)       2
19  0.237288           (Concession:Candy, Ticket:Adult)       2
30  0.237288         (Payment:Cash, Concession:Popcorn)       2
26  0.220339       (Concession:Soda, Concession:Nachos)       2
31  0.220339  (Payment:Credit Card, Concession:Popcorn)       2
23  0.203390      (Concession:Soda, Concession:Hot Dog)       2
17  0.203390    (Payment:Credit Card, Concession:Candy)       2
16  0.203390           (Payment:Cash, Concession:Candy)       2
38  0.186441     

In [110]:
# Item triplets analysis
print("ITEM TRIPLETS (Length=3):")
print("="*100)
triplets = frequent_itemsets[frequent_itemsets['length'] == 3].sort_values('support', ascending=False)
if len(triplets) > 0:
    print(triplets.to_string())
    print(f"\nThese triplets of items are frequently purchased together")
else:
    print("No triplets found with minimum support threshold")
    print("This is normal - larger itemsets are less frequent")

ITEM TRIPLETS (Length=3):
     support                                                     itemsets  length
51  0.355932      (Concession:Popcorn, Concession:Candy, Concession:Soda)       3
79  0.271186         (Payment:Credit Card, Ticket:Adult, Concession:Soda)       3
73  0.237288          (Concession:Popcorn, Ticket:Adult, Concession:Soda)       3
71  0.203390   (Payment:Credit Card, Concession:Popcorn, Concession:Soda)       3
55  0.203390         (Concession:Popcorn, Concession:Candy, Ticket:Adult)       3
59  0.203390            (Concession:Candy, Concession:Soda, Ticket:Adult)       3
56  0.186441         (Concession:Popcorn, Concession:Candy, Ticket:Child)       3
58  0.186441     (Payment:Credit Card, Concession:Candy, Concession:Soda)       3
68  0.169492           (Ticket:Adult, Concession:Soda, Concession:Nachos)       3
52  0.169492         (Payment:Cash, Concession:Popcorn, Concession:Candy)       3
76  0.169492      (Payment:Credit Card, Concession:Popcorn, Ticket:Adult

---
## STEP 11: Generate Association Rules

**Explanation:** Association rules show **directional relationships**: "If A, then B"

**Rule Format:** `{Antecedent} → {Consequent}`
- **Antecedent:** The "if" part (what customer already has)
- **Consequent:** The "then" part (what they're likely to buy)

**Three Key Metrics:**

1. **SUPPORT:** Frequency of the complete itemset
   - `Support(A,B) = # of transactions with both A and B / total transactions`
   - Higher support = more common combination
   - Range: 0 to 1

2. **CONFIDENCE:** Probability of consequent given antecedent
   - `Confidence(A→B) = Support(A,B) / Support(A)`
   - Example: If 80% of Popcorn buyers also buy Soda, Confidence = 0.8
   - Range: 0 to 1

3. **LIFT:** How much more likely the consequent is when antecedent is present
   - `Lift(A→B) = Confidence(A→B) / Support(B)`
   - **Lift > 1:** Positive correlation (A and B go together)
   - **Lift = 1:** No correlation (independent)
   - **Lift < 1:** Negative correlation (A and B don't go together)

In [111]:
# Generate association rules
min_threshold = 0.3  # Minimum confidence threshold

print("GENERATING ASSOCIATION RULES:")
print("="*100)
print(f"Minimum confidence threshold: {min_threshold} ({min_threshold*100}%)")
print("Rules must have at least 30% confidence to be included\n")

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_threshold)

if len(rules) > 0:
    # Add readable columns
    rules['antecedents_str'] = rules['antecedents'].apply(lambda x: ', '.join(list(x)))
    rules['consequents_str'] = rules['consequents'].apply(lambda x: ', '.join(list(x)))
    rules['rule'] = rules['antecedents_str'] + ' → ' + rules['consequents_str']
    
    print(f"✓ Generated {len(rules)} association rules\n")
else:
    print("No rules found with the minimum confidence threshold.")
    print("Try lowering the min_threshold value.")

GENERATING ASSOCIATION RULES:
Minimum confidence threshold: 0.3 (30.0%)
Rules must have at least 30% confidence to be included

✓ Generated 359 association rules



In [112]:
# Display all rules sorted by LIFT
if len(rules) > 0:
    print("ALL ASSOCIATION RULES - SORTED BY LIFT:")
    print("="*100)
    print("Lift shows how much more likely items are purchased together vs independently")
    print("Higher lift = stronger association\n")
    rules_by_lift = rules.sort_values('lift', ascending=False)
    print(rules_by_lift[['rule', 'support', 'confidence', 'lift']].to_string())

ALL ASSOCIATION RULES - SORTED BY LIFT:
Lift shows how much more likely items are purchased together vs independently
Higher lift = stronger association

                                                                                          rule   support  confidence      lift
335         Ticket:Child → Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda  0.101695    0.545455  5.363636
314         Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda → Ticket:Child  0.101695    1.000000  5.363636
186                                  Concession:Pretzel → Concession:Water, Payment:Debit Card  0.101695    0.750000  4.916667
187                                  Concession:Water → Concession:Pretzel, Payment:Debit Card  0.101695    0.500000  4.916667
184                                  Concession:Pretzel, Payment:Debit Card → Concession:Water  0.101695    1.000000  4.916667
185                                  Concession:Water, Payment:Debit Card → Concessi

In [113]:
# Display all rules sorted by CONFIDENCE
if len(rules) > 0:
    print("ALL ASSOCIATION RULES - SORTED BY CONFIDENCE:")
    print("="*100)
    print("Confidence shows how often the rule is true")
    print("Higher confidence = more reliable prediction\n")
    rules_by_conf = rules.sort_values('confidence', ascending=False)
    print(rules_by_conf[['rule', 'support', 'confidence', 'lift']].to_string())

ALL ASSOCIATION RULES - SORTED BY CONFIDENCE:
Confidence shows how often the rule is true
Higher confidence = more reliable prediction

                                                                                          rule   support  confidence      lift
260                             Payment:Cash, Concession:Soda, Ticket:Child → Concession:Candy  0.101695    1.000000  1.903226
317         Payment:Cash, Concession:Popcorn, Concession:Soda, Ticket:Child → Concession:Candy  0.101695    1.000000  1.903226
138                                   Payment:Credit Card, Concession:Nachos → Concession:Soda  0.152542    1.000000  1.475000
315         Payment:Cash, Concession:Candy, Concession:Soda, Ticket:Child → Concession:Popcorn  0.101695    1.000000  1.638889
314         Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda → Ticket:Child  0.101695    1.000000  5.363636
37                                                           Ticket:Child → Concession:Popcorn  0.1864

In [114]:
# Display all rules sorted by SUPPORT
if len(rules) > 0:
    print("ALL ASSOCIATION RULES - SORTED BY SUPPORT:")
    print("="*100)
    print("Support shows how frequently the combination appears")
    print("Higher support = more common pattern\n")
    rules_by_supp = rules.sort_values('support', ascending=False)
    print(rules_by_supp[['rule', 'support', 'confidence', 'lift']].to_string())

ALL ASSOCIATION RULES - SORTED BY SUPPORT:
Support shows how frequently the combination appears
Higher support = more common pattern

                                                                                          rule   support  confidence      lift
28                                                        Concession:Soda → Concession:Popcorn  0.474576    0.700000  1.147222
27                                                        Concession:Popcorn → Concession:Soda  0.474576    0.777778  1.147222
1                                                        Concession:Candy → Concession:Popcorn  0.457627    0.870968  1.427419
0                                                        Concession:Popcorn → Concession:Candy  0.457627    0.750000  1.427419
3                                                           Concession:Soda → Concession:Candy  0.423729    0.625000  1.189516
2                                                           Concession:Candy → Concession:Soda  0.423729

In [115]:
# Detailed interpretation of top rules
if len(rules) > 0:
    print("DETAILED RULE INTERPRETATION (Top 10 by Lift):")
    print("="*100)
    
    for i, (idx, row) in enumerate(rules_by_lift.head(10).iterrows(), 1):
        print(f"\n{'='*100}")
        print(f"RULE #{i}: {row['rule']}")
        print(f"{'='*100}")
        print(f"Support:    {row['support']:.3f} ({row['support']*100:.1f}% of transactions)")
        print(f"Confidence: {row['confidence']:.3f} ({row['confidence']*100:.1f}%)")
        print(f"Lift:       {row['lift']:.3f}")
        print(f"\nInterpretation:")
        print(f"  • When customers buy {row['antecedents_str']},")
        print(f"    {row['confidence']*100:.1f}% of them also buy {row['consequents_str']}")
        print(f"  • This is {row['lift']:.2f}x more likely than buying {row['consequents_str']} alone")
        if row['lift'] > 1.5:
            print(f"  • STRONG positive correlation - excellent cross-selling opportunity!")
        elif row['lift'] > 1.2:
            print(f"  • MODERATE positive correlation - good cross-selling potential")
        else:
            print(f"  • WEAK positive correlation")

DETAILED RULE INTERPRETATION (Top 10 by Lift):

RULE #1: Ticket:Child → Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda
Support:    0.102 (10.2% of transactions)
Confidence: 0.545 (54.5%)
Lift:       5.364

Interpretation:
  • When customers buy Ticket:Child,
    54.5% of them also buy Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda
  • This is 5.36x more likely than buying Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda alone
  • STRONG positive correlation - excellent cross-selling opportunity!

RULE #2: Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda → Ticket:Child
Support:    0.102 (10.2% of transactions)
Confidence: 1.000 (100.0%)
Lift:       5.364

Interpretation:
  • When customers buy Payment:Cash, Concession:Popcorn, Concession:Candy, Concession:Soda,
    100.0% of them also buy Ticket:Child
  • This is 5.36x more likely than buying Ticket:Child alone
  • STRONG positive correlation - excellent cro

In [116]:
# Save rules to CSV
if len(rules) > 0:
    rules_output = 'association_rules.csv'
    rules[['rule', 'antecedents_str', 'consequents_str', 'support', 'confidence', 'lift']].to_csv(rules_output, index=False)
    print(f"\n✓ Association rules saved to: {rules_output}")


✓ Association rules saved to: association_rules.csv


---
## STEP 12: Business Insights and Recommendations

**Explanation:** Now we translate the statistical findings into actionable business strategies.

In [117]:
if len(rules) > 0:
    print("BUSINESS INSIGHTS AND RECOMMENDATIONS:")
    print("="*100)
    
    # Focus on concession-related rules
    concession_rules = rules[
        rules['antecedents_str'].str.contains('Concession') & 
        rules['consequents_str'].str.contains('Concession')
    ].sort_values('lift', ascending=False)
    
    if len(concession_rules) > 0:
        print("\nKEY CONCESSION ITEM COMBINATIONS:")
        print("-" * 100)
        for i, (idx, row) in enumerate(concession_rules.head(5).iterrows(), 1):
            print(f"\n{i}. {row['rule']}")
            print(f"   Confidence: {row['confidence']*100:.1f}% | Lift: {row['lift']:.2f}x | Support: {row['support']*100:.1f}%")
    
    print("\n" + "="*100)
    print("ACTIONABLE RECOMMENDATIONS:")
    print("="*100)
    
    print("\n1. BUNDLE PROMOTIONS:")
    print("   • Create combo deals based on high-confidence rules")
    print("   • Example: 'Popcorn + Soda Combo' at 10% discount")
    print("   • Focus on items with lift > 1.5 for strongest effect")
    
    print("\n2. CROSS-SELLING STRATEGY:")
    print("   • Train concession staff to suggest items based on purchase")
    print("   • Example: 'Would you like Candy with your Popcorn?'")
    print("   • Use high-confidence rules (>70%) for suggestions")
    
    print("\n3. PRODUCT PLACEMENT:")
    print("   • Place frequently co-purchased items near each other")
    print("   • Example: Display Soda coolers next to Popcorn station")
    print("   • Create visual prompts for complementary items")
    
    print("\n4. INVENTORY MANAGEMENT:")
    print("   • Ensure adequate stock of items with high support")
    print("   • Monitor stock levels of commonly paired items")
    print("   • Predict demand based on association patterns")
    
    print("\n5. PERSONALIZED MARKETING:")
    print("   • Send targeted promotions based on past purchases")
    print("   • Example: Email Soda discount to frequent Popcorn buyers")
    print("   • Use customer transaction history for recommendations")
    
    print("\n6. MENU DESIGN:")
    print("   • Group related items together on menu boards")
    print("   • Highlight popular combinations")
    print("   • Use visual cues to suggest pairings")

print("\n" + "="*100)
print("ANALYSIS COMPLETE!")
print("="*100)
print("\nGenerated Files:")
print("1. cleaned_transactions.csv - Cleaned and preprocessed data")
print("2. association_rules.csv - All discovered association rules")

BUSINESS INSIGHTS AND RECOMMENDATIONS:

KEY CONCESSION ITEM COMBINATIONS:
----------------------------------------------------------------------------------------------------

1. Concession:Pretzel → Concession:Water, Payment:Debit Card
   Confidence: 75.0% | Lift: 4.92x | Support: 10.2%

2. Concession:Pretzel, Payment:Debit Card → Concession:Water
   Confidence: 100.0% | Lift: 4.92x | Support: 10.2%

3. Concession:Water, Payment:Debit Card → Concession:Pretzel
   Confidence: 66.7% | Lift: 4.92x | Support: 10.2%

4. Concession:Water → Concession:Pretzel, Payment:Debit Card
   Confidence: 50.0% | Lift: 4.92x | Support: 10.2%

5. Payment:Cash, Concession:Popcorn, Concession:Candy → Concession:Soda, Ticket:Child
   Confidence: 60.0% | Lift: 4.43x | Support: 10.2%

ACTIONABLE RECOMMENDATIONS:

1. BUNDLE PROMOTIONS:
   • Create combo deals based on high-confidence rules
   • Example: 'Popcorn + Soda Combo' at 10% discount
   • Focus on items with lift > 1.5 for strongest effect

2. CROSS-SE