# **Learning Apriori Analysis Principles**

This notebook is designed as an educational tool to understand and apply the **Apriori Algorithm** for frequent itemset mining and association rule generation. The principles of Apriori analysis are demonstrated through practical implementation and step-by-step explanations.

---

## **Purpose of the Notebook**
1. **Understand Apriori Analysis**:
   - Learn the fundamentals of **frequent itemset mining** and **association rule generation**.
   - Explore the concepts of **support**, **confidence**, and **lift**, which are key metrics in association rule mining.

2. **Practical Application**:
   - Implement the Apriori algorithm to mine frequent itemsets from datasets.
   - Generate actionable association rules from the frequent itemsets.

3. **Educational Focus**:
   - Gain hands-on experience by working with structured datasets.
   - Learn how to preprocess data, apply the Apriori algorithm, and interpret results.

---

## **What You Will Learn**
1. **Key Concepts in Apriori Analysis**:
   - **Frequent Itemsets**: Identify patterns that occur frequently in transactional data.
   - **Support**: Measure how often an itemset appears in the dataset.
   - **Confidence**: Calculate the likelihood of a consequent given an antecedent.
   - **Lift**: Understand the strength of the rule relative to random chance.

2. **Step-by-Step Implementation**:
   - **Dataset Preprocessing**: Transform raw data into a format suitable for Apriori analysis.
   - **Frequent Itemset Mining**: Use the Apriori algorithm to find itemsets that meet a minimum support threshold.
   - **Association Rule Generation**: Extract meaningful rules from frequent itemsets based on confidence and lift thresholds.

3. **Practical Scenarios**:
   - Analyze transactional datasets such as game moves, traffic accidents, or retail sales.
   - Understand how to categorize numerical and categorical data for effective pattern mining.

---

## **Tools and Libraries Used**
- **Python 3.6+**
- **pandas**: For data manipulation and preprocessing.
- **mlxtend**: For implementing the Apriori algorithm and generating association rules.

---

## **Structure of the Notebook**
1. **Introduction to Apriori Analysis**:
   - Overview of frequent itemsets and association rules.

2. **Dataset Preprocessing**:
   - Transform raw data into transactions suitable for Apriori analysis.

3. **Frequent Itemset Mining**:
   - Apply the Apriori algorithm to identify frequent itemsets.

4. **Association Rule Generation**:
   - Generate rules based on user-defined thresholds for confidence and lift.

5. **Insights and Applications**:
   - Discuss the results and their potential applications in real-world scenarios.

---

## **Who Should Use This Notebook?**
- **Students and Learners**:
   - Ideal for those new to data mining and association rule generation.
   - Understand the theoretical and practical aspects of Apriori analysis.

- **Data Analysts**:
   - Learn to apply the Apriori algorithm to discover actionable patterns in data.

- **Instructors and Educators**:
   - Use this notebook as a teaching tool for frequent itemset mining concepts.

---

# **Apriori Analysis on Chess Dataset**

This repository performs **Apriori Analysis** on a **chess dataset** provided in a CSV format with the following structure:

| **Columns**                                                                                     |
|-------------------------------------------------------------------------------------------------|
| `id`, `rated`, `created_at`, `last_move_at`, `turns`, `victory_status`, `winner`, `increment_code` |
| `white_id`, `white_rating`, `black_id`, `black_rating`, `moves`, `opening_eco`, `opening_name`, `opening_ply` |

The dataset explores attributes like **player ratings**, **game outcomes**, **opening strategies**, and more. Using the Apriori algorithm, frequent patterns and associations between these attributes are identified.

Due to the structured nature of the dataset and potential diversity of transactions, the analysis may go beyond **Candidate 2 (C2)** depending on the size of the dataset and support thresholds. This project highlights the application of Apriori analysis to chess games, extracting actionable insights like popular openings, common ratings, and their correlation with victories.

This project serves as a valuable resource for understanding chess game patterns and leveraging data mining techniques for strategic insights.

# **Formulas Used in Apriori Analysis**

## **1. Candidate Itemset Generation**
The number of candidate \( n \)-itemsets generated at each step is determined by the combination of frequent \((n-1)\)-itemsets:


**|Cn| = |L(n-1)|! / (n! * (|L(n-1)| - n)!)**

Where:
- |Cn|: Number of candidate n-itemsets.
- |L(n-1)|: Number of frequent (n-1)-itemsets.

---

## **2. Pruning Step**
Based on the **Apriori Property**, an itemset is pruned if any of its subsets is not frequent:

**If subset(X) is not frequent, then X is not frequent.**

- X: Candidate itemset.

---

## **3. Support**
Support measures the frequency of an itemset appearing in transactions:

**Support(X) = (Number of Transactions Containing X) / (Total Number of Transactions)**

---

## **4. Confidence**
Confidence measures the strength of an association rule:

**Confidence(X => Y) = Support(X ∪ Y) / Support(X)**

- X: Antecedent (if-part of the rule).
- Y: Consequent (then-part of the rule).
---

## **5. Lift**
Lift measures the strength of a rule compared to random chance:

**Lift(X => Y) = Confidence(X => Y) / Support(Y)**


Where:
- Lift > 1: Positive correlation between X and Y.
- Lift = 1: No correlation between X and Y.
- Lift < 1: Negative correlation between X and Y.

---

## **Summary Table of Metrics**

| **Metric**       | **Formula**                                       | **Purpose**                               |
|-------------------|---------------------------------------------------|-------------------------------------------|
| **Support**       | \( \text{Support}(X) = \frac{\text{Transactions Containing } X}{\text{Total Transactions}} \) | Measures frequency of an itemset          |
| **Confidence**    | \( \text{Confidence}(X \Rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} \) | Measures strength of an association rule |
| **Lift**          | \( \text{Lift}(X \Rightarrow Y) = \frac{\text{Confidence}(X \Rightarrow Y)}{\text{Support}(Y)} \) | Measures correlation strength            |
| **Candidate Size**| \( |C_n| = \binom{|L_{n-1}|}{n} \)                  | Number of generated itemsets at step \( n \) |

# **Step 1: Generate C1 (Candidate Set with Support Counts)**

## **Objective:**
The goal of this step is to generate **C1**, a candidate set containing each individual item from the dataset along with its support count. This serves as the foundation for further frequent itemset mining using the Apriori algorithm.

---

## **How the Code Works:**

### **1. Load Dataset**
- The dataset (`chess.csv`) is loaded using `pandas.read_csv`.
- It contains attributes like `turns`, `victory_status`, `winner`, `white_rating`, `black_rating`, and more.

### **2. Apply Threshold-Based Categorization**
- Numerical attributes like `white_rating`, `black_rating`, and `turns` are categorized into ranges (e.g., `Rating_1200-1400`, `Turns_20-40`).
- Other attributes like `winner` and `victory_status` are treated as categorical.
- This helps convert the raw data into discrete transactions suitable for Apriori analysis.

#### **Example**:
For a row:
- `white_rating = 1400`, `black_rating = 1300`, `turns = 35`, `winner = "white"`.
- After categorization:
  - `White Rating_1200-1400`, `Black Rating_1200-1400`, `Turns_20-40`, `Winner_White`.

### **3. Generate Transactions**
- Each row in the dataset is converted into a transaction, representing a set of categorized attributes.

### **4. Generate C1**
- For each transaction, the code counts the number of times each item appears across all transactions.
- The result is a dictionary (`C1`) where:
  - **Key**: Item (e.g., `"Turns_20-40"`, `"Winner_White"`)
  - **Value**: Support count (number of transactions containing that item).

### **5. Display Results**
- The candidate set (`C1`) is converted into a tabular format using `pandas.DataFrame`.
- It is displayed as a table for easy visualization.

---

## **Output:**
- A table displaying each item and its support count.

| **Item**                | **Support Count** |
|--------------------------|-------------------|
| `White Rating_1200-1400` | 25                |
| `Winner_White`           | 30                |
| `Turns_20-40`            | 18                |

This provides an initial understanding of the most frequent individual attributes in the dataset, setting the stage for further analysis.

In [124]:
import pandas as pd

# Load the dataset
data = pd.read_csv("Data/chess.csv")

# Define thresholds or categorization rules for relevant columns
thresholds = {
    "turns": [20, 40, 60],  # Define ranges for turns
    "white_rating": [1200, 1400, 1600, 1800, 2000],  # Define ranges for white player rating
    "black_rating": [1200, 1400, 1600, 1800, 2000],  # Define ranges for black player rating
    "opening_ply": [5, 10, 15],  # Define ranges for opening ply
}

# Define functions for categorization
def discretize(value, bins, labels):
    """
    Discretizes a continuous value into categories based on defined bins and labels.
    """
    return pd.cut([value], bins=bins, labels=labels, right=True, include_lowest=True)[0]

def categorize(row, thresholds):
    """
    Categorizes each row into transaction items based on thresholds.
    """
    transaction = []
    
    # Discretize turns
    transaction.append(f"Turns_{discretize(row['turns'], bins=[0] + thresholds['turns'] + [float('inf')], labels=['<20', '20-40', '40-60', '60+'])}")
    
    # Discretize ratings
    transaction.append(f"White Rating_{discretize(row['white_rating'], bins=[0] + thresholds['white_rating'] + [float('inf')], labels=['<1200', '1200-1400', '1400-1600', '1600-1800', '1800-2000', '2000+'])}")
    transaction.append(f"Black Rating_{discretize(row['black_rating'], bins=[0] + thresholds['black_rating'] + [float('inf')], labels=['<1200', '1200-1400', '1400-1600', '1600-1800', '1800-2000', '2000+'])}")
    
    # Discretize opening ply
    transaction.append(f"Opening Ply_{discretize(row['opening_ply'], bins=[0] + thresholds['opening_ply'] + [float('inf')], labels=['<=5', '6-10', '11-15', '>15'])}")
    
    # Add categorical attributes
    transaction.append(f"Victory Status_{row['victory_status']}")
    transaction.append(f"Winner_{row['winner']}")
    return transaction

# Create transactions
transactions = data.apply(lambda row: categorize(row, thresholds), axis=1).tolist()

# Generate C1: Count support for each individual item
def generate_c1(transactions):
    """
    Generate C1, a candidate set of individual items with their support counts.

    Parameters:
        transactions (list): List of transactions, where each transaction is a set of items.

    Returns:
        dict: Dictionary of items and their support counts.
    """
    c1 = {}
    for transaction in transactions:
        for item in transaction:
            c1[item] = c1.get(item, 0) + 1
    return c1

# Generate C1
c1 = generate_c1(transactions)

# Convert C1 to a DataFrame for better visualization
c1_df = pd.DataFrame({
    "Item": list(c1.keys()),
    "Support Count": list(c1.values())
})

# Display C1
from IPython.display import display, HTML
print("C1 (Candidate Set with Support Counts):")
display(HTML(c1_df.to_html(index=False)))


C1 (Candidate Set with Support Counts):


Item,Support Count
Turns_<20,1795
White Rating_1400-1600,5775
Black Rating_<1200,1704
Opening Ply_<=5,13560
Victory Status_outoftime,1680
Winner_white,10001
White Rating_1200-1400,3490
Black Rating_1200-1400,3569
Victory Status_resign,11147
Winner_black,9107


# **Generating L from C: The Pruning Step in Apriori Analysis**

## **What is Pruning in Apriori?**
Pruning is the process of eliminating infrequent itemsets from the candidate set (**C**) to generate the frequent itemset (**L**). This step ensures that only itemsets meeting the minimum support threshold are carried forward in the Apriori analysis.

---

## **How It Works**
1. **Candidate Set (C)**:
   - Generated in each iteration of Apriori, containing potential itemsets of increasing size (e.g., **C1**, **C2**, **C3**).
   
2. **Frequent Itemset (L)**:
   - Derived from **C** by filtering itemsets based on the **minimum support threshold**.
   - Itemsets with support counts below the threshold are discarded.

3. **Relation to the Apriori Property**:
   - **Apriori Property**:
     - "If an itemset is not frequent, then all its supersets cannot be frequent."
   - Pruning leverages this property to reduce the search space by focusing only on frequent itemsets.

---

## **Why is it Called Pruning?**
- **Pruning** refers to removing unnecessary parts (infrequent itemsets) and focusing on the meaningful core (frequent itemsets).
- By eliminating infrequent itemsets, pruning ensures computational efficiency and avoids generating irrelevant candidates in subsequent iterations.

---

## **Example**

### **C1 → L1**:
- **Candidate Set C1**:
    ```plaintext
    {Item1: 5, Item2: 10, Item3: 2}
    ```
- **Support Threshold**: **3**
- **Frequent Set L1**:
    ```plaintext
    {Item1: 5, Item2: 10}
    ```

### **C2 → L2**:
- **Candidate Set C2**:
    ```plaintext
    {Item1 & Item2: 4, Item2 & Item3: 1}
    ```
- **Support Threshold**: **3**
- **Frequent Set L2**:
    ```plaintext
    {Item1 & Item2: 4}
    ```

---

## **Why is Pruning Important?**
1. **Focuses on Relevant Patterns**:
   - Eliminates infrequent itemsets that cannot contribute to meaningful rules or insights.

2. **Reduces Computational Complexity**:
   - Limits the size of subsequent candidate sets by removing unnecessary combinations.

3. **Adheres to Apriori Property**:
   - Ensures that only frequent itemsets are expanded in subsequent iterations.

---

## **Key Takeaways**
- Generating **L** from **C** is the **pruning step** in Apriori.
- Pruning reduces the search space, improves efficiency, and adheres to the Apriori property.
- It is a critical step in identifying meaningful patterns while maintaining computational feasibility.


## Step II: Generate L1 (Filtered Itemset with Support Counts ≥ Minimum Support)

### Objective:
Filter the candidate set (`C1`) by removing items with support counts below a specified minimum support threshold. The resulting filtered itemset is called `L1`.

---

### **How the Code Works**

1. **Input**:
   - Takes the `C1` candidate set generated in Step I, which contains items and their support counts.
   - A predefined minimum support threshold (`min_support`) is set. For example:
     ```python
     min_support = 3000
     ```

2. **Filtering Logic**:
   - Each item in `C1` is compared against the `min_support` threshold.
   - If an item's support count is greater than or equal to `min_support`, it is retained in the filtered set `L1`.
   - Items that do not meet this condition are removed.

3. **Output**:
   - The filtered set `L1`, containing items and their support counts, is stored as a dictionary.
   - This dictionary is converted to a `pandas.DataFrame` for tabular display.

4. **Visualization**:
   - The resulting `L1` is displayed as an HTML table for better readability in a Jupyter Notebook environment.

---

In [132]:
# Define the minimum support threshold
min_support = 3000  # Adjust based on dataset size and requirements

# Generate L1: Filter C1 to retain only items meeting the minimum support threshold
def generate_l1(c1, min_support):
    """
    Generate L1, the frequent itemsets of size 1, from C1.

    Parameters:
        c1 (dict): Dictionary of items and their support counts (C1).
        min_support (int): Minimum support threshold.

    Returns:
        dict: Dictionary of frequent items (L1) and their support counts.
    """
    l1 = {item: support for item, support in c1.items() if support >= min_support}
    return l1

# Generate L1 from C1
l1 = generate_l1(c1, min_support)

# Convert L1 to a DataFrame for better visualization
l1_df = pd.DataFrame({
    "Item": list(l1.keys()),
    "Support Count": list(l1.values())
})

# Display L1
if not l1_df.empty:
    print("L1 (Frequent 1-itemsets with Support Counts):")
    from IPython.display import display, HTML
    display(HTML(l1_df.to_html(index=False)))
else:
    print("L1 is empty. No items met the minimum support threshold.")


L1 (Frequent 1-itemsets with Support Counts):


Item,Support Count
White Rating_1400-1600,5775
Opening Ply_<=5,13560
Winner_white,10001
White Rating_1200-1400,3490
Black Rating_1200-1400,3569
Victory Status_resign,11147
Winner_black,9107
Turns_60+,8653
Black Rating_1400-1600,5772
Victory Status_mate,6325


# Generating Candidate 2-itemsets (C2) and Support Counts

## Objective:
To generate all possible 2-itemsets (**C2**) from frequent 1-itemsets (**L1**) and calculate their support counts based on transaction data.

---

## How the Code Works:

### **1. Generate Candidate 2-itemsets (C2)**
The function `generate_c2`:
- Takes **L1** (frequent 1-itemsets) as input.
- Creates all possible pairs of items from **L1** as candidate 2-itemsets.
- Ensures each itemset is a valid 2-itemset.

#### Example:
If **L1** contains:
"Item A": 5, "Item B": 7, "Item C": 6 

The resulting **C2** will include:
("Item A", "Item B"), ("Item A", "Item C"), ("Item B", "Item C")

### **2. Count Support for C2**
The function `count_support_c2`:
- Iterates through transactions to count how many times each candidate 2-itemset in **C2** appears.
- Outputs a dictionary where:
  - **Key**: Candidate 2-itemset.
  - **Value**: Support count (number of transactions containing the itemset).

#### Example:
For transactions:
{"Item A", "Item B"}, {"Item A", "Item C"}, {"Item B", "Item C"}

And **C2**:
("Item A", "Item B"), ("Item A", "Item C"), ("Item B", "Item C")

The resulting support counts might be:
("Item A", "Item B"): 1, ("Item A", "Item C"): 1, ("Item B", "Item C"): 1


---

### **3. Include All Candidates (Including Zero Support)**
Unlike filtering candidates based on frequent subsets:
- This code includes all candidate 2-itemsets from **C2**.
- Support counts are calculated for all itemsets, even if their support is zero.

---

### **4. Display Results**
- Converts the support counts into a DataFrame for easier visualization.
- Displays the results as an HTML table.

#### Example Output:
| Itemset                  | Support Count |
|--------------------------|---------------|
| Item A & Item B          | 1             |
| Item A & Item C          | 1             |
| Item B & Item C          | 0             |

---

## Purpose:
This step ensures that all candidate 2-itemsets are generated and evaluated, providing a complete view of their support counts, regardless of whether they meet a specific support threshold.

---

## Use Cases:
- Preparing for further pruning of **C2** to generate **L2** based on a minimum support threshold.
- Debugging and analyzing the distribution of 2-itemsets in the dataset.


In [133]:
# Generate Candidate 2-itemsets (C2) using L1
def generate_c2(l1_items):
    c2 = []
    l1_list = list(l1_items.keys())
    for i in range(len(l1_list)):
        for j in range(i + 1, len(l1_list)):
            candidate = tuple(sorted(set([l1_list[i], l1_list[j]])))  # Create 2-itemsets
            if len(candidate) == 2:  # Ensure it is a 2-itemset
                c2.append(candidate)
    return c2

# Check if all subsets of an itemset are frequent
def is_frequent(itemset, l1):
    subsets = combinations(itemset, len(itemset) - 1)
    return all(tuple(sorted(subset)) in l1 for subset in subsets)

# Generate C2 from L1
c2 = generate_c2(l1)

# Filter C2 to retain all candidates regardless of frequent subsets
filtered_c2 = c2  # Include all candidates

# Count support for C2 itemsets in the dataset
def count_support_c2(c2, transactions):
    support_counts = {itemset: 0 for itemset in c2}
    for transaction in transactions:
        for itemset in c2:
            if set(itemset).issubset(transaction):
                support_counts[itemset] += 1
    return support_counts

# Count support for C2
c2_support = count_support_c2(filtered_c2, transactions)

# Convert C2 with support counts to DataFrame for display
import pandas as pd
from IPython.display import display, HTML

c2_df = pd.DataFrame({
    "Itemset": [' & '.join(itemset) for itemset in c2_support.keys()],
    "Support Count": list(c2_support.values())
})

# Display C2
print("C2 (Candidate 2-itemsets with Support Counts):")
display(HTML(c2_df.to_html(index=False)))

C2 (Candidate 2-itemsets with Support Counts):


Itemset,Support Count
Opening Ply_<=5 & White Rating_1400-1600,4090
White Rating_1400-1600 & Winner_white,2874
White Rating_1200-1400 & White Rating_1400-1600,0
Black Rating_1200-1400 & White Rating_1400-1600,1025
Victory Status_resign & White Rating_1400-1600,3077
White Rating_1400-1600 & Winner_black,2649
Turns_60+ & White Rating_1400-1600,2385
Black Rating_1400-1600 & White Rating_1400-1600,2908
Victory Status_mate & White Rating_1400-1600,1989
Turns_20-40 & White Rating_1400-1600,1230


# Generating L2 (Frequent 2-itemsets) from C2

## Objective:
To filter candidate 2-itemsets (**C2**) based on a minimum support threshold to generate **L2**, the set of frequent 2-itemsets.

---

## How the Code Works:

### **1. Define Minimum Support Threshold**
- A minimum support threshold (`min_support`) is defined, below which itemsets are considered infrequent and are removed.

#### Example:
If `min_support = 2`, any 2-itemset with a support count less than 2 will be excluded.

---

### **2. Filter C2 to Generate L2**
The code:
- Iterates through the support counts of all candidates in **C2**.
- Retains only those 2-itemsets where the support count is greater than or equal to `min_support`.
- Creates **L2**, a dictionary of frequent 2-itemsets and their support counts.

#### Example:
For **C2**:
("Item A", "Item B"): 1, ("Item A", "Item C"): 2, ("Item B", "Item C"): 3

With `min_support = 2`, the resulting **L2** will be:
("Item A", "Item C"): 2, ("Item B", "Item C"): 3

---

### **3. Convert to DataFrame**
The filtered **L2** is converted into a pandas `DataFrame` for better visualization, with columns:
- **Itemset**: The frequent 2-itemset (e.g., `"Item A & Item B"`).
- **Support Count**: The number of transactions containing the itemset.

---

### **4. Display Results**
The results are displayed in an HTML table if **L2** is not empty:
- Example Table:
| Itemset                  | Support Count |
|--------------------------|---------------|
| Item A & Item B          | 1             |
| Item A & Item C          | 1             |
| Item B & Item C          | 0             |

In [135]:
# Filter C2 to generate L2 based on minimum support threshold
min_support = 3000  # Set minimum support threshold

# Generate L2 by filtering C2 candidates with support >= min_support
l2 = {itemset: support for itemset, support in c2_support.items() if support >= min_support}

# Convert L2 to DataFrame for better visualization
l2_df = pd.DataFrame({
    "Itemset": [' & '.join(itemset) for itemset in l2.keys()],
    "Support Count": list(l2.values())
})

# Display L2
if not l2_df.empty:
    print("L2 (Frequent 2-itemsets with Support Counts):")
    from IPython.display import display, HTML
    display(HTML(l2_df.to_html(index=False)))
else:
    print("L2 is empty. No 2-itemsets met the minimum support threshold.")


L2 (Frequent 2-itemsets with Support Counts):


Itemset,Support Count
Opening Ply_<=5 & White Rating_1400-1600,4090
Victory Status_resign & White Rating_1400-1600,3077
Opening Ply_<=5 & Winner_white,6731
Opening Ply_<=5 & Victory Status_resign,7237
Opening Ply_<=5 & Winner_black,6208
Opening Ply_<=5 & Turns_60+,5848
Black Rating_1400-1600 & Opening Ply_<=5,4070
Opening Ply_<=5 & Victory Status_mate,4542
Opening Ply_<=5 & Turns_40-60,3568
Victory Status_resign & Winner_white,5844


# **Step 3: Generate C3 (Candidate 3-itemsets)**

## **Objective**
To generate **C3**, a set of candidate 3-itemsets, from the frequent 2-itemsets (**L2**). This step extends the Apriori analysis by identifying potential itemsets of size 3 that satisfy the Apriori property.

---

## **How the Code Works**

### **1. Candidate Generation**
- **Input**: Frequent 2-itemsets (**L2**).
- **Join Step**:
  - Combine two 2-itemsets if the last element of the first itemset matches the first element of the second itemset.
  - Example: `("I1", "I2")` and `("I2", "I3")` join to form `("I1", "I2", "I3")`.
- **Output**: A list of all possible candidate 3-itemsets (**C3**).

### **2. Support Counting**
- For each candidate 3-itemset:
  - Check how many transactions contain all three items in the itemset.
  - Count the occurrences and store the support count in a dictionary.

### **3. Display Results**
- Convert the candidate 3-itemsets and their support counts into a table.
- Display the table for further filtering or analysis.

---

## **Example**

### **Input (L2):**
```plaintext
{
    ("Winner_white", "Turns_20-40"): 4,
    ("Winner_white", "Opening Ply_<=5"): 6,
    ("Turns_20-40", "Opening Ply_<=5"): 3
}
```

### **Output (C3)**:
| Itemset                                         | Support Count |
|-------------------------------------------------|---------------|
| Winner_white & Turns_20-40 & Opening Ply_<=5    | 3             |


In [136]:
from itertools import combinations

# Function to generate C3 (Candidate 3-itemsets) from L2
def generate_c3(l2_items):
    c3 = []
    l2_list = list(l2_items.keys())
    for i in range(len(l2_list)):
        for j in range(i + 1, len(l2_list)):
            if l2_list[i][1] == l2_list[j][0]:  # Join condition: (K-2) elements match
                candidate = tuple(sorted(set(l2_list[i]) | set(l2_list[j])))  # Merge sets
                if len(candidate) == 3:  # Ensure it is a 3-itemset
                    c3.append(candidate)
    return c3

# Count support for C3 itemsets in the transactions
def count_support_c3(c3, transactions):
    support_counts = {itemset: 0 for itemset in c3}
    for transaction in transactions:
        for itemset in c3:
            if set(itemset).issubset(transaction):
                support_counts[itemset] += 1
    return support_counts

# Generate C3 and count support
c3 = generate_c3(l2)
c3_support = count_support_c3(c3, transactions)

# Convert results to a DataFrame
import pandas as pd
from IPython.display import display, HTML

c3_df = pd.DataFrame({
    "Itemset": [' & '.join(itemset) for itemset in c3],
    "Support Count": [c3_support.get(itemset, 0) for itemset in c3]
})

# Display C3
if not c3_df.empty:
    print("C3 (Candidate 3-itemsets with Support Counts):")
    display(HTML(c3_df.to_html(index=False)))
else:
    print("C3 is empty. No 3-itemsets were generated.")

C3 (Candidate 3-itemsets with Support Counts):


Itemset,Support Count
Opening Ply_<=5 & Victory Status_resign & Winner_white,3781
Opening Ply_<=5 & Victory Status_resign & Winner_black,3456
Opening Ply_<=5 & Turns_60+ & Winner_white,2725
Opening Ply_<=5 & Turns_60+ & Victory Status_resign,2548
Opening Ply_<=5 & Turns_60+ & Winner_black,2691
Opening Ply_<=5 & Turns_60+ & Victory Status_mate,2176
Black Rating_1400-1600 & Opening Ply_<=5 & Victory Status_mate,1426
Black Rating_1400-1600 & Opening Ply_<=5 & Turns_40-60,1099
Opening Ply_<=5 & Victory Status_mate & Winner_white,2368
Opening Ply_<=5 & Turns_40-60 & Victory Status_resign,2013


# **Step 4: Generate L3 (Frequent 3-itemsets)**

## **Objective**
To generate **L3**, a set of frequent 3-itemsets, by filtering **C3** (candidate 3-itemsets) using a defined minimum support threshold. This step narrows down the significant itemsets for deeper analysis.

---

## **How the Code Works**

### **1. Input**
- **C3**: Candidate 3-itemsets generated from L2.
- **Support Counts**: Support counts for each 3-itemset in C3.
- **Minimum Support Threshold**: A threshold value to filter frequent itemsets.

### **2. Filtering Logic**
- For each itemset in C3:
  - If the support count meets or exceeds the `min_support`, the itemset is included in **L3**.
  - Itemsets with support counts below the threshold are discarded.

### **3. Output**
- **L3**: A dictionary of frequent 3-itemsets and their support counts.
- The result is displayed as a table for easy visualization.

---

## **Example**

### **Input (C3)**:
```plaintext
Itemset                                            Support Count
Winner_white & Turns_20-40 & Opening Ply_<=5       2580
Winner_black & Turns_40-60 & Opening Ply_6-10      1500
```

### Minimum Support Threshold:
min_support = 2000


In [140]:
# Define the minimum support threshold
min_support = 2000  # Adjust based on dataset size and requirements

# Generate L3: Filter C3 to retain only itemsets meeting the minimum support threshold
def generate_l3(c3_support, min_support):
    """
    Generate L3, the frequent itemsets of size 3, from C3.

    Parameters:
        c3_support (dict): Dictionary of candidate 3-itemsets (C3) and their support counts.
        min_support (int): Minimum support threshold.

    Returns:
        dict: Dictionary of frequent 3-itemsets (L3) and their support counts.
    """
    l3 = {itemset: support for itemset, support in c3_support.items() if support >= min_support}
    return l3

# Generate L3 from C3
l3 = generate_l3(c3_support, min_support)

# Convert L3 to a DataFrame for better visualization
import pandas as pd
from IPython.display import display, HTML

l3_df = pd.DataFrame({
    "Itemset": [' & '.join(itemset) for itemset in l3.keys()],
    "Support Count": list(l3.values())
})

# Display L3
if not l3_df.empty:
    print("L3 (Frequent 3-itemsets with Support Counts):")
    display(HTML(l3_df.to_html(index=False)))
else:
    print("L3 is empty. No 3-itemsets met the minimum support threshold.")

L3 (Frequent 3-itemsets with Support Counts):


Itemset,Support Count
Opening Ply_<=5 & Victory Status_resign & Winner_white,3781
Opening Ply_<=5 & Victory Status_resign & Winner_black,3456
Opening Ply_<=5 & Turns_60+ & Winner_white,2725
Opening Ply_<=5 & Turns_60+ & Victory Status_resign,2548
Opening Ply_<=5 & Turns_60+ & Winner_black,2691
Opening Ply_<=5 & Turns_60+ & Victory Status_mate,2176
Opening Ply_<=5 & Victory Status_mate & Winner_white,2368
Opening Ply_<=5 & Turns_40-60 & Victory Status_resign,2013


# **Step 5: Generate C4 (Candidate 4-itemsets)**

## **Objective**
To generate **C4**, a set of candidate 4-itemsets, by joining frequent 3-itemsets (**L3**) and counting their support in the dataset. This step extends the Apriori analysis by identifying potential 4-itemsets.

---

## **How the Code Works**

### **1. Input**
- **L3**: Frequent 3-itemsets with their support counts.
- **Transactions**: A list of transactions for support counting.

### **2. Candidate Generation**
- **Join Step**:
  - Combine two 3-itemsets if the last two elements of the first match the first two elements of the second.
  - Example: `("I1", "I2", "I3")` and `("I2", "I3", "I4")` join to form `("I1", "I2", "I3", "I4")`.
- **Output**: A list of all possible candidate 4-itemsets (**C4**).

### **3. Support Counting**
- For each candidate 4-itemset:
  - Check how many transactions contain all four items.
  - Count the occurrences and store the support count in a dictionary.

### **4. Display Results**
- Convert the candidate 4-itemsets and their support counts into a table.
- Display the table for further filtering or analysis.

---

## **Example**

### **Input (L3)**:
```plaintext
Itemset                                            Support Count
Winner_white & Turns_20-40 & Opening Ply_<=5       2580
Turns_20-40 & Opening Ply_<=5 & Victory Status_mate 2300
```

### Output (C4):
Itemset                                            Support Count
Winner_white & Turns_20-40 & Opening Ply_<=5 & Victory Status_mate   1900


In [141]:
from itertools import combinations

# Function to generate C4 (Candidate 4-itemsets) using L3
def generate_c4(l3_items):
    c4 = []
    l3_list = list(l3_items.keys())
    for i in range(len(l3_list)):
        for j in range(i + 1, len(l3_list)):
            if l3_list[i][1:] == l3_list[j][:-1]:  # Join condition: (K-2) elements match
                candidate = tuple(sorted(set(l3_list[i]) | set(l3_list[j])))  # Merge sets
                if len(candidate) == 4:  # Ensure it is a 4-itemset
                    c4.append(candidate)
    return c4

# Count support for C4 itemsets in the transactions
def count_support_c4(c4, transactions):
    support_counts = {itemset: 0 for itemset in c4}
    for transaction in transactions:
        for itemset in c4:
            if set(itemset).issubset(transaction):
                support_counts[itemset] += 1
    return support_counts

# Generate C4 from L3 and count support
c4 = generate_c4(l3)
c4_support = count_support_c4(c4, transactions)

# Convert results to a DataFrame
import pandas as pd
c4_df = pd.DataFrame({
    "Itemset": [' & '.join(itemset) for itemset in c4],
    "Support Count": [c4_support.get(itemset, 0) for itemset in c4]
})

# Display C4
if not c4_df.empty:
    print("C4 (Candidate 4-itemsets with Support Counts):")
    display(HTML(c4_df.to_html(index=False)))
else:
    print("C4 is empty. No 4-itemsets were generated.")

C4 is empty. No 4-itemsets were generated.


# **Conclusion: Empty C4 Indicates End of Apriori Analysis**

## **Why C4 is Empty?**
1. **Support Declines with Larger Itemsets**:
   - As the size of itemsets increases, their occurrence across transactions becomes less frequent. This is expected, as combinations of more items are less likely to co-occur in diverse datasets.
   
2. **High Support Threshold**:
   - A higher minimum support threshold may filter out larger itemsets that occur less frequently. Lowering the threshold could potentially yield 4-itemsets, but it may also introduce less meaningful patterns.

3. **Dataset Characteristics**:
   - Even with a sizable dataset (20,060 transactions in this case), sparse or non-overlapping item combinations naturally lead to empty higher-order candidate sets.

---

## **Significance of an Empty C4**
- **Final Step**:
   - Since no 4-itemsets meet the support threshold, the Apriori process stops at **C3**.
   - The frequent 3-itemsets (L3) represent the most significant patterns.

- **Focus**:
   - Analyze **L1**, **L2**, and **L3** for meaningful insights and actionable patterns.
   - Use these frequent itemsets to generate association rules and uncover relationships.

---

## **Insights from Frequent 3-Itemsets**
The following frequent 3-itemsets were identified from **L3** with their support counts:

| **Itemset**                                           | **Support Count** |
|-------------------------------------------------------|-------------------|
| Opening Ply_<=5 & Victory Status_resign & Winner_white | 3781              |
| Opening Ply_<=5 & Victory Status_resign & Winner_black | 3456              |
| Opening Ply_<=5 & Turns_60+ & Winner_white            | 2725              |
| Opening Ply_<=5 & Turns_60+ & Victory Status_resign   | 2548              |
| Opening Ply_<=5 & Turns_60+ & Winner_black            | 2691              |
| Opening Ply_<=5 & Turns_60+ & Victory Status_mate     | 2176              |
| Opening Ply_<=5 & Victory Status_mate & Winner_white  | 2368              |
| Opening Ply_<=5 & Turns_40-60 & Victory Status_resign | 2013              |

### **Key Observations**
1. **Opening Ply_<=5 Dominates**:
   - All frequent 3-itemsets involve `Opening Ply_<=5`, suggesting that games with fewer opening moves (indicating shorter openings) strongly correlate with other game attributes.

2. **Victory Status Resign**:
   - The itemsets containing `Victory Status_resign` have notably high support counts:
     - For `Winner_white`, the support is **3781**.
     - For `Winner_black`, the support is **3456**.
   - This indicates that resignation is a common outcome, regardless of the winner.

3. **Turns_60+ Correlations**:
   - Itemsets involving games with `Turns_60+` appear frequently with:
     - `Winner_white`: **2725 occurrences**.
     - `Winner_black`: **2691 occurrences**.
     - `Victory Status_resign`: **2548 occurrences**.
   - This suggests that longer games often end with resignations and victories for either side.

4. **Victory Status Mate**:
   - The presence of `Victory Status_mate` correlates with `Opening Ply_<=5` and other attributes:
     - `Winner_white`: **2368 occurrences**.
     - `Turns_60+`: **2176 occurrences**.
   - These patterns may indicate successful mate strategies in longer games with shorter openings.

5. **Turns_40-60**:
   - The frequent 3-itemset `Opening Ply_<=5 & Turns_40-60 & Victory Status_resign` occurs **2013 times**, suggesting that medium-length games also often end in resignation.

---

## **Conclusion**
The Apriori analysis highlights significant patterns in the chess dataset:
- **Opening Ply_<=5**: Short openings dominate frequent patterns, influencing outcomes and game lengths.
- **Victory Status_resign**: Resignation is a common conclusion, especially in longer games.
- **Turns_60+**: Longer games exhibit strong correlations with resignation and victory for both players.

Since **C4 is empty**, the analysis stops here. The identified 3-itemsets (**L3**) are the most complex frequent patterns, providing actionable insights for chess game analysis.


In [None]:
def generate_custom_association_rules(frequent_itemsets, min_confidence=0.7):
    """
    Generate association rules from frequent itemsets.
    
    Parameters:
    - frequent_itemsets (pd.DataFrame): DataFrame with 'itemsets' (frozenset) and 'support'.
    - min_confidence (float): Minimum confidence threshold for rules.
    
    Returns:
    - rules (list): List of association rules with metrics.
    """
    rules = []

    for _, row in frequent_itemsets.iterrows():
        itemset = row['itemsets']
        itemset_support = row['support']

        # Generate all possible antecedent-consequent pairs
        for i in range(1, len(itemset)):
            antecedents = [frozenset(a) for a in combinations(itemset, i)]
            for antecedent in antecedents:
                consequent = itemset - antecedent
                if consequent:
                    # Calculate confidence
                    antecedent_support = frequent_itemsets[
                        frequent_itemsets['itemsets'] == antecedent
                    ]['support'].values[0]
                    confidence = itemset_support / antecedent_support

                    # Check if confidence meets the threshold
                    if confidence >= min_confidence:
                        rules.append({
                            'antecedent': antecedent,
                            'consequent': consequent,
                            'support': itemset_support,
                            'confidence': confidence,
                            'lift': confidence / (frequent_itemsets[
                                frequent_itemsets['itemsets'] == consequent
                            ]['support'].values[0])
                        })
    return rules


# Assuming L1, L2, and L3 are precomputed DataFrames with 'itemsets' and 'support'
# Example:
# L1 = pd.DataFrame({'itemsets': [frozenset(['A']), frozenset(['B'])], 'support': [0.6, 0.4]})
# L2 = pd.DataFrame({'itemsets': [frozenset(['A', 'B']), frozenset(['B', 'C'])], 'support': [0.3, 0.2]})
# L3 = pd.DataFrame({'itemsets': [frozenset(['A', 'B', 'C'])], 'support': [0.1]})

# Custom association rule function (already defined above)

# Store L1, L2, L3 in an array
levels = [l1, l2, l3]

# Generate association rules for each level
all_rules = []
for i, level in enumerate(levels, start=1):
    if not level.empty:
        print(f"Generating association rules for L{i}...")
        rules = generate_custom_association_rules(level, min_confidence=0.7)
        all_rules.append(rules)

# Print results
for i, rules in enumerate(all_rules, start=1):
    print(f"\nAssociation Rules for L{i}:")
    for rule in rules:
        print(f"Antecedent: {rule['antecedent']}, Consequent: {rule['consequent']}, "
              f"Support: {rule['support']:.2f}, Confidence: {rule['confidence']:.2f}, Lift: {rule['lift']:.2f}")


NameError: name 'l1' is not defined

# **Association Rules Generation for Precomputed Levels (L1, L2, L3)**

This section generates **association rules** from precomputed frequent itemsets (`L1`, `L2`, `L3`) using a custom Python function. These levels represent frequent itemsets of size 1 (`L1`), size 2 (`L2`), and size 3 (`L3`).

---

## **Workflow**

### **1. Input: Precomputed Levels**
- **L1**: Frequent itemsets of size 1.
- **L2**: Frequent itemsets of size 2.
- **L3**: Frequent itemsets of size 3.

Each level is a **pandas DataFrame** with the following structure:
- `itemsets`: A `frozenset` of items.
- `support`: The support value of the itemset.

### **2. Custom Association Rule Function**
- A custom function, `generate_custom_association_rules`, is used to generate rules.
- For each level, it computes:
  - **Antecedents**: Subsets of the itemset (items on the "if" side).
  - **Consequents**: Remaining items in the itemset (items on the "then" side).
  - **Confidence**: The conditional probability of the consequent given the antecedent.
  - **Lift**: Measures how much more likely the antecedent implies the consequent compared to random chance.

### **3. Process**
- For each level (`L1`, `L2`, `L3`):
  1. Pass the precomputed frequent itemsets to the association rule function.
  2. Generate rules for itemsets with confidence greater than a specified threshold (`min_confidence`).

---