# Apriori Algorithm Analysis on `connect.dat` Dataset

This document provides an overview of the steps involved in applying the **Apriori algorithm** to the `accidents.dat` dataset. The goal is to discover **frequent itemsets** and generate **association rules** from the given transaction data.

## Steps Involved

1. **Load the Dataset**
2. **Convert Transactions to One-Hot Encoded Format**
3. **Apply the Apriori Algorithm**
4. **Generate Association Rules**
5. **Output the Results**

---

### Step 1: Load the Dataset

The first step in the process is to **load the dataset** (`connect.dat` file). Each line in this file represents a transaction where items (represented as integers) are involved. The objective is to read these transactions and store them in a structured format (a list of transactions), which will later be processed by the Apriori algorithm.

---

### Step 2: Convert Transactions to One-Hot Encoded Format

Once the transactions are loaded, the next task is to **convert them into a one-hot encoded format**. This transformation will represent each transaction as a binary vector where:
- Each vector element corresponds to an item in the dataset (based on all unique items).
- A value of `1` indicates that the item is present in the transaction.
- A value of `0` indicates that the item is absent.

This step is essential for the Apriori algorithm, as it operates on a binary matrix where each row is a transaction, and each column is an item.

---

### Step 3: Apply the Apriori Algorithm

With the data in the one-hot encoded format, we can now apply the **Apriori algorithm**. The Apriori algorithm finds **frequent itemsets**, which are groups of items that appear together in the dataset more frequently than a specified threshold, called **minimum support**.

- The algorithm starts by finding individual frequent items.
- It then iterates to find pairs, triples, and larger itemsets, checking whether they meet the **minimum support** threshold.
- The algorithm is iterative and reduces the candidate itemsets at each step based on the support count.

---

### Step 4: Generate Association Rules

After the frequent itemsets are identified, the next step is to **generate association rules** from these itemsets. Association rules provide insights into how the occurrence of one item (or set of items) in a transaction can imply the occurrence of other items.

- **Confidence** is used as the metric for generating these rules. It is defined as the probability that a transaction containing the antecedent (left-hand side) also contains the consequent (right-hand side) of the rule.
- Rules are generated from itemsets that meet the **minimum confidence** threshold, which ensures that the rules are statistically significant.

---

### Step 5: Output the Results

Finally, the results are displayed, including:
- **Frequent itemsets**: The sets of items that frequently appear together in the dataset.
- **Association rules**: The rules that show relationships between different items in the dataset.

These results can be analyzed to understand patterns or co-occurrences in accident-related data, providing valuable insights for further research or decision-making.

---

## Summary of Steps

1. **Load Transactions**: Read the dataset and store the transactions in a structured format.
2. **One-Hot Encoding**: Convert transactions into a one-hot encoded format where each column represents an item.
3. **Apriori Algorithm**: Apply the Apriori algorithm to find frequent itemsets based on the **minimum support** threshold.
4. **Association Rules**: Generate association rules using the frequent itemsets, filtered by the **minimum confidence** threshold.
5. **Display Results**: Show the frequent itemsets and association rules found during the analysis.

---

This approach allows us to apply the Apriori algorithm on the accident dataset to uncover hidden relationships between different attributes of accidents, helping in predictive analysis and decision-making.

In [4]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Step 1: Load the Data
def load_dat_file(filename):
    """Read the .dat file and convert it to a list of transactions."""
    transactions = []
    with open(filename, 'r') as file:
        for line in file:
            transaction = list(map(int, line.strip().split()))
            transactions.append(transaction)
    return transactions

# Step 2: Convert Transactions to a One-Hot Encoded DataFrame
def transactions_to_dataframe(transactions):
    """Convert the list of transactions to a one-hot encoded DataFrame."""
    # Get all unique items
    unique_items = set(item for transaction in transactions for item in transaction)
    
    # Create a DataFrame with columns for each unique item
    encoded_df = pd.DataFrame(0, index=range(len(transactions)), columns=sorted(unique_items))
    for i, transaction in enumerate(transactions):
        for item in transaction:
            encoded_df.loc[i, item] = 1
    
    # Convert to boolean type for compatibility with mlxtend
    return encoded_df.astype(bool)

# Step 3: Apply Apriori and Generate Association Rules
def apply_apriori(data, min_support, min_confidence):
    """Apply the Apriori algorithm and generate association rules."""
    # Apply Apriori
    frequent_itemsets = apriori(data, min_support=min_support, use_colnames=True)
    
    # Generate Association Rules
    num_itemsets = len(data)  # Calculate the total number of transactions
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence, num_itemsets=num_itemsets)
    return frequent_itemsets, rules

# Main Execution
filename = "Data/connect.dat"  # Replace with your actual .dat file path
transactions = load_dat_file(filename)
encoded_data = transactions_to_dataframe(transactions)

# Parameters
min_support = 0.95  # Minimum support threshold
min_confidence = 0.6  # Minimum confidence threshold

# Apply Apriori
frequent_itemsets, rules = apply_apriori(encoded_data, min_support, min_confidence)

# Display Results
print("Frequent Itemsets:")
print(frequent_itemsets)

print("\nAssociation Rules:")
print(rules)

Frequent Itemsets:
       support                                 itemsets
0     0.966073                                     (16)
1     0.992347                                     (19)
2     0.965170                                     (34)
3     0.992377                                     (37)
4     0.964267                                     (52)
...        ...                                      ...
2196  0.955460     (37, 106, 75, 109, 55, 91, 124, 127)
2197  0.952011     (37, 106, 109, 55, 88, 91, 124, 127)
2198  0.954986     (37, 106, 75, 109, 88, 91, 124, 127)
2199  0.955016     (106, 75, 109, 55, 88, 91, 124, 127)
2200  0.950916  (37, 106, 75, 109, 19, 55, 88, 91, 127)

[2201 rows x 2 columns]

Association Rules:
      antecedents                          consequents  antecedent support  \
0            (16)                                 (19)            0.966073   
1            (19)                                 (16)            0.992347   
2            (16)             