# **Pakistan Traffic Accidents Data Analysis**

This script performs **Apriori Analysis** on the Pakistan Traffic Accidents dataset using Python's **`mlxtend`** library. The workflow involves preprocessing the dataset, mining frequent itemsets, and generating association rules.

---

## **Workflow**

### **1. Dataset Preprocessing**
- The dataset is read as a CSV file with the following structure:
  - Columns: `Area`, `Year`, `Total number of accidents`, `Fatal Accidents`, `Non-Fatal Accidents`, `Killed`, `Injured`, `Total number of vehicles involved`.
  - Each row represents annual accident data for a specific area.
- **Preprocessing Steps**:
  1. **Convert categorical data into items**: Directly add columns like `Area` and `Year` as items (e.g., `Area_Lahore`, `Year_2022`).
  2. **Categorize numerical data**: Convert numerical columns into categorized items (e.g., `Total number of accidents` becomes `TotalAccidents_High` or `TotalAccidents_Low`).

### **2. One-Hot Encoding**
- The transactions are converted into a **one-hot encoded DataFrame**:
  - Each column corresponds to an item (e.g., `TotalAccidents_High`).
  - Each row represents a transaction, with `True` or `False` indicating the presence or absence of an item.

### **3. Frequent Itemset Mining**
- The **Apriori algorithm** from `mlxtend` is applied to identify itemsets that occur at least a specified minimum number of times (`min_support`).
- **Example**:
  - If `min_support = 0.3`, only itemsets that appear in at least 30% of transactions are considered frequent.

### **4. Association Rules Generation**
- **Association rules** are generated from frequent itemsets using metrics like:
  - **Support**: Proportion of transactions containing the itemset.
  - **Confidence**: Likelihood that a consequent is present given an antecedent.
  - **Lift**: Measure of how much the antecedent boosts the likelihood of the consequent.

---

In [78]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Preprocess the Accident Dataset
def preprocess_accident_data(filename, nrows=None):
    """
    Preprocess the Pakistan Traffic Accident dataset into transactions.
    """
    # Load the dataset
    data = pd.read_csv(filename, nrows=nrows)

    # Transform each row into a transaction
    transactions = []
    for _, row in data.iterrows():
        transaction = []

        # Add categorical attributes
        transaction.append(f"Area_{row['Area']}")
        transaction.append(f"Year_{row['Year']}")

        # Categorize numerical attributes
        transaction.append(f"TotalAccidents_{'High' if row['Total number of accidents'] > 100 else 'Low'}")
        transaction.append(f"FatalAccidents_{'High' if row['Fatal Accidents'] > 10 else 'Low'}")
        transaction.append(f"NonFatalAccidents_{'High' if row['Non-Fatal Accidents'] > 50 else 'Low'}")
        transaction.append(f"Killed_{'High' if row['Killed'] > 5 else 'Low'}")
        transaction.append(f"Injured_{'High' if row['Injured'] > 10 else 'Low'}")
        transaction.append(f"VehiclesInvolved_{'High' if row['Total number of vehicles involved'] > 10 else 'Low'}")

        transactions.append(transaction)
    
    return transactions

# Load and preprocess the dataset
filename = "Data/pak-traffic-accidents-annual.csv"
transactions = preprocess_accident_data(filename)

# Convert transactions to one-hot encoded DataFrame
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_array, columns=te.columns_)

# Apply Apriori
min_support = 0.80  # Set minimum support threshold
frequent_itemsets = apriori(df, min_support=min_support, use_colnames=True)

# Calculate num_itemsets for older versions of association_rules
num_itemsets = frequent_itemsets['itemsets'].apply(len).max()

# Display frequent itemsets
print("Frequent Itemsets:")
print(frequent_itemsets)

# Generate association rules
min_confidence = 0.3  # Set minimum confidence threshold
rules = association_rules(frequent_itemsets, num_itemsets=num_itemsets, metric="confidence", min_threshold=min_confidence)

# Display association rules
print("\nAssociation Rules:")
print(rules)


Frequent Itemsets:
    support                                           itemsets
0       1.0                              (FatalAccidents_High)
1       1.0                                     (Injured_High)
2       1.0                                      (Killed_High)
3       1.0                           (NonFatalAccidents_High)
4       1.0                              (TotalAccidents_High)
..      ...                                                ...
58      1.0  (FatalAccidents_High, Injured_High, Killed_Hig...
59      1.0  (FatalAccidents_High, Injured_High, NonFatalAc...
60      1.0  (FatalAccidents_High, NonFatalAccidents_High, ...
61      1.0  (Injured_High, NonFatalAccidents_High, Killed_...
62      1.0  (FatalAccidents_High, VehiclesInvolved_High, N...

[63 rows x 2 columns]

Association Rules:
                  antecedents  \
0       (FatalAccidents_High)   
1              (Injured_High)   
2               (Killed_High)   
3       (FatalAccidents_High)   
4       (FatalAcc