# Tutorial 1: From ERP Data to an Event Log (Procure-to-Pay)

This tutorial walks through the process of loading relational ERP data for a procure-to-pay (P2P) process, transforming it into an event log, discovering a process model, and computing KPI analysis.

**Reproducibility Note**: This notebook uses deterministic data and algorithms, ensuring identical results on every run.

## 1. Import necessary libraries

In [None]:
import pandas as pd
from erp_processminer.io_erp import loaders, mappings
from erp_processminer.discovery import directly_follows
from erp_processminer.visualization import graphs
from erp_processminer.statistics.performance import calculate_cycle_times
from erp_processminer.statistics.variants import get_variants

## 2. Create Sample ERP Data

For this tutorial, we'll create some sample data in memory. In a real-world scenario, you would load this data from CSV files or a database.

In [None]:
po_data = [
    ["PO-001", "2023-01-10", "Vendor A", "User 1"],
    ["PO-002", "2023-01-11", "Vendor B", "User 2"],
    ["PO-003", "2023-01-12", "Vendor A", "User 1"],
]
po_df = pd.DataFrame(po_data, columns=["PO_NUMBER", "CREATION_DATE", "VENDOR", "CREATED_BY"])
po_df['CREATION_DATE'] = pd.to_datetime(po_df['CREATION_DATE'])

gr_data = [
    ["GR-101", "PO-001", "2023-01-15", 100, 1],
    ["GR-102", "PO-002", "2023-01-18", 200, 1],
    ["GR-103", "PO-003", "2023-01-17", 50, 1],
]
gr_df = pd.DataFrame(gr_data, columns=["GR_NUMBER", "PO_NUMBER", "RECEIPT_DATE", "QUANTITY", "ITEM_NUMBER"])
gr_df['RECEIPT_DATE'] = pd.to_datetime(gr_df['RECEIPT_DATE'])

inv_data = [
    ["INV-201", "PO-001", "2023-01-20", 1000.0, "Paid"],
    ["INV-202", "PO-002", "2023-01-22", 2000.0, "Paid"],
    ["INV-203", "PO-003", "2023-01-25", 500.0, "Paid"],
]
inv_df = pd.DataFrame(inv_data, columns=["INVOICE_NUMBER", "PO_NUMBER", "INVOICE_DATE", "AMOUNT", "STATUS"])
inv_df['INVOICE_DATE'] = pd.to_datetime(inv_df['INVOICE_DATE'])

display(po_df.head())
display(gr_df.head())
display(inv_df.head())

## 3. Define the ERP-to-EventLog Mapping

The mapping configuration is a dictionary that tells the toolkit how to construct an event log. We need to specify:
- `case_id`: The column that represents the process instance.
- `tables`: A dictionary where each key is a table name and the value defines how to extract events from it.

In [None]:
p2p_mapping_config = {
    "case_id": "PO_NUMBER",
    "tables": {
        "purchase_orders": {
            "entity_id": "PO_NUMBER",
            "activity": "'Create Purchase Order'", # Static activity name
            "timestamp": "CREATION_DATE"
        },
        "goods_receipts": {
            "entity_id": "PO_NUMBER",
            "activity": "'Receive Goods'",
            "timestamp": "RECEIPT_DATE"
        },
        "invoices": {
            "entity_id": "PO_NUMBER",
            "activity": "'Receive Invoice'",
            "timestamp": "INVOICE_DATE"
        }
    }
}

## 4. Apply the Mapping to Create an Event Log

In [None]:
event_log = mappings.apply_mapping([po_df, gr_df, inv_df], p2p_mapping_config)

print(f"Successfully created an event log with {len(event_log.traces)} traces.")

# Print the first trace for inspection
print("\nFirst trace events:")
for event in event_log.traces[0]:
    print(event)

## 5. Discover and Visualize a Directly-Follows Graph (DFG)

Now that we have an event log, we can use a discovery algorithm to learn a process model. The DFG is the simplest process model.

In [None]:
dfg, start_activities, end_activities = directly_follows.discover_dfg(event_log)

print("Start activities:", start_activities)
print("End activities:", end_activities)
print("\nDFG edges with frequencies:")
for edge, data in dfg.get_edges().items():
    print(f"  {edge[0]} -> {edge[1]}: {data['frequency']} occurrences")

# The visualize_dfg function saves the graph to a file and returns the graphviz object
try:
    g = graphs.visualize_dfg(dfg, start_activities, end_activities, output_file='p2p_tutorial_dfg')
    print("\nDFG visualization saved to 'p2p_tutorial_dfg.png'")
    display(g)
except Exception as e:
    print(f"\nVisualization skipped (Graphviz not available): {e}")

## 6. KPI Analysis

Now let's compute key performance indicators (KPIs) for the P2P process.

In [None]:
# Calculate cycle times for each trace
cycle_times = calculate_cycle_times(event_log)

print("=== Cycle Time Analysis ===")
print(f"\nNumber of cases: {len(cycle_times)}")

# Convert to days for readability
cycle_days = [ct.total_seconds() / 86400 for ct in cycle_times.values()]
print(f"Average cycle time: {sum(cycle_days) / len(cycle_days):.1f} days")
print(f"Min cycle time: {min(cycle_days):.1f} days")
print(f"Max cycle time: {max(cycle_days):.1f} days")

print("\nCycle time per case:")
for case_id, ct in cycle_times.items():
    days = ct.total_seconds() / 86400
    print(f"  {case_id}: {days:.1f} days")

In [None]:
# Analyze process variants
variants = get_variants(event_log)

print("=== Variant Analysis ===")
print(f"\nNumber of unique variants: {len(variants)}")

print("\nVariants by frequency:")
for variant, count in sorted(variants.items(), key=lambda x: x[1], reverse=True):
    print(f"  {variant}: {count} cases ({count/len(event_log.traces)*100:.1f}%)")

## 7. Summary

In this tutorial, we demonstrated a complete P2P analysis workflow:

1. **ERP Data Loading**: Created sample purchase order, goods receipt, and invoice data
2. **Mapping Configuration**: Defined a declarative mapping to transform relational data to events
3. **Event Log Creation**: Applied the mapping to generate a structured event log
4. **Process Discovery**: Discovered a Directly-Follows Graph showing the process flow
5. **KPI Analysis**: Computed cycle times and variant frequencies

This workflow is **fully reproducible** - running the notebook again will produce identical results.