# SAP ERP Data Integration for Order Fulfillment & Delivery Analytics

This notebook demonstrates an end-to-end ETL process:

**Data Extraction:**  
Extract data from SAP ERP Excel sheets (KNA1, LFA1, VBAK, VBAP, LIKP, LIPS, VTTK, VTTP).

**Data Transformation:**  
Build a relational data model with these tables:
- **customers:** Data from KNA1.
- **sap_customers:** Duplicate of customers (optional).
- **orders:** Header-level order data from VBAK.
- **order_items:** Item-level order details from VBAP.
- **shipments:** Delivery header data from LIKP.
- **shipment_items:** Delivery item data from LIPS.
- **carriers:** Carrier information from LFA1.
- **delivery_status:** Merged shipment status info from VTTK and VTTP.
- **delivery_analytics:** Aggregated delivery performance metrics.

**Data Validation:**  
Checks include primary key uniqueness, referential integrity, and format validations.

**Data Load:**  
Load the data into a MySQL database (table names in lower-case).

**Prerequisites:**  
Ensure you have installed: `pandas`, `sqlalchemy`, `pymysql`, `openpyxl`.

In [1]:
import pandas as pd
from sqlalchemy import create_engine
import numpy as np

# Define the Excel file path (update the path if needed)
excel_file = 'SAP-DataSet.xlsx'

# Load the Excel file and list available sheet names
xls = pd.ExcelFile(excel_file)
print("Available sheets:", xls.sheet_names)

Available sheets: ['KNA1', 'LFA1', 'VBAK', 'VBAP', 'LIKP', 'LIPS', 'VTTK', 'VTTP']


## Data Extraction
Load each relevant sheet into a DataFrame. The sheets and key columns are:
- **KNA1:** Customer master data
- **LFA1:** Vendor data (used here as Carriers)
- **VBAK:** Sales Order Header
- **VBAP:** Sales Order Items
- **LIKP:** Delivery Header
- **LIPS:** Delivery Items
- **VTTK:** Shipment Header
- **VTTP:** Shipment Items

In [2]:
df_kna1 = pd.read_excel(excel_file, sheet_name='KNA1')
df_lfa1 = pd.read_excel(excel_file, sheet_name='LFA1')
df_vbak = pd.read_excel(excel_file, sheet_name='VBAK')
df_vbap = pd.read_excel(excel_file, sheet_name='VBAP')
df_likp = pd.read_excel(excel_file, sheet_name='LIKP')
df_lips = pd.read_excel(excel_file, sheet_name='LIPS')
df_vttk = pd.read_excel(excel_file, sheet_name='VTTK')
df_vttp = pd.read_excel(excel_file, sheet_name='VTTP')

## Data Transformation

### 1. Customers and SAP_Customers Tables
Use the KNA1 sheet to build the customers table

In [3]:
customers_df = df_kna1.copy()
customers_df.rename(columns={
    'Customer ID': 'customer_id',
    'Customer Name': 'customer_name',
    'Country': 'country',
    'Region': 'region',
    'City': 'city',
    'Postal Code': 'postal_code',
    'Street Address': 'street_address',
    'Phone Number': 'phone_number',
    'Email Address': 'email_address',
    'Language': 'language',
    'Tax Number': 'tax_number',
    'Customer Group': 'customer_group',
    'Sales Organization': 'sales_organization',
    'Distribution Channel': 'distribution_channel',
    'Division': 'division'
}, inplace=True)
sap_customers_df = customers_df.copy()  # Optional duplicate

### 2. Orders and Order_Items Tables
**Orders:** Derived from VBAK (order header).  
**Order_Items:** Derived from VBAP (order item details).

In [4]:
orders_df = df_vbak.copy()
orders_df.rename(columns={
    'Sales Document': 'order_id',
    'Order Date': 'order_date',
    'Customer ID': 'customer_id',
    'Order Type': 'order_type',
    'Sales Organization': 'sales_organization',
    'Distribution Channel': 'distribution_channel',
    'Division': 'division',
    'Order Status': 'order_status'
}, inplace=True)
orders_df['order_date'] = pd.to_datetime(orders_df['order_date'], errors='coerce')

order_items_df = df_vbap.copy()
order_items_df.rename(columns={
    'Sales Document': 'order_id',
    'Item Number': 'item_number',
    'Material Number': 'product_id',
    'Quantity': 'order_quantity',
    'Net Price': 'unit_price',
    'Item Status': 'item_status',
    'Delivery Date': 'delivery_date'
}, inplace=True)
order_items_df['delivery_date'] = pd.to_datetime(order_items_df['delivery_date'], errors='coerce')


### 3. Shipments and Shipment_Items Tables
**Shipments:** Based on LIKP (delivery header).  
**Shipment_Items:** Based on LIPS (delivery items).

In [5]:
shipments_df = df_likp.copy()
shipments_df.rename(columns={
    'Delivery Number': 'shipment_id',
    'Delivery Date': 'shipment_date',
    'Sales Document': 'order_id',
    'Shipping Point': 'shipping_point',
    'Shipping Type': 'shipping_type',
    'Delivery Status': 'delivery_status',
    'Shipping Status': 'shipping_status',
    'Route': 'route',
    'Delivery Priority': 'delivery_priority',
    'Customer ID': 'customer_id'
}, inplace=True)
shipments_df['shipment_date'] = pd.to_datetime(shipments_df['shipment_date'], errors='coerce')

# %%
shipment_items_df = df_lips.copy()
shipment_items_df.rename(columns={
    'Delivery Number': 'shipment_id',
    'Item Number': 'item_number',
    'Material Number': 'product_id',
    'Delivered Quantity': 'shipped_quantity',
    'Net Price': 'unit_price',
    'Delivery Status': 'delivery_status',
    'Customer ID': 'customer_id',
    'Sales Document': 'order_id',
    'Sales Item': 'sales_item',
    'Delivery Date': 'delivery_date'
}, inplace=True)
shipment_items_df['delivery_date'] = pd.to_datetime(shipment_items_df['delivery_date'], errors='coerce')

### 4. Carriers Table
Use the LFA1 sheet as carriers.

In [6]:
carriers_df = df_lfa1.copy()
carriers_df.rename(columns={
    'Vendor Number': 'carrier_id',
    'Vendor Name': 'carrier_name',
    'Country': 'country',
    'Region': 'region',
    'City': 'city',
    'Postal Code': 'postal_code',
    'Street Address': 'street_address',
    'Phone Number': 'phone_number',
    'Email Address': 'email_address',
    'Language': 'language',
    'Tax Number': 'tax_number',
    'Payment Terms': 'payment_terms'
}, inplace=True)

### 5. Delivery_Status Table
Merge VTTK (shipment header) and VTTP (shipment items) to build the delivery_status table.
Because both sheets have common columns (e.g. "Shipment Date"), suffixes are applied.
We use the header values for key fields by renaming:
  - "Shipment Date_header" → "shipment_date"
  - "Sales Document_header" → "order_id"
  - "Delivery Number_header" → "shipment_id"
  - "Customer ID_header" → "customer_id"

In [7]:
delivery_status_df = pd.merge(df_vttk, df_vttp, on='Shipment Number', how='inner', 
                              suffixes=('_header', '_item'))
# Rename columns using header values
delivery_status_df.rename(columns={
    'Shipment Number': 'shipment_number',
    'Shipment Date_header': 'shipment_date',
    'Shipment Status': 'shipment_status',
    'Carrier': 'carrier',
    'Sales Document_header': 'order_id',
    'Delivery Number_header': 'shipment_id',
    'Customer ID_header': 'customer_id'
}, inplace=True)
delivery_status_df['shipment_date'] = pd.to_datetime(delivery_status_df['shipment_date'], errors='coerce')
delivery_status_df.drop_duplicates(inplace=True)


### 6. Delivery_Analytics Table
Compute performance metrics by joining orders and shipments.
For each order, calculate:
  - **delivery_time:** Difference in days between order_date and earliest shipment_date.
  - **on_time:** Flag if delivery_time is within a threshold (e.g. ≤ 3 days).
  - **delay_reason:** 'Delayed' if not on time, otherwise blank.

In [8]:
orders_shipments = pd.merge(orders_df[['order_id','order_date','customer_id']], 
                            shipments_df[['shipment_id','order_id','shipment_date']], 
                            on='order_id', how='left')
agg_shipments = orders_shipments.groupby('order_id').agg({'shipment_date': 'min'}).reset_index()
delivery_analytics_df = pd.merge(orders_df[['order_id','order_date','customer_id']], 
                                 agg_shipments, on='order_id', how='left')
delivery_analytics_df['delivery_time'] = (delivery_analytics_df['shipment_date'] - delivery_analytics_df['order_date']).dt.days
threshold = 3
delivery_analytics_df['on_time'] = np.where(delivery_analytics_df['delivery_time'] <= threshold, True, False)
delivery_analytics_df['delay_reason'] = np.where(delivery_analytics_df['on_time'], '', 'Delayed')

## Data Validation Checks
Validate primary key uniqueness and foreign key consistency using helper functions.

In [9]:
def check_primary_key_uniqueness(df, key_columns, table_name):
    duplicates = df.duplicated(subset=key_columns)
    if duplicates.any():
        print(f"WARNING: Duplicates in {table_name} for key columns {key_columns}:")
        print(df[duplicates][key_columns])
    else:
        print(f"Primary key check passed for {table_name}.")

def check_foreign_key(child_df, child_key, parent_df, parent_key, table_name):
    missing = set(child_df[child_key].dropna()) - set(parent_df[parent_key].dropna())
    if missing:
        print(f"WARNING: In {table_name}, the following {child_key} values are missing in parent table: {missing}")
    else:
        print(f"Foreign key check passed for {table_name} ({child_key}).")
        
# delivery_status_df.drop_duplicates(subset=['shipment_number'], inplace=True)

# Primary key validations
check_primary_key_uniqueness(customers_df, ['customer_id'], 'customers')
check_primary_key_uniqueness(orders_df, ['order_id'], 'orders')
check_primary_key_uniqueness(order_items_df, ['order_id', 'item_number'], 'order_items')
check_primary_key_uniqueness(shipments_df, ['shipment_id'], 'shipments')
check_primary_key_uniqueness(shipment_items_df, ['shipment_id', 'item_number'], 'shipment_items')
check_primary_key_uniqueness(carriers_df, ['carrier_id'], 'carriers')
check_primary_key_uniqueness(delivery_status_df, ['shipment_number'], 'delivery_status')
check_primary_key_uniqueness(delivery_analytics_df, ['order_id'], 'delivery_analytics')

# Foreign key validations
check_foreign_key(orders_df, 'customer_id', customers_df, 'customer_id', 'orders')
check_foreign_key(order_items_df, 'order_id', orders_df, 'order_id', 'order_items')
check_foreign_key(shipments_df, 'order_id', orders_df, 'order_id', 'shipments')
check_foreign_key(shipment_items_df, 'order_id', orders_df, 'order_id', 'shipment_items')
check_foreign_key(delivery_status_df, 'order_id', orders_df, 'order_id', 'delivery_status')
check_foreign_key(delivery_analytics_df, 'customer_id', customers_df, 'customer_id', 'delivery_analytics')

Primary key check passed for customers.
Primary key check passed for orders.
Primary key check passed for order_items.
Primary key check passed for shipments.
Primary key check passed for shipment_items.
Primary key check passed for carriers.
   shipment_number
1          2001001
Primary key check passed for delivery_analytics.
Foreign key check passed for orders (customer_id).
Foreign key check passed for order_items (order_id).
Foreign key check passed for shipments (order_id).
Foreign key check passed for shipment_items (order_id).
Foreign key check passed for delivery_status (order_id).
Foreign key check passed for delivery_analytics (customer_id).


## Load Data into MySQL
Create a MySQL engine and load each DataFrame into corresponding tables.
Table names are set to lower-case.

In [10]:
username = 'root'
password = '12345'
host = 'localhost'
port = '3306'
database = 'case1'
engine = create_engine(f'mysql+pymysql://{username}:{password}@{host}:{port}/{database}')

# Load tables to MySQL (table names in lower-case)
customers_df.to_sql('customers', con=engine, if_exists='replace', index=False)
sap_customers_df.to_sql('sap_customers', con=engine, if_exists='replace', index=False)
orders_df.to_sql('orders', con=engine, if_exists='replace', index=False)
order_items_df.to_sql('order_items', con=engine, if_exists='replace', index=False)
shipments_df.to_sql('shipments', con=engine, if_exists='replace', index=False)
shipment_items_df.to_sql('shipment_items', con=engine, if_exists='replace', index=False)
carriers_df.to_sql('carriers', con=engine, if_exists='replace', index=False)
delivery_status_df.to_sql('delivery_status', con=engine, if_exists='replace', index=False)
delivery_analytics_df.to_sql('delivery_analytics', con=engine, if_exists='replace', index=False)

print("Data loaded to MySQL database successfully.")

Data loaded to MySQL database successfully.


## Conclusion
This notebook has:
- Extracted SAP ERP data from an Excel file.
- Transformed the data into a SQL data model for order fulfillment and delivery analytics.
- Validated data for completeness, uniqueness, and referential integrity.
- Loaded the data into a MySQL database with lower-case table names.
The resulting data warehouse is now ready for further reporting and analytics.