# PSDDP Data Ingestion & Normalization

**Dataset**: Payment Systems Data – Daily (PSDDP)  
**Publisher**: Reserve Bank of India (RBI)

This notebook ingests raw PSDDP Excel files downloaded from the RBI website
and normalizes them into a single, canonical daily dataset suitable for
time-series analysis and anomaly detection.

This notebook is intentionally limited to **data engineering only**.
No analytics or modeling is performed here.


## Data Source & Provenance

- **Dataset**: Payment Systems Data – Daily (PSDDP)
- **Frequency**: Daily
- **Granularity**: Aggregated system-level settlements
- **Coverage**: June 2020 – December 2025
- **Source URL**:
  https://rbidocs.rbi.org.in/rdocs/content/docs/PSDDP04062020.xlsx

The dataset is publicly available and published by the RBI.
It contains no individual transaction-level information.


## Why Normalization Is Required

The PSDDP dataset exhibits **multiple structural changes over time** due to
evolving RBI reporting standards and phased introduction of instruments.

Specifically, four distinct reporting regimes are observed:

1. **Jun 2020 – Oct 2020**
   - No card-based payment data reported
   - Only core retail and interbank systems available

2. **Nov 2020 – Apr 2021**
   - Card transactions reported as merged POS + E-commerce totals
   - Financial market instruments not yet included

3. **May 2021 – Sep 2021**
   - Card transactions still reported in merged form
   - Financial market instruments (G-Sec, Forex, Rupee Derivatives) introduced

4. **Oct 2021 onwards**
   - POS and E-commerce transactions reported separately
   - All instruments reported consistently

Additionally:
- Some instruments appear only in selected periods
- Special placeholders such as `"H"` are used for withheld values
- Missing values are often structural, not random

To enable consistent longitudinal analysis, all PSDDP files are normalized
into a single canonical schema.

In [9]:
import pandas as pd

## Canonical Output Schema

The final dataset enforces a fixed column schema across all reporting periods.

- Volume fields are in **lakhs**
- Value fields are in **₹ crores**
- Combined card metrics are explicitly derived for continuity

In [10]:
# CONSTANT FINAL SCHEMA
NON_COMBINED_COLS = [
    "date",
    "rtgs_vol", "rtgs_val",
    "neft_vol", "neft_val",
    "aeps_vol", "aeps_val",
    "upi_vol", "upi_val",
    "imps_vol", "imps_val",
    "nach_credit_vol", "nach_credit_val",
    "nach_debit_vol", "nach_debit_val",
    "netc_vol", "netc_val",
    "bbps_vol", "bbps_val",
    "cts_vol", "cts_val",
    "credit_pos_vol", "credit_pos_val",
    "credit_ecom_vol", "credit_ecom_val",
    "debit_pos_vol", "debit_pos_val",
    "debit_ecom_vol", "debit_ecom_val",
    "ppi_pos_vol", "ppi_pos_val",
    "ppi_ecom_vol", "ppi_ecom_val",
    "nfs_vol", "nfs_val",
    "aeps_bc_vol", "aeps_bc_val",
    "gov_sec_vol", "gov_sec_val",
    "forex_vol", "forex_val",
    "rupee_der_vol", "rupee_der_val"
]

COMBINED_COLS = [
    "credit_combined_vol", "credit_combined_val",
    "debit_combined_vol", "debit_combined_val",
    "ppi_combined_vol", "ppi_combined_val"
]

FINAL_COLS = NON_COMBINED_COLS[:33] + COMBINED_COLS + NON_COMBINED_COLS[33:]


## Normalization Functions

Separate normalization strategies are implemented for each reporting regime:

- No card data period
- Merged POS/E-commerce reporting
- Fully separate POS/E-commerce reporting

Each function maps the raw data to the canonical schema.


In [11]:
def normalize_no_cards(df):
    """
    Normalizes PSDDP data for the period where no card
    instruments were reported.

    Strategy:
    - Assign known non-card columns
    - Reindex to final schema
    - Fill missing fields with NA
    """

    cols = [
        "date",
        "rtgs_vol", "rtgs_val",
        "neft_vol", "neft_val",
        "aeps_vol", "aeps_val",
        "upi_vol", "upi_val",
        "imps_vol", "imps_val",
        "nach_credit_vol", "nach_credit_val",
        "nach_debit_vol", "nach_debit_val",
        "netc_vol", "netc_val",
        "bbps_vol", "bbps_val",
        "cts_vol", "cts_val",
        "nfs_vol", "nfs_val",
        "aeps_bc_vol", "aeps_bc_val"
    ]

    df = df.iloc[:, :len(cols)].copy()
    df.columns = cols

    return df.reindex(columns=FINAL_COLS, fill_value=pd.NA)


In [12]:
def normalize_merged(df, has_markets):
    """
    Normalizes PSDDP data for the period where POS and
    E-commerce card transactions were reported in merged form.

    Parameters
    ----------
    has_markets : bool
        Indicates whether market instruments (G-Sec, Forex,
        Rupee Derivatives) are present in the file.
    """

    cols = [
        "date",
        "rtgs_vol", "rtgs_val",
        "neft_vol", "neft_val",
        "aeps_vol", "aeps_val",
        "upi_vol", "upi_val",
        "imps_vol", "imps_val",
        "nach_credit_vol", "nach_credit_val",
        "nach_debit_vol", "nach_debit_val",
        "netc_vol", "netc_val",
        "bbps_vol", "bbps_val",
        "cts_vol", "cts_val",
        "credit_vol", "credit_val",
        "debit_vol", "debit_val",
        "ppi_vol", "ppi_val",
        "nfs_vol", "nfs_val",
        "aeps_bc_vol", "aeps_bc_val"
    ]

    if has_markets:
        cols += [
            "gov_sec_vol", "gov_sec_val",
            "forex_vol", "forex_val",
            "rupee_der_vol", "rupee_der_val"
        ]

    df = df.iloc[:, :len(cols)].copy()
    df.columns = cols

    # Map merged card values → combined fields
    df["credit_combined_vol"] = df["credit_vol"]
    df["credit_combined_val"] = df["credit_val"]
    df["debit_combined_vol"] = df["debit_vol"]
    df["debit_combined_val"] = df["debit_val"]
    df["ppi_combined_vol"] = df["ppi_vol"]
    df["ppi_combined_val"] = df["ppi_val"]

    # POS and Ecom not separately reported
    for c in ["credit", "debit", "ppi"]:
        df[f"{c}_pos_vol"] = pd.NA
        df[f"{c}_pos_val"] = pd.NA
        df[f"{c}_ecom_vol"] = pd.NA
        df[f"{c}_ecom_val"] = pd.NA

    # Drop temporary merged columns
    df.drop(
        columns=["credit_vol", "credit_val",
                 "debit_vol", "debit_val",
                 "ppi_vol", "ppi_val"],
        inplace=True
    )

    # Ensure market columns exist
    for col in [
        "gov_sec_vol", "gov_sec_val",
        "forex_vol", "forex_val",
        "rupee_der_vol", "rupee_der_val"
    ]:
        if col not in df.columns:
            df[col] = pd.NA

    return df.reindex(columns=FINAL_COLS, fill_value=pd.NA)


## PSDDP File Processing Logic

Each Excel sheet is processed independently:

1. Metadata rows are skipped
2. Date column is cleaned and parsed
3. Summary rows (Total / Notes) are removed
4. Reporting period is inferred from max date
5. Appropriate normalization strategy is applied


In [13]:
def process_file(file_path, output_path="psddp_clean.csv"):
    """
    Process RBI PSDDP Excel file and normalize it into a canonical schema.

    Handles:
    - Jun 2020 – Oct 2020: No card data
    - Nov 2020 – Apr 2021: Merged POS + Ecom (no market data)
    - May 2021 – Sep 2021: Merged POS + Ecom (with market data)
    - Oct 2021 onwards: Separate POS and Ecom
    """

    dfs = []

    with pd.ExcelFile(file_path) as xls:
        for sheet in xls.sheet_names:

            # Read raw sheet (skip RBI metadata rows)
            df = pd.read_excel(xls, sheet, header=None, skiprows=6)
            df.dropna(how="all", inplace=True)

            # Clean and parse date column
            df.rename(columns={0: "date"}, inplace=True)
            df = df[~df["date"].astype(str).str.contains("total|note", case=False, na=False)]
            df["date"] = pd.to_datetime(df["date"], errors="coerce", format="mixed")
            df = df[df["date"].notna()]

            if df.empty:
                continue

            # Pad columns defensively (RBI sheets vary)
            for i in range(df.shape[1], 55):
                df[i] = pd.NA

            max_date = df["date"].max()
            is_separate = False

            # Route based on reporting regime
            if max_date <= pd.Timestamp("2020-10-31"):
                df = normalize_no_cards(df)

            elif max_date <= pd.Timestamp("2021-04-30"):
                df = normalize_merged(df, has_markets=False)

            elif max_date <= pd.Timestamp("2021-09-30"):
                df = normalize_merged(df, has_markets=True)

            else:
                # POS and E-commerce reported separately
                df = df.iloc[:, :len(NON_COMBINED_COLS)].copy()
                df.columns = NON_COMBINED_COLS
                is_separate = True

            # Convert numeric columns with RBI placeholder handling
            current_cols = [c for c in df.columns if c != "date"]
            for col in current_cols:
                df[col] = df[col].apply(
                    lambda x: "H" if isinstance(x, str) and x.strip().lower() == "h"
                    else pd.to_numeric(x, errors="coerce")
                )

            # Compute combined POS + E-commerce metrics (Oct 2021+)
            if is_separate:

                def to_num_for_sum(s):
                    return s.replace("H", 0).fillna(0)

                for c in ["credit", "debit", "ppi"]:

                    # Volume
                    pos_vol = df[f"{c}_pos_vol"]
                    ecom_vol = df[f"{c}_ecom_vol"]
                    vol_sum = to_num_for_sum(pos_vol) + to_num_for_sum(ecom_vol)
                    mask_h = (pos_vol == "H") & (ecom_vol == "H")
                    df[f"{c}_combined_vol"] = vol_sum.where(~mask_h, "H")

                    # Value
                    pos_val = df[f"{c}_pos_val"]
                    ecom_val = df[f"{c}_ecom_val"]
                    val_sum = to_num_for_sum(pos_val) + to_num_for_sum(ecom_val)
                    mask_h = (pos_val == "H") & (ecom_val == "H")
                    df[f"{c}_combined_val"] = val_sum.where(~mask_h, "H")

                df = df.reindex(columns=FINAL_COLS, fill_value=pd.NA)

            # Final numeric coercion for newly created columns
            for col in df.columns:
                if col != "date":
                    df[col] = df[col].apply(
                        lambda x: "H" if isinstance(x, str) and x.strip().lower() == "h"
                        else pd.to_numeric(x, errors="coerce")
                    )

            dfs.append(df)

    # Concatenate all sheets and write output
    out = pd.concat(dfs, ignore_index=True)
    out.to_csv(output_path, index=False)

    print(f"CSV written: {output_path}")
    print(f"Rows: {len(out)}")


In [14]:

# RUN PIPELINE
process_file(
    file_path="../data/raw/psddp/PSDDP04062020.xlsx",
    output_path="../data/processed/psddp_clean.csv"
)


CSV written: ../data/processed/psddp_clean.csv
Rows: 2030


## Output Validation: Cross-Period Sanity Checks

As a manual verification step, one representative row is inspected from
each known reporting regime.

This ensures that:
- Expected columns are populated or blank as per the reporting period
- Combined vs separate card fields behave correctly
- Market instruments appear only in valid periods

This check is intended for **structural validation**, not analytics.


In [15]:
import pandas as pd

# SHOW ONE REPRESENTATIVE ROW PER REPORTING PERIOD
def show_one_row_per_period(csv_path):
    """
    Displays one sample row from each PSDDP reporting regime
    to manually validate structural correctness of the
    normalized dataset.
    """

    # Display all columns without truncation
    pd.set_option("display.max_columns", None)
    pd.set_option("display.width", None)
    pd.set_option("display.max_colwidth", None)

    # Load normalized PSDDP output
    df = pd.read_csv(csv_path, parse_dates=["date"])

    # Representative dates chosen from each known regime
    periods = {
        "Jun 2020 – Oct 2020 (No Cards)": "2020-06-01",
        "Nov 2020 – Apr 2021 (Merged Cards, No Markets)": "2020-11-01",
        "May 2021 – Sep 2021 (Merged Cards + Markets)": "2021-05-01",
        "Oct 2021 – Present (Separate POS/Ecom)": "2021-10-01"
    }

    for period_name, date_str in periods.items():
        date = pd.Timestamp(date_str)
        period_row = df[df["date"] == date]

        print(f"\nPeriod: {period_name}")
        print(f"Sample Date: {date_str}")

        if period_row.empty:
            print("⚠️ No data found for this date.")
        else:
            print(period_row.iloc[0])

        print("-" * 120)


In [16]:
# RUN VALIDATION CHECK
show_one_row_per_period("../data/processed/psddp_clean.csv")



Period: Jun 2020 – Oct 2020 (No Cards)
Sample Date: 2020-06-01
date                   2020-06-01 00:00:00
rtgs_vol                              4.85
rtgs_val                         436996.69
neft_vol                            172.11
neft_val                         104275.13
aeps_vol                           0.43618
aeps_val                          7.682205
upi_vol                           476.9671
upi_val                       10413.108975
imps_vol                          76.80648
imps_val                       9072.554678
nach_credit_vol                      84.86
nach_credit_val                    6326.17
nach_debit_vol                       20.12
nach_debit_val                     2084.74
netc_vol                          23.82175
netc_val                         44.367367
bbps_vol                              5.43
bbps_val                             95.36
cts_vol                            17.5486
cts_val                         15056.2426
credit_pos_vol                   

## Scope Note

This notebook is intentionally restricted to **data ingestion and
normalization only**.

All exploratory analysis, feature engineering, and anomaly detection
are performed in subsequent notebooks.
