# DSTA-0000--bsr-design-sprint--munge-bsis-test-data

## Context

This work is part of the Breast Screening Reporting team's July 2025 KC63 design sprint.
This notebook adds some noise to BSIS test data before we send it to FDP.
(Test data comes from the BSIS Confluence space.)

We load four datasets from the `notebooks/data/inputs` folder, modify three of them (the ones with numeric values), and write outputs to `notebooks/data/outputs`.

Datasets
1. `bso_gpp_20231123.csv`
   1. BSO to GPP mapping.
   2. This file is passed through without modification. We just change the file name.
2. `kc63_bso_20231123_FromBSS.csv`, `kc63_gpp_20231123_FromBSS.csv`, `kc63_utla_20231123_FromBSS.csv` 
   1. These three files are the same line-item data that has been grouped in three different ways.
   2. For each we modify the numeric values as explained in `perturb_numeric_values_where_possible`.
      1. The aim is to add noise to test data numbers before we send data to FDP.
      2. This is precautionary since no PII is in this data.

## Setup

In [None]:
%load_ext autoreload
%load_ext jupyter_black

In [None]:
import csv
import logging
from pathlib import Path

import numpy as np
import pandas as pd

import _01_munge_bsis_test_data as utils

utils.set_up_logging()
logger = logging.getLogger(__name__)

In [None]:
DATA_FOLDER__INPUTS = Path("data/inputs")
DATA_FOLDER__OUTPUTS = Path("data/outputs")
DEFAULT_DATA_TYPE = str

COL__KC63_OUTPUT_BY_RUN_ID = "kc63_output_by_run_id"
COL__LINE_NUMBER = "col_b"
COL__AGE_BUCKET_LABEL = "col_c"

## Load KC63 test datasets

In [None]:
raw__bso_gpp = pd.read_csv(DATA_FOLDER__INPUTS / "bso_gpp_20231123.csv", dtype=DEFAULT_DATA_TYPE)
raw__bso = pd.read_csv(DATA_FOLDER__INPUTS / "kc63_bso_20231123_FromBSS.csv", dtype=DEFAULT_DATA_TYPE)
raw__gpp = pd.read_csv(DATA_FOLDER__INPUTS / "kc63_gpp_20231123_FromBSS.csv", dtype=DEFAULT_DATA_TYPE)
raw__utla = pd.read_csv(DATA_FOLDER__INPUTS / "kc63_utla_20231123_FromBSS.csv", dtype=DEFAULT_DATA_TYPE)

## Munge

### Perturb numeric values

In [None]:
%%time

bso_gpp = raw__bso_gpp


def perturb_dataframe_numeric_values_where_possible(df):
    cols__perturbation = df.columns.drop([COL__KC63_OUTPUT_BY_RUN_ID, COL__LINE_NUMBER, COL__AGE_BUCKET_LABEL])
    df_out = df.copy()
    df_out[cols__perturbation] = df_out[cols__perturbation].map(
        lambda x: utils.perturb_numeric_values_where_possible(
            x, apply_random_number_below_threshold=10, above_threshold_smear_factor=0.5
        )
    )
    return df_out


bso = perturb_dataframe_numeric_values_where_possible(raw__bso)
gpp = perturb_dataframe_numeric_values_where_possible(raw__gpp)
utla = perturb_dataframe_numeric_values_where_possible(raw__utla)

In [None]:
top_n = 3
display(raw__bso.head(top_n))
display(bso.head(top_n))

In [None]:
utils.xcheck__perturbation_worked_as_expected(bso, raw__bso)
utils.xcheck__perturbation_worked_as_expected(gpp, raw__gpp)
utils.xcheck__perturbation_worked_as_expected(utla, raw__utla)

### Export

In [None]:
bso_gpp.to_csv(DATA_FOLDER__OUTPUTS / f"2023-11-23--bso--gpp.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)
utils.exporter(bso, DATA_FOLDER__OUTPUTS, f"2023-11-23--bso--perturbed.csv")
utils.exporter(gpp, DATA_FOLDER__OUTPUTS, f"2023-11-23--gpp--perturbed.csv")
utils.exporter(utla, DATA_FOLDER__OUTPUTS, f"2023-11-23--utla--perturbed.csv")

### Generate and Export Data Representing Previous Years

In [None]:
# Specify the list of dates to generate output for
dates = [
    "2022-11-23",
    "2021-11-23",
    "2020-11-23",
    "2019-11-23",
    "2018-11-23"
]

In [None]:
for date in dates:
      bso_perturbed = perturb_dataframe_numeric_values_where_possible(raw__bso)
      gpp_perturbed = perturb_dataframe_numeric_values_where_possible(raw__gpp)
      utla_perturbed = perturb_dataframe_numeric_values_where_possible(raw__utla)

      bso_gpp.to_csv(DATA_FOLDER__OUTPUTS / f"{date}--bso--gpp.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)
      utils.exporter(bso_perturbed, DATA_FOLDER__OUTPUTS, f"{date}--bso--perturbed.csv")
      utils.exporter(gpp_perturbed, DATA_FOLDER__OUTPUTS, f"{date}--gpp--perturbed.csv")
      utils.exporter(utla_perturbed, DATA_FOLDER__OUTPUTS, f"{date}--utla--perturbed.csv")