
# Parameters

**Project** CCU056

**Description** This notebook defines a set of parameters, which is loaded in each notebook in the data curation pipeline, so that helper functions and parameters are consistently available.

**Author(s)** Tom Bolton, Fionna Chalmers, Anna Stevenson (Health Data Science Team, BHF Data Science Centre)

**Reviewers** âš  UNREVIEWED

**Acknowledgements** Based on CCU004_01-D01-parameters, CCU003_05-D01-parameters and CCU002_07

**Notes** This pipeline has an initial production date of 2023-08-15 (`pipeline_production_date` == `2023-08-15`) and the `archived_on` dates used for each dataset correspond to the latest (most recent) batch of data before this date. Should the pipeline and all the notebooks that follow need to be updated and rerun, then this notebook should be rerun directly (before being called by subsequent notebooks) with `pipeline_production_date` updated and `run_all_toggle` switched to True. After this notebook has been rerun the `run_all_toggle` should be reset to False to prevent subsequent notebooks that call this notebook from having to rerun the 'archived_on' section. Rerunning this notebook with the updated `pipeline_production_date` will ensure that the `archived_on` dates used for each dataset are updated with these dates being saved for reference in the collabortion database.

**Versions** 
<br>Version 6 as at '2024-03-07' - study end date also chnaged from 2022-08-01 to 2023-12-01
<br>Version 5 as at '2023-11-28' - which will include November provisioning - NICOR datasets still hard-coded as below
<br>Version 4 as at '2023-09-15' - issues with HES resolved but NACSA still hardcoded - also hardcoding TAVI back to 2023-03-31 as August batch have no surgery dates
<br>Version 3 as at '2023-08-24' - hard coded dates for HES APC, HES APC OTR, NACSA - as recent versions of these have data quality issues
<br>Version 2 as at '2023-08-15'
<br>Version 1 as at '2023-07-04'

**Data Output** 
- **`ccu056_parameters_df_datasets`**: table of `archived_on` dates for each dataset that can be used consistently throughout the pipeline 

# 0. Setup

In [0]:
run_all_toggle = False

In [0]:
spark.conf.set('spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation', 'true')

# 1. Libraries

In [0]:
import pyspark.sql.functions as f
import pandas as pd
import re
import datetime

# 2.  Helpers

In [0]:
%run "/Repos/shds/common/functions"

In [0]:
%run "/Repos/shds/Fionna/help_functions"

# 3. Custom Functions

In [0]:
# Updated function that compares the number of rows expected (the number that were found when running the parameters notebook in full) against the number of rows observed (the number that were found when extracting the data from the archive in a subsequent notebook). This would alert us to the number of rows being changed in the archive tables, which the data wranglers control.

# function to extract the batch corresponding to the pre-defined archived_on date - will be used in subsequent notebooks

from pyspark.sql import DataFrame
def extract_batch_from_archive(_df_datasets: DataFrame, _dataset: str):
  
  # get row from df_archive_tables corresponding to the specified dataset
  _row = _df_datasets[_df_datasets['dataset'] == _dataset]
  
  # check one row only
  assert _row.shape[0] != 0, f"dataset = {_dataset} not found in _df_datasets (datasets = {_df_datasets['dataset'].tolist()})"
  assert _row.shape[0] == 1, f"dataset = {_dataset} has >1 row in _df_datasets"
  
  # create path and extract archived on
  _row = _row.iloc[0]
  _path = _row['database'] + '.' + _row['table']  
  _archived_on = _row['archived_on']
  _n_rows_expected = _row['n']  
  print(_path + ' (archived_on = ' + _archived_on + ', n_rows_expected = ' + _n_rows_expected + ')')
  
  # check path exists # commented out for runtime
#   _tmp_exists = spark.sql(f"SHOW TABLES FROM {_row['database']}")\
#     .where(f.col('tableName') == _row['table'])\
#     .count()
#   assert _tmp_exists == 1, f"path = {_path} not found"

  # extract batch
  _tmp = spark.table(_path)\
    .where(f.col('archived_on') == _archived_on)  
  
  # check number of records returned
  _n_rows_observed = _tmp.count()
  print(f'  n_rows_observed = {_n_rows_observed:,}')
  assert _n_rows_observed > 0, f"_n_rows_observed == 0"
  assert f'{_n_rows_observed:,}' == _n_rows_expected, f"_n_rows_observed != _n_rows_expected ({_n_rows_observed:,} != {_n_rows_expected})"

  # return dataframe
  return _tmp

# 4. Paths and Variables

## 4.1 Set Project Specific Variables

In [0]:
# Please set and check the variables below

# -----------------------------------------------------------------------------
# Pipeline production date
# -----------------------------------------------------------------------------
# date at which pipeline was created and archived_on dates for datasets have been selected based on
pipeline_production_date = '2024-03-07'


# -----------------------------------------------------------------------------
# Databases
# -----------------------------------------------------------------------------
db = 'dars_nic_391419_j3w9t'
dbc = f'{db}_collab'
dsa = f'dsa_391419_j3w9t_collab'

# -----------------------------------------------------------------------------
# Project
# -----------------------------------------------------------------------------
proj = 'ccu056'


# -----------------------------------------------------------------------------
# Dates
# -----------------------------------------------------------------------------
study_start_date = '2000-01-01'
study_end_date   = '2023-12-01' #currently set at pipeline productiondate
cohort = 'c01'

# -----------------------------------------------------------------------------
# Datasets
# -----------------------------------------------------------------------------
# data frame of datasets
datasets = [
  # -----------------------------------------------------------------------------
  # Datasets requested by the project
  # -----------------------------------------------------------------------------  
    ['gdppr',         dbc, f'gdppr_{db}_archive',              'NHS_NUMBER_DEID',                'DATE']  
  , ['hes_apc',       dbc, f'hes_apc_all_years_archive',       'PERSON_ID_DEID',                 'EPISTART']
  , ['hes_apc_otr',       dbc, f'hes_apc_otr_all_years_archive', 'PERSON_ID_DEID',                 '']
  , ['deaths',        dbc, f'deaths_{db}_archive',             'DEC_CONF_NHS_NUMBER_CLEAN_DEID', 'REG_DATE_OF_DEATH']
  , ['nacsa',         dbc, f'nicor_acs_combined_{db}_archive', 'PERSON_ID_DEID',                 'DATE_AND_TIME_OF_OPERATION']
  , ['tavi',          dbc, f'nicor_tavi_{db}_archive',         'PERSON_ID_DEID',                 '7_01_DATE_AND_TIME_OF_OPERATION']
 
  
  # -----------------------------------------------------------------------------
  # Additonal datasets needed for the data curation pipeline for this project
  # -----------------------------------------------------------------------------
  , ['hes_ae',        dbc, f'hes_ae_all_years_archive',       'PERSON_ID_DEID',                 'ARRIVALDATE']
  , ['hes_op',        dbc, f'hes_op_all_years_archive',       'PERSON_ID_DEID',                 'APPTDATE']
  , ['hes_cc',        dbc, f'hes_cc_all_years_archive',       'PERSON_ID_DEID',                 'CCSTARTDATE'] 
  , ['chess',         dbc, f'chess_{db}_archive',             'PERSON_ID_DEID',                 'InfectionSwabDate']
  
  # -----------------------------------------------------------------------------
  # Datasets not required for this project
  # -----------------------------------------------------------------------------
#   , ['pmeds',         dbc, f'primary_care_meds_{db}_archive', 'Person_ID_DEID',                 'ProcessingPeriodDate']           
#   , ['sgss',          dbc, f'sgss_{db}_archive',              'PERSON_ID_DEID',                 'Specimen_Date']
#   , ['sus',           dbc, f'sus_{db}_archive',               'NHS_NUMBER_DEID',                'EPISODE_START_DATE'] 
#   , ['icnarc',        dbc, f'icnarc_{db}_archive',            'NHS_NUMBER_DEID',               'Date_of_admission_to_your_unit']  
#   , ['ssnap',         dbc, f'ssnap_{db}_archive',             'PERSON_ID_DEID',                'S1ONSETDATETIME'] 
#   , ['minap',         dbc, f'minap_{db}_archive',             'NHS_NUMBER_DEID',               'ARRIVAL_AT_HOSPITAL'] 
#   , ['nhfa',          dbc, f'nhfa_{db}_archive',              '1_03_NHS_NUMBER_DEID',          '2_00_DATE_OF_VISIT'] 
#   , ['nvra',          dbc, f'nvra_{db}_archive',              'NHS_NUMBER_DEID',               'DATE'] 
#   , ['vacc',          dbc, f'vaccine_status_{db}_archive',    'PERSON_ID_DEID',                'DATE_AND_TIME']  
]

tmp_df_datasets = pd.DataFrame(datasets, columns=['dataset', 'database', 'table', 'id', 'date']).reset_index()

if(run_all_toggle):
  print('tmp_df_datasets:\n', tmp_df_datasets.to_string())


## 4.2 Datasets Archived States

### 4.2.1 Create

In [0]:
# for each dataset in tmp_df_datasets, 
#   tabulate all archived_on dates (for information)
#   find the latest (most recent) archived_on date before the pipeline_production_date
#   create a table containing a row with the latest archived_on date and count of the number of records for each dataset
  
# this will not run each time the Parameters notebook is run in annother notebook - will only run if the toggle is switched to True
if(run_all_toggle):

  latest_archived_on = []
  lsoa_1st = []
  for index, row in tmp_df_datasets.iterrows():
    # initial  
    dataset = row['dataset']
    path = row['database'] + '.' + row['table']
    print(index, dataset, path); print()

    # point to table
    tmpd = spark.table(path)

    # tabulate all archived_on dates
    tmpt = tab(tmpd, 'archived_on')
    
    # extract latest (most recent) archived_on date before the pipeline_production_date
    tmpa = (
      tmpd
      .groupBy('archived_on')
      .agg(f.count(f.lit(1)).alias('n'))
      .withColumn('n', f.format_number('n', 0))
      .where(f.col('archived_on') <= pipeline_production_date)
      .orderBy(f.desc('archived_on'))
      .limit(1)
      .withColumn('dataset', f.lit(dataset))
      .select('dataset', 'archived_on', 'n')
    )
    
    # extract closest archived_on date that comes after study_start_date
    if(dataset=="gdppr"):
      tmpb = (
        tmpd
        .groupBy('archived_on')
        .agg(f.count(f.lit(1)).alias('n'))
        .withColumn('n', f.format_number('n', 0))
        .where(f.col('archived_on') >= study_start_date)
        .orderBy(f.asc('archived_on'))
        .limit(1)
        .withColumn('dataset', f.lit(dataset))
        .select('dataset', 'archived_on', 'n')
      )
      
      if(index == 0): lsoa_1st = tmpb
      else: lsoa_1st = lsoa_1st.unionByName(tmpb)
    
    # append results
    if(index == 0): latest_archived_on = tmpa
    else: latest_archived_on = latest_archived_on.unionByName(tmpa)
    print()
    


  # check
  print('Latest (most recent) archived_on date before pipeline_production_date')
  print(latest_archived_on.toPandas().to_string())
  print('\nClosest (1st) GDPPR archived_on date following study_start_date')
  print(lsoa_1st.toPandas().to_string())

### 4.2.2 Check

In [0]:
# this will not run each time the Parameters notebook is run in annother notebook - will only run if the toggle is switched to True
if(run_all_toggle):
  # check
  display(latest_archived_on)

### 4.2.3 Prepare

In [0]:
# prepare the tables to be saved

# this will not run each time the Parameters notebook is run in annother notebook - will only run if the toggle is switched to True
if(run_all_toggle):
  
  # merge the datasets dataframe with the latest_archived_on
  tmp_df_datasets_sp = spark.createDataFrame(tmp_df_datasets) 
  parameters_df_datasets = merge(tmp_df_datasets_sp, latest_archived_on, ['dataset'], validate='1:1', assert_results=['both'], indicator=0).orderBy('index'); print()
  
  # check  
  print(parameters_df_datasets.toPandas().to_string())


### 4.2.4 Save

#### Adhoc

There are issues with data being missing from HES APC, HES APC OTR and NACSA.

Will hard code some of these here for time being, to get datasets as at the same archived_on date to align for HES APC, and the best version of NACSA.

In [0]:
# hes_apc_all = spark.table(f'dars_nic_391419_j3w9t_collab.hes_apc_all_years_archive')
# display(hes_apc_all.select("archived_on").groupBy("archived_on").count())

In [0]:
# hes_apc_otr_all = spark.table(f'dars_nic_391419_j3w9t_collab.hes_apc_otr_all_years_archive')
# display(hes_apc_otr_all.select("archived_on").groupBy("archived_on").count())

In [0]:
# hes_apc_otr_all = spark.table(f'dars_nic_391419_j3w9t_collab.hes_apc_otr_all_years_archive')
# display(hes_apc_otr_all.select("archived_on").groupBy("archived_on").count())

In [0]:
# nacsa = spark.table(f'dars_nic_391419_j3w9t_collab.nicor_acs_combined_dars_nic_391419_j3w9t_archive')
# display(nacsa.select("archived_on").groupBy("archived_on").count())

In [0]:
# tavi = spark.table(f'dars_nic_391419_j3w9t_collab.nicor_tavi_dars_nic_391419_j3w9t_archive')
# display(tavi.select("archived_on").groupBy("archived_on").count())

HES APC and HES APC OPT "2023-04-27" <br>
NICOR NACSA will use "2023-03-31"

In [0]:
# hes_apc_count = (
#     spark.table(f'dars_nic_391419_j3w9t_collab.hes_apc_all_years_archive')
#                 .filter(f.col("archived_on")=="2023-04-27").count()
#     )

# hes_apc_otr_count = (
#     spark.table(f'dars_nic_391419_j3w9t_collab.hes_apc_otr_all_years_archive')
#                 .filter(f.col("archived_on")=="2023-04-27").count()
#     )

nacsa_count = (
    spark.table(f'dars_nic_391419_j3w9t_collab.nicor_acs_combined_dars_nic_391419_j3w9t_archive')
    .filter(f.col("archived_on")=="2023-03-31").count()
)


tavi_count = (
    spark.table(f'dars_nic_391419_j3w9t_collab.nicor_tavi_dars_nic_391419_j3w9t_archive')
    .filter(f.col("archived_on")=="2023-03-31").count()
)

In [0]:
if(run_all_toggle):
    parameters_df_datasets = (
        parameters_df_datasets
        # .withColumn("archived_on", f.when(f.col("dataset")=="hes_apc","2023-04-27").otherwise(f.col("archived_on")))
        # .withColumn("archived_on", f.when(f.col("dataset")=="hes_apc_otr","2023-04-27").otherwise(f.col("archived_on")))
        .withColumn("archived_on", f.when(f.col("dataset")=="nacsa","2023-03-31").otherwise(f.col("archived_on")))
        .withColumn("archived_on", f.when(f.col("dataset")=="tavi","2023-03-31").otherwise(f.col("archived_on")))

        # .withColumn("n", f.when(f.col("dataset")=="hes_apc",hes_apc_count).otherwise(f.col("n")))
        # .withColumn("n", f.when(f.col("dataset")=="hes_apc_otr",hes_apc_otr_count).otherwise(f.col("n")))
        .withColumn("n", f.when(f.col("dataset")=="nacsa",nacsa_count).otherwise(f.col("n")))
        .withColumn("n", f.when(f.col("dataset")=="tavi",tavi_count).otherwise(f.col("n")))
        
        .withColumn("n", f.when(f.col("dataset").isin([
            #'hes_apc','hes_apc_otr',
            'nacsa','tavi']), f.format_number(f.col("n").cast("int"),0)).otherwise(f.col('n')))
        )

In [0]:
# save the parameters_df_datasets table
# which we can simply import below when not running all and calling this notebook in subsequent notebooks

# this will not run each time the Parameters notebook is run in annother notebook - will only run if the toggle is switched to True
if(run_all_toggle):
  save_table(df=parameters_df_datasets, out_name=f'{proj}_parameters_df_datasets', save_previous=True, data_base=dsa)

### 4.2.5 Import

In [0]:
# import the parameters_df_datasets table 
# convert to a Pandas dataframe and transform archived_on to a string (to conform to the input that the extract_batch_from_archive function is expecting)

spark.sql(f'REFRESH TABLE {dsa}.{proj}_parameters_df_datasets')
parameters_df_datasets = (
  spark.table(f'{dsa}.{proj}_parameters_df_datasets')
  .orderBy('index')
  .toPandas()
)
parameters_df_datasets['archived_on'] = parameters_df_datasets['archived_on'].astype(str)

In [0]:
display(parameters_df_datasets)

## 4.3 Curated Data Paths

In [0]:
# -----------------------------------------------------------------------------
# These are paths to data tables curated in subsequent notebooks that may be
# needed in subsequent notebooks from which they were curated
# -----------------------------------------------------------------------------

# note: the below is largely listed in order of appearance within the pipeline:

# temp path for TAVI as using Live version instead of an archived version for now
path_tavi = f'{db}.nicor_tavi_{db}'

# reference tables
path_ref_bhf_phenotypes  = 'bhf_cvd_covid_uk_byod.bhf_covid_uk_phenotypes_20210127'
path_ref_geog            = 'dss_corporate.ons_chd_geo_listings'
path_ref_imd             = 'dss_corporate.english_indices_of_dep_v02'
path_ref_gp_refset       = 'dss_corporate.gpdata_snomed_refset_full'
path_ref_gdppr_refset    = 'dss_corporate.gdppr_cluster_refset'
path_ref_icd10           = 'dss_corporate.icd10_group_chapter_v01'
path_ref_opcs4           = 'dss_corporate.opcs_codes_v02'
# path_ref_map_ctv3_snomed = 'dss_corporate.read_codes_map_ctv3_to_snomed'
# path_ref_ethnic_hes      = 'dss_corporate.hesf_ethnicity'
# path_ref_ethnic_gdppr    = 'dss_corporate.gdppr_ethnicity'

# curated tables
path_cur_hes_apc_long      = f'{dsa}.{proj}_cur_hes_apc_all_years_archive_long'
path_cur_hes_apc_op_long   = f'{dsa}.{proj}_cur_hes_apc_all_years_archive_op_long'
path_cur_deaths_long       = f'{dsa}.{proj}_cur_deaths_{db}_archive_long'
path_cur_deaths_sing       = f'{dsa}.{proj}_cur_deaths_{db}_archive_sing'
path_cur_lsoa_region       = f'{dsa}.{proj}_cur_lsoa_region_lookup'
path_cur_lsoa_imd          = f'{dsa}.{proj}_cur_lsoa_imd_lookup'
path_cur_lsoa              = f'{dsa}.{proj}_lsoa'

# path_cur_vacc_first        = f'{dsa}.{proj}_cur_vacc_first'
# path_cur_covid             = f'{dsa}.{proj}_cur_covid'

# # temporary tables
path_tmp_skinny_unassembled             = f'{dsa}.{proj}_tmp_kpc_harmonised'
path_tmp_skinny_assembled               = f'{dsa}.{proj}_tmp_kpc_selected'
path_tmp_skinny                         = f'{dsa}.{proj}_tmp_skinny'

path_tmp_quality_assurance_hx_1st_wide  = f'{dsa}.{proj}_tmp_quality_assurance_hx_1st_wide'
path_tmp_quality_assurance_hx_1st       = f'{dsa}.{proj}_tmp_quality_assurance_hx_1st'
path_tmp_quality_assurance_qax          = f'{dsa}.{proj}_tmp_quality_assurance_qax'
path_tmp_quality_assurance              = f'{dsa}.{proj}_tmp_quality_assurance'

path_tmp_inc_exc_cohort                 = f'{dsa}.{proj}_tmp_inc_exc_cohort'
path_tmp_inc_exc_flow                   = f'{dsa}.{proj}_tmp_inc_exc_flow'

path_tmp_hx_af_hyp_cohort               = f'{dsa}.{proj}_tmp_hx_af_hyp_cohort'
path_tmp_hx_af_hyp_gdppr                = f'{dsa}.{proj}_tmp_hx_af_hyp_gdppr'
path_tmp_hx_af_hyp_hes_apc              = f'{dsa}.{proj}_tmp_hx_af_hyp_hes_apc'
path_tmp_hx_af_hyp                      = f'{dsa}.{proj}_tmp_hx_af_hyp'

path_tmp_hx_nonfatal                    = f'{dsa}.{proj}_tmp_hx_nonfatal'

path_tmp_inc_exc_2_cohort                 = f'{dsa}.{proj}_tmp_inc_exc_2_cohort'
path_tmp_inc_exc_2_flow                   = f'{dsa}.{proj}_tmp_inc_exc_2_flow'

# path_tmp_covariates_hes_apc             = f'{dsa}.{proj}_tmp_covariates_hes_apc'
# path_tmp_covariates_pmeds               = f'{dsa}.{proj}_tmp_covariates_pmeds'
# path_tmp_covariates_lsoa                = f'{dsa}.{proj}_tmp_covariates_lsoa'
# path_tmp_covariates_lsoa_2              = f'{dsa}.{proj}_tmp_covariates_lsoa_2'
# path_tmp_covariates_lsoa_3              = f'{dsa}.{proj}_tmp_covariates_lsoa_3'
# path_tmp_covariates_n_consultations     = f'{dsa}.{proj}_tmp_covariates_n_consultations'
# path_tmp_covariates_unique_bnf_chapters = f'{dsa}.{proj}_tmp_covariates_unique_bnf_chapters'
# path_tmp_covariates_hx_out_1st_wide     = f'{dsa}.{proj}_tmp_covariates_hx_out_1st_wide'
# path_tmp_covariates_hx_com_1st_wide     = f'{dsa}.{proj}_tmp_covariates_hx_com_1st_wide'

# out tables
path_out_codelist_quality_assurance      = f'{dsa}.{proj}_out_codelist_quality_assurance'
path_out_codelist_cvd                    = f'{dsa}.{proj}_out_codelist_cvd'
# path_out_codelist_comorbidity            = f'{dsa}.{proj}_out_codelist_comorbidity'
path_out_codelist_covid                  = f'{dsa}.{proj}_out_codelist_covid'
path_out_codelist_covariates             = f'{dsa}.{proj}_out_codelist_covariates'
path_out_codelist_covariates_markers     = f'{dsa}.{proj}_out_codelist_covariates_markers'
path_out_codelist_outcomes               = f'{dsa}.{proj}_out_codelist_outcomes'

path_out_cohort                          = f'{dsa}.{proj}_out_cohort'

# path_out_covariates                 = f'{dsa}.{proj}_out_covariates'
path_out_exposures                    = f'{dsa}.{proj}_out_exposures'
path_out_outcomes                     = f'{dsa}.{proj}_out_outcomes'