# Curated Data - LSOA
 
**Description** This notebook creates the covariates based on LSOA. LSOA will be used to derive LSOA, region and index of multiple deprivation.

*Previous versions of this notebook incorporated LSOA batches from only the first and latest GDPPR batches. This updated version will employ a **staggered derivation** and LSOA will be obtained as at the closest date to the baseline batch. That is, the LSOA was taken from the first archived version of GDPPR (which has up until this point typically been the closest batch to baseline) and where persons were not included in this first batch of GDPPR, their LSOA would be extracted from the latest version of LSOA. For persons who had only 1 LSOA across their full history or records this was sufficient but for people who change practice in between the first and last batch used, this detail is overlooked. For example if a person was not included in batch 1 (the closest typically to baseline) and then also had multiple LSOAs across different GDPPR versions then taking the latest LSOA may not be the most appropriate LSOA to use. There in fact may be a different LSOA with a closer REPORTING_PERIOD_END_DATE in an earlier batch.*


<br>**Full LSOA history**
<br>Derivation of an LSOA histories table. This table tabulates a persons **full LSOA history** existing **across all archived versions** of GDPPR. By accessing all archived versions of GDPPR (full GDPPR table) a full history of a persons `LSOA` by `REPORTING_PERIOD_END_DATE` is captured. With access to all LSOAs with `REPORTING_PERIOD_END_DATE`, `LSOA` can be validated/confirmed and verified throughout the full history rather than only at the latest archived version. This will allow the most appropriate `LSOA` to be chosen for a given project study start date in part 2<br>


**Please note** that the earlist `REPORTING_PERIOD_END_DATE` from GDPPR occurs as at 2020-05-18 (thus if a study end date falls before this date, no LSOA will be found in GDPPR unless a selection group post study end date is applied).


 
<br>**Authors** Tom Bolton, Fionna Chalmers, Anna Stevenson (Health Data Science Team, BHF Data Science Centre)

**Reviewers** âš  UNREVIEWED

**Acknowledgements** Based on previous work for CCU003_05, CCU018_01 (Tom Bolton, John Nolan), earlier CCU002 sub-projects amd subsequently CCU002_07-D7a-covariates_LSOA

**Notes**

**Data output**
- **`CCU056_gdppr_lsoa_rped`** : Full `LSOA` and `REPORTING_PERIOD_END_DATE` history across all versions of GDPPR for all individuals in the population
- **`CCU056_gdppr_lsoa_rped_collapsed`** : Collapsed `LSOA` and `REPORTING_PERIOD_END_DATE` (one row per person) such that the closest to baseline `REPORTING_PERIOD_END_DATE` is chosen for LSOA. Note that this dataset includes those who have conflicts as at the closest to baseline `REPORTING_PERIOD_END_DATE`. These can be queried using `lsoa_conflict` == 1; for those who have no LSOA conflict `lsoa_conflict` == 0 (do note that for these persons there will be more than one row per person).
- **`CCU056_gdppr_lsoa_rped_conflicts_full`** : For those who have a conflict, as described above, their full `LSOA` and `REPORTING_PERIOD_END_DATE` history with region and IMD has been compiled for review.
- **`CCU056_lsoa`** : Collapsed `LSOA` and `REPORTING_PERIOD_END_DATE` (one row per person) such that the closest to baseline `REPORTING_PERIOD_END_DATE` is chosen for LSOA. Unlike `CCU056_gdppr_lsoa_rped_collapsed`, this dataset is strictly one row per person and conflicts here have been nulled. Note that an LSOA conflict does not imply that there will be a Region/IMD Decile/IMD Quintile conflict and in these cases these will be carried forward despite the LSOA conflict. See 'Final' section for more detail.

# 0. Setup

In [0]:
spark.sql('CLEAR CACHE')
spark.conf.set('spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation', 'true')

In [0]:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql import Window

from functools import reduce

import databricks.koalas as ks
import pandas as pd
import numpy as np

import re
import io
import datetime

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import dates as mdates
import seaborn as sns

print("Matplotlib version: ", matplotlib.__version__)
print("Seaborn version: ", sns.__version__)
_datetimenow = datetime.datetime.now() # .strftime("%Y%m%d")
print(f"_datetimenow:  {_datetimenow}")

In [0]:
%run "/Repos/shds/common/functions"

##0.1 Custom Functions

In [0]:
def lsoa_full_history(gdppr, pipeline_production_date):
  
  lsoa_rped_full = (
  #Â´all archived versions of gdppr
  gdppr
    .select(f.col('NHS_NUMBER_DEID').alias('PERSON_ID'), f.col('REPORTING_PERIOD_END_DATE'), 'LSOA')
    .where((f.col("LSOA").isNotNull()) & (f.col("PERSON_ID").isNotNull()))
    .distinct()
    .where(f.col('archived_on') <= pipeline_production_date) # ensure not working with a version > pipeline production date
    .orderBy(f.col("PERSON_ID"),f.col("REPORTING_PERIOD_END_DATE"))
  )
  
  return lsoa_rped_full

# 1. Parameters

In [0]:
%run "./CCU056-01-parameters"

# 2. Data

In [0]:
# -----------------------------------------------------------------------------
# GDPPR Paths, Dates & Data
# -----------------------------------------------------------------------------
parameters_df_gdppr = parameters_df_datasets.loc[parameters_df_datasets['dataset'] == 'gdppr']
gdppr_path = (parameters_df_gdppr['database'].values[0] + '.' + parameters_df_gdppr['table'].values[0])
gdppr_latest_archived_on = parameters_df_gdppr['archived_on'][0]

gdppr = spark.table(gdppr_path)

In [0]:
hes_apc = extract_batch_from_archive(parameters_df_datasets, 'hes_apc')
hes_ae  = extract_batch_from_archive(parameters_df_datasets, 'hes_ae')
hes_op  = extract_batch_from_archive(parameters_df_datasets, 'hes_op')

acs   = extract_batch_from_archive(parameters_df_datasets, 'nacsa')
tavi = extract_batch_from_archive(parameters_df_datasets, 'tavi')

#3 Create Unassembled

## 3.1 GDPPR - Full LSOA history

In [0]:
# compile all distinct LSOA and REPORTING_PERIOD_END_DATEs for each individual for all GDPPR archived_on versions
# not filtering for RPEDs before study_start_date as a proj selection criteria may look slightly beyond baseline (or in most cases they may have to use an LSOA > study_start_date) as RPEDs start 2020-05

lsoa_rped_full = lsoa_full_history(gdppr, pipeline_production_date)
save_table(df=lsoa_rped_full, out_name=f'{proj}_gdppr_lsoa_rped', save_previous=True)

In [0]:
# read back in data
lsoa_rped_full = (
  spark.table(f'{dsa}.{proj}_gdppr_lsoa_rped')
  .orderBy("PERSON_ID","REPORTING_PERIOD_END_DATE","LSOA")
)

display(lsoa_rped_full)

## 3.2 HES & Audits

In [0]:
def lsoa_harmonise(hes_apc, hes_ae, hes_op, acs, tavi, pipeline_production_date):

  # ------------------------------------------------------------------------------------
  # _hes_apc
  # ------------------------------------------------------------------------------------
  _hes_apc = (
    hes_apc
    .select('archived_on'
            , f.col('PERSON_ID_DEID').alias('PERSON_ID')
            , f.col('EPISTART').alias('RECORD_DATE')
            , f.col('LSOA01').alias('LSOA')
            )
    .distinct()
    .withColumn('RECORD_SOURCE', f.lit('hes_apc'))
  )

  # ------------------------------------------------------------------------------------
  # _hes_ae  
  # ------------------------------------------------------------------------------------
  _hes_ae = (
    hes_ae
    .select('archived_on'
            , f.col('PERSON_ID_DEID').alias('PERSON_ID')
            , f.col('ARRIVALDATE').alias('RECORD_DATE')
            , f.col('LSOA11').alias('LSOA')
            )
    .distinct()
    .withColumn('RECORD_SOURCE', f.lit('hes_ae'))
  )
    
  # ------------------------------------------------------------------------------------
  # _hes_op
  # ------------------------------------------------------------------------------------
  _hes_op = (
    hes_op
    .select('archived_on'
            , f.col('PERSON_ID_DEID').alias('PERSON_ID')
            , f.col('APPTDATE').alias('RECORD_DATE')
            , f.col('LSOA11').alias('LSOA')
            )
    .distinct()
    .withColumn('RECORD_SOURCE', f.lit('hes_op'))
  )

  # ------------------------------------------------------------------------------------
  # Audits
  # ------------------------------------------------------------------------------------
  
  _acs = (acs
        .select('archived_on'
                , f.col('PERSON_ID_DEID').alias('PERSON_ID') 
                , f.col('DATE_AND_TIME_OF_OPERATION').alias('RECORD_DATE')
                , f.col('LSOA_OF_RESIDENCE').alias('LSOA')
                )
        .distinct()
        .withColumn('RECORD_SOURCE', f.lit('nacsa'))
        )
  
  _tavi = (tavi
        .select('archived_on'
                , f.col('PERSON_ID_DEID').alias('PERSON_ID') 
                , f.col('7_01_DATE_AND_TIME_OF_OPERATION').alias('RECORD_DATE')
                , f.col('LSOA_OF_RESIDENCE').alias('LSOA')
                )
        .distinct()
        .withColumn('RECORD_SOURCE', f.lit('tavi'))
        )
  
  # ------------------------------------------------------------------------------------    
  # _harmonised
  # ------------------------------------------------------------------------------------
  # union all
  _harmonised = (_hes_apc
                  .unionByName(_hes_ae)
                  .unionByName(_hes_op)
                  .unionByName(_acs)
                  .unionByName(_tavi)
                  .select('archived_on', 'PERSON_ID', 'RECORD_SOURCE', 'RECORD_DATE', 
                          'LSOA')
                  .where((f.col("LSOA").isNotNull()) & (f.col("PERSON_ID").isNotNull()))
                  .where(f.col('archived_on') <= pipeline_production_date) # ensure not working with a version > pipeline production date
                  .orderBy(f.col("PERSON_ID"),f.col("RECORD_DATE"))
                  )
     
  return _harmonised

In [0]:
lsoa_unassembled = lsoa_harmonise(hes_apc, hes_ae, hes_op, acs, tavi, pipeline_production_date)

In [0]:
display(lsoa_unassembled)

##3.3 Combine

In [0]:
all_lsoa_unassembled = (lsoa_rped_full.withColumn('RECORD_SOURCE',f.lit("gdppr")).withColumnRenamed("REPORTING_PERIOD_END_DATE","RECORD_DATE")
               .unionByName((lsoa_unassembled.drop('archived_on').withColumn("RECORD_DATE",f.to_date(f.col("RECORD_DATE")))))
               .withColumn('LSOA_1', f.substring(f.col('LSOA'), 1, 1))
)

save_table(df=all_lsoa_unassembled, out_name=f'{proj}_tmp_all_cases_lsoa_unassembled', save_previous=True, data_base=dsa)

In [0]:
all_lsoa_unassembled = spark.table(f'{dsa}.{proj}_tmp_all_cases_lsoa_unassembled')

In [0]:
display(all_lsoa_unassembled)