# Curated Data - LSOA
 
**Description** This notebook creates the collpased (selected) version of LSOA.

<br>Collapsed (one row per person) extracting the most appropriate LSOA for the study. 

For this project we select the LSOA **as close as possible to the operation date**, regardless of datasource.

In case of ties, we prioritise datasources as follows: HES APC, GDPPR, HES Outpatients, HES A&E, NACSA/TAVI.

Remaining LSOA conflicts (different `LSOA` values which have the same `RECORD_DATE` and the same data source) will be highlighted.

**Please note** that for GDPPR, the earlist `REPORTING_PERIOD_END_DATE` occurs as at 2020-05-18 thus GDPPR as a source will not be overly prevalent here.


 
<br>**Authors** Tom Bolton, Fionna Chalmers, Anna Stevenson (Health Data Science Team, BHF Data Science Centre)

**Reviewers** ⚠ UNREVIEWED

**Acknowledgements** Based on previous work for CCU003_05, CCU018_01 (Tom Bolton, John Nolan), earlier CCU002 sub-projects amd subsequently CCU002_07-D7a-covariates_LSOA

**Notes**

**Data output**



# 0. Setup

In [0]:
spark.sql('CLEAR CACHE')
spark.conf.set('spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation', 'true')

In [0]:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql import Window

from functools import reduce

import databricks.koalas as ks
import pandas as pd
import numpy as np

import re
import io
import datetime

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import dates as mdates
import seaborn as sns

print("Matplotlib version: ", matplotlib.__version__)
print("Seaborn version: ", sns.__version__)
_datetimenow = datetime.datetime.now() # .strftime("%Y%m%d")
print(f"_datetimenow:  {_datetimenow}")

In [0]:
%run "/Repos/shds/common/functions"

##0.1 Custom Functions

In [0]:
def lsoa_full_history(gdppr, pipeline_production_date):
  
  lsoa_rped_full = (
  #´all archived versions of gdppr
  gdppr
    .select(f.col('NHS_NUMBER_DEID').alias('PERSON_ID'), f.col('REPORTING_PERIOD_END_DATE'), 'LSOA')
    .where((f.col("LSOA").isNotNull()) & (f.col("PERSON_ID").isNotNull()))
    .distinct()
    .where(f.col('archived_on') <= pipeline_production_date) # ensure not working with a version > pipeline production date
    .orderBy(f.col("PERSON_ID"),f.col("REPORTING_PERIOD_END_DATE"))
  )
  
  return lsoa_rped_full

# 1. Parameters

In [0]:
%run "./CCU056-01-parameters"

# 2. Data

In [0]:
# Main Cohort - needed for operation dates for LSOA selection
main_cohort = spark.table(f'{dsa}.ccu056_tmp_main_cohort_final')

In [0]:
all_lsoa_unassembled = spark.table(f'{dsa}.{proj}_tmp_all_cases_lsoa_unassembled')

# -----------------------------------------------------------------------------
# LSOA Curated Data
# -----------------------------------------------------------------------------
lsoa_region = spark.table(path_cur_lsoa_region)
lsoa_imd    = spark.table(path_cur_lsoa_imd)

# Prepare
lsoa_region = (lsoa_region.select(f.col('lsoa_code').alias('LSOA'), 'lsoa_name', f.col('region_name').alias('region')))

In [0]:
# display(all_lsoa_unassembled.filter(f.col("RECORD_SOURCE")=="hes_apc"))

In [0]:
# display(spark.table(f'{dbc}.hes_apc_all_years_archive').filter(f.col("PERSON_ID_DEID")=="TIHQ7YGMZ4ATP77"))

#3. Add Region & IMD

In [0]:

all_lsoa_unassembled = (
all_lsoa_unassembled
  .join(lsoa_region, on=["LSOA"], how="left")
  .withColumn('region',
              f.when(f.col('LSOA_1') == 'W', 'Wales')
              .when(f.col('LSOA_1') == 'S', 'Scotland')
              .otherwise(f.col('region'))
             )
  .join(lsoa_imd, on=["LSOA"], how="left")
  
)

In [0]:
# display(all_lsoa_unassembled)

# 4. Select LSOA

Final LSOA version will deal with conflicts as below:

**A.** If there is an LSOA conflict, resulting in a Region conflict, resulting in an IMD Deciles conflict resulting in an IMD Quintiles conflict then `LSOA`, `LSOA_1`, `lsoa_name`, `region`, `IMD_2019_DECILES`, `IMD_2019_QUINTILES` will **all** be nulled.

<br>**B.** If there is an LSOA conflict, resulting in a Region conflict, resulting in an IMD Deciles conflict but this time does not result in an IMD Quintiles conflict (e.g. IMD Decile moves from 1 to 2 but this means the person remains in IMD Quintile 1) then `LSOA`, `LSOA_1`, `lsoa_name`, `region`, `IMD_2019_DECILES` will be nulled whilst `IMD_2019_QUINTILES` will be carried forward.

<br>**C.** If there is an LSOA conflict resulting in a Region conflict but no IMD Deciles (and thus no IMD Quintiles) conflict then `LSOA`, `LSOA_1`, `lsoa_name` `region` will be nulled whilst `IMD_2019_DECILES`, `IMD_2019_QUINTILES` are carried forward.

<br>**D.** If there is an LSOA conflict that does not result in a Region conflict but does result in a IMD Deciles (and thus IMD Quintiles) conflict then `LSOA`, `LSOA_1`, `lsoa_name` `IMD_2019_DECILES`, `IMD_2019_QUINTILES` will be nulled whilst `region` is carried forward.

<br>**E.** If there is an LSOA conflict that does not result in a Region conflict but does result in a IMD Deciles but this time does not result in an IMD Quintiles conflict (e.g. IMD Decile moves from 1 to 2 but this means the person remains in IMD Quintile 1) then `LSOA`, `LSOA_1`, `lsoa_name` `IMD_2019_DECILES` will be nulled whilst `region` and `IMD_2019_QUINTILES` will be carried forward.

<br>**F.** If there is an LSOA conflict that does **not** result in a Region conflict or an IMD Deciles (and thus IMD Quintiles) conflict then only `LSOA`, `LSOA_1`, `lsoa_name` will be nulled whilst `region`, `IMD_2019_DECILES`, `IMD_2019_QUINTILES` are carried forward.

This will ensure that each person only has one row each.

In [0]:
# in cases of ties when choosing closest to operate date - apply data priority groups
all_lsoa_unassembled = (all_lsoa_unassembled
        .withColumn('RECORD_SOURCE_group_final',
                    f.when(f.col('RECORD_SOURCE') == 'nacsa', 5)
                    .when(f.col('RECORD_SOURCE') == 'tavi', 5)
                    .when(f.col('RECORD_SOURCE') == 'gdppr', 3)
                    .when(f.col('RECORD_SOURCE') == 'gdppr_snomed', 4)
                    .when(f.col('RECORD_SOURCE') == 'hes_apc', 1)
                    .when(f.col('RECORD_SOURCE') == 'hes_op', 2)
                    .when(f.col('RECORD_SOURCE') == 'hes_ae', 2)
                    )
)

In [0]:
# Join on Operation Dates and find date difference
all_lsoa_unassembled = (main_cohort.join(all_lsoa_unassembled,on="PERSON_ID",how="left"))

all_lsoa_unassembled = (all_lsoa_unassembled
.withColumn('OPERATION_DATE',f.date_format(f.col("OPERATION_DATE"), "yyyy-MM-dd"))
.withColumn('DATE_DIFF', f.abs(f.datediff(f.col("RECORD_DATE"), f.col("OPERATION_DATE"))))
)

In [0]:
display(all_lsoa_unassembled.groupBy('RECORD_SOURCE').count())

In [0]:
# define windows for row numbers
_win_rownum_LSOA = (
    Window
    .partitionBy('PERSON_ID')
    .orderBy(['DATE_DIFF', 'RECORD_SOURCE_group_final'])) #prioritising DATE_DIFF first then datasource after

all_lsoa_unassembled = (all_lsoa_unassembled
    .withColumn('_rownum_LSOA', f.row_number().over(_win_rownum_LSOA))
    )

In [0]:
# display(all_lsoa_unassembled.orderBy("PERSON_ID","DATE_DIFF"))

In [0]:
varlist = ['LSOA']
  
for ind, var in enumerate(varlist):
    record_source = 'RECORD_SOURCE_group_final'
    # define window for tied records
    _win_tie = (Window
      .partitionBy('PERSON_ID')
      .orderBy('DATE_DIFF', record_source)
      )
      
    # count distinct values of var (including null) within tied records
    _tie = (
      all_lsoa_unassembled
      .withColumn(f'_tie_{var}', f.dense_rank().over(_win_tie))
      .where(f.col(f'_tie_{var}') == 1)
      .groupBy('PERSON_ID')
      .agg(
        f.countDistinct(f.col(f'{var}')).alias(f'_n_distinct_{var}')
        , f.countDistinct(f.when(f.col(f'{var}').isNull(), 1)).alias(f'_null_{var}')
      )
      .withColumn(f'_tie_{var}', f.when((f.col(f'_n_distinct_{var}') + f.col(f'_null_{var}')) > 1, 1).otherwise(0))
      .select('PERSON_ID', f'_tie_{var}'))
  
    if(ind == 0): _tmp_ties = _tie
    else: _tmp_ties = (_tmp_ties.join(_tie, on=['PERSON_ID'], how='outer'))



# take information from the first row identified above
_tmp_selected = {}
for var in varlist:
    _tmp = (
      all_lsoa_unassembled
      .select('PERSON_ID', 'RECORD_DATE', 'RECORD_SOURCE', f'{var}', f'_rownum_{var}')
      .where(f.col(f'_rownum_{var}') == 1)
      .withColumnRenamed('RECORD_DATE', f'_date_{var}')
      .withColumnRenamed('RECORD_SOURCE', f'_source_{var}')
      .select('PERSON_ID', f'{var}', f'_date_{var}', f'_source_{var}'))
    _tmp_selected[f'{var}'] = _tmp

_selected = (
    _tmp_selected['LSOA']
    .join(_tmp_ties, on=['PERSON_ID'], how='outer')
    .select('PERSON_ID', 'LSOA'
            , '_date_LSOA', '_source_LSOA', '_tie_LSOA'))

In [0]:
display(_selected)

In [0]:
save_table(df=_selected, out_name=f'{proj}_tmp_all_cases_lsoa_selected', save_previous=True, data_base=dsa)

In [0]:
# count ties
display(_selected.groupBy("_tie_LSOA").count())