# UK Biobank RAP Cohort Extraction Tutorial


| Attrition Table |
|:--------|
| 1. Total UK Biobank participants | 
| 2. First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) |
| 3. First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement  | 
| 4. First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old |
| 5. First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date |


# UK Biobank Data on the Research Analysis Platform
### How the Data is Organized
#### EIDs and Data-fields

UK Biobank contains data collected from approximately 500,000 volunteer participants.  Within an access application, each participant is identified by a unique, 7-digit number, or EID. 
Note that each access application receives a different set of randomized EIDs, unique to the application. This EID randomization process - also known as "pseudonymization" - is managed by UK Biobank and is automatically applied to the data by the Research Analysis Platform. 
All data in the UK Biobank resource are organized into data-fields. Your access application is approved for a precise subset of those data-fields.

The [UK Biobank Showcase](https://biobank.ctsu.ox.ac.uk/showcase/) provides an in-depth look into the types of data stored in the UK Biobank, how it's collected, and how it's organized.  You can find more information about data-fields, broken down by type, on the [UK Biobank Field Listing page](https://biobank.ctsu.ox.ac.uk/crystal/list.cgi).

### Project Data
When you create a project on the UK Biobank Research Analysis Platform, the system dispenses the data  corresponding to the data-fields listed in the access application associated with the project.

#### Database and Dataset
Tabular data-fields and linked health data are stored in a SQL database. 

In [1]:
# Import packages needed to access Spark SQL Database in JupyterLab
import pyspark
import dxpy
import dxdata
from pyspark.sql import functions as f
from pyspark.sql import Window
from pyspark.sql.types import *
import itertools
import plotly.express as px
import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items

#### Instantiating the Spark Context
Having created your notebook in the project, you can populate your first cells as below. It is good practice to instantiate your Spark context at the very beginning of your analyses, as shown below.

In [2]:
# Spark initialization (Done only once; do not rerun this cell unless you select Kernel -> Restart kernel).
#sc = pyspark.SparkContext()
#spark = pyspark.sql.SparkSession(sc)

config = pyspark.SparkConf().setAll([('spark.kryoserializer.buffer.max', '2000')])
sc = pyspark.SparkContext(conf=config)
                                    
spark = pyspark.sql.SparkSession(sc)

The SQL database is located on the root folder of your project, and is typically named in accord with this pattern:

app\<APPLICATION-ID\>\_\<CREATION-TIME\> (e.g. app12345_20210101123456)

To improve the reproducibility of your notebooks, and ensure they are portable across projects, it is better not to hardcode any database or dataset names. Instead, you can use the following code to automatically discover the database and dataset:

In [3]:
# Automatically discover dispensed database name and dataset id
dispensed_database_name = dxpy.find_one_data_object(classname="database", name="app*", folder="/", name_mode="glob", describe=True)["describe"]["name"]
dispensed_dataset_id = dxpy.find_one_data_object(typename="Dataset", name="app*.dataset", folder="/", name_mode="glob")["id"]

In [4]:
# This will give the name of the Database found in the root project folder
dispensed_database_name

'app59456_20240126144654'

In [5]:
# Corresponding Dataset ID
dispensed_dataset_id

'record-GfkzVP8JZy7YFybPJ3xqGg6B'

#### Accessing the Database Directly Using SQL
To evaluate SQL, you can use the spark.sql("...") function, which returns a Spark DataFrame. 

In [6]:
# List the tables in the database:
spark.sql("USE " + dispensed_database_name)
spark.sql("SHOW TABLES").show(150, truncate=False)

+-----------------------+---------------------------------------+-----------+
|namespace              |tableName                              |isTemporary|
+-----------------------+---------------------------------------+-----------+
|app59456_20240126144654|allele_23146                           |false      |
|app59456_20240126144654|allele_23148                           |false      |
|app59456_20240126144654|allele_23157                           |false      |
|app59456_20240126144654|annotation_23146                       |false      |
|app59456_20240126144654|annotation_23148                       |false      |
|app59456_20240126144654|annotation_23157                       |false      |
|app59456_20240126144654|assay_eid_map_23146                    |false      |
|app59456_20240126144654|assay_eid_map_23148                    |false      |
|app59456_20240126144654|assay_eid_map_23157                    |false      |
|app59456_20240126144654|covid19_result_england                 

***

# Important Tables in the Database
### <u>Participant Tables</u>
From the command above, you can see some of the different participant tables i.e. **participant_0001, .., participant_0020**.

These tables contain the main UK Biobank participant data. Each participant is represented as one row, and each data-field is represented as one or more columns. For scalability reasons, the data-fields are horizontally split across multiple tables, starting from table participant_0001 (which contains the first few hundred columns for all participants), followed by participant_0002 (which contains the next few hundred columns), etc. The exact number of tables depends on how many data-fields your application is approved for. 

### <u>Inpatient Hospitalization Tables<u>
Tables beginning with **hesin** contain information about record-level inpatient data.

The linked HES data consists of seven interrelated database tables: **hesin, hesin_diag, hesin_oper, hesin_critical, hesin_psych, hesin_maternity** and **hesin_delivery**. These are explained in detail in the overview of the **Inpatient data (Resource 138483)** and the **Hospital Inpatient Data Dictionary (Resource 141140)** below.

**Inpatient data (Resource 138483)**

`wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/docs/HospitalEpisodeStatistics.pdf`

**Hospital Inpatient Data Dictionary (Resource 141140)**

`wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/docs/HESDataDic.xlsx`

### <u>National Death Registries Tables</u>
The death and death_cause tables from the death registry include the date of death and the 
primary and contributory causes of death, coded using the ICD-10 system. Some records 
also have free-text cause of death information from the death certificate.

**Resource 115559:**

This is a guide detailing how UK Biobank handles linking data from National Death Registries with its own database:

`wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/docs/DeathLinkage.pdf`

### <u>Primary Care Tables</u>

These tables contain record-level GP data for approximately 45% of the UK Biobank cohort **(Category 3001)**. It consists of three tables, **gp_clinical**, **gp_scripts** and **gp_registrations**
    
Look-ups for clinical coding systems used in the Primary care data and maps between different coding systems used in the generation of the First occurrence data-fields. The data are provided in an Excel workbook with further information on the versions and sources of these data given in an included PDF document.

`wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/primarycare_codings.zip`

#### Database Columns
For the main UK Biobank participant tables, the column-naming convention is generally as follows:
p\<FIELD-ID\>_i\<INSTANCE-ID\>_a\<ARRAY-ID\>
However, the following additional rules apply:
- If a field is not instanced, the _i\<INSTANCE-ID\> piece is skipped altogether.
- If a field is not arrayed, the _a\<ARRAY-ID\> piece is skipped altogether.
- If a field is arrayed due to being multi-select, the field is converted into a single column of type "embedded array", and the _a<ARRAY-ID> piece is skipped altogether.

#### Accessing the Dataset Using Python
The dataset combines the low-level database structure with metadata from the UK Biobank Showcase. Database tables are exposed as virtual entities, and database columns are exposed as fields. **The split participant tables are all combined into a single entity called participant**.
You can load the dataset as follows:

In [7]:
# Access dataset
dataset = dxdata.load_dataset(id=dispensed_dataset_id)

#### Dataset 'entities' are virtual tables linked to one another.
The main entity is 'participant' and corresponds to most pheno fields. Additional entities correspond to linked health care data. Entities starting with 'hesin' are for hospital records; entities starting with 'gp' are for GP records, etc.

In [8]:
# List all entities available
dataset.entities

[<Entity "participant">,
 <Entity "covid19_result_england">,
 <Entity "covid19_result_scotland">,
 <Entity "covid19_result_wales">,
 <Entity "gp_clinical">,
 <Entity "gp_scripts">,
 <Entity "gp_registrations">,
 <Entity "hesin">,
 <Entity "hesin_diag">,
 <Entity "hesin_oper">,
 <Entity "hesin_critical">,
 <Entity "hesin_maternity">,
 <Entity "hesin_delivery">,
 <Entity "hesin_psych">,
 <Entity "death">,
 <Entity "death_cause">,
 <Entity "omop_death">,
 <Entity "omop_device_exposure">,
 <Entity "omop_note">,
 <Entity "omop_observation">,
 <Entity "omop_drug_exposure">,
 <Entity "omop_observation_period">,
 <Entity "omop_person">,
 <Entity "omop_procedure_occurrence">,
 <Entity "omop_specimen">,
 <Entity "omop_visit_detail">,
 <Entity "omop_visit_occurrence">,
 <Entity "omop_dose_era">,
 <Entity "omop_drug_era">,
 <Entity "omop_condition_era">,
 <Entity "omop_condition_occurrence">,
 <Entity "omop_measurement">,
 <Entity "olink_instance_0">,
 <Entity "olink_instance_2">,
 <Entity "olink_

In [9]:
#Accessing dataset entities of interest:
participant = dataset['participant']
hesin = dataset['hesin']
hesin_diag = dataset['hesin_diag']
hesin_oper = dataset['hesin_oper']
hesin_critical = dataset['hesin_critical']
death = dataset['death']
death_cause = dataset['death_cause']
gp_clinical = dataset['gp_clinical'] # (HY)
gp_scripts = dataset['gp_scripts']
gp_registrations = dataset['gp_registrations']

# Finding entity data-fields
To fetch participant fields, you must first make a list of field names of interest. 
Use a combination of the UKB showcase: https://biobank.ndph.ox.ac.uk/showcase/search.cgi,  the find_field(), field_titles_by_title_keyword(), field_names_by_title_keyword() functions in order to find the fields of interest. 

The fields_by_title_keyword() and field_names_by_title_keyword() functions are from the DNAexus OpenBio github which can be accessed here: https://github.com/dnanexus/OpenBio/blob/master/UKB_notebooks/ukb-rap-pheno-basic.ipynb


In [10]:
#participant.find_field(title='p33')

In [11]:
#hesin_diag.find_field(title='33')

#### Looking up fields, given UKB showcase field id
If you know the field id but you are not sure if it is instanced or arrayed, and want to grab all instances/arrays (if any), use these:

In [12]:
# Returns all field objects for a given UKB showcase field id
def fields_for_id(field_id):
    from distutils.version import LooseVersion
    field_id = str(field_id)
    fields = participant.find_fields(name_regex=r'^p{}(_i\d+)?(_a\d+)?$'.format(field_id))
    return sorted(fields, key=lambda f: LooseVersion(f.name))

# Returns all field names for a given UKB showcase field id

def field_names_for_id(field_id):
    return [f.name for f in fields_for_id(field_id)]

In [13]:
# Example:
# Participant sex
# field_names_for_id('31')

In [14]:
# Returns all field objects for a given title keyword
def fields_by_title_keyword(keyword):
    from distutils.version import LooseVersion
    fields = list(participant.find_fields(lambda f: keyword.lower() in f.title.lower()))
    return sorted(fields, key=lambda f: LooseVersion(f.name))

# Returns all field names for a given title keyword

def field_names_by_title_keyword(keyword):
    return [f.name for f in fields_by_title_keyword(keyword)]

# Returns all field titles for a given title keyword

def field_titles_by_title_keyword(keyword):
    return [f.title for f in fields_by_title_keyword(keyword)]

In [15]:
# Example:
# field_titles_by_title_keyword('Systolic blood pressure, automated reading')

In [16]:
#field_names_by_title_keyword('Year of birth')

# Accessing Variables of Interest in UK Biobank Database

### Baseline characteristics (Category 100094)
This category contains data on some general characteristics of participants that were known before arrival at the Assessment Centre, and includes date of birth, sex and an index of deprivation (based on the participant's postcode), all obtained from local NHS Primary Care Trust registries, and the name of the recruitment centre. This information could be amended by the participant upon arrival at the Assessment Centre.

- **Age at index** will be determined by using the fields '33: Date of birth' within Baseline Characteristics, and substracting this from the date of the index ASCVD diagnosis. 

- **Sex** will be determined from field '31: Sex' within Baseline Characteristics. 

In [17]:
base_characteristics_field_names = {'eid': 'eid', # participant ID
                                    #'p33', # Date of birth - not available? This data-field will only be made available to researchers in exceptional circumstances. Wherever possible, reasearchers should use Month of Birth (Field 52) and Year of Birth (Field 34) instead.
                                    'p52': 'month_of_birth', # Month of birth
                                    'p34': 'year_of_birth', # Year of birth
                                    #'p21022', # Age at recruitment
                                    'p31': 'sex', # Sex
                                    'p50_i0': 'height_0', # height
                                    'p50_i1': 'height_1',
                                    'p50_i2': 'height_2', 
                                    'p50_i3': 'height_3',
                                    'p3160_i0': 'weight_0', # weight 
                                    'p3160_i1': 'weight_1', 
                                    'p3160_i2': 'weight_2', 
                                    'p3160_i3': 'weight_3',
                                    'p21001_i0': 'bmi_0', # bmi
                                    'p21001_i1': 'bmi_1', 
                                    'p21001_i2': 'bmi_2', 
                                    'p21001_i3': 'bmi_3'
                                    
                                   }

### Assessment centre ⏵ Recruitment ⏵ Reception (Category 100024) # (HY)
This category contains information about participant arrival at the Assessment Centre and the locations from which they were recruited.

Variables of interest within Reception:
- **Date of attending assessment centre** will be detected using field '53 Date of attending assessment centre'

Note: Date of attending assessment centre can be used as UK Biobank enrolment date

In [18]:
# (HY)
reception_field = {'p53_i0': 'date_of_ac_0', # Date of attending assessment centre instance 1
                   'p53_i1': 'date_of_ac_1', # Date of attending assessment centre instance 2
                   'p53_i2': 'date_of_ac_2', # Date of attending assessment centre instance 3
                   'p53_i3': 'date_of_ac_3' # Date of attending assessment centre instance 4
                   }

### Blood Biochemsitry (Category 17518)
A range of key biochemistry markers were measured in the blood sample collected at recruitment (for all 500,000 participants) and at repeat assessment approx. 5 years later (for 20,000 participants). The biomarkers were selected for analysis because they represent established risk factors for disease, are established diagnostic measures, or characterised phenotypes not otherwise well assessed and are feasible to measure at scale.

Variables of interest within Blood Biochemistry:
- **Lp(a) (nmol/L)** will be detected using field '30790 Lipoprotein A'
- **LDL-C (nmol/L)** will be detected using field '30780 LDL direct'
- **hs-CRP (mg/L)** values from field '30710 C-reactive protein'

In [19]:
blood_biochemistry_field_names = {'p30790_i0': 'lpa_0', # (HY) Lipoprotein A instance 1
                                  'p30790_i1': 'lpa_1', # (HY) Lipoprotein A instance 2
                                  'p30796_i0': 'lpa_0_reportability',
                                  'p30690_i0': 'tc_0', # Total cholesterol instance 1
                                  'p30690_i1': 'tc_1', # Total cholesterol instance 2
                                  'p30760_i0': 'hdl_0', # HDL instance 1
                                  'p30760_i1': 'hdl_1', # HDL instance 2
                                  'p30780_i0': 'ldl_0', # LDL direct instance 1
                                  'p30780_i1': 'ldl_1',  # LDL direct instance 2
                                  'p30710_i0': 'crp_0', # C-reactive protein instance 1
                                  'p30710_i1': 'crp_1', # C-reactive protein instance 2
                                  
                                  }

### Ethnicity (Category 100065)
This category contains data from the touchscreen questionnaire on ethnic background, and years spent in the UK, where applicable.

Although further granularity is available for ethnicity, due to small numbers of patients within certain ethnicities only the parent categories of White, Mixed, Asian or Asian British, Black, Black British or Missing/Unknown will be used.    

**Variables of interest within Ethnicity:**
- **Ethnicity** will be determined using the field '21000 Ethnic background'

In [20]:
ethnicity_field_names = {'p21000_i0': 'ethnic_0',
                         'p21000_i1': 'ethnic_1',
                         'p21000_i2': 'ethnic_2',
                         'p21000_i3': 'ethnic_3'} # Ethnicity instances 1-4

### Smoking (Category 100058)
This category contains data from the touchscreen questionnaire on smoking habits (duration of smoking, number of cigarettes/day, age started smoking; among former smokers the amount smoked, time since cessation, ease of cessation, reason for cessation); and exposure to environmental tobacco smoke.

Categories within smoking status are: Never, Previous, Current and Prefer not to answer. These will be reported as Never, Previous, Current and Missing/Unknown.  

These values are from the date of the UK Biobank baseline assessment which is different to the index date.  

Variables of interest within Smoking:
- **Smoking status** will be determined using the field ‘20116 Smoking status’ 


In [21]:
smoking_field_names = {'p20116_i0': 'smoking_0',
                       'p20116_i1': 'smoking_1',
                       'p20116_i2': 'smoking_2',
                       'p20116_i3': 'smoking_3'} # Smoking status instances 1-4

### Blood Pressure (Category 100011)
This category contains information on blood pressure measurements and pulse rate using the Omron Digital blood pressure monitor. Two blood pressure measurements were performed on each individual, using automated (the default option) or manual devices.

Measurement of blood pressure is from the UK Biobank baseline assessment which is different to the index date.  

**Variables of interest within Blood Pressure:**
- **Systolic blood pressure (mmHg)** will be taken from the field '4080 Systolic blood pressure, automated reading'
- **Diastolic blood pressure (mmHg)** will be taken from the field '4079 Systolic blood pressure, automated reading' # (HY)

In [22]:
blood_pressure_field_names = {'p4079_i0_a0': 'bp_diastolic_i0a0', # (HY) Diastolic blood pressure instance 1 array 1
                              'p4079_i0_a1': 'bp_diastolic_i0a1', # (HY) Diastolic blood pressure instance 1 array 2
                              'p4079_i1_a0': 'bp_diastolic_i1a0', # (HY) Diastolic blood pressure instance 2 array 1
                              'p4079_i1_a1': 'bp_diastolic_i1a1', # (HY) Diastolic blood pressure instance 2 array 2
                              'p4079_i2_a0': 'bp_diastolic_i2a0', # (HY) Diastolic blood pressure instance 3 array 1
                              'p4079_i2_a1': 'bp_diastolic_i2a1', # (HY) Diastolic blood pressure instance 3 array 2
                              'p4079_i3_a0': 'bp_diastolic_i3a0', # (HY) Diastolic blood pressure instance 4 array 1
                              'p4079_i3_a1': 'bp_diastolic_i3a1', # (HY) Diastolic blood pressure instance 4 array 2
                              'p4080_i0_a0': 'bp_systolic_i0a0',  # Systolic blood pressure instance 1 array 1
                              'p4080_i0_a1': 'bp_systolic_i0a1',
                              'p4080_i1_a0': 'bp_systolic_i1a0',
                              'p4080_i1_a1': 'bp_systolic_i1a1',
                              'p4080_i2_a0': 'bp_systolic_i2a0',
                              'p4080_i2_a1': 'bp_systolic_i2a1',
                              'p4080_i3_a0': 'bp_systolic_i3a0',
                              'p4080_i3_a1': 'bp_systolic_i3a1'} # Systolic blood pressure instance 4 array 2

### First occurrences (Category 1712)
This category contains data showing the 'first occurrence' of any code mapped to 3-character ICD-10.

The data-fields have been generated by mapping:
Read code information in the Primary Care data (Category 3000),
ICD-9 and ICD-10 codes in the Hospital inpatient data (Category 2000),
ICD-10 codes in Death Register records (Field 40001, Field 40002), and
Self-reported medical condition codes (Field 20002) reported at the baseline or subsequent UK Biobank assessment centre visit
to 3-character ICD-10 codes.

For each code two data-fields are available:
the date the code was first recorded across any of the sources listed above
the source where the code was first recorded, and information on whether the code was recorded in at least one other source subsequently
The data-fields are grouped by ICD-10 chapter in sub-categories 2401-2417.

Details of the mapping process, construction of these variables and caveats related to their use can be found in Resource 593.

**Variables of interest within First occurrences:**
- Genitourinary system disorders: Presence of **chronic kidney diseases** at index will be determined using the date of first occurrence of ICD-10 from field '132032 Date N18 first reported (chronic renal failure).'
- **Diabetes** field '2976 Age diabetes diagnosed', 20002*, insulin use at assessment date.* ICD codes?


In [23]:
fo_field_names = {# 'p2976_i0': 'dm_age_i0',
                  # 'p2976_i1': 'dm_age_i1',
                  # 'p2976_i2': 'dm_age_i2',
                  # 'p2976_i3': 'dm_age_i3', # Age diabetes diagnosed instances 1-4

                  # Health-related outcomes ⏵ First occurrences ⏵ Endocrine, nutritional and metabolic diseases:
                  'p130706': 'dm_e10_date_first', # Date E10 first reported (insulin-dependent diabetes mellitus)
                  # 'p130707': 'dm_e10_source', # Source of report of E10 (insulin-dependent diabetes mellitus)
                  'p130708': 'dm_e11_date_first', # Date E11 first reported (non-insulin-dependent diabetes mellitus)
                  # 'p130709': 'dm_e11_source', # Source of report of E11 (non-insulin-dependent diabetes mellitus)
                  'p130710': 'dm_e12_date_first', # Date E12 first reported (malnutrition-related diabetes mellitus)
                  # 'p130711': 'dm_e12_source', # Source of report of E12 (malnutrition-related diabetes mellitus)
                  'p130712': 'dm_e13_date_first', # Date E13 first reported (other specified diabetes mellitus)
                  # 'p130713': 'dm_e13_source', # Source of report of E13 (other specified diabetes mellitus)
                  'p130714': 'dm_e14_date_first', # Date E14 first reported (unspecified diabetes mellitus)
                  # 'p130715': 'dm_e14_source', # Source of report of E14 (unspecified diabetes mellitus)
                  # 'p130716': 'dm_e15_date_first', # Date E15 first reported (nondiabetic hypoglycaemic coma)
                  # 'p130717': 'dm_e15_source', # Source of report of E15 (nondiabetic hypoglycaemic coma)

                   # (HY) below codes added
                  # 'p2966_i0': 'hbp_age_i0',
                  # 'p2966_i1': 'hbp_age_i1',
                  # 'p2966_i2': 'hbp_age_i2',
                  # 'p2966_i3': 'hbp_age_i3', # Age high blood pressure diagnosed instancces 1-4

                  'p131286': 'hbp_i10_date_first', # Date I10 first reported (essential (primary) hypertension)
                  # 'p131287': 'hbp_i10_source', # Source of report of I10 (essential (primary) hypertension)
                  'p131288': 'hbp_i11_date_first', # Date I11 first reported (hypertensive heart disease)
                  # 'p131289': 'hbp_i11_source', # Source of report of I11 (hypertensive heart disease)
                  'p131290': 'hbp_i12_date_first', # Date I12 first reported (hypertensive renal disease)
                  # 'p131291': 'hbp_i12_source', # Source of report of I12 (hypertensive renal disease)
                  'p131292': 'hbp_i13_date_first', # Date I13 first reported (hypertensive heart and renal disease)
                  # 'p131293': 'hbp_i13_source', # Source of report of I13 (hypertensive heart and renal disease)
                  'p131294': 'hbp_i15_date_first', # Date I15 first reported (secondary hypertension)
                  # 'p131295': 'hbp_i15_source', # Source of report of I15 (secondary hypertension)

                  'p132032': 'crf_n18_date_first', # Date N18 first reported (chronic renal failure)
                  # 'p132033': 'crf_n18_source' # Source of report of N18 (chronic renal failure)
                    }

In [24]:
# Join all field name lists/dictionaries
# field_names = list(itertools.chain(base_characteristics_field_names,
#                                   reception_field, # (HY)
#                                   blood_biochemistry_field_names,
#                                   ethnicity_field_names,
#                                   smoking_field_names,
#                                   blood_pressure_field_names,
#                                   fo_field_names
#                                  ))

field_names = {**base_characteristics_field_names,
               **reception_field, # (HY)
               **blood_biochemistry_field_names,
               **ethnicity_field_names,
               **smoking_field_names,
               **blood_pressure_field_names,
               **fo_field_names}
field_names

{'eid': 'eid',
 'p52': 'month_of_birth',
 'p34': 'year_of_birth',
 'p31': 'sex',
 'p50_i0': 'height_0',
 'p50_i1': 'height_1',
 'p50_i2': 'height_2',
 'p50_i3': 'height_3',
 'p3160_i0': 'weight_0',
 'p3160_i1': 'weight_1',
 'p3160_i2': 'weight_2',
 'p3160_i3': 'weight_3',
 'p21001_i0': 'bmi_0',
 'p21001_i1': 'bmi_1',
 'p21001_i2': 'bmi_2',
 'p21001_i3': 'bmi_3',
 'p53_i0': 'date_of_ac_0',
 'p53_i1': 'date_of_ac_1',
 'p53_i2': 'date_of_ac_2',
 'p53_i3': 'date_of_ac_3',
 'p30790_i0': 'lpa_0',
 'p30790_i1': 'lpa_1',
 'p30796_i0': 'lpa_0_reportability',
 'p30690_i0': 'tc_0',
 'p30690_i1': 'tc_1',
 'p30760_i0': 'hdl_0',
 'p30760_i1': 'hdl_1',
 'p30780_i0': 'ldl_0',
 'p30780_i1': 'ldl_1',
 'p30710_i0': 'crp_0',
 'p30710_i1': 'crp_1',
 'p21000_i0': 'ethnic_0',
 'p21000_i1': 'ethnic_1',
 'p21000_i2': 'ethnic_2',
 'p21000_i3': 'ethnic_3',
 'p20116_i0': 'smoking_0',
 'p20116_i1': 'smoking_1',
 'p20116_i2': 'smoking_2',
 'p20116_i3': 'smoking_3',
 'p4079_i0_a0': 'bp_diastolic_i0a0',
 'p4079_i0_a1

### Grabbing fields into a Spark DataFrame
The participant.retrieve_fields() function can be used to construct a Spark DataFrame of the given fields.

By default, this retrieves data as encoded by UK Biobank. For example, field p31 (participant sex) will be returned as an integer column with values of 0 and 1. To receive decoded values, supply the coding_values='replace' argument.

In [25]:
df = participant.retrieve_fields(names=field_names.keys(),
                                 engine=dxdata.connect(),
                                 coding_values='replace',
                                 column_aliases=field_names)

df = df.withColumn('studyend',f.lit('2022-10-31'))

df = df.withColumn('studyend', f.to_date(f.col('studyend'), 'yyyy-MM-dd'))

In [26]:
df.show(3)
df.printSchema()
df.count() # (HY) Count number of rows/patients

+-------+--------------+-------------+------+--------+--------+--------+--------+--------+--------+--------+--------+-------+-----+-----+-----+------------+------------+------------+------------+-----+-----+--------------------+-----+----+-----+-----+-----+-----+-----+-----+--------+--------+--------+--------+---------+---------+---------+---------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+----------+
|    eid|month_of_birth|year_of_birth|   sex|height_0|height_1|height_2|height_3|weight_0|weight_1|weight_2|weight_3|  bmi_0|bmi_1|bmi_2|bmi_3|date_of_ac

502230

In [27]:
#date_of_birth: birthdate
# Impute day_of_birth with 15
df = df.withColumn('month_of_birth_n', f.from_unixtime(f.unix_timestamp(f.col('month_of_birth'),'MMMM'),'MM'))
df = df.withColumn('month_of_birth_n', f.col('month_of_birth_n').cast('int'))
df = df.withColumn('day_of_birth', f.lit(15))
datecols=['year_of_birth','month_of_birth_n','day_of_birth']
df = df.withColumn("date_of_birth",f.to_date(f.concat_ws("-",*datecols).cast("date"),"MM-dd-yyyy"))
df = df.drop('month_of_birth_n','day_of_birth')

In [28]:
#Ethnicity
ethnic_mapping = pd.read_csv("/mnt/project/Users/yonghu4/Lpa_EB/pgm/ethnic_mapping.csv", dtype=str, keep_default_na=False)
ethnic_mapping = spark.createDataFrame(ethnic_mapping)
ethnic_mapping.show(truncate=False)
ethnic_mapping.printSchema()

+--------------------------+----------------------+
|ethnic                    |ethnic_final          |
+--------------------------+----------------------+
|Any other Asian background|Asian or Asian British|
|Indian                    |Asian or Asian British|
|Pakistani                 |Asian or Asian British|
|Chinese                   |Asian or Asian British|
|Bangladeshi               |Asian or Asian British|
|Asian or Asian British    |Asian or Asian British|
|Black or Black British    |Black or Black British|
|Any other Black background|Black or Black British|
|African                   |Black or Black British|
|Caribbean                 |Black or Black British|
|Mixed                     |Mixed                 |
|White and Black African   |Mixed                 |
|White and Black Caribbean |Mixed                 |
|Any other mixed background|Mixed                 |
|White and Asian           |Mixed                 |
|Other ethnic group        |Unknown               |
|null       

In [29]:
df = df.withColumn('ethnic', f.when((f.col('ethnic_0').isNotNull()), f.col('ethnic_0'))
                              .when((f.col('ethnic_0').isNull()) & (f.col('ethnic_1').isNotNull()), f.col('ethnic_1'))
                              .when((f.col('ethnic_0').isNull()) & (f.col('ethnic_1').isNull()) & (f.col('ethnic_2').isNotNull()), f.col('ethnic_2'))
                              .when((f.col('ethnic_0').isNull()) & (f.col('ethnic_1').isNull()) & (f.col('ethnic_2').isNull()) & (f.col('ethnic_3').isNotNull()), f.col('ethnic_3'))
                              .otherwise(f.lit(None)))

df = df.join(ethnic_mapping,'ethnic','left')
df = df.withColumn('ethnic_final', f.when((f.col('ethnic_final').isNull()), f.lit('Unknown')).otherwise(f.col('ethnic_final')))
df = df.drop('ethnic')
df = df.withColumnRenamed('ethnic_final','ethnic')

df = df.drop('ethnic_0','ethnic_1','ethnic_2','ethnic_3')
df.select('eid','ethnic').show(10)

+-------+------+
|    eid|ethnic|
+-------+------+
|3319388| White|
|4544491| White|
|4760583| White|
|5376532| White|
|1045957| White|
|1484808| White|
|3128280| White|
|3962502| White|
|5967127| White|
|1442780| White|
+-------+------+
only showing top 10 rows



# Returned data 2321 (Reassessed Lp(a))

In [30]:
df.select('eid','lpa_0','lpa_0_reportability').filter(f.col('lpa_0') < 10).show(10)
df.select('eid','lpa_0','lpa_0_reportability').filter(f.col('lpa_0').isNull()).show(truncate=False)
df.select('eid','lpa_0','lpa_0_reportability').filter(f.col('lpa_0_reportability') == f.lit('Not reportable at assay (too high)')).show(truncate=False)
df.select('lpa_0_reportability').distinct().show(truncate=False)

+-------+-----+--------------------+
|    eid|lpa_0| lpa_0_reportability|
+-------+-----+--------------------+
|1000094| 8.04|Reportable at ass...|
|1000220| 6.44|Reportable at ass...|
|1000356| 9.74|Reportable at ass...|
|1000898|  6.1|Reportable at ass...|
|1001030|  9.6|Reportable at ass...|
|1001309|  4.3|Reportable at ass...|
|1001740|  9.3|Reportable at ass...|
|1001771|  5.4|Reportable at ass...|
|1002470|  5.4|Reportable at ass...|
|1002508| 8.29|Reportable at ass...|
+-------+-----+--------------------+
only showing top 10 rows

+-------+-----+----------------------------------+
|eid    |lpa_0|lpa_0_reportability               |
+-------+-----+----------------------------------+
|1002368|null |Not reportable at assay (too low) |
|1000467|null |Not reportable at assay (too low) |
|1000784|null |Not reportable at assay (too high)|
|1000910|null |Not reportable at assay (too low) |
|1000952|null |Not reportable at assay (too low) |
|1001128|null |Not reportable at assay (too high

In [31]:
return2321_lpa = pd.read_csv("/mnt/project/returned_datasets/2321/lpa_values_outofrange.csv", dtype=str, keep_default_na=False)
return2321_lpa = return2321_lpa.where((pd.notnull(return2321_lpa)), None)
return2321_eidbridge = pd.read_fwf("/mnt/project/returned_datasets/2321/ukb59456bridge50016.txt", dtype=str, keep_default_na=False, header=None, names=['eid', 'eid50016'])
return2321_eidbridge = return2321_eidbridge.where((pd.notnull(return2321_eidbridge)), None)

In [32]:
return2321_lpa

Unnamed: 0,eid,LPA_oval_b,LPA_oval_r
0,4832806,2.6,
1,5429370,3.7,
2,3214032,57.2,
3,2106982,3.02,
4,1554511,12.49,
...,...,...,...
460544,5303954,19.0,
460545,3151503,101.2,
460546,2075690,59.9,
460547,1538360,2.61,


In [33]:
return2321_eidbridge

Unnamed: 0,eid,eid50016
0,1000012,2763664
1,1000029,1702778
2,1000031,4547999
3,1000047,4266808
4,1000050,1370458
...,...,...
502486,6025013,5153348
502487,6025028,3713210
502488,6025035,5361929
502489,6025044,1258063


In [34]:
df.printSchema()

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- height_0: double (nullable = true)
 |-- height_1: double (nullable = true)
 |-- height_2: double (nullable = true)
 |-- height_3: double (nullable = true)
 |-- weight_0: double (nullable = true)
 |-- weight_1: double (nullable = true)
 |-- weight_2: double (nullable = true)
 |-- weight_3: double (nullable = true)
 |-- bmi_0: double (nullable = true)
 |-- bmi_1: double (nullable = true)
 |-- bmi_2: double (nullable = true)
 |-- bmi_3: double (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- lpa_0: double (nullable = true)
 |-- lpa_1: double (nullable = true)
 |-- lpa_0_reportability: string (nullable = true)
 |-- tc_0: double (nullable = true)
 |-- tc_1: double (nullable = true)
 |-

In [35]:
return2321_lpa_s = spark.createDataFrame(return2321_lpa)
return2321_lpa_s.show(3)
return2321_lpa_s.printSchema()
return2321_lpa_s = return2321_lpa_s.withColumn('LPA_oval_b', f.col('LPA_oval_b').cast(DoubleType()))
return2321_lpa_s = return2321_lpa_s.withColumn('LPA_oval_r', f.col('LPA_oval_r').cast(DoubleType()))
return2321_lpa_s = return2321_lpa_s.withColumn('LPA_oval_b', f.when((f.col('LPA_oval_b') == ''), f.lit(None)).otherwise(f.col('LPA_oval_b')))
return2321_lpa_s = return2321_lpa_s.withColumn('LPA_oval_r', f.when((f.col('LPA_oval_r') == ''), f.lit(None)).otherwise(f.col('LPA_oval_r')))
return2321_lpa_s.show(5)
return2321_lpa_s.printSchema()

+-------+----------+----------+
|    eid|LPA_oval_b|LPA_oval_r|
+-------+----------+----------+
|4832806|       2.6|          |
|5429370|       3.7|          |
|3214032|      57.2|          |
+-------+----------+----------+
only showing top 3 rows

root
 |-- eid: string (nullable = true)
 |-- LPA_oval_b: string (nullable = true)
 |-- LPA_oval_r: string (nullable = true)

+-------+----------+----------+
|    eid|LPA_oval_b|LPA_oval_r|
+-------+----------+----------+
|4832806|       2.6|      null|
|5429370|       3.7|      null|
|3214032|      57.2|      null|
|2106982|      3.02|      null|
|1554511|     12.49|      null|
+-------+----------+----------+
only showing top 5 rows

root
 |-- eid: string (nullable = true)
 |-- LPA_oval_b: double (nullable = true)
 |-- LPA_oval_r: double (nullable = true)



In [36]:
return2321_lpa_s = return2321_lpa_s.withColumn('lpacheck', f.when((f.col('LPA_oval_b').isNotNull()) & (f.col('LPA_oval_r').isNotNull()), f.abs(f.col('LPA_oval_b') - f.col('LPA_oval_r'))).otherwise(f.lit(None)))
return2321_lpa_s.filter(f.col('lpacheck').isNotNull()).select(f.min(f.col('lpacheck')),f.max(f.col('lpacheck'))).show()
return2321_lpa_s.filter(f.col('lpacheck') > 100).show()
return2321_lpa_sb = return2321_lpa_s.filter(f.col('LPA_oval_b').isNotNull()).count()
return2321_lpa_sr = return2321_lpa_s.filter(f.col('LPA_oval_r').isNotNull()).count()
return2321_lpa_sbr = return2321_lpa_s.filter(f.col('lpacheck').isNotNull()).count()
return2321_lpa_sb189 = return2321_lpa_s.filter(f.col('LPA_oval_b') > 189).count()
dflpa = df.filter(f.col('lpa_0').isNotNull()).count()
print(f'Number of record for return2321_lpa_s baseline measurement: {return2321_lpa_sb}')
print(f'Number of record for return2321_lpa_s repeat measurement: {return2321_lpa_sr}')
print(f'Number of record for return2321_lpa_s two measurement: {return2321_lpa_sbr}')
print(f'Number of record for return2321_lpa_s baseline measurement > 189: {return2321_lpa_sb189}')
print(f'Number of record for df with lpa measurement at enrolment: {dflpa}')

+-------------+------------------+
|min(lpacheck)|     max(lpacheck)|
+-------------+------------------+
|          0.0|334.59999999999997|
+-------------+------------------+

+-------+----------+----------+------------------+
|    eid|LPA_oval_b|LPA_oval_r|          lpacheck|
+-------+----------+----------+------------------+
|1888360|    299.23|     427.0|127.76999999999998|
|3922954|     128.0|    231.51|103.50999999999999|
|3793740|     434.3|     768.9|334.59999999999997|
|4609559|      6.34|     167.5|            161.16|
|4399652|    459.49|    286.58|172.91000000000003|
|1596552|       3.5|     249.6|             246.1|
|4686877|    260.36|       4.1|            256.26|
|1767618|    139.39|       5.5|            133.89|
|3284572|     343.2|     551.7|208.50000000000006|
|2938647|     178.0|    290.07|            112.07|
|4364311|     169.8|     309.9|140.09999999999997|
|2557059|     236.9|     369.5|             132.6|
|4331397|    115.36|      4.45|            110.91|
|1269895

In [37]:
return2321_eidbridge_s = spark.createDataFrame(return2321_eidbridge)
return2321_eidbridge_s.printSchema()

root
 |-- eid: string (nullable = true)
 |-- eid50016: string (nullable = true)



In [38]:
#QC to make sure no duplicated eid
return2321_lpa_scount = return2321_lpa_s.count()
return2321_lpa_scountuniq = return2321_lpa_s.select('eid').distinct().count()

print(f'Row count for return2321_lpa_s: {return2321_lpa_scount}')
print(f'Unique patient count for return2321_lpa_s: {return2321_lpa_scountuniq}')

Row count for return2321_lpa_s: 460549
Unique patient count for return2321_lpa_s: 460549


In [39]:
return2321_lpa_s = return2321_lpa_s.withColumnRenamed('eid', 'eid50016')

In [40]:
return2321_lpa_s2 = return2321_lpa_s.join(return2321_eidbridge_s,'eid50016','inner')
return2321_lpa_s2.count()

460541

In [41]:
return2321_lpa_s2.show(5)

+--------+----------+----------+--------+-------+
|eid50016|LPA_oval_b|LPA_oval_r|lpacheck|    eid|
+--------+----------+----------+--------+-------+
| 1000045|    135.68|      null|    null|4118113|
| 1000107|      6.53|      null|    null|1883480|
| 1000140|       7.6|      null|    null|2327181|
| 1000235|       2.9|      null|    null|4953762|
| 1000268|     14.25|      null|    null|1954067|
+--------+----------+----------+--------+-------+
only showing top 5 rows



In [42]:
dflpa0 = df.select('eid','lpa_0','lpa_0_reportability')
dflpa0_r2321 = dflpa0.join(return2321_lpa_s2, 'eid', 'left')
dflpa0_r2321.filter((f.col('lpa_0').isNotNull()) & (f.col('LPA_oval_b').isNotNull())).show(5)
dflpa0_r2321.filter((f.col('lpa_0').isNull()) & (f.col('LPA_oval_b').isNotNull())).show(5)
dflpa0_r2321.filter((f.col('lpa_0') == 189) & (f.col('LPA_oval_b') > 189)).show()

+-------+-----+--------------------+--------+----------+----------+--------+
|    eid|lpa_0| lpa_0_reportability|eid50016|LPA_oval_b|LPA_oval_r|lpacheck|
+-------+-----+--------------------+--------+----------+----------+--------+
|1000047| 9.45|Reportable at ass...| 4266808|      9.45|      null|    null|
|1000050|  6.2|Reportable at ass...| 1370458|       6.2|      null|    null|
|1000068|15.16|Reportable at ass...| 1823257|     15.16|      null|    null|
|1000122|29.95|Reportable at ass...| 5239918|     29.95|      null|    null|
|1000214|142.9|Reportable at ass...| 3106735|     142.9|      null|    null|
+-------+-----+--------------------+--------+----------+----------+--------+
only showing top 5 rows

+-------+-----+--------------------+--------+----------+----------+--------+
|    eid|lpa_0| lpa_0_reportability|eid50016|LPA_oval_b|LPA_oval_r|lpacheck|
+-------+-----+--------------------+--------+----------+----------+--------+
|1000467| null|Not reportable at...| 3613689|      

In [43]:
dflpa0_r2321 = dflpa0_r2321.withColumn('lpa_baseline', f.when(f.col('lpa_0').isNotNull(), f.col('lpa_0'))
                                                        .when((f.col('lpa_0').isNull()) & (f.col('lpa_0_reportability') == f.lit('Not reportable at assay (too high)')), f.col('LPA_oval_b'))
                                                        .when((f.col('lpa_0').isNull()) & (f.col('lpa_0_reportability') == f.lit('Not reportable at assay (too low)')), f.col('LPA_oval_b'))
                                                        .otherwise(f.lit(None)))

dflpa0_r2321.filter((f.col('lpa_0').isNull()) & (f.col('LPA_oval_b').isNotNull())).show(5)
dflpa0_r2321.filter((f.col('lpa_0') < 3.8)).show(5)
dflpa0_r2321.filter((f.col('LPA_oval_b') < 3.8)).show(5)
dflpa0_r2321.filter((f.col('LPA_oval_b') > 189)).show(5)

dflpa0_r2321_c1 = dflpa0_r2321.filter((f.col('lpa_0').isNull()) & (f.col('LPA_oval_b').isNotNull())).count()
dflpa0_r2321_c2 = dflpa0_r2321.filter((f.col('lpa_0').isNotNull()) & (f.col('LPA_oval_b') == f.col('lpa_0'))).count()
dflpa0_r2321_c3_8 = dflpa0_r2321.filter((f.col('lpa_baseline').isNotNull()) & (f.col('lpa_baseline') < 3.8)).count()
dflpa0_r2321_c189 = dflpa0_r2321.filter((f.col('lpa_baseline').isNotNull()) & (f.col('lpa_baseline') > 189)).count()

print(f'Patient count lpa_0 is null, LPA_oval_b is not null: {dflpa0_r2321_c1}')
print(f'Patient count lpa_0 is not null, lpa_0 = LPA_oval_b: {dflpa0_r2321_c2}')
print(f'Patient count lpa_baseline < 3.8: {dflpa0_r2321_c3_8}')
print(f'Patient count lpa_baseline > 189: {dflpa0_r2321_c189}')

+-------+-----+--------------------+--------+----------+----------+--------+------------+
|    eid|lpa_0| lpa_0_reportability|eid50016|LPA_oval_b|LPA_oval_r|lpacheck|lpa_baseline|
+-------+-----+--------------------+--------+----------+----------+--------+------------+
|1000467| null|Not reportable at...| 3613689|       2.6|      null|    null|         2.6|
|1000784| null|Not reportable at...| 3051330|     190.1|      null|    null|       190.1|
|1000910| null|Not reportable at...| 1379642|      1.49|      null|    null|        1.49|
|1000952| null|Not reportable at...| 1306474|       1.8|      null|    null|         1.8|
|1001128| null|Not reportable at...| 3752095|     306.4|      null|    null|       306.4|
+-------+-----+--------------------+--------+----------+----------+--------+------------+
only showing top 5 rows

+---+-----+-------------------+--------+----------+----------+--------+------------+
|eid|lpa_0|lpa_0_reportability|eid50016|LPA_oval_b|LPA_oval_r|lpacheck|lpa_basel

In [44]:
dflpa0_r2321.filter(f.col('lpa_baseline').isNotNull()).count()

454915

In [45]:
dflpa0_r2321.filter(f.col('lpa_baseline').isNull()).count()

47315

In [46]:
dflpa0_r2321 = dflpa0_r2321.select('eid', 'lpa_baseline').withColumnRenamed('lpa_baseline','lpa')
df = df.join(dflpa0_r2321,'eid','left')

In [47]:
df = df.withColumn('lpa_threshold', f.when((f.col('lpa') < 65), f.lit('<65 nmol/L'))
                                     .when((f.col('lpa') >= 65) & (f.col('lpa') < 150), f.lit('>=65 - <150 nmol/L'))
                                     .when((f.col('lpa') >= 150) & (f.col('lpa') < 175), f.lit('>=150 - <175 nmol/L'))
                                     .when((f.col('lpa') >= 175) & (f.col('lpa') < 190), f.lit('>=175 - <190 nmol/L'))
                                     .when((f.col('lpa') >= 190) & (f.col('lpa') < 225), f.lit('>=190 - <225 nmol/L'))
                                     .when((f.col('lpa') >= 225) & (f.col('lpa') < 250), f.lit('>=225 - <250 nmol/L'))
                                     .when((f.col('lpa') >= 250), f.lit('>=250 nmol/L'))
                                     .otherwise(f.lit('Missing')))

In [48]:
df = df.drop('lpa_0','lpa_1','lpa_0_reportability')
df.printSchema()

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- height_0: double (nullable = true)
 |-- height_1: double (nullable = true)
 |-- height_2: double (nullable = true)
 |-- height_3: double (nullable = true)
 |-- weight_0: double (nullable = true)
 |-- weight_1: double (nullable = true)
 |-- weight_2: double (nullable = true)
 |-- weight_3: double (nullable = true)
 |-- bmi_0: double (nullable = true)
 |-- bmi_1: double (nullable = true)
 |-- bmi_2: double (nullable = true)
 |-- bmi_3: double (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- tc_0: double (nullable = true)
 |-- tc_1: double (nullable = true)
 |-- hdl_0: double (nullable = true)
 |-- hdl_1: double (nullable = true)
 |-- ldl_0: double (nullable = true)
 |-- ldl_1: doubl

In [49]:
# https://academic.oup.com/eurjpc/article/28/18/1991/5918819
## The minimum reported concentration of Lp(a) was 3.8 nmol/L and the maximum was 189 nmol/L;
## Participants who had levels below the lower level were coded as having an Lp(a) concentration of 2.88 nmol/L
## and above the upper level used returned data 2321

#dflpa0_r2321 = dflpa0_r2321.withColumn('lpa_baseline', f.when((f.col('lpa_baseline').isNotNull()) & (f.col('lpa_baseline') < 3.8), 2.88).otherwise(f.col('lpa_baseline')))

#dflpa0_r2321.filter(f.col('lpa_baseline') < 3.8).show()

# Accessing variables in Hospital Inpatient data

**HESIN** – This is the overall master table, providing information on inpatient episodes of 
care for England, Wales and Scotland (but currently excluding maternity inpatient 
episodes for Scotland), including details on admissions and discharge, the type of 
episode. This also includes, where applicable, how an episode fits into a hospital spell.

In [50]:
# HESIN 
hesin_field_names = ['eid', 
                     'ins_index', 
                     'dsource', 
                     'source', 
                     'epistart', 
                     'epiend', 
                     'epidur', 
                     'bedyear', 
                     'epistat', 
                     'epitype', 
                     'epiorder', 
                     'spell_index', 
                     'spell_seq', 
                     'spelbgin', 
                     'spelend', 
                     'speldur', 
                     'pctcode', 
                     'gpprpct', 
                     'category', 
                     'elecdate', 
                     'elecdur', 
                     'admidate', 
                     'admimeth_uni', 
                     'admimeth', 
                     'admisorc_uni', 
                     'admisorc', 
                     'firstreg', 
                     'classpat_uni', 
                     'classpat', 
                     'intmanag_uni', 
                     'intmanag', 
                     'mainspef_uni', 
                      'mainspef', 
                     'tretspef_uni', 
                     'tretspef', 
                     'operstat', 
                     'disdate', 
                     'dismeth_uni', 
                     'dismeth', 
                     'disdest_uni', 
                     'disdest', 
                     'carersi'
                    ]

**HESIN_DIAG** – Diagnosis codes (ICD-9 or ICD-10) relating to the inpatient episode of 
care for England, Wales and Scotland (but currently excluding maternity episodes for 
Scotland). 

In [51]:
# HESIN DIAG
hesin_diag_field_names = ['eid', # This identifier is the same encoded id used in the main dataset.
                          'ins_index', # A numerical index which together with the eid uniquely identifies the corresponding record in the main HESIN table.
                          'arr_index', # A numerical index which together with the eid & ins_index uniquely identifies this record, i.e. eid, ins_index & arr_index together form a primary key for this table.
                          'level', # 1 Primary/main diagnosis, 2 Secondary diagnosis, 3 External cause
                          'diag_icd9',
                          'diag_icd9_nb',
                          'diag_icd10', # Diagnoses are coded according to the International Classification of Diseases version-10 (ICD 10).
                          'diag_icd10_nb'
                         ]

**HESIN_OPER** – Operations and procedural codes (OPCS-3 or OPCS-4) relating to the 
inpatient episode of care (but currently excluding maternity episodes for Scotland).

In [52]:
# HESIN OPER
hesin_oper_field_names = ['eid', # This identifier is the same encoded id used in the main dataset.
                          'ins_index', # A numerical index which together with the eid uniquely identifies the corresponding record in the main HESIN table.
                          'arr_index', # A numerical index which together with the eid & ins_index uniquely identifies this record, i.e. eid, ins_index & arr_index together form a primary key for this table.
                          'level', # 1 Main operation, 2 Secondary operation
                          'opdate', # Date of operation
                          'oper3',
                          'oper3_nb',
                          'oper4', # Operative procedures are coded according to the Office of Population Censuses and Surveys Classification of Interventions and Procedures, version 4 (OPCS-4).
                          'oper4_nb'
                         ]

**HESIN_CRITICAL** – A child table of HESIN containing further information about those hospital episodes that required treatment in a critical care unit. For example, it gives the number of days of (basic & advanced) cardiac and respiratory support received by a patient. Links back to HESIN via the eid and ins_index fields. 

In [53]:
hesin_critical_field_names = ['eid',
                              'ins_index',
                              'arr_index',
                              'dsource',
                              'source',
                              'ccstartdate',
                              'ccadmitype',
                              'ccadmisorc',
                              'ccsorcloc',
                              'ccdisdate',
                              'ccdisrdydate',
                              'ccdisstat',
                              'ccdisdest',
                              'ccdisloc',
                              'ccapcrel',
                              'bressupdays',
                              'aressupdays',
                              'bcardsupdays',
                              'acardsupdays',
                              'rensupdays',
                              'neurosupdays',
                              'gisupdays',
                              'dermsupdays',
                              'liversupdays',
                              'orgsupmax',
                              'cclev2days',
                              'cclev3days',
                              'ccunitfun',
                              'unitbedconfig'
                             ]

In [54]:
# Grabbing fields into a Spark DataFrames
hesin_df = hesin.retrieve_fields(names=hesin_field_names, engine=dxdata.connect())
hesin_diag_df = hesin_diag.retrieve_fields(names=hesin_diag_field_names, engine=dxdata.connect())
hesin_oper_df = hesin_oper.retrieve_fields(names=hesin_oper_field_names, engine=dxdata.connect())
hesin_critical_df = hesin_critical.retrieve_fields(names=hesin_critical_field_names, engine=dxdata.connect())

In [55]:
hesin_df_rownum = hesin_df.count()
hesin_df_eidnum = hesin_df.select('eid').distinct().count()
print(f'Number of records in hesin: {hesin_df_rownum}')
print(f'Number of patients in hesin: {hesin_df_eidnum}')

Number of records in hesin: 4238180
Number of patients in hesin: 449078


In [56]:
hesin_df.select('eid','dsource','epistart','admidate','epiend','disdate').filter((f.col('epistart').isNull()) & (f.col('dsource') == 'HES')).distinct().show(5)

+-------+-------+--------+----------+----------+----------+
|    eid|dsource|epistart|  admidate|    epiend|   disdate|
+-------+-------+--------+----------+----------+----------+
|4877445|    HES|    null|2008-04-07|2008-04-11|2008-04-11|
|1054255|    HES|    null|      null|      null|      null|
|4975309|    HES|    null|      null|      null|      null|
|4720663|    HES|    null|      null|      null|      null|
|5771257|    HES|    null|2008-07-16|2008-07-16|2008-07-16|
+-------+-------+--------+----------+----------+----------+
only showing top 5 rows



In [57]:
#impute epistart, disdate
#identify censor date for each dsource

hesin_df = hesin_df.withColumn('epistart_im', f.when((f.col('epistart').isNull()) & (f.col('admidate').isNotNull()), f.col('admidate'))
                                               .otherwise(f.col('epistart')))

hesin_df = hesin_df.withColumn('disdate_im', f.when((f.col('disdate').isNull()) & (f.col('epiend').isNotNull()), f.col('epiend'))
                                              .when((f.col('disdate').isNull()) & (f.col('epiend').isNull()) & (f.col('epistart').isNotNull()), f.col('epistart'))
                                              .when((f.col('disdate').isNull()) & (f.col('epiend').isNull()) & (f.col('epistart').isNull()) & (f.col('admidate').isNotNull()), f.col('admidate'))
                                              .otherwise(f.col('disdate')))

hesin_df = hesin_df.withColumn('censordate', f.when((f.col('dsource') == 'HES'), f.lit('2022-10-31'))
                                              .when((f.col('dsource') == 'SMR'), f.lit('2022-08-31'))
                                              .when((f.col('dsource') == 'PEDW'), f.lit('2022-05-31'))
                                              .otherwise(f.lit(None)))

hesin_df = hesin_df.withColumn('censordate', f.to_date(f.col('censordate'), 'yyyy-MM-dd'))

In [58]:
hesin_df.select('eid','dsource','censordate').show(5)
hesin_df.select('eid','dsource','censordate').printSchema()

+-------+-------+----------+
|    eid|dsource|censordate|
+-------+-------+----------+
|5245449|    HES|2022-10-31|
|5792320|    HES|2022-10-31|
|1490947|    HES|2022-10-31|
|2037719|    HES|2022-10-31|
|4595537|    HES|2022-10-31|
+-------+-------+----------+
only showing top 5 rows

root
 |-- eid: string (nullable = true)
 |-- dsource: string (nullable = true)
 |-- censordate: date (nullable = true)



# Import code list

In [59]:
%%bash
pip install openpyxl



Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.9/250.9 kB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.5


[0m

In [60]:
from openpyxl import load_workbook
wb = load_workbook('/mnt/project/Users/yonghu4/Lpa_EB/code/Consolidated codes_Pelacarsen HEOR Study_final_v01.xlsx')
#ws4 = wb2["New Title"]

In [61]:
print(wb.sheetnames)

['readme', 'column_dictionary', 'ascvd_group', 'AnyASCVD_final', 'Revascularization_final', 'MI_final', 'IS_final', 'HF_final']


In [62]:
from openpyxl.utils.dataframe import dataframe_to_rows
from itertools import islice

#load AnyASCVD_final
code_ascvd_data = wb["AnyASCVD_final"]

code_ascvd_data = code_ascvd_data.values

code_ascvd_cols = next(code_ascvd_data)[0:]

code_ascvd_data = list(code_ascvd_data)

code_ascvd_idx = [r[0] for r in code_ascvd_data]

code_ascvd_data = (islice(r, 0, None) for r in code_ascvd_data)

code_ascvd =  pd.DataFrame(code_ascvd_data, index=code_ascvd_idx, columns=code_ascvd_cols)

In [63]:
code_ascvd

Unnamed: 0,group,coding_system,code,description,used_in_UKB,present_in_SP,group_original,Remark
ascvd_cad_codes,ascvd_cad_codes,ICD-10,I24,Other acute ischaemic heart diseases,Yes,No,CHD/Unstable & stable angina,
ascvd_cad_codes,ascvd_cad_codes,ICD-10,I25,Chronic ischaemic heart disease,Yes,No,CHD/Unstable & stable angina,
ascvd_cad_codes,ascvd_cad_codes,ICD-10,I250,"Atherosclerotic cardiovascular disease, so des...",Yes,No,CHD/Unstable & stable angina,
ascvd_cad_codes,ascvd_cad_codes,ICD-10,I251,Atherosclerotic heart disease,Yes,No,CHD/Unstable & stable angina,
ascvd_cad_codes,ascvd_cad_codes,ICD-10,I253,Aneurysm of heart,Yes,No,CHD/Unstable & stable angina,
...,...,...,...,...,...,...,...,...
ascvd_revascularization_codes,ascvd_revascularization_codes,DRG,550,CORONARY BYP W/O CARD CATH W/O MAJ CV DX,No,Yes,Coronary artery bypass graft surgery,
ascvd_revascularization_codes,ascvd_revascularization_codes,DRG,556,PERC CV PR W NON-DRUG STENT WO MAJ CV DX,No,No,Angioplasty and Stent placement,
ascvd_revascularization_codes,ascvd_revascularization_codes,DRG,557,PERC CV PR W DRUG-EL STENT W MAJ CV DX,No,No,Angioplasty and Stent placement,
ascvd_revascularization_codes,ascvd_revascularization_codes,DRG,558,PERC CV PR W DRUG-EL STENT W/O MAJ CV DX,No,No,Angioplasty and Stent placement,


In [64]:
# Creating the DataFrame
code_ascvd_schema = StructType([StructField("group", StringType(), True)
                               ,StructField("coding_system", StringType(), True)
                               ,StructField("code", StringType(), True)
                               ,StructField("description", StringType(), True)
                               ,StructField("used_in_UKB", StringType(), True)
                               ,StructField("present_in_SP", StringType(), True)
                               ,StructField("group_original", StringType(), True)
                               ,StructField("Remark", StringType(), True)
                               ])
 
code_ascvd = spark.createDataFrame(code_ascvd, schema=code_ascvd_schema)

In [65]:
#Check dataframe and column types
code_ascvd.show()
code_ascvd.printSchema()

+--------------------+-------------+----+--------------------+-----------+-------------+--------------------+------+
|               group|coding_system|code|         description|used_in_UKB|present_in_SP|      group_original|Remark|
+--------------------+-------------+----+--------------------+-----------+-------------+--------------------+------+
|     ascvd_cad_codes|       ICD-10| I24|Other acute ischa...|        Yes|           No|CHD/Unstable & st...|  null|
|     ascvd_cad_codes|       ICD-10| I25|Chronic ischaemic...|        Yes|           No|CHD/Unstable & st...|  null|
|     ascvd_cad_codes|       ICD-10|I250|Atherosclerotic c...|        Yes|           No|CHD/Unstable & st...|  null|
|     ascvd_cad_codes|       ICD-10|I251|Atherosclerotic h...|        Yes|           No|CHD/Unstable & st...|  null|
|     ascvd_cad_codes|       ICD-10|I253|   Aneurysm of heart|        Yes|           No|CHD/Unstable & st...|  null|
|     ascvd_cad_codes|       ICD-10|I254|Coronary artery a...|  

In [66]:
code_ascvd.select('coding_system').distinct().sort('coding_system').show()

+-------------+
|coding_system|
+-------------+
|    CPT/HCPCS|
|          DRG|
|       ICD-10|
|    ICD-10-CM|
|   ICD-10-PCS|
|        ICD-9|
|     ICD-9-CM|
|    ICD-9-PCS|
|       OPCS-3|
|       OPCS-4|
+-------------+



In [67]:
#Extract ascvd code list
code_ascvd_icd09 = code_ascvd.filter(f.col('coding_system') == 'ICD-9').select(f.col('group').alias('group_icd9'),f.col('code').alias('diag_icd9')).distinct()
code_ascvd_icd10 = code_ascvd.filter(f.col('coding_system') == 'ICD-10').select(f.col('group').alias('group_icd10'),f.col('code').alias('diag_icd10')).distinct()
code_ascvd_opcs3 = code_ascvd.filter(f.col('coding_system') == 'OPCS-3').select(f.col('group').alias('group_oper3'),f.col('code').alias('oper3')).distinct()
code_ascvd_opcs4 = code_ascvd.filter(f.col('coding_system') == 'OPCS-4').select(f.col('group').alias('group_oper4'),f.col('code').alias('oper4')).distinct()

code_ascvd_icd09.printSchema()
code_ascvd_icd10.printSchema()
code_ascvd_opcs3.printSchema()
code_ascvd_opcs4.printSchema()

root
 |-- group_icd9: string (nullable = true)
 |-- diag_icd9: string (nullable = true)

root
 |-- group_icd10: string (nullable = true)
 |-- diag_icd10: string (nullable = true)

root
 |-- group_oper3: string (nullable = true)
 |-- oper3: string (nullable = true)

root
 |-- group_oper4: string (nullable = true)
 |-- oper4: string (nullable = true)



In [68]:
code_ascvd_icd09count = code_ascvd_icd09.count()
code_ascvd_icd09count_distinct = code_ascvd_icd09.select('diag_icd9').distinct().count()
code_ascvd_icd10count = code_ascvd_icd10.count()
code_ascvd_icd10count_distinct = code_ascvd_icd10.select('diag_icd10').distinct().count()
code_ascvd_opcs3count = code_ascvd_opcs3.count()
code_ascvd_opcs3count_distinct = code_ascvd_opcs3.select('oper3').distinct().count()
code_ascvd_opcs4count = code_ascvd_opcs4.count()
code_ascvd_opcs4count_distinct = code_ascvd_opcs4.select('oper4').distinct().count()


print(f'Number of row for code_ascvd_icd09: {code_ascvd_icd09count}')
print(f'Number of unique codes for code_ascvd_icd09: {code_ascvd_icd09count_distinct}')
print(f'Number of row for code_ascvd_icd10: {code_ascvd_icd10count}')
print(f'Number of unique codes for code_ascvd_icd10: {code_ascvd_icd10count_distinct}')
print(f'Number of row for code_ascvd_opcs3: {code_ascvd_opcs3count}')
print(f'Number of unique codes for code_ascvd_opcs3: {code_ascvd_opcs3count_distinct}')
print(f'Number of row for code_ascvd_opcs4: {code_ascvd_opcs4count}')
print(f'Number of unique codes for code_ascvd_opcs4: {code_ascvd_opcs4count_distinct}')

Number of row for code_ascvd_icd09: 30
Number of unique codes for code_ascvd_icd09: 30
Number of row for code_ascvd_icd10: 102
Number of unique codes for code_ascvd_icd10: 102
Number of row for code_ascvd_opcs3: 3
Number of unique codes for code_ascvd_opcs3: 3
Number of row for code_ascvd_opcs4: 169
Number of unique codes for code_ascvd_opcs4: 169


In [69]:
#load HF_final
code_hf_data = wb["HF_final"]

code_hf_data = code_hf_data.values

code_hf_cols = next(code_hf_data)[0:]

code_hf_data = list(code_hf_data)

code_hf_idx = [r[0] for r in code_hf_data]

code_hf_data = (islice(r, 0, None) for r in code_hf_data)

code_hf =  pd.DataFrame(code_hf_data, index=code_hf_idx, columns=code_hf_cols)

In [70]:
# Creating the hf code DataFrame
code_hf_schema = StructType([StructField("group", StringType(), True)
                               ,StructField("coding_system", StringType(), True)
                               ,StructField("code", StringType(), True)
                               ,StructField("description", StringType(), True)
                               ,StructField("used_in_UKB", StringType(), True)
                               ,StructField("Remark", StringType(), True)
                               ])
 
code_hf = spark.createDataFrame(code_hf, schema=code_hf_schema)
code_hf.show()

+-------------+-------------+------+--------------------+-----------+--------------------+
|        group|coding_system|  code|         description|used_in_UKB|              Remark|
+-------------+-------------+------+--------------------+-----------+--------------------+
|mace_hf_codes|       ICD-10|   I50|      Heart failure |        Yes|                null|
|mace_hf_codes|       ICD-10|  I500|Congestive heart ...|        Yes|                null|
|mace_hf_codes|       ICD-10|  I501|Left ventricular ...|        Yes|                null|
|mace_hf_codes|       ICD-10|  I509|Heart failure, un...|        Yes|                null|
|mace_hf_codes|       ICD-10|  I110|Hypertensive hear...|        Yes|                null|
|mace_hf_codes|       ICD-10|  I130|Hypertensive hear...|        Yes|                null|
|mace_hf_codes|       ICD-10|  I132|Hypertensive hear...|        Yes|                null|
|mace_hf_codes|        ICD-9|   428|       Heart failure|        Yes|Newly added and v...|

In [71]:
#Extract HF code list
code_hf_icd10_diag = code_hf.filter(f.col('coding_system') == 'ICD-10').select(f.col('code').alias('cause_icd10')).distinct()

In [72]:
# Create CVD related ICD-10 code list
code_ascvd_icd10_diag = code_ascvd_icd10.select(f.col('diag_icd10').alias('cause_icd10')).distinct()
code_cvd_icd10 = code_ascvd_icd10_diag.union(code_hf_icd10_diag)

In [73]:
#Primary care read codelist

wbread = load_workbook('/mnt/project/Users/yonghu4/Lpa_EB/code/read2ctv3_ascvd_codelist_hf_draft_tobeQC_Luiz_v2.xlsx')

#load excel
code_ascvd_read = wbread["primarycare_ascvdcode_final"]

code_ascvd_read = code_ascvd_read.values

code_ascvd_r_cols = next(code_ascvd_read)[0:]

code_ascvd_read  = list(code_ascvd_read)

code_ascvd_r_idx = [r[0] for r in code_ascvd_read]

code_ascvd_read = (islice(r, 0, None) for r in code_ascvd_read)

code_ascvd_r =  pd.DataFrame(code_ascvd_read, index=code_ascvd_r_idx, columns=code_ascvd_r_cols)
code_ascvd_r

Unnamed: 0,coding_system,read_code,term_description_list,ascvd_group
read_v2,read_v2,F11x2,Cerebral degeneration due to cerebrovascular d...,ascvd_cerebrovascular_codes
read_v2,read_v2,F21y2,Binswanger's disease|Binswanger's encephalopathy,ascvd_cerebrovascular_codes
read_v2,read_v2,F4236,Amaurosis fugax,ascvd_tia_codes
read_v2,read_v2,Fyu55,[X]Other transient cerebral ischaemic attacks ...,ascvd_tia_codes
read_v2,read_v2,G....,Circulatory system diseases|Cardiovascular sys...,ascvd_nonspecific_codes
...,...,...,...,...
ctv3,ctv3,XaMt8,Transluminal aortic branched stent graft NEC|T...,ascvd_revascularization_codes
ctv3,ctv3,XaMvB,Percutaneous transluminal insertion of stent i...,ascvd_revascularization_codes
ctv3,ctv3,XaMyP,Percutaneous transluminal insertion of stent i...,ascvd_revascularization_codes
ctv3,ctv3,XaQIl,Percutaneous transluminal insertion of stent i...,ascvd_revascularization_codes


In [74]:
# Creating the DataFrame
code_ascvd_r_schema = StructType([StructField("coding_system", StringType(), True) 
                                 ,StructField("read_code", StringType(), True)
                                 ,StructField("term_description_list", StringType(), True)
                                 ,StructField("ascvd_group", StringType(), True) 
                                ])

code_ascvd_r = spark.createDataFrame(code_ascvd_r, schema=code_ascvd_r_schema)

In [75]:
#Check dataframe and column types
code_ascvd_r = code_ascvd_r.withColumnRenamed('ascvd_group', 'group')
code_ascvd_r.show()
code_ascvd_r.printSchema()

+-------------+---------+---------------------+--------------------+
|coding_system|read_code|term_description_list|               group|
+-------------+---------+---------------------+--------------------+
|      read_v2|    F11x2| Cerebral degenera...|ascvd_cerebrovasc...|
|      read_v2|    F21y2| Binswanger's dise...|ascvd_cerebrovasc...|
|      read_v2|    F4236|      Amaurosis fugax|     ascvd_tia_codes|
|      read_v2|    Fyu55| [X]Other transien...|     ascvd_tia_codes|
|      read_v2|    G....| Circulatory syste...|ascvd_nonspecific...|
|      read_v2|    G3...| Ischaemic heart d...|ascvd_nonspecific...|
|      read_v2|    G30..| Acute myocardial ...|      ascvd_mi_codes|
|      read_v2|    G300.| Acute anterolater...|      ascvd_mi_codes|
|      read_v2|    G301.| Other specified a...|      ascvd_mi_codes|
|      read_v2|    G3010| Acute anteroapica...|      ascvd_mi_codes|
|      read_v2|    G3011| Acute anterosepta...|      ascvd_mi_codes|
|      read_v2|    G301z| Anterior

In [76]:
#Extract code list
code_ascvd_readv2 = code_ascvd_r.filter(f.col('coding_system') == 'read_v2').select(f.col('group').alias('group_readv2'),f.col('read_code').alias('read_2')).distinct()
code_ascvd_ctv3 = code_ascvd_r.filter(f.col('coding_system') == 'ctv3').select(f.col('group').alias('group_ctv3'),f.col('read_code').alias('read_3')).distinct()

code_ascvd_readv2.printSchema()
code_ascvd_ctv3.printSchema()

root
 |-- group_readv2: string (nullable = true)
 |-- read_2: string (nullable = true)

root
 |-- group_ctv3: string (nullable = true)
 |-- read_3: string (nullable = true)



In [77]:
# Check read codes
code_ascvd_ctv3.filter(f.col('read_3').contains('322')).show()

+--------------+------+
|    group_ctv3|read_3|
+--------------+------+
|ascvd_mi_codes| 3222.|
|ascvd_mi_codes| 322..|
+--------------+------+



# Accessing variables in Death Register data

**DEATH** - Each record in the DEATH table is uniquely identified by the eid (encoded identifier) of 
the participant and the instance index (ins_index) of the record. Participants usually only 
have one record in this table with ins_index=0; however, there are a small number of participants for whom we have more than one death record.
The DEATH table includes the date that the participant died in the date_of_death field.
This table also contains information on the data source (dsource) which is either E/W for 
England & Wales or SCOT for Scotland.

In [78]:
death_field_names = ['eid',
                     'ins_index',
                     #'dsource',
                     #'source',
                     'date_of_death'
                    ]

**DEATH_CAUSE** - The corresponding causes of death, coded using ICD-10, are provided on the 
DEATH_CAUSE table, with the eid and ins_index fields used to link back to the main 
record on the DEATH table. The arr_index is a sequential index, starting at 0, which 
labels each separate cause of death. A primary cause of death is assigned level=1 in this table and a contributory cause of 
death level=2.

In [79]:
death_cause_field_names = ['eid',
                           'ins_index',
                           'arr_index',
                           'level',
                           'cause_icd10'
                          ]

In [80]:
# Grabbing fields into a Spark DataFrames
death_df = death.retrieve_fields(names=death_field_names, engine=dxdata.connect())
death_cause_df = death_cause.retrieve_fields(names=death_cause_field_names, engine=dxdata.connect())

In [81]:
death_cause_df.show(5)

+-------+---------+---------+-----+-----------+
|    eid|ins_index|arr_index|level|cause_icd10|
+-------+---------+---------+-----+-----------+
|5388020|        0|        4|    2|        I38|
|2443426|        0|        1|    2|        F03|
|5403157|        0|        0|    1|       U071|
|4997874|        0|        1|    2|        R58|
|4910041|        0|        0|    1|       I608|
+-------+---------+---------+-----+-----------+
only showing top 5 rows



In [82]:
#cv death
death_cause_df.select('eid','cause_icd10', 'level').filter((f.col('cause_icd10').startswith('I2')) | 
                                                           (f.col('cause_icd10').startswith('I3')) |
                                                           (f.col('cause_icd10').startswith('I4')) |
                                                           (f.col('cause_icd10').startswith('I5')) |
                                                           (f.col('cause_icd10').startswith('I6')) |
                                                           (f.col('cause_icd10').startswith('I7'))).show()

+-------+-----------+-----+
|    eid|cause_icd10|level|
+-------+-----------+-----+
|5388020|        I38|    2|
|4910041|       I608|    1|
|3919128|       I500|    2|
|5474244|       I489|    2|
|4948725|       I420|    1|
|4516879|       I619|    1|
|5704682|        I64|    1|
|1215012|       I251|    2|
|4474764|       I501|    2|
|4603338|       I509|    2|
|3012611|       I259|    1|
|4557357|       I259|    1|
|5115025|       I269|    2|
|2960119|       I219|    1|
|5578257|       I330|    2|
|5082025|       I509|    2|
|4516997|       I259|    2|
|3946056|       I517|    2|
|2694027|       I251|    1|
|5192554|       I219|    1|
+-------+-----------+-----+
only showing top 20 rows



# Merge death information to df

In [83]:
# Identify CV death
death_cause_eid = death_cause_df.select('eid','cause_icd10','level').groupBy('eid').agg(f.collect_list('cause_icd10').alias('death_icd10'),f.collect_list('level').alias('death_level'))

#'death_cv_primary': primary cause of death, as recorded in the Death Register data, where the corresponding ICD code is between I20-I79
death_cause_cv_df = death_cause_df.select('eid','cause_icd10', 'level').filter((f.col('cause_icd10').startswith('I2')) | 
                                                                               (f.col('cause_icd10').startswith('I3')) |
                                                                               (f.col('cause_icd10').startswith('I4')) |
                                                                               (f.col('cause_icd10').startswith('I5')) |
                                                                               (f.col('cause_icd10').startswith('I6')) |
                                                                               (f.col('cause_icd10').startswith('I7')))

death_cause_cv_eid = death_cause_cv_df.distinct()
death_cause_cv_eid = death_cause_cv_eid.groupBy('eid').agg(f.collect_list('cause_icd10').alias('death_cv_icd10'),f.collect_list('level').alias('death_cv_level'))
death_cause_cv_eid = death_cause_cv_eid.withColumn('death_cv', f.lit(1))
death_cause_cv_eid = death_cause_cv_eid.withColumn('death_cv_primary', f.when((f.array_contains(f.col('death_cv_level'), 1)), f.lit(1)).otherwise(f.lit(0)))

#'death_cv02_primary': primary cause of death, as recorded in the Death Register data, where the corresponding ICD code is
## ICD-10 codes from excel "Consolidated codes_Pelacarsen HEOR Study_final_v01.xlsx", tab "AnyASCVD_final" and "HF_final".
death_cause_cv2_df = death_cause_df.filter(f.col('level') == 1)
death_cause_cv2_df = death_cause_cv2_df.select('eid','cause_icd10').join(code_cvd_icd10, 'cause_icd10', 'inner')
death_cause_cv2_df = death_cause_cv2_df.select('eid').distinct()
death_cause_cv2_df = death_cause_cv2_df.withColumn('death_cv02_primary', f.lit(1))

death_df_uniq = death_df.select('eid', 'date_of_death').distinct()
death_df_final = death_df_uniq.join(death_cause_eid,'eid','left')
death_df_final = death_df_final.join(death_cause_cv_eid,'eid','left')
death_df_final = death_df_final.join(death_cause_cv2_df,'eid','left')
death_df_final.show(5, truncate=False)


# death_field_names = ['eid',
#                      'ins_index',
#                      #'dsource',
#                      #'source',
#                      'date_of_death'
#                     ]

# death_cause_field_names = ['eid',
#                            'ins_index',
#                            'arr_index',
#                            'level',
#                            'cause_icd10'
#                           ]

+-------+-------------+------------------------------+---------------+------------------+--------------+--------+----------------+------------------+
|eid    |date_of_death|death_icd10                   |death_level    |death_cv_icd10    |death_cv_level|death_cv|death_cv_primary|death_cv02_primary|
+-------+-------------+------------------------------+---------------+------------------+--------------+--------+----------------+------------------+
|1582577|2020-10-18   |[I489, J189]                  |[2, 1]         |[I489]            |[2]           |1       |0               |null              |
|3392520|2011-09-23   |[X448, X458, T509, F329, T519]|[1, 2, 2, 2, 2]|null              |null          |null    |null            |null              |
|3706305|2019-03-08   |[C19, K566]                   |[1, 2]         |null              |null          |null    |null            |null              |
|4963901|2010-07-07   |[A419, J189]                  |[2, 1]         |null              |null       

In [84]:
# Check #'death_cv02_primary'
death_df_final.filter(f.col('death_cv02_primary') == 1).show(5, truncate=False)

+-------+-------------+------------------------------+---------------+------------------------+--------------+--------+----------------+------------------+
|eid    |date_of_death|death_icd10                   |death_level    |death_cv_icd10          |death_cv_level|death_cv|death_cv_primary|death_cv02_primary|
+-------+-------------+------------------------------+---------------+------------------------+--------------+--------+----------------+------------------+
|1004416|2022-11-15   |[I259, I219]                  |[2, 1]         |[I219, I259]            |[1, 2]        |1       |1               |1                 |
|1006881|2021-01-18   |[I255, I071, I509]            |[1, 2, 2]      |[I509, I255]            |[2, 1]        |1       |1               |1                 |
|1014289|2014-09-28   |[I509, I269, I802, I219, I251]|[2, 2, 2, 1, 2]|[I251, I509, I269, I219]|[2, 2, 2, 1]  |1       |1               |1                 |
|1019438|2009-12-25   |[I251, I509]                  |[1, 2]    

In [85]:
df = df.join(death_df_final,'eid','left')
df = df.withColumn('death', f.when((f.col('date_of_death').isNull()), f.lit(0)).otherwise(f.lit(1)))
df = df.withColumn('death_cv', f.when((f.col('death_cv').isNull()), f.lit(0)).otherwise(f.col('death_cv')))
df = df.withColumn('death_cv_primary', f.when((f.col('death_cv_primary').isNull()), f.lit(0)).otherwise(f.col('death_cv_primary')))
df = df.withColumn('death_cv02_primary', f.when((f.col('death_cv02_primary').isNull()), f.lit(0)).otherwise(f.col('death_cv02_primary')))


In [86]:
df.printSchema()

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- height_0: double (nullable = true)
 |-- height_1: double (nullable = true)
 |-- height_2: double (nullable = true)
 |-- height_3: double (nullable = true)
 |-- weight_0: double (nullable = true)
 |-- weight_1: double (nullable = true)
 |-- weight_2: double (nullable = true)
 |-- weight_3: double (nullable = true)
 |-- bmi_0: double (nullable = true)
 |-- bmi_1: double (nullable = true)
 |-- bmi_2: double (nullable = true)
 |-- bmi_3: double (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- tc_0: double (nullable = true)
 |-- tc_1: double (nullable = true)
 |-- hdl_0: double (nullable = true)
 |-- hdl_1: double (nullable = true)
 |-- ldl_0: double (nullable = true)
 |-- ldl_1: doubl

# Summary of dates

In [87]:
#QC all dates check max min

##Assessment dates
df_acdate = df.select(f.min(f.col('date_of_ac_0')).alias('date_of_ac_0_min'),
                      f.max(f.col('date_of_ac_0')).alias('date_of_ac_0_max'),
                      f.min(f.col('date_of_ac_1')).alias('date_of_ac_1_min'),
                      f.max(f.col('date_of_ac_1')).alias('date_of_ac_1_max'),
                      f.min(f.col('date_of_ac_2')).alias('date_of_ac_2_min'),
                      f.max(f.col('date_of_ac_2')).alias('date_of_ac_2_max'),
                      f.min(f.col('date_of_ac_3')).alias('date_of_ac_3_min'),
                      f.max(f.col('date_of_ac_3')).alias('date_of_ac_3_max')
                     )

hesin_dfdate = hesin_df.select(f.min(f.col('epistart')).alias('epistart_min'),
                               f.max(f.col('epistart')).alias('epistart_max'),
                               f.min(f.col('epiend')).alias('epiend_min'),
                               f.max(f.col('epiend')).alias('epiend_max'),
                               f.min(f.col('admidate')).alias('admidate_min'),
                               f.max(f.col('admidate')).alias('admidate_max'),
                               f.min(f.col('disdate')).alias('disdate_min'),
                               f.max(f.col('disdate')).alias('disdate_max'),
                               f.min(f.col('disdate_im')).alias('disdate_im_min'),
                               f.max(f.col('disdate_im')).alias('disdate_im_max')
                               )

hesin_oper_dfdate = hesin_oper_df.select(f.min(f.col('opdate')).alias('opdate_min'),
                                         f.max(f.col('opdate')).alias('opdate_max'))

hesin_critical_dfdate = hesin_critical_df.select(f.min(f.col('ccstartdate')).alias('ccstartdate_min'),
                                                 f.max(f.col('ccstartdate')).alias('ccstartdate_max'),
                                                 f.min(f.col('ccdisdate')).alias('ccdisdate_min'),
                                                 f.max(f.col('ccdisdate')).alias('ccdisdate_max'))


death_dfdate = death_df.select(f.min(f.col('date_of_death')).alias('date_of_death_min'),
                               f.max(f.col('date_of_death')).alias('date_of_death_max'))
df_acdate.show()
hesin_dfdate.show()
hesin_oper_dfdate.show()
hesin_critical_dfdate.show()                                                 

#where clinical event or prescription date precedes participant date of birth it has been altered to 01/01/1901.
#where the date matches participant date of birth it has been altered to 02/02/1902.
#Where the date follows participant date of birth but is in the year of their birth it has been altered to 03/03/1903.
#Where the date was in the future this has been changed to 07/07/2037 as these are likely to have been entered as a place-holder or other system default.

+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
|date_of_ac_0_min|date_of_ac_0_max|date_of_ac_1_min|date_of_ac_1_max|date_of_ac_2_min|date_of_ac_2_max|date_of_ac_3_min|date_of_ac_3_max|
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
|      2006-03-13|      2010-10-01|      2009-12-12|      2013-06-07|      2014-04-30|      2023-10-12|      2019-05-22|      2023-10-12|
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+

+------------+------------+----------+----------+------------+------------+-----------+-----------+--------------+--------------+
|epistart_min|epistart_max|epiend_min|epiend_max|admidate_min|admidate_max|disdate_min|disdate_max|disdate_im_min|disdate_im_max|
+------------+------------+----------+----------+

## Identify patients with Scotland hospitalization records

In [88]:
hesin_df_source = hesin_df.select('eid','dsource').distinct()

In [89]:
#Identify patients with inpatient
pat_w_in = hesin_df.select('eid').distinct()
pat_w_in = pat_w_in.withColumn('hesin_record', f.lit(1))

#Identify patients with SMR records
pat_w_smr = hesin_df_source.filter(f.col('dsource') == 'SMR')
pat_w_smr = pat_w_smr.withColumnRenamed('dsource','hesin_smr')

#merge to df
df = df.join(pat_w_in,'eid','left')
df = df.join(pat_w_smr,'eid','left')
df = df.withColumn('hesin_record', f.when((f.col('hesin_record').isNull()), f.lit(0)).otherwise(f.col('hesin_record')))
df = df.withColumn('hesin_smr', f.when((f.col('hesin_record') == 1) & (f.col('hesin_smr').isNull()), f.lit(0))
                                 .when((f.col('hesin_record') != 1) & (f.col('hesin_smr').isNull()), f.lit(None))
                                 .when((f.col('hesin_smr') == 'SMR'), f.lit(1))
                                 .otherwise(f.col('hesin_smr')))

df.select('eid','hesin_record','hesin_smr').filter(f.col('hesin_smr') == 'SMR').show(10)
df.select('eid','hesin_record','hesin_smr').filter((f.col('hesin_smr') == 1) | (f.col('hesin_smr') == 0)).show(10)

+---+------------+---------+
|eid|hesin_record|hesin_smr|
+---+------------+---------+
+---+------------+---------+

+-------+------------+---------+
|    eid|hesin_record|hesin_smr|
+-------+------------+---------+
|1000047|           1|        0|
|1000050|           1|        0|
|1000068|           1|        0|
|1000122|           1|        0|
|1000214|           1|        0|
|1000467|           1|        0|
|1000517|           1|        0|
|1000578|           1|        0|
|1000591|           1|        1|
|1000621|           1|        0|
+-------+------------+---------+
only showing top 10 rows



In [90]:
df.select('eid','hesin_record','hesin_smr').filter(f.col('hesin_smr').isNull()).show(5)
df.select('eid','hesin_record','hesin_smr').filter((f.col('hesin_record') == 0) & (f.col('hesin_smr').isNotNull())).show(5) #should be empty

+-------+------------+---------+
|    eid|hesin_record|hesin_smr|
+-------+------------+---------+
|1000924|           0|     null|
|1000952|           0|     null|
|1001446|           0|     null|
|1002105|           0|     null|
|1002217|           0|     null|
+-------+------------+---------+
only showing top 5 rows

+---+------------+---------+
|eid|hesin_record|hesin_smr|
+---+------------+---------+
+---+------------+---------+



## Identify patients with primary care diagnosis records

In [91]:
gp_field_names = ['eid']
gp_clinical_df = gp_clinical.retrieve_fields(names=gp_field_names, engine=dxdata.connect())
pat_gpc = gp_clinical_df.select('eid').distinct()
pat_gpc = pat_gpc.withColumn('gp_clinical_record',f.lit('Yes'))
df = df.join(pat_gpc,'eid','left')
df = df.withColumn('gp_clinical_record', f.when((f.col('gp_clinical_record') == 'Yes'), f.lit(1)).otherwise(f.lit(0)))

In [92]:
pat_gpc.show(10)

+-------+------------------+
|    eid|gp_clinical_record|
+-------+------------------+
|5870568|               Yes|
|2878492|               Yes|
|5324517|               Yes|
|3233456|               Yes|
|4502499|               Yes|
|5612215|               Yes|
|3782532|               Yes|
|1862417|               Yes|
|5851469|               Yes|
|1236150|               Yes|
+-------+------------------+
only showing top 10 rows



# Cohort Extraction

## Step 1: Total UK Biobank participants with only inpatient records and without primary care data

In [93]:
#df_cohort = df.filter(f.col('lpa_baseline').isNotNull())
df_cohort = df
#Don't filter first
#df_cohort = df.filter((f.col('hesin_record') == 1) & (f.col('gp_clinical_record') == 0))

#Total UK Biobank participants
df_cohort_count = df_cohort.count()
#Total UK Biobank participants with only inpatient records available and without primary care data
df_cohort_a_count = df_cohort.filter((f.col('hesin_record') == 1) & (f.col('gp_clinical_record') == 0)).count()
#Total UK Biobank participants with first assessment date
df_cohort_acd_count = df_cohort.filter(f.col('date_of_ac_0').isNotNull()).count()

In [94]:
print(f'Total UK Biobank participants: {df_cohort_count}')
print(f'Total UK Biobank participants with only inpatient records available and without primary care data: {df_cohort_a_count}')
print(f'Total UK Biobank participants with first assessment date: {df_cohort_acd_count}')

Total UK Biobank participants: 502230
Total UK Biobank participants with only inpatient records available and without primary care data: 242221
Total UK Biobank participants with first assessment date: 502230


## Step2 : Filter patients with ASCVD diagnosis after first visit

In [95]:
#Check whether diag_icd9 is not null

hesin_diag_df.filter(f.col('diag_icd9').isNotNull()).show()

+-------+---------+---------+-----+---------+------------+----------+-------------+
|    eid|ins_index|arr_index|level|diag_icd9|diag_icd9_nb|diag_icd10|diag_icd10_nb|
+-------+---------+---------+-----+---------+------------+----------+-------------+
|3388333|        1|        0|    1|     2429|        null|      null|         null|
|3413178|        1|        1|    2|     V571|        null|      null|         null|
|2373271|        2|        2|    2|     3000|        null|      null|         null|
|4835291|       14|        1|    2|     V138|        null|      null|         null|
|4534238|        1|        0|    1|     4721|        null|      null|         null|
|1467792|        2|        0|    1|     6088|        null|      null|         null|
|1539804|        5|        0|    1|     7889|        null|      null|         null|
|5958214|        2|        0|    1|     4439|        null|      null|         null|
|2357210|        1|        0|    1|     3842|        null|      null|       

In [96]:
#QC hesin_df to make sure unique row for respective eid and ins_index
hesin_df_count = hesin_df.count()
hesin_df_eidinscount = hesin_df.select('eid','ins_index').count()

print(f'Number of row for hesin_df_count: {hesin_df_count}')
print(f'Number of unique eid & ins_index for hesin_df_count: {hesin_df_eidinscount}')

Number of row for hesin_df_count: 4238180
Number of unique eid & ins_index for hesin_df_count: 4238180


In [97]:
#Extract 'eid', 'ins_index', 'epistart', 'epiend', 'admidate' and 'disdate' from hesin_df
#Impute disdate as disdate with 'epiend'
hesin_dfid = hesin_df.select('eid', 'ins_index', 'epistart', 'epistart_im', 'epiend', 'admidate', 'disdate', 'disdate_im', 'spell_index')

#Combine hesin_dfid, hesin_diag_df and hesin_oper_df
hesin_oper_df = hesin_oper_df.withColumnRenamed('level','oper_level')
hesin_do_df = hesin_diag_df.join(hesin_oper_df, ['eid','ins_index','arr_index'], 'full')
hesin_iddo_df = hesin_dfid.join(hesin_do_df, ['eid','ins_index'], 'full')

#Extract ASCVD diagnosis records
hesin_iddo_df_ascvd = hesin_iddo_df.join(code_ascvd_icd09, 'diag_icd9','left')
hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.join(code_ascvd_icd10, 'diag_icd10','left')
hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.join(code_ascvd_opcs3, 'oper3','left')
hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.join(code_ascvd_opcs4, 'oper4','left')

hesin_iddo_df_ascvd.printSchema()

root
 |-- oper4: string (nullable = true)
 |-- oper3: string (nullable = true)
 |-- diag_icd10: string (nullable = true)
 |-- diag_icd9: string (nullable = true)
 |-- eid: string (nullable = true)
 |-- ins_index: long (nullable = true)
 |-- epistart: date (nullable = true)
 |-- epistart_im: date (nullable = true)
 |-- epiend: date (nullable = true)
 |-- admidate: date (nullable = true)
 |-- disdate: date (nullable = true)
 |-- disdate_im: date (nullable = true)
 |-- spell_index: long (nullable = true)
 |-- arr_index: long (nullable = true)
 |-- level: string (nullable = true)
 |-- diag_icd9_nb: string (nullable = true)
 |-- diag_icd10_nb: string (nullable = true)
 |-- oper_level: string (nullable = true)
 |-- opdate: date (nullable = true)
 |-- oper3_nb: string (nullable = true)
 |-- oper4_nb: string (nullable = true)
 |-- group_icd9: string (nullable = true)
 |-- group_icd10: string (nullable = true)
 |-- group_oper3: string (nullable = true)
 |-- group_oper4: string (nullable = true)

In [98]:
# #QC to check OPCS codes exist in HESIN_DIAG
# #Conclusion: should not use OPCS codes to extract ASCVD diagnosis by matching OPCS codes with ICD codes in hesin_diag table. 

# hesin_diag_df.join(code_ascvd_ops_cr_opcs3, hesin_diag_df.diag_icd9 == code_ascvd_ops_cr_opcs3.oper3,'inner').show()
# hesin_diag_df.join(code_ascvd_ops_anycr_opcs4, hesin_diag_df.diag_icd10 == code_ascvd_ops_anycr_opcs4.oper4,'inner').select('diag_icd10').distinct().show()
# hesin_diag_df.join(code_ascvd_ops_anycr_opcs4, hesin_diag_df.diag_icd10 == code_ascvd_ops_anycr_opcs4.oper4,'inner').select('diag_icd10').distinct().count()

# #OPCS3 codes exist in hesin_diag: number of codes: 2
# #OPCS3,3041,Operations affecting myocardium: coronary endartectomy
# #OPCS3,3043,Operations affecting myocardium: coronary anastomosis or graft

# #ICD9
# #ICD9,3041,Sedative, hypnotic or anxiolytic dependence
# #ICD9,3043,Cannabis dependence

# ################################################################################

# #OPCS4 codes exist in hesin_diag: number of codes: 79
# #OPCS4,K402,Saphenous vein graft replacement of two coronary arteries
# #OPCS4,K412,Autograft replacement of two coronary arteries NEC
# #OPCS4,K509,Unspecified other therapeutic transluminal operations on coronary artery

# #ICD10
# #ICD-10 code: K402 Bilateral inguinal hernia, without obstruction or gangrene
# #ICD-10 code: K412 Bilateral femoral hernia, without obstruction or gangrene
# #ICD-10 code: K509 Crohn's disease, unspecified

In [99]:
#Extract ASCVD diagnosis records
hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.filter((f.col('group_icd9').isNotNull()) | (f.col('group_icd10').isNotNull()) | (f.col('group_oper3').isNotNull()) | (f.col('group_oper4').isNotNull()))
#hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.withColumn('group', f.array(f.col('group_icd9'),f.col('group_icd10'),f.col('group_opcs3'),f.col('group_opcs4')))

hesin_iddo_df_ascvd.show(5)

+-----+-----+----------+---------+-------+---------+----------+-----------+----------+----------+----------+----------+-----------+---------+-----+------------+-------------+----------+----------+--------+--------+----------+--------------------+-----------+-----------+
|oper4|oper3|diag_icd10|diag_icd9|    eid|ins_index|  epistart|epistart_im|    epiend|  admidate|   disdate|disdate_im|spell_index|arr_index|level|diag_icd9_nb|diag_icd10_nb|oper_level|    opdate|oper3_nb|oper4_nb|group_icd9|         group_icd10|group_oper3|group_oper4|
+-----+-----+----------+---------+-------+---------+----------+-----------+----------+----------+----------+----------+-----------+---------+-----+------------+-------------+----------+----------+--------+--------+----------+--------------------+-----------+-----------+
| A018| null|      I209|     null|5836332|       28|2006-10-25| 2006-10-25|2006-10-31|2006-10-25|2006-10-31|2006-10-31|         28|        2|    2|        null|         null|         2|20

In [100]:
#Create indexdate
##Use episode start date as index date as described in HESDataDic.xlxs, sheet Diagnosis, row 30
##If episode start data is missing, use admission date (admidate)
hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.withColumn('indexdate', f.when(((f.col('group_oper3').isNotNull()) | (f.col('group_oper4').isNotNull())) & (f.col('opdate').isNotNull()) & (f.col('opdate') <= f.col('epistart_im')), f.col('opdate'))
                                                                   .when(((f.col('group_oper3').isNotNull()) | (f.col('group_oper4').isNotNull())) & (f.col('opdate').isNotNull()) & (f.col('opdate') > f.col('epistart_im')), f.col('epistart_im'))
                                                                   .when(((f.col('group_oper3').isNotNull()) | (f.col('group_oper4').isNotNull())) & (f.col('opdate').isNull()), f.col('epistart_im'))
                                                                   .when((f.col('group_oper3').isNull()) & (f.col('group_oper4').isNull()) & ((f.col('group_icd9').isNotNull()) | (f.col('group_icd10').isNotNull())), f.col('epistart_im'))
                                                                   .otherwise(f.lit(None)))
hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.drop('epistart_im')
#hesin_iddo_df_ascvd.show(5)
#hesin_iddo_df_ascvd.filter(f.col('indexdate').isNull()).show(5)

##QC indexdate: all rows should have indexdate
hesin_iddo_df_ascvdrow =  hesin_iddo_df_ascvd.count()
hesin_iddo_df_ascvdnullrow = hesin_iddo_df_ascvd.filter(f.col('indexdate').isNull()).count()

print(f'Number of row for hesin_iddo_df_ascvd: {hesin_iddo_df_ascvdrow}')
print(f'Number of row for hesin_iddo_df_ascvd with null index_date: {hesin_iddo_df_ascvdnullrow}')

Number of row for hesin_iddo_df_ascvd: 729572
Number of row for hesin_iddo_df_ascvd with null index_date: 0


In [101]:
hesin_iddo_df_ascvd.filter((f.col('group_oper3').isNull()) & (f.col('group_oper4').isNull()) & (f.col('epistart').isNull())).show(5)

+-----+-----+----------+---------+-------+---------+--------+------+----------+----------+----------+-----------+---------+-----+------------+-------------+----------+----------+--------+--------+---------------+-----------+-----------+-----------+----------+
|oper4|oper3|diag_icd10|diag_icd9|    eid|ins_index|epistart|epiend|  admidate|   disdate|disdate_im|spell_index|arr_index|level|diag_icd9_nb|diag_icd10_nb|oper_level|    opdate|oper3_nb|oper4_nb|     group_icd9|group_icd10|group_oper3|group_oper4| indexdate|
+-----+-----+----------+---------+-------+---------+--------+------+----------+----------+----------+-----------+---------+-----+------------+-------------+----------+----------+--------+--------+---------------+-----------+-----------+-----------+----------+
| K263| null|      null|     4140|5631866|        2|    null|  null|1990-03-02|1990-03-12|1990-03-12|          2|        1|    2|        null|         null|         2|      null|    null|    null|ascvd_cad_codes|       n

In [102]:
hesin_iddo_df_ascvd = hesin_iddo_df_ascvd.groupBy('eid','indexdate').agg(f.collect_list('epiend').alias('epiend'),
                                                                         f.collect_list('admidate').alias('admidate'),
                                                                         f.collect_list('disdate_im').alias('disdate_im'),
                                                                         f.collect_list('ins_index').alias('ins_index'),
                                                                         f.collect_list('arr_index').alias('arr_index'),
                                                                         f.collect_list('spell_index').alias('spell_index'),
                                                                         f.collect_list('level').alias('level'),
                                                                         f.collect_list('oper_level').alias('oper_level'),
                                                                         f.collect_list('diag_icd9').alias('diag_icd9'),
                                                                         f.collect_list('diag_icd10').alias('diag_icd10'),
                                                                         f.collect_list('oper3').alias('oper3'),
                                                                         f.collect_list('oper4').alias('oper4'),
                                                                         f.collect_list('group_icd9').alias('group_icd9'),
                                                                         f.collect_list('group_icd10').alias('group_icd10'),
                                                                         f.collect_list('group_oper3').alias('group_oper3'),
                                                                         f.collect_list('group_oper4').alias('group_oper4')
                                                                         )

In [103]:
#Extract primary care ASCVD diagnosis records

gp_clinical_field_namesv2 = ['eid',
                           'event_dt', # Date clinical code was entered
                           'read_2', # Read v2 clinical code for primary care events, such as consultations,diagnoses, history, symptoms, procedures, laboratory tests and administrative information
                           'read_3' #ctv3
                          ]

gp_clinical_dfv2 = gp_clinical.retrieve_fields(names=gp_clinical_field_namesv2, engine=dxdata.connect())
gp_clinical_df2 = gp_clinical_dfv2.filter(f.col('event_dt') != f.lit('1901-01-01'))
gp_clinical_df2 = gp_clinical_df2.filter(f.col('event_dt') != f.lit('1902-02-02'))
gp_clinical_df2 = gp_clinical_df2.filter(f.col('event_dt') != f.lit('1903-03-03'))
gp_clinical_df2 = gp_clinical_df2.filter(f.col('event_dt') != f.lit('2037-07-07'))
gp_clinical_df2 = gp_clinical_df2.filter(f.col('event_dt').isNotNull())

gp_clinical_df2.show(10)

+-------+----------+------+------+
|    eid|  event_dt|read_2|read_3|
+-------+----------+------+------+
|5083145|2001-07-11|  null| XE2JU|
|3582056|2005-02-08|  null| XaF6J|
|3074455|2015-05-12|  null| 42N..|
|3597696|2014-03-26|  null| XaFsp|
|1151226|2008-04-16|  null| 2469.|
|1653027|2001-07-11|  null| 6799.|
|2600180|2009-09-08|  null| XE2px|
|1754957|2013-10-17|  null| XE2pb|
|5371429|2014-01-15|  null| X77Wg|
|4182207|2015-06-02|  null| XaEYo|
+-------+----------+------+------+
only showing top 10 rows



In [104]:
gp_clinical_df_ascvd = gp_clinical_df2.join(code_ascvd_readv2, 'read_2','left')
gp_clinical_df_ascvd = gp_clinical_df_ascvd.join(code_ascvd_ctv3, 'read_3','left')
gp_clinical_df_ascvd = gp_clinical_df_ascvd.withColumnRenamed('event_dt','indexdate')

In [105]:
gp_clinical_df_ascvd.printSchema()

root
 |-- read_3: string (nullable = true)
 |-- read_2: string (nullable = true)
 |-- eid: string (nullable = true)
 |-- indexdate: date (nullable = true)
 |-- group_readv2: string (nullable = true)
 |-- group_ctv3: string (nullable = true)



In [106]:
#Extract ASCVD diagnosis records
gp_clinical_df_ascvd = gp_clinical_df_ascvd.filter((f.col('group_readv2').isNotNull()) | (f.col('group_ctv3').isNotNull()))
#gp_clinical_df_ascvd = gp_clinical_df_ascvd.withColumn('group_gp', f.array(f.col('group_readv2'),f.col('group_ctv3')))

gp_clinical_df_ascvd = gp_clinical_df_ascvd.groupBy('eid','indexdate').agg(f.collect_list('read_2').alias('read_2'),
                                                                           f.collect_list('read_3').alias('read_3'),
                                                                           f.collect_list('group_readv2').alias('group_readv2'),
                                                                           f.collect_list('group_ctv3').alias('group_ctv3')
                                                                          )

gp_clinical_df_ascvd.show(5)

+-------+----------+-------+-------+--------------------+--------------------+
|    eid| indexdate| read_2| read_3|        group_readv2|          group_ctv3|
+-------+----------+-------+-------+--------------------+--------------------+
|1000047|2001-07-24|     []|[XE0VK]|                  []|   [ascvd_tia_codes]|
|1000344|2007-05-15|     []|[XE2uV]|                  []|   [ascvd_cad_codes]|
|1000696|2000-04-15|     []|[G33..]|                  []|[ascvd_unstable_a...|
|1000759|2005-05-10|[G30..]|     []|    [ascvd_mi_codes]|                  []|
|1001589|2009-07-29|[G33..]|     []|[ascvd_stable_ang...|                  []|
+-------+----------+-------+-------+--------------------+--------------------+
only showing top 5 rows



In [107]:
#Merde hesin and gp ASCVD diagnosis records
hesingp_df_ascvd = hesin_iddo_df_ascvd.join(gp_clinical_df_ascvd,['eid','indexdate'],'full')
hesingp_df_ascvd.show(5)
hesingp_df_ascvd.printSchema()

+-------+----------+--------------------+--------------------+--------------------+---------+---------+-----------+------+----------+---------+------------+-----+------------+----------+--------------------+-----------+--------------------+------+-------+------------+-----------------+
|    eid| indexdate|              epiend|            admidate|          disdate_im|ins_index|arr_index|spell_index| level|oper_level|diag_icd9|  diag_icd10|oper3|       oper4|group_icd9|         group_icd10|group_oper3|         group_oper4|read_2| read_3|group_readv2|       group_ctv3|
+-------+----------+--------------------+--------------------+--------------------+---------+---------+-----------+------+----------+---------+------------+-----+------------+----------+--------------------+-----------+--------------------+------+-------+------------+-----------------+
|1000047|2001-07-24|                null|                null|                null|     null|     null|       null|  null|      null|     n

In [108]:
#Identify first ASCVD diagnosis from HESIN and primary care
window = Window.partitionBy('eid').orderBy('indexdate')

hesingp_df_ascvd = hesingp_df_ascvd.withColumn('indexdt_order', f.rank().over(window))
#hesin_iddo_df_ascvd_ori = hesin_iddo_df_ascvd
hesingp_df_ascvd_1 = hesingp_df_ascvd.filter(f.col('indexdt_order') == 1)
#hesin_iddo_df_ascvd_2 = hesin_iddo_df_ascvd_ori.filter(f.col('indexdt_order') == 2)

# ##QC to check whether there are duplicated ASCVD diagnosis on same indexdate
# hesin_iddo_df_ascvd_1_rank1row =  hesin_iddo_df_ascvd_1.count()
# hesin_iddo_df_ascvd_1_rank1rowuniq = hesin_iddo_df_ascvd_1.select('eid').distinct().count()

# print(f'Number of row for hesin_iddo_df_ascvd_1: {hesin_iddo_df_ascvd_1_rank1row}')
# print(f'Number of row for hesin_iddo_df_ascvd_1 with unique eid: {hesin_iddo_df_ascvd_1_rank1rowuniq}')

#If the above two row counts are the same, proceed to next code.
#If not, need to collect list.
hesingp_df_ascvd_1.printSchema()

root
 |-- eid: string (nullable = true)
 |-- indexdate: date (nullable = true)
 |-- epiend: array (nullable = true)
 |    |-- element: date (containsNull = false)
 |-- admidate: array (nullable = true)
 |    |-- element: date (containsNull = false)
 |-- disdate_im: array (nullable = true)
 |    |-- element: date (containsNull = false)
 |-- ins_index: array (nullable = true)
 |    |-- element: long (containsNull = false)
 |-- arr_index: array (nullable = true)
 |    |-- element: long (containsNull = false)
 |-- spell_index: array (nullable = true)
 |    |-- element: long (containsNull = false)
 |-- level: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- oper_level: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- diag_icd9: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- diag_icd10: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- oper3: array (nullable = true)
 |    |-- 

In [109]:
#QC row of hesingp_df_ascvd_1
## unique hesingp_df_ascvd_1 (eid and indexdate) should be same as row before distinct
hesingp_df_ascvd_1count = hesingp_df_ascvd_1.count()
hesingp_df_ascvd_1countu = hesingp_df_ascvd_1.select('eid','indexdate').distinct().count()

print(f'Number of row for hesingp_df_ascvd_1: {hesingp_df_ascvd_1count}')
print(f'Number of row for hesingp_df_ascvd_1 unique: {hesingp_df_ascvd_1countu}')

Number of row for hesingp_df_ascvd_1: 102782
Number of row for hesingp_df_ascvd_1 unique: 102782


In [110]:
# Create incident ASCVD: ascvd01

ascvd01 = hesingp_df_ascvd_1.withColumnRenamed('indexdate', 'ascvd01_indexdate')
ascvd01 = ascvd01.withColumnRenamed('epiend', 'ascvd01_epiend')
ascvd01 = ascvd01.withColumnRenamed('admidate', 'ascvd01_admidate')
ascvd01 = ascvd01.withColumnRenamed('disdate_im', 'ascvd01_disdate')
ascvd01 = ascvd01.withColumnRenamed('ins_index', 'ascvd01_ins_index')
ascvd01 = ascvd01.withColumnRenamed('arr_index', 'ascvd01_arr_index')
ascvd01 = ascvd01.withColumnRenamed('spell_index', 'ascvd01_spell_index')
ascvd01 = ascvd01.withColumnRenamed('level', 'ascvd01_diag_level')
ascvd01 = ascvd01.withColumnRenamed('oper_level', 'ascvd01_oper_level')
ascvd01 = ascvd01.withColumnRenamed('diag_icd9', 'ascvd01_icd9')
ascvd01 = ascvd01.withColumnRenamed('diag_icd10', 'ascvd01_icd10')
ascvd01 = ascvd01.withColumnRenamed('oper3', 'ascvd01_oper3')
ascvd01 = ascvd01.withColumnRenamed('oper4', 'ascvd01_oper4')
ascvd01 = ascvd01.withColumnRenamed('group_icd9', 'ascvd01_group_icd9')
ascvd01 = ascvd01.withColumnRenamed('group_icd10', 'ascvd01_group_icd10')
ascvd01 = ascvd01.withColumnRenamed('group_oper3', 'ascvd01_group_oper3')
ascvd01 = ascvd01.withColumnRenamed('group_oper4', 'ascvd01_group_oper4')
ascvd01 = ascvd01.withColumnRenamed('read_2', 'ascvd01_read_2')
ascvd01 = ascvd01.withColumnRenamed('read_3', 'ascvd01_read_3')
ascvd01 = ascvd01.withColumnRenamed('group_readv2', 'ascvd01_group_readv2')
ascvd01 = ascvd01.withColumnRenamed('group_ctv3', 'ascvd01_group_ctv3')
#ascvd01 = ascvd01.withColumnRenamed('group_gp', 'ascvd01_group_gp')

# Get cohort's birthdate and sex
cohort_bs = df_cohort.select('eid','date_of_birth', 'sex', 'date_of_ac_0', 'lpa', 'date_of_death')

# Calculate patient age
ascvd01 = ascvd01.join(cohort_bs, 'eid', 'inner')
ascvd01 = ascvd01.withColumn('ascvd01_index_age', f.floor(f.datediff(f.col('ascvd01_indexdate'), f.col('date_of_birth'))/365.25))
# ascvd01 = ascvd01.withColumn('ascvd01_indexdate', f.when((f.col('ascvd01_indexdate') > f.col('date_of_ac_0')) &
#                                                          (f.col('lpa').isNotNull()) &

In [111]:
ascvd01.show(5)
ascvd01.filter(f.col('date_of_death').isNotNull()).show(5)

+-------+-----------------+--------------+----------------+---------------+-----------------+-----------------+-------------------+------------------+------------------+------------+-------------+-------------+-------------+------------------+-------------------+-------------------+-------------------+--------------+--------------+--------------------+------------------+-------------+-------------+------+------------+-----+-------------+-----------------+
|    eid|ascvd01_indexdate|ascvd01_epiend|ascvd01_admidate|ascvd01_disdate|ascvd01_ins_index|ascvd01_arr_index|ascvd01_spell_index|ascvd01_diag_level|ascvd01_oper_level|ascvd01_icd9|ascvd01_icd10|ascvd01_oper3|ascvd01_oper4|ascvd01_group_icd9|ascvd01_group_icd10|ascvd01_group_oper3|ascvd01_group_oper4|ascvd01_read_2|ascvd01_read_3|ascvd01_group_readv2|ascvd01_group_ctv3|indexdt_order|date_of_birth|   sex|date_of_ac_0|  lpa|date_of_death|ascvd01_index_age|
+-------+-----------------+--------------+----------------+---------------+-----

In [112]:
# ascvd01 followup date and days

## 6 months
ascvd01 = ascvd01.withColumn('ascvd01_followup06m_date_temp', f.date_add(f.col('ascvd01_indexdate'), 180))
ascvd01 = ascvd01.withColumn('ascvd01_followup06m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd01_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd01_followup06m_date_temp')), f.col('ascvd01_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd01_followup06m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd01 = ascvd01.drop('ascvd01_followup06m_date_temp')

ascvd01 = ascvd01.withColumn('ascvd01_followup06m_date', f.when((f.col('ascvd01_followup06m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd01_followup06m_date')))


ascvd01 = ascvd01.withColumn('ascvd01_followup06m_dur', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ascvd01_followup06m_date').isNotNull()), f.datediff(f.col('ascvd01_followup06m_date'),f.col('ascvd01_indexdate')))
                                                         .otherwise(f.lit(None)))

## 12 months
ascvd01 = ascvd01.withColumn('ascvd01_followup12m_date_temp', f.date_add(f.col('ascvd01_indexdate'), 365))
ascvd01 = ascvd01.withColumn('ascvd01_followup12m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd01_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd01_followup12m_date_temp')), f.col('ascvd01_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd01_followup12m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd01 = ascvd01.drop('ascvd01_followup12m_date_temp')
ascvd01 = ascvd01.withColumn('ascvd01_followup12m_date', f.when((f.col('ascvd01_followup12m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd01_followup12m_date')))

ascvd01 = ascvd01.withColumn('ascvd01_followup12m_dur', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ascvd01_followup12m_date').isNotNull()), f.datediff(f.col('ascvd01_followup12m_date'),f.col('ascvd01_indexdate')))
                                                         .otherwise(f.lit(None)))

## 24 months
ascvd01 = ascvd01.withColumn('ascvd01_followup24m_date_temp', f.date_add(f.col('ascvd01_indexdate'), 730))
ascvd01 = ascvd01.withColumn('ascvd01_followup24m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd01_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd01_followup24m_date_temp')), f.col('ascvd01_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd01_followup24m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd01 = ascvd01.drop('ascvd01_followup24m_date_temp')
ascvd01 = ascvd01.withColumn('ascvd01_followup24m_date', f.when((f.col('ascvd01_followup24m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd01_followup24m_date')))

ascvd01 = ascvd01.withColumn('ascvd01_followup24m_dur', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ascvd01_followup24m_date').isNotNull()), f.datediff(f.col('ascvd01_followup24m_date'),f.col('ascvd01_indexdate')))
                                                         .otherwise(f.lit(None)))

## Last follow up date
ascvd01 = ascvd01.withColumn('ascvd01_followupXXm_date', f.when((f.col('date_of_death').isNull()), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.lit('2022-10-31')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))
ascvd01 = ascvd01.withColumn('ascvd01_followupXXm_dur', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ascvd01_followupXXm_date').isNotNull()), f.datediff(f.col('ascvd01_followupXXm_date'),f.col('ascvd01_indexdate')))
                                                         .otherwise(f.lit(None)))

In [113]:
ascvd01 = ascvd01.select('eid','ascvd01_indexdate','ascvd01_index_age','ascvd01_epiend','ascvd01_admidate','ascvd01_disdate',
                         'ascvd01_ins_index','ascvd01_arr_index','ascvd01_spell_index',
                         'ascvd01_diag_level','ascvd01_oper_level',
                         'ascvd01_icd9','ascvd01_icd10','ascvd01_oper3','ascvd01_oper4','ascvd01_read_2','ascvd01_read_3',

                         'ascvd01_group_icd9', 'ascvd01_group_icd10', 'ascvd01_group_oper3', 'ascvd01_group_oper4',
                         'ascvd01_group_readv2', 'ascvd01_group_ctv3',
                         'ascvd01_followup06m_date', 'ascvd01_followup06m_dur',
                         'ascvd01_followup12m_date', 'ascvd01_followup12m_dur',
                         'ascvd01_followup24m_date', 'ascvd01_followup24m_dur',
                         'ascvd01_followupXXm_date', 'ascvd01_followupXXm_dur')

In [114]:
# Create incident ASCVD is MI, IS, PAD
ascvd01 = ascvd01.withColumn('ascvd01_mi', f.when((f.array_contains(f.col('ascvd01_group_icd9'), 'ascvd_mi_codes')) | (f.array_contains(f.col('ascvd01_group_icd10'), 'ascvd_mi_codes')) | (f.array_contains(f.col('ascvd01_group_readv2'), 'ascvd_mi_codes')) | (f.array_contains(f.col('ascvd01_group_ctv3'), 'ascvd_mi_codes')), f.lit(1))
                                            .otherwise(f.lit(0)))

ascvd01 = ascvd01.withColumn('ascvd01_is', f.when((f.array_contains(f.col('ascvd01_group_icd9'), 'ascvd_is_codes')) | (f.array_contains(f.col('ascvd01_group_icd10'), 'ascvd_is_codes')) | (f.array_contains(f.col('ascvd01_group_readv2'), 'ascvd_is_codes')) | (f.array_contains(f.col('ascvd01_group_ctv3'), 'ascvd_is_codes')), f.lit(1))
                                            .otherwise(f.lit(0)))

ascvd01 = ascvd01.withColumn('ascvd01_pad', f.when((f.array_contains(f.col('ascvd01_group_icd9'), 'ascvd_pad_codes')) | (f.array_contains(f.col('ascvd01_group_icd10'), 'ascvd_pad_codes')) | (f.array_contains(f.col('ascvd01_group_readv2'), 'ascvd_pad_codes')) | (f.array_contains(f.col('ascvd01_group_ctv3'), 'ascvd_pad_codes')), f.lit(1))
                                            .otherwise(f.lit(0)))

In [115]:
ascvd01.show(5)

+-------+-----------------+-----------------+--------------+----------------+---------------+-----------------+-----------------+-------------------+------------------+------------------+------------+-------------+-------------+-------------+--------------+--------------+------------------+-------------------+-------------------+-------------------+--------------------+------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+----------+----------+-----------+
|    eid|ascvd01_indexdate|ascvd01_index_age|ascvd01_epiend|ascvd01_admidate|ascvd01_disdate|ascvd01_ins_index|ascvd01_arr_index|ascvd01_spell_index|ascvd01_diag_level|ascvd01_oper_level|ascvd01_icd9|ascvd01_icd10|ascvd01_oper3|ascvd01_oper4|ascvd01_read_2|ascvd01_read_3|ascvd01_group_icd9|ascvd01_group_icd10|ascvd01_group_oper3|ascvd01_group_oper4|ascvd01_group_readv2|ascvd0

In [116]:
# Create recent ASCVD: ascvd02
hesingp_df_ascvd_2 = hesingp_df_ascvd.filter((f.array_contains(f.col('group_icd9'), 'ascvd_mi_codes')) | 
                                             (f.array_contains(f.col('group_icd9'), 'ascvd_is_codes')) | 
                                             (f.array_contains(f.col('group_icd9'), 'ascvd_pad_codes')) | 
                                             (f.array_contains(f.col('group_icd10'), 'ascvd_mi_codes')) | 
                                             (f.array_contains(f.col('group_icd10'), 'ascvd_is_codes')) | 
                                             (f.array_contains(f.col('group_icd10'), 'ascvd_pad_codes')) |
                                             (f.array_contains(f.col('group_readv2'), 'ascvd_mi_codes')) | 
                                             (f.array_contains(f.col('group_readv2'), 'ascvd_is_codes')) | 
                                             (f.array_contains(f.col('group_readv2'), 'ascvd_pad_codes')) |
                                             (f.array_contains(f.col('group_ctv3'), 'ascvd_mi_codes')) | 
                                             (f.array_contains(f.col('group_ctv3'), 'ascvd_is_codes')) | 
                                             (f.array_contains(f.col('group_ctv3'), 'ascvd_pad_codes'))
                                             )

hesingp_df_ascvd_2 = hesingp_df_ascvd_2.withColumn('ascvd02_order', f.rank().over(window))
ascvd02 = hesingp_df_ascvd_2.filter(f.col('ascvd02_order') == 1)
ascvd02 = ascvd02.withColumnRenamed('indexdate', 'ascvd02_indexdate')
ascvd02 = ascvd02.select('eid','ascvd02_indexdate').distinct()

# ascvd01_date
ascvd01_date = ascvd01.select('eid','ascvd01_indexdate')

# ascvd02_indexdate must be less than 12 months from ascvd01_indexdate
ascvd02 = ascvd02.join(ascvd01_date, 'eid', 'left')
ascvd02 = ascvd02.withColumn('ascvd01to02_datediff', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ascvd02_indexdate').isNotNull()), f.datediff(f.col('ascvd02_indexdate'), f.col('ascvd01_indexdate')))
                                                      .otherwise(f.lit(None)))
ascvd02 = ascvd02.filter((f.col('ascvd01to02_datediff') >= 0) & (f.col('ascvd01to02_datediff') <= 365))
ascvd02.show(5)

#QC check whether ascvd01to02_datediff has negative
ascvd02.filter(f.col('ascvd01to02_datediff') < 0).show() #should be empty table

# Calculate patient age
ascvd02 = ascvd02.join(cohort_bs, 'eid', 'inner')
ascvd02 = ascvd02.withColumn('ascvd02_index_age', f.floor(f.datediff(f.col('ascvd02_indexdate'), f.col('date_of_birth'))/365.25))

ascvd02raw = ascvd02.select('eid', 'sex', 'ascvd02_indexdate', 'ascvd02_index_age')

ascvd02 = ascvd02.select('eid', 'ascvd02_indexdate', 'ascvd02_index_age', 'date_of_death')

+-------+-----------------+-----------------+--------------------+
|    eid|ascvd02_indexdate|ascvd01_indexdate|ascvd01to02_datediff|
+-------+-----------------+-----------------+--------------------+
|1000759|       2005-05-08|       2005-05-08|                   0|
|1001212|       2020-11-16|       2020-11-16|                   0|
|1001695|       2007-10-08|       2007-10-08|                   0|
|1001864|       2017-12-22|       2017-12-22|                   0|
|1004680|       2008-11-13|       2008-11-13|                   0|
+-------+-----------------+-----------------+--------------------+
only showing top 5 rows

+---+-----------------+-----------------+--------------------+
|eid|ascvd02_indexdate|ascvd01_indexdate|ascvd01to02_datediff|
+---+-----------------+-----------------+--------------------+
+---+-----------------+-----------------+--------------------+



In [117]:
ascvd02count = ascvd02.count()
ascvd02eidcount = ascvd02.select('eid').distinct().count()
print(f'Number of row for ascvd02: {ascvd02count}')
print(f'Number of unique patients in ascvd02: {ascvd02eidcount}')

Number of row for ascvd02: 34451
Number of unique patients in ascvd02: 34451


In [118]:
# ascvd02 followup date and days

## 6 months
ascvd02 = ascvd02.withColumn('ascvd02_followup06m_date_temp', f.date_add(f.col('ascvd02_indexdate'), 180))
ascvd02 = ascvd02.withColumn('ascvd02_followup06m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd02_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd02_followup06m_date_temp')), f.col('ascvd02_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd02_followup06m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd02 = ascvd02.drop('ascvd02_followup06m_date_temp')

ascvd02 = ascvd02.withColumn('ascvd02_followup06m_date', f.when((f.col('ascvd02_followup06m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd02_followup06m_date')))


ascvd02 = ascvd02.withColumn('ascvd02_followup06m_dur', f.when((f.col('ascvd02_indexdate').isNotNull()) & (f.col('ascvd02_followup06m_date').isNotNull()), f.datediff(f.col('ascvd02_followup06m_date'),f.col('ascvd02_indexdate')))
                                                         .otherwise(f.lit(None)))

## 12 months
ascvd02 = ascvd02.withColumn('ascvd02_followup12m_date_temp', f.date_add(f.col('ascvd02_indexdate'), 365))
ascvd02 = ascvd02.withColumn('ascvd02_followup12m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd02_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd02_followup12m_date_temp')), f.col('ascvd02_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd02_followup12m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd02 = ascvd02.drop('ascvd02_followup12m_date_temp')
ascvd02 = ascvd02.withColumn('ascvd02_followup12m_date', f.when((f.col('ascvd02_followup12m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd02_followup12m_date')))

ascvd02 = ascvd02.withColumn('ascvd02_followup12m_dur', f.when((f.col('ascvd02_indexdate').isNotNull()) & (f.col('ascvd02_followup12m_date').isNotNull()), f.datediff(f.col('ascvd02_followup12m_date'),f.col('ascvd02_indexdate')))
                                                         .otherwise(f.lit(None)))

## 24 months
ascvd02 = ascvd02.withColumn('ascvd02_followup24m_date_temp', f.date_add(f.col('ascvd02_indexdate'), 730))
ascvd02 = ascvd02.withColumn('ascvd02_followup24m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd02_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd02_followup24m_date_temp')), f.col('ascvd02_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd02_followup24m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd02 = ascvd02.drop('ascvd02_followup24m_date_temp')
ascvd02 = ascvd02.withColumn('ascvd02_followup24m_date', f.when((f.col('ascvd02_followup24m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd02_followup24m_date')))

ascvd02 = ascvd02.withColumn('ascvd02_followup24m_dur', f.when((f.col('ascvd02_indexdate').isNotNull()) & (f.col('ascvd02_followup24m_date').isNotNull()), f.datediff(f.col('ascvd02_followup24m_date'),f.col('ascvd02_indexdate')))
                                                         .otherwise(f.lit(None)))

## Last follow up date
ascvd02 = ascvd02.withColumn('ascvd02_followupXXm_date', f.when((f.col('date_of_death').isNull()), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.lit('2022-10-31')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))
ascvd02 = ascvd02.withColumn('ascvd02_followupXXm_dur', f.when((f.col('ascvd02_indexdate').isNotNull()) & (f.col('ascvd02_followupXXm_date').isNotNull()), f.datediff(f.col('ascvd02_followupXXm_date'),f.col('ascvd02_indexdate')))
                                                         .otherwise(f.lit(None)))

In [119]:
ascvd02.printSchema()

root
 |-- eid: string (nullable = true)
 |-- ascvd02_indexdate: date (nullable = true)
 |-- ascvd02_index_age: long (nullable = true)
 |-- date_of_death: date (nullable = true)
 |-- ascvd02_followup06m_date: string (nullable = true)
 |-- ascvd02_followup06m_dur: integer (nullable = true)
 |-- ascvd02_followup12m_date: string (nullable = true)
 |-- ascvd02_followup12m_dur: integer (nullable = true)
 |-- ascvd02_followup24m_date: string (nullable = true)
 |-- ascvd02_followup24m_dur: integer (nullable = true)
 |-- ascvd02_followupXXm_date: string (nullable = true)
 |-- ascvd02_followupXXm_dur: integer (nullable = true)



In [120]:
ascvd02 = ascvd02.drop('date_of_death')
ascvd02.show(10)

+-------+-----------------+-----------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|    eid|ascvd02_indexdate|ascvd02_index_age|ascvd02_followup06m_date|ascvd02_followup06m_dur|ascvd02_followup12m_date|ascvd02_followup12m_dur|ascvd02_followup24m_date|ascvd02_followup24m_dur|ascvd02_followupXXm_date|ascvd02_followupXXm_dur|
+-------+-----------------+-----------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|1000759|       2005-05-08|               60|              2005-11-04|                    180|              2006-05-08|                    365|              2007-05-08|                    730|              2022-10-31|                   6385|
|1001212|       2020-11-16|     

In [121]:
#Create premature ASCVD: ascvd03
ascvd03 = hesingp_df_ascvd_2.filter(f.col('ascvd02_order') == 1)
ascvd03 = ascvd03.withColumnRenamed('indexdate', 'ascvd03_indexdate')
ascvd03 = ascvd03.select('eid','ascvd03_indexdate').distinct()

ascvd03 = ascvd03.join(cohort_bs, 'eid', 'inner')
ascvd03 = ascvd03.withColumn('ascvd03_index_age', f.floor(f.datediff(f.col('ascvd03_indexdate'), f.col('date_of_birth'))/365.25))
ascvd03.show(5)
ascvd03_m = ascvd03.filter((f.col('sex') == 'Male') & (f.col('ascvd03_index_age') < 55))
ascvd03_f = ascvd03.filter((f.col('sex') == 'Female') & (f.col('ascvd03_index_age') < 65))
ascvd03 = ascvd03_m.union(ascvd03_f)

ascvd03 = ascvd03.select('eid', 'ascvd03_indexdate', 'ascvd03_index_age', 'date_of_death')
ascvd03.show(5)

+-------+-----------------+-------------+------+------------+-----+-------------+-----------------+
|    eid|ascvd03_indexdate|date_of_birth|   sex|date_of_ac_0|  lpa|date_of_death|ascvd03_index_age|
+-------+-----------------+-------------+------+------------+-----+-------------+-----------------+
|1000759|       2005-05-08|   1944-11-15|Female|  2008-02-19|123.7|         null|               60|
|1001212|       2020-11-16|   1944-09-15|  Male|  2008-03-27|244.9|         null|               76|
|1001695|       2007-10-08|   1941-08-15|Female|  2008-08-05| 20.1|         null|               66|
|1001864|       2017-12-22|   1959-10-15|  Male|  2008-10-13|  9.3|         null|               58|
|1003227|       2014-10-20|   1947-01-15|Female|  2008-04-24|  4.7|         null|               67|
+-------+-----------------+-------------+------+------------+-----+-------------+-----------------+
only showing top 5 rows

+-------+-----------------+-----------------+-------------+
|    eid|ascvd0

In [122]:
ascvd03.filter(f.col('date_of_death').isNotNull()).show(5)

+-------+-----------------+-----------------+-------------+
|    eid|ascvd03_indexdate|ascvd03_index_age|date_of_death|
+-------+-----------------+-----------------+-------------+
|1038796|       2004-05-01|               52|   2019-07-27|
|1076630|       1995-07-26|               52|   2018-02-28|
|1163632|       1999-07-20|               52|   2018-11-04|
|1215279|       1975-07-30|               34|   2018-12-21|
|1226700|       2000-03-15|               43|   2013-03-10|
+-------+-----------------+-----------------+-------------+
only showing top 5 rows



In [123]:
# ascvd03 followup date and days

## 6 months
ascvd03 = ascvd03.withColumn('ascvd03_followup06m_date_temp', f.date_add(f.col('ascvd03_indexdate'), 180))
ascvd03 = ascvd03.withColumn('ascvd03_followup06m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd03_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd03_followup06m_date_temp')), f.col('ascvd03_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd03_followup06m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd03 = ascvd03.drop('ascvd03_followup06m_date_temp')

ascvd03 = ascvd03.withColumn('ascvd03_followup06m_date', f.when((f.col('ascvd03_followup06m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd03_followup06m_date')))


ascvd03 = ascvd03.withColumn('ascvd03_followup06m_dur', f.when((f.col('ascvd03_indexdate').isNotNull()) & (f.col('ascvd03_followup06m_date').isNotNull()), f.datediff(f.col('ascvd03_followup06m_date'),f.col('ascvd03_indexdate')))
                                                         .otherwise(f.lit(None)))

## 12 months
ascvd03 = ascvd03.withColumn('ascvd03_followup12m_date_temp', f.date_add(f.col('ascvd03_indexdate'), 365))
ascvd03 = ascvd03.withColumn('ascvd03_followup12m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd03_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd03_followup12m_date_temp')), f.col('ascvd03_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd03_followup12m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd03 = ascvd03.drop('ascvd03_followup12m_date_temp')
ascvd03 = ascvd03.withColumn('ascvd03_followup12m_date', f.when((f.col('ascvd03_followup12m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd03_followup12m_date')))

ascvd03 = ascvd03.withColumn('ascvd03_followup12m_dur', f.when((f.col('ascvd03_indexdate').isNotNull()) & (f.col('ascvd03_followup12m_date').isNotNull()), f.datediff(f.col('ascvd03_followup12m_date'),f.col('ascvd03_indexdate')))
                                                         .otherwise(f.lit(None)))

## 24 months
ascvd03 = ascvd03.withColumn('ascvd03_followup24m_date_temp', f.date_add(f.col('ascvd03_indexdate'), 730))
ascvd03 = ascvd03.withColumn('ascvd03_followup24m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd03_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd03_followup24m_date_temp')), f.col('ascvd03_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd03_followup24m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd03 = ascvd03.drop('ascvd03_followup24m_date_temp')
ascvd03 = ascvd03.withColumn('ascvd03_followup24m_date', f.when((f.col('ascvd03_followup24m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd03_followup24m_date')))

ascvd03 = ascvd03.withColumn('ascvd03_followup24m_dur', f.when((f.col('ascvd03_indexdate').isNotNull()) & (f.col('ascvd03_followup24m_date').isNotNull()), f.datediff(f.col('ascvd03_followup24m_date'),f.col('ascvd03_indexdate')))
                                                         .otherwise(f.lit(None)))

## Last follow up date
ascvd03 = ascvd03.withColumn('ascvd03_followupXXm_date', f.when((f.col('date_of_death').isNull()), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.lit('2022-10-31')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))
ascvd03 = ascvd03.withColumn('ascvd03_followupXXm_dur', f.when((f.col('ascvd03_indexdate').isNotNull()) & (f.col('ascvd03_followupXXm_date').isNotNull()), f.datediff(f.col('ascvd03_followupXXm_date'),f.col('ascvd03_indexdate')))
                                                         .otherwise(f.lit(None)))

In [124]:
ascvd03.filter(f.col('date_of_death').isNotNull()).show(5)

+-------+-----------------+-----------------+-------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|    eid|ascvd03_indexdate|ascvd03_index_age|date_of_death|ascvd03_followup06m_date|ascvd03_followup06m_dur|ascvd03_followup12m_date|ascvd03_followup12m_dur|ascvd03_followup24m_date|ascvd03_followup24m_dur|ascvd03_followupXXm_date|ascvd03_followupXXm_dur|
+-------+-----------------+-----------------+-------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|1038796|       2004-05-01|               52|   2019-07-27|              2004-10-28|                    180|              2005-05-01|                    365|              2006-05-01|                    730|              2019-07-27| 

In [125]:
ascvd03.printSchema()

root
 |-- eid: string (nullable = true)
 |-- ascvd03_indexdate: date (nullable = true)
 |-- ascvd03_index_age: long (nullable = true)
 |-- date_of_death: date (nullable = true)
 |-- ascvd03_followup06m_date: string (nullable = true)
 |-- ascvd03_followup06m_dur: integer (nullable = true)
 |-- ascvd03_followup12m_date: string (nullable = true)
 |-- ascvd03_followup12m_dur: integer (nullable = true)
 |-- ascvd03_followup24m_date: string (nullable = true)
 |-- ascvd03_followup24m_dur: integer (nullable = true)
 |-- ascvd03_followupXXm_date: string (nullable = true)
 |-- ascvd03_followupXXm_dur: integer (nullable = true)



In [126]:
ascvd03 = ascvd03.drop('date_of_death')
ascvd03.show(10)

+-------+-----------------+-----------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|    eid|ascvd03_indexdate|ascvd03_index_age|ascvd03_followup06m_date|ascvd03_followup06m_dur|ascvd03_followup12m_date|ascvd03_followup12m_dur|ascvd03_followup24m_date|ascvd03_followup24m_dur|ascvd03_followupXXm_date|ascvd03_followupXXm_dur|
+-------+-----------------+-----------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|1011033|       1986-06-30|               43|              1986-12-27|                    180|              1987-06-30|                    365|              1988-06-29|                    730|              2022-10-31|                  13272|
|1024163|       2016-01-10|     

In [127]:
# Create recurrent ASCVD: ascvd04
ascvd04_eid = hesingp_df_ascvd_2.filter(f.col('ascvd02_order') == 2).select('eid','indexdate').distinct()
ascvd02_eid = ascvd02.select('eid','ascvd02_indexdate','ascvd02_index_age')
ascvd04 = ascvd02_eid.join(ascvd04_eid, 'eid', 'inner')

ascvd04 = ascvd04.join(cohort_bs, 'eid', 'inner')

#Duration from fist MI, IS, PAD to second MI, IS, PAD must be <= 24 months (730 days)
ascvd04 = ascvd04.withColumn('ascvd02torec', f.datediff(f.col('indexdate'),f.col('ascvd02_indexdate')))
ascvd04 = ascvd04.filter(f.col('ascvd02torec') <= 730)
ascvd04.show()
ascvd04.filter(f.col('ascvd02torec') < 0).show(5) #should be empty table
ascvd04.filter(f.col('ascvd02torec') == 0).show(5)

ascvd04 = ascvd04.select('eid', 'ascvd02_indexdate','ascvd02_index_age', 'date_of_death').withColumnRenamed('ascvd02_indexdate', 'ascvd04_indexdate').withColumnRenamed('ascvd02_index_age', 'ascvd04_index_age')
ascvd04.show(5)

+-------+-----------------+-----------------+----------+-------------+------+------------+-----+-------------+------------+
|    eid|ascvd02_indexdate|ascvd02_index_age| indexdate|date_of_birth|   sex|date_of_ac_0|  lpa|date_of_death|ascvd02torec|
+-------+-----------------+-----------------+----------+-------------+------+------------+-----+-------------+------------+
|1000759|       2005-05-08|               60|2005-05-10|   1944-11-15|Female|  2008-02-19|123.7|         null|           2|
|1001695|       2007-10-08|               66|2007-12-05|   1941-08-15|Female|  2008-08-05| 20.1|         null|          58|
|1007180|       2004-03-15|               60|2005-05-19|   1943-08-15|  Male|  2007-10-31| 16.7|         null|         430|
|1007474|       2020-01-23|               57|2020-01-29|   1962-10-15|  Male|  2010-06-29|  5.7|         null|           6|
|1017671|       2016-01-10|               75|2016-01-12|   1940-04-15|Female|  2007-10-18| null|         null|           2|
|1019821

In [128]:
# ascvd04 followup date and days

## 6 months
ascvd04 = ascvd04.withColumn('ascvd04_followup06m_date_temp', f.date_add(f.col('ascvd04_indexdate'), 180))
ascvd04 = ascvd04.withColumn('ascvd04_followup06m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd04_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd04_followup06m_date_temp')), f.col('ascvd04_followup06m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd04_followup06m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd04 = ascvd04.drop('ascvd04_followup06m_date_temp')

ascvd04 = ascvd04.withColumn('ascvd04_followup06m_date', f.when((f.col('ascvd04_followup06m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd04_followup06m_date')))


ascvd04 = ascvd04.withColumn('ascvd04_followup06m_dur', f.when((f.col('ascvd04_indexdate').isNotNull()) & (f.col('ascvd04_followup06m_date').isNotNull()), f.datediff(f.col('ascvd04_followup06m_date'),f.col('ascvd04_indexdate')))
                                                         .otherwise(f.lit(None)))

## 12 months
ascvd04 = ascvd04.withColumn('ascvd04_followup12m_date_temp', f.date_add(f.col('ascvd04_indexdate'), 365))
ascvd04 = ascvd04.withColumn('ascvd04_followup12m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd04_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd04_followup12m_date_temp')), f.col('ascvd04_followup12m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd04_followup12m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd04 = ascvd04.drop('ascvd04_followup12m_date_temp')
ascvd04 = ascvd04.withColumn('ascvd04_followup12m_date', f.when((f.col('ascvd04_followup12m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd04_followup12m_date')))

ascvd04 = ascvd04.withColumn('ascvd04_followup12m_dur', f.when((f.col('ascvd04_indexdate').isNotNull()) & (f.col('ascvd04_followup12m_date').isNotNull()), f.datediff(f.col('ascvd04_followup12m_date'),f.col('ascvd04_indexdate')))
                                                         .otherwise(f.lit(None)))

## 24 months
ascvd04 = ascvd04.withColumn('ascvd04_followup24m_date_temp', f.date_add(f.col('ascvd04_indexdate'), 730))
ascvd04 = ascvd04.withColumn('ascvd04_followup24m_date', f.when((f.col('date_of_death').isNull()), f.col('ascvd04_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.col('ascvd04_followup24m_date_temp')), f.col('ascvd04_followup24m_date_temp'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.col('ascvd04_followup24m_date_temp')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))

ascvd04 = ascvd04.drop('ascvd04_followup24m_date_temp')
ascvd04 = ascvd04.withColumn('ascvd04_followup24m_date', f.when((f.col('ascvd04_followup24m_date') > f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .otherwise(f.col('ascvd04_followup24m_date')))

ascvd04 = ascvd04.withColumn('ascvd04_followup24m_dur', f.when((f.col('ascvd04_indexdate').isNotNull()) & (f.col('ascvd04_followup24m_date').isNotNull()), f.datediff(f.col('ascvd04_followup24m_date'),f.col('ascvd04_indexdate')))
                                                         .otherwise(f.lit(None)))

## Last follow up date
ascvd04 = ascvd04.withColumn('ascvd04_followupXXm_date', f.when((f.col('date_of_death').isNull()), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') >= f.lit('2022-10-31')), f.lit('2022-10-31'))
                                                          .when((f.col('date_of_death').isNotNull()) & (f.col('date_of_death') < f.lit('2022-10-31')), f.col('date_of_death'))
                                                          .otherwise(f.lit(None)))
ascvd04 = ascvd04.withColumn('ascvd04_followupXXm_dur', f.when((f.col('ascvd04_indexdate').isNotNull()) & (f.col('ascvd04_followupXXm_date').isNotNull()), f.datediff(f.col('ascvd04_followupXXm_date'),f.col('ascvd04_indexdate')))
                                                         .otherwise(f.lit(None)))

In [129]:
ascvd04.printSchema()

root
 |-- eid: string (nullable = true)
 |-- ascvd04_indexdate: date (nullable = true)
 |-- ascvd04_index_age: long (nullable = true)
 |-- date_of_death: date (nullable = true)
 |-- ascvd04_followup06m_date: string (nullable = true)
 |-- ascvd04_followup06m_dur: integer (nullable = true)
 |-- ascvd04_followup12m_date: string (nullable = true)
 |-- ascvd04_followup12m_dur: integer (nullable = true)
 |-- ascvd04_followup24m_date: string (nullable = true)
 |-- ascvd04_followup24m_dur: integer (nullable = true)
 |-- ascvd04_followupXXm_date: string (nullable = true)
 |-- ascvd04_followupXXm_dur: integer (nullable = true)



In [130]:
ascvd04 = ascvd04.drop('date_of_death')
ascvd04.show(10)

+-------+-----------------+-----------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|    eid|ascvd04_indexdate|ascvd04_index_age|ascvd04_followup06m_date|ascvd04_followup06m_dur|ascvd04_followup12m_date|ascvd04_followup12m_dur|ascvd04_followup24m_date|ascvd04_followup24m_dur|ascvd04_followupXXm_date|ascvd04_followupXXm_dur|
+-------+-----------------+-----------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+
|1000759|       2005-05-08|               60|              2005-11-04|                    180|              2006-05-08|                    365|              2007-05-08|                    730|              2022-10-31|                   6385|
|1001695|       2007-10-08|     

In [131]:
#Merge df_cohort and ascvd
df_cohort_ascvd = df_cohort.join(ascvd01, 'eid', 'left')
df_cohort_ascvd = df_cohort_ascvd.join(ascvd02, 'eid', 'left')
df_cohort_ascvd = df_cohort_ascvd.join(ascvd03, 'eid', 'left')
df_cohort_ascvd = df_cohort_ascvd.join(ascvd04, 'eid', 'left')
df_cohort_ascvd.printSchema()
# #ASCVD patient count
# df_cohort_ascvd_totalcount = df_cohort_ascvd.count()
# df_cohort_ascvd_acddcount = df_cohort_ascvd.filter(f.col('date_of_ac_0').isNotNull()).count()
# df_cohort_ascvd_ascvdcount = df_cohort_ascvd.filter((f.col('date_of_ac_0').isNotNull()) & (f.col('ascvd01_indexdate').isNotNull())).count()

# print(f'Total UK Biobank participants: {df_cohort_ascvd_totalcount}')
# print(f'Total UK Biobank participants with first assessment date: {df_cohort_ascvd_acddcount}')
# print(f'Total UK Biobank participants with first assessment date and ASCVD diagnosis: {df_cohort_ascvd_ascvdcount}')

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- height_0: double (nullable = true)
 |-- height_1: double (nullable = true)
 |-- height_2: double (nullable = true)
 |-- height_3: double (nullable = true)
 |-- weight_0: double (nullable = true)
 |-- weight_1: double (nullable = true)
 |-- weight_2: double (nullable = true)
 |-- weight_3: double (nullable = true)
 |-- bmi_0: double (nullable = true)
 |-- bmi_1: double (nullable = true)
 |-- bmi_2: double (nullable = true)
 |-- bmi_3: double (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- tc_0: double (nullable = true)
 |-- tc_1: double (nullable = true)
 |-- hdl_0: double (nullable = true)
 |-- hdl_1: double (nullable = true)
 |-- ldl_0: double (nullable = true)
 |-- ldl_1: doubl

In [132]:
# Identify the last episort start date, 'epistart_last'
# 'epistart_last' will be used to identify hospital care data span from index date (ascvd0X_indexdate)
hesin_epistartmax = hesin_df.groupBy('eid').agg(f.max(f.col('epistart_im')).alias('last_epistart'))

In [133]:
#Get the last record of hesin
windowcd = Window.partitionBy('eid').orderBy(f.col('disdate_im').desc(),f.col('epiend').desc(),f.col('epistart_im').desc(),f.col('admidate').desc())
hesin_lc = hesin_df.withColumn('disdate_im_last', f.rank().over(windowcd))                                                               
hesin_lc = hesin_lc.filter(f.col('disdate_im_last') == 1)

hesin_lc = hesin_lc.select('eid','dsource', 'censordate').distinct()

#Uncommand this code if there are no redundant last record
#hesin_lc = hesin_lc.select('eid','disdate_im_last')

hesin_lc_count = hesin_lc.count()
hesin_lc_eidcount = hesin_lc.select('eid').distinct().count()
hesin_lc_eiddscount = hesin_lc.select('eid','dsource').distinct().count()
print(f'Number of row for hesin_lc_count: {hesin_lc_count}')
print(f'Unique patient count for hesin_lc_eidcount: {hesin_lc_eidcount}')
print(f'Unique patient count and dsource for hesin_lc_eiddscount: {hesin_lc_eiddscount}')

Number of row for hesin_lc_count: 449078
Unique patient count for hesin_lc_eidcount: 449078
Unique patient count and dsource for hesin_lc_eiddscount: 449078


In [134]:
# last_hesin_dsource: Last inpatient record origin
# last_hesin_censordate: Last inpatient record censor date based on "last_hesin_dsource":
## HES: 2022-10-31;
## PEDW: 2022-08-31;
## SMR: 2022-05-31"

hesin_lc = hesin_lc.withColumnRenamed('dsource', 'last_hesin_dsource')
hesin_lc = hesin_lc.withColumnRenamed('censordate', 'last_hesin_censordate')
hesin_lc.show(5)

+-------+------------------+---------------------+
|    eid|last_hesin_dsource|last_hesin_censordate|
+-------+------------------+---------------------+
|1000047|               HES|           2022-10-31|
|1000050|               HES|           2022-10-31|
|1000068|               HES|           2022-10-31|
|1000122|               HES|           2022-10-31|
|1000214|               HES|           2022-10-31|
+-------+------------------+---------------------+
only showing top 5 rows



In [135]:
df_cohort_ascvd = df_cohort_ascvd.join(hesin_lc, 'eid', 'left')
df_cohort_ascvd = df_cohort_ascvd.join(hesin_epistartmax, 'eid', 'left')
df_cohort_ascvd.printSchema()

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- height_0: double (nullable = true)
 |-- height_1: double (nullable = true)
 |-- height_2: double (nullable = true)
 |-- height_3: double (nullable = true)
 |-- weight_0: double (nullable = true)
 |-- weight_1: double (nullable = true)
 |-- weight_2: double (nullable = true)
 |-- weight_3: double (nullable = true)
 |-- bmi_0: double (nullable = true)
 |-- bmi_1: double (nullable = true)
 |-- bmi_2: double (nullable = true)
 |-- bmi_3: double (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- tc_0: double (nullable = true)
 |-- tc_1: double (nullable = true)
 |-- hdl_0: double (nullable = true)
 |-- hdl_1: double (nullable = true)
 |-- ldl_0: double (nullable = true)
 |-- ldl_1: doubl

In [136]:
#days from visit to index
for x in range(4):
    date_of_ac = 'date_of_ac_num'
    days = 'days_ac2index_num'
    date_of_ac_x = date_of_ac.replace('num', str(x))
    days_x = days.replace('num', str(x))
    df_cohort_ascvd = df_cohort_ascvd.withColumn(days_x, f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col(date_of_ac_x) < f.col('ascvd01_indexdate')), f.datediff(f.col('ascvd01_indexdate'),f.col(date_of_ac_x))).otherwise(f.lit(None)))
    
    
df_cohort_ascvd.filter((f.col('date_of_ac_0').isNotNull()) & (f.col('date_of_ac_0').isNotNull())).select('eid','ascvd01_indexdate','date_of_ac_0','date_of_ac_1','days_ac2index_0','days_ac2index_1','days_ac2index_2','days_ac2index_3').show(5)

+-------+-----------------+------------+------------+---------------+---------------+---------------+---------------+
|    eid|ascvd01_indexdate|date_of_ac_0|date_of_ac_1|days_ac2index_0|days_ac2index_1|days_ac2index_2|days_ac2index_3|
+-------+-----------------+------------+------------+---------------+---------------+---------------+---------------+
|3319388|       2020-11-02|  2006-03-21|        null|           5340|           null|           null|           null|
|5967127|       2010-12-10|  2009-05-01|  2013-02-20|            588|           null|           null|           null|
|1505589|       2008-04-25|  2008-03-14|        null|             42|           null|           null|           null|
|2301343|             null|  2010-06-14|        null|           null|           null|           null|           null|
|3215021|             null|  2008-04-07|        null|           null|           null|           null|           null|
+-------+-----------------+------------+------------+---

In [137]:
df_cohort_ascvd.filter((f.col('date_of_ac_0').isNotNull()) & (f.col('date_of_ac_3').isNotNull()) & (f.col('date_of_ac_3') < f.col('ascvd01_indexdate'))).select('eid','ascvd01_indexdate','date_of_ac_0','date_of_ac_1','date_of_ac_2','date_of_ac_3','date_of_ac_1','days_ac2index_0','days_ac2index_1','days_ac2index_2','days_ac2index_3').show(5)

+-------+-----------------+------------+------------+------------+------------+------------+---------------+---------------+---------------+---------------+
|    eid|ascvd01_indexdate|date_of_ac_0|date_of_ac_1|date_of_ac_2|date_of_ac_3|date_of_ac_1|days_ac2index_0|days_ac2index_1|days_ac2index_2|days_ac2index_3|
+-------+-----------------+------------+------------+------------+------------+------------+---------------+---------------+---------------+---------------+
|1444949|       2021-05-08|  2008-05-01|  2013-05-30|  2017-11-07|  2020-01-14|  2013-05-30|           4755|           2900|           1278|            480|
|1539262|       2021-12-12|  2009-06-03|        null|  2017-10-11|  2020-03-11|        null|           4575|           null|           1523|            641|
|1876069|       2021-04-19|  2008-07-05|        null|  2017-05-25|  2019-09-01|        null|           4671|           null|           1425|            596|
|2274120|       2021-07-08|  2008-04-01|        null|  201

In [138]:
# Function to identify baseline characteristics with 4 instances

## baseline_var_str: string variable 
def baseline_var_4i_str(df, var_prefix):
    var_0 = var_prefix + '_' + str(0)
    var_1 = var_prefix + '_' + str(1)
    var_2 = var_prefix + '_' + str(2)
    var_3 = var_prefix + '_' + str(3)
    
    #identify the earliest var record pre-index
    df = df.withColumn('days_var2index_0', f.when(f.col(var_0).isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
    df = df.withColumn('days_var2index_1', f.when(f.col(var_1).isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))
    df = df.withColumn('days_var2index_2', f.when(f.col(var_2).isNotNull(), f.col('days_ac2index_2')).otherwise(f.lit(None)))
    df = df.withColumn('days_var2index_3', f.when(f.col(var_3).isNotNull(), f.col('days_ac2index_3')).otherwise(f.lit(None)))

    #Identify baseline var: var with date closest to index
    
    baseline_var = 'baseline_' + var_prefix
    
    varcols = ['days_var2index_0','days_var2index_1','days_var2index_2','days_var2index_3']
    varcol_string = ', '.join("'{0}'".format(c) for c in varcols)
    
    # Merge days_var2index_x and var status/values, output as merged array
    # Remove elements of merged array if days_var2index_x is null
    # Sort elements in merged array by days_var2index
    # var_final_col: the number of days of var closest to index
    df = df.withColumn('vals', f.array(varcols)) \
    .withColumn('varcols', f.expr('Array(' + varcol_string + ')')) \
    .withColumn('varzipped', f.arrays_zip('vals', 'varcols')) \
    .withColumn('varwithout_nulls', f.expr('filter(varzipped, x -> not x.vals is null)')) \
    .withColumn('varsorted', f.expr('array_sort(varwithout_nulls)')) \
    .withColumn('var_final_col', f.col('varsorted')[0].varcols) \
    .drop('vals', 'varcols', 'varzipped', 'varwithout_nulls', 'varsorted')
    
    df = df.withColumn(baseline_var, f.when((f.col('var_final_col') == f.lit('days_var2index_0')), f.col(var_0))
                                      .when((f.col('var_final_col') == f.lit('days_var2index_1')), f.col(var_1))
                                      .when((f.col('var_final_col') == f.lit('days_var2index_2')), f.col(var_2))
                                      .when((f.col('var_final_col') == f.lit('days_var2index_3')), f.col(var_3))
                                      .otherwise(f.lit('Missing')))
    
    df = df.drop('days_var2index_0','days_var2index_1','days_var2index_2','days_var2index_3','var_final_col')
    
    return df


## baseline_var_num: numerical variable 
def baseline_var_4i_num(df, var_prefix):
    var_0 = var_prefix + '_' + str(0)
    var_1 = var_prefix + '_' + str(1)
    var_2 = var_prefix + '_' + str(2)
    var_3 = var_prefix + '_' + str(3)
    
    #identify the earliest var record pre-index
    df = df.withColumn('days_var2index_0', f.when(f.col(var_0).isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
    df = df.withColumn('days_var2index_1', f.when(f.col(var_1).isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))
    df = df.withColumn('days_var2index_2', f.when(f.col(var_2).isNotNull(), f.col('days_ac2index_2')).otherwise(f.lit(None)))
    df = df.withColumn('days_var2index_3', f.when(f.col(var_3).isNotNull(), f.col('days_ac2index_3')).otherwise(f.lit(None)))

    #Identify baseline var: var with date closest to index
    
    baseline_var = 'baseline_' + var_prefix
    
    varcols = ['days_var2index_0','days_var2index_1','days_var2index_2','days_var2index_3']
    varcol_string = ', '.join("'{0}'".format(c) for c in varcols)
    
    # Merge days_var2index_x and var status/values, output as merged array
    # Remove elements of merged array if days_var2index_x is null
    # Sort elements in merged array by days_var2index
    # var_final_col: the number of days of var closest to index
    df = df.withColumn('vals', f.array(varcols)) \
    .withColumn('varcols', f.expr('Array(' + varcol_string + ')')) \
    .withColumn('varzipped', f.arrays_zip('vals', 'varcols')) \
    .withColumn('varwithout_nulls', f.expr('filter(varzipped, x -> not x.vals is null)')) \
    .withColumn('varsorted', f.expr('array_sort(varwithout_nulls)')) \
    .withColumn('var_final_col', f.col('varsorted')[0].varcols) \
    .drop('vals', 'varcols', 'varzipped', 'varwithout_nulls', 'varsorted')
    
    df = df.withColumn(baseline_var, f.when((f.col('var_final_col') == f.lit('days_var2index_0')), f.col(var_0))
                                      .when((f.col('var_final_col') == f.lit('days_var2index_1')), f.col(var_1))
                                      .when((f.col('var_final_col') == f.lit('days_var2index_2')), f.col(var_2))
                                      .when((f.col('var_final_col') == f.lit('days_var2index_3')), f.col(var_3))
                                      .otherwise(f.lit(None)))
    
    df = df.drop('days_var2index_0','days_var2index_1','days_var2index_2','days_var2index_3','var_final_col')
    
    return df

def baseline_var_2i_num(df, var_prefix):
    var_0 = var_prefix + '_' + str(0)
    var_1 = var_prefix + '_' + str(1)
    
    #identify the earliest var record pre-index
    df = df.withColumn('days_var2index_0', f.when(f.col(var_0).isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
    df = df.withColumn('days_var2index_1', f.when(f.col(var_1).isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))

    #Identify baseline var: var with date closest to index
    
    baseline_var = 'baseline_' + var_prefix
    
    varcols = ['days_var2index_0','days_var2index_1']
    varcol_string = ', '.join("'{0}'".format(c) for c in varcols)
    
    # Merge days_var2index_x and var status/values, output as merged array
    # Remove elements of merged array if days_var2index_x is null
    # Sort elements in merged array by days_var2index
    # var_final_col: the number of days of var closest to index
    df = df.withColumn('vals', f.array(varcols)) \
    .withColumn('varcols', f.expr('Array(' + varcol_string + ')')) \
    .withColumn('varzipped', f.arrays_zip('vals', 'varcols')) \
    .withColumn('varwithout_nulls', f.expr('filter(varzipped, x -> not x.vals is null)')) \
    .withColumn('varsorted', f.expr('array_sort(varwithout_nulls)')) \
    .withColumn('var_final_col', f.col('varsorted')[0].varcols) \
    .drop('vals', 'varcols', 'varzipped', 'varwithout_nulls', 'varsorted')
    
    df = df.withColumn(baseline_var, f.when((f.col('var_final_col') == f.lit('days_var2index_0')), f.col(var_0))
                                      .when((f.col('var_final_col') == f.lit('days_var2index_1')), f.col(var_1))
                                      .otherwise(f.lit(None)))
    
    df = df.drop('days_var2index_0','days_var2index_1','var_final_col')
    
    return df

## baseline_var_num_2arrays: numerical variable (4 instances 2 arrays)
def baseline_var_num_2arrays(df, var_prefix):
    for x in range(4):
        var = var_prefix
        var_x = var + '_' + str(x)
        var_ix = var + '_i' + str(x)
        var_ixa0 = var_ix + 'a0'
        var_ixa1 = var_ix + 'a1'
        
        df = df.withColumn(var_x, f.when((f.col(var_ixa0).isNotNull()) & (f.col(var_ixa1).isNotNull()), (f.col(var_ixa0) + f.col(var_ixa1)) / 2)
                                   .when((f.col(var_ixa0).isNotNull()) & (f.col(var_ixa1).isNull()), f.col(var_ixa0))
                                   .when((f.col(var_ixa0).isNull()) & (f.col(var_ixa1).isNotNull()), f.col(var_ixa1))
                                   .otherwise(f.lit(None)))

    df1 = baseline_var_4i_num(df, var_prefix)
    
    return df1

In [139]:
# smoking: string, 4 instances

df_cohort_ascvd = baseline_var_4i_str(df_cohort_ascvd, 'smoking')

In [140]:
# height: numeric, 4 instances

df_cohort_ascvd = baseline_var_4i_num(df_cohort_ascvd, 'height')

In [141]:
# weight: numeric, 4 instances

df_cohort_ascvd = baseline_var_4i_num(df_cohort_ascvd, 'weight')

In [142]:
# bmi: numeric, 4 instances

df_cohort_ascvd = baseline_var_4i_num(df_cohort_ascvd, 'bmi')

In [143]:
# bp_diastolic & bp_systolic: numeric, 4 instances, 2 arrays
df_cohort_ascvd = baseline_var_num_2arrays(df_cohort_ascvd, 'bp_diastolic')
df_cohort_ascvd = baseline_var_num_2arrays(df_cohort_ascvd, 'bp_systolic')

In [144]:
# ldl: numeric, 2 instances
df_cohort_ascvd = baseline_var_2i_num(df_cohort_ascvd,'ldl')

In [145]:
# hdl: numeric, 2 instances
df_cohort_ascvd = baseline_var_2i_num(df_cohort_ascvd,'hdl')

In [146]:
# tc: numeric, 2 instances
df_cohort_ascvd = baseline_var_2i_num(df_cohort_ascvd,'tc')

In [147]:
#identify the earliest nonhdl record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('nonhdl_0', f.when((f.col('tc_0').isNotNull()) & (f.col('hdl_0').isNotNull()), f.col('tc_0') - f.col('hdl_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('nonhdl_1', f.when((f.col('tc_1').isNotNull()) & (f.col('hdl_1').isNotNull()), f.col('tc_1') - f.col('hdl_1')).otherwise(f.lit(None)))

df_cohort_ascvd = baseline_var_2i_num(df_cohort_ascvd,'nonhdl')

In [148]:
# crp: numeric, 2 instances
df_cohort_ascvd = baseline_var_2i_num(df_cohort_ascvd,'crp')

In [149]:
df_cohort_ascvd.printSchema()

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- height_0: double (nullable = true)
 |-- height_1: double (nullable = true)
 |-- height_2: double (nullable = true)
 |-- height_3: double (nullable = true)
 |-- weight_0: double (nullable = true)
 |-- weight_1: double (nullable = true)
 |-- weight_2: double (nullable = true)
 |-- weight_3: double (nullable = true)
 |-- bmi_0: double (nullable = true)
 |-- bmi_1: double (nullable = true)
 |-- bmi_2: double (nullable = true)
 |-- bmi_3: double (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- tc_0: double (nullable = true)
 |-- tc_1: double (nullable = true)
 |-- hdl_0: double (nullable = true)
 |-- hdl_1: double (nullable = true)
 |-- ldl_0: double (nullable = true)
 |-- ldl_1: doubl

In [150]:
df_cohort_ascvd = df_cohort_ascvd.drop('height_0','height_1','height_2','height_3',
                                       'weight_0','weight_1','weight_2','weight_3',
                                       'bmi_0','bmi_1','bmi_2','bmi_3',
                                       'tc_0','tc_1',
                                       'hdl_0','hdl_1',
                                       'ldl_0','ldl_1',
                                       'nonhdl_0','nonhdl_1',
                                       'crp_0','crp_1',
                                       'smoking_0','smoking_1','smoking_2','smoking_3',
                                       'bp_diastolic_i0a0','bp_diastolic_i0a1','bp_diastolic_i1a0','bp_diastolic_i1a1','bp_diastolic_i2a0','bp_diastolic_i2a1','bp_diastolic_i3a0','bp_diastolic_i3a1',
                                       'bp_systolic_i0a0','bp_systolic_i0a1','bp_systolic_i1a0','bp_systolic_i1a1','bp_systolic_i2a0','bp_systolic_i2a1','bp_systolic_i3a0','bp_systolic_i3a1',
                                       'bp_diastolic_0','bp_diastolic_1','bp_diastolic_2','bp_diastolic_3',
                                       'bp_systolic_0','bp_systolic_1','bp_systolic_2','bp_systolic_3',
                                       'days_ac2index_0','days_ac2index_1','days_ac2index_2','days_ac2index_3')


In [151]:
#Comorbidity
df_cohort_ascvd = df_cohort_ascvd.withColumn('hbp_date_first', f.least(f.col('hbp_i10_date_first'),f.col('hbp_i11_date_first'),f.col('hbp_i12_date_first'),f.col('hbp_i13_date_first'),f.col('hbp_i15_date_first')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('dm_date_first', f.least(f.col('dm_e10_date_first'),f.col('dm_e11_date_first'),f.col('dm_e12_date_first'),f.col('dm_e13_date_first'),f.col('dm_e14_date_first')))

df_cohort_ascvd = df_cohort_ascvd.withColumn('cm_ht', f.when((f.col('hbp_date_first').isNotNull()) & (f.col('hbp_date_first') < f.col('ascvd01_indexdate')), 1).otherwise(0))
df_cohort_ascvd = df_cohort_ascvd.withColumn('cm_dm', f.when((f.col('dm_date_first').isNotNull()) & (f.col('dm_date_first') < f.col('ascvd01_indexdate')), 1).otherwise(0))
df_cohort_ascvd = df_cohort_ascvd.withColumn('cm_ckd', f.when((f.col('crf_n18_date_first').isNotNull()) & (f.col('crf_n18_date_first') < f.col('ascvd01_indexdate')), 1).otherwise(0))

In [152]:
df_cohort_ascvd.filter((f.col('ascvd01_indexdate').isNotNull()) & (f.col('cm_dm') == 1)).select('eid','ascvd01_indexdate','dm_date_first','dm_e10_date_first','dm_e11_date_first','dm_e12_date_first','dm_e14_date_first','cm_dm').show(5)

+-------+-----------------+-------------+-----------------+-----------------+-----------------+-----------------+-----+
|    eid|ascvd01_indexdate|dm_date_first|dm_e10_date_first|dm_e11_date_first|dm_e12_date_first|dm_e14_date_first|cm_dm|
+-------+-----------------+-------------+-----------------+-----------------+-----------------+-----------------+-----+
|1003593|       1999-01-22|   1978-11-01|       1999-01-22|       2007-05-19|             null|       1978-11-01|    1|
|1006619|       2022-03-23|   2011-12-21|             null|       2011-12-21|             null|             null|    1|
|1007474|       2020-01-23|   2015-07-21|             null|       2015-07-21|             null|             null|    1|
|1008949|       2020-02-05|   2002-03-01|             null|       2009-10-09|             null|       2002-03-01|    1|
|1010972|       2018-04-30|   2011-12-15|             null|       2011-12-15|             null|             null|    1|
+-------+-----------------+-------------

In [153]:
df_cohort_ascvd = df_cohort_ascvd.drop('hbp_date_first','hbp_i10_date_first','hbp_i11_date_first','hbp_i12_date_first','hbp_i13_date_first','hbp_i15_date_first',
                                       'dm_date_first','dm_e10_date_first','dm_e11_date_first','dm_e12_date_first','dm_e13_date_first','dm_e14_date_first',
                                       'crf_n18_date_first')

In [154]:
# Baseline medications

# GP prescription records
gp_script_field_names = ['eid',
                         'data_provider', # Data provider. Whether the record originates from England (Vision), Scotland, England (TPP) or Wales.
                         'issue_date', # Date prescription was issued. Special coded values are given in Coding 819.
                         'read_2', # Read v2 code. Read version 2 code.
                         'bnf_code', # BNF code. British National Formulary code.
                         'dmd_code', # dm+d code. Dictionary of medicines and devices (dm+d) code.
                         'drug_name', # Drug name.
                         'quantity' # Quantity issued.
                        ]

gp_scripts_df = gp_scripts.retrieve_fields(names=gp_script_field_names, engine=dxdata.connect())

In [155]:
gp_scripts_df = gp_scripts_df.withColumn('drug_name', f.lower(f.col('drug_name')))
read_v2_drugs = pd.read_csv('/mnt/project/Users/yonghu4/Lpa_EB/code/read_v2_drugs_lkp.txt',sep='\t')
read_v2_drugs_s = spark.createDataFrame(read_v2_drugs)
read_v2_drugs_s = read_v2_drugs_s.withColumnRenamed('read_code', 'read_2')
gp_scripts_df = gp_scripts_df.join(read_v2_drugs_s, 'read_2', 'left')
gp_scripts_df.show(5)

+-------+-------+-------------+----------+---------------+---------+--------------------+------------+--------------------+-----------+
| read_2|    eid|data_provider|issue_date|       bnf_code| dmd_code|           drug_name|    quantity|    term_description|status_flag|
+-------+-------+-------------+----------+---------------+---------+--------------------+------------+--------------------+-----------+
|b312.00|5364675|            1|2008-06-09|           null|317972000|furosemide tabs 40mg|      56.000|                null|       null|
|  bkB1.|1584874|            4|2004-10-04|           null|     null|                null|        null|OLMESARTAN MEDOXO...|          0|
|bu23.00|3811703|            1|2004-05-18|           null|319773006|aspirin disp tab ...|      28.000|                null|       null|
|   null|1067608|            3|2007-01-08| 02.09.01.00.00|     null|aspirin 75mg disp...|56 tablet(s)|                null|       null|
|   null|4883548|            2|2016-01-25|020505

In [156]:
gp_scripts_df = gp_scripts_df.withColumn('drug_name2', f.when((f.col('data_provider') != 3) & (f.col('drug_name').isNull()), f.col('term_description')).otherwise(f.col('drug_name')))
gp_scripts_df = gp_scripts_df.drop('drug_name', 'term_description')
gp_scripts_df = gp_scripts_df.withColumnRenamed('drug_name2','drug_name')
gp_scripts_df = gp_scripts_df.drop('bnf_code')


In [157]:
# load drug name list
drugnamelist = pd.read_csv("/mnt/project/Users/yonghu4/Lpa_EB/code/cv_drug_codelist_v00.txt", dtype=str, sep='\t', keep_default_na=False)
# convert 'drugnamelist' to spark dataframe
drugnamelist_s = spark.createDataFrame(drugnamelist)
drugnamelist_s = drugnamelist_s.withColumnRenamed('drug_name3','drug_name')
drugnamelist_s = drugnamelist_s.withColumn('remark', f.when((f.col('remark') == ''), f.lit(None)).otherwise(f.col('remark')))
drugnamelist_s.show(5)

drugnamelist_s.printSchema()


+----------+--------------------+--------+------+
|drug_class|           drug_name|bnf_code|remark|
+----------+--------------------+--------+------+
|     ACEis|*coversyl 2mg tab...|  020505|  null|
|     ACEis|*coversyl 4mg tab...|  020505|  null|
|     ACEis|*coversyl 8mg tab...|  020505|  null|
|     ACEis|*co-zidocapt 50mg...|  020505|  null|
|     ACEis|*gopten 1mg capsules|  020505|  null|
+----------+--------------------+--------+------+
only showing top 5 rows

root
 |-- drug_class: string (nullable = true)
 |-- drug_name: string (nullable = true)
 |-- bnf_code: string (nullable = true)
 |-- remark: string (nullable = true)



In [158]:
# Change drugname to lower case

gp_scripts_df.filter(f.col('eid') == '1584874').show(10)
gp_scripts_df = gp_scripts_df.withColumn('drug_name', f.lower(f.col('drug_name')))
gp_scripts_df = gp_scripts_df.filter(f.col('issue_date').isNotNull())
gp_scripts_df.filter(f.col('eid') == '1584874').show(10)

+------+-------+-------------+----------+--------+--------+-----------+--------------------+
|read_2|    eid|data_provider|issue_date|dmd_code|quantity|status_flag|           drug_name|
+------+-------+-------------+----------+--------+--------+-----------+--------------------+
| gda1.|1584874|            4|2006-11-02|    null|    null|          0|OXYBUTYNIN HYDROC...|
| gda1.|1584874|            4|2006-08-09|    null|    null|          0|OXYBUTYNIN HYDROC...|
| me41.|1584874|            4|2006-08-09|    null|    null|          0|CANESTEN 1% cream...|
| bkB1.|1584874|            4|2007-02-26|    null|    null|          0|OLMESARTAN MEDOXO...|
| bkB1.|1584874|            4|2005-06-22|    null|    null|          0|OLMESARTAN MEDOXO...|
| bkB1.|1584874|            4|2013-08-13|    null|    null|          0|OLMESARTAN MEDOXO...|
| bkB1.|1584874|            4|2014-08-07|    null|    null|          0|OLMESARTAN MEDOXO...|
| bkB1.|1584874|            4|2009-01-12|    null|    null|          0

In [159]:
# extract ACEi drug names from 'drugnamelist_s'
statin = drugnamelist_s.filter(f.col('drug_class') == 'Statin').drop('remark')
acei = drugnamelist_s.filter(f.col('drug_class') == 'ACEis').drop('remark')
arb = drugnamelist_s.filter(f.col('drug_class') == 'ARB').drop('remark')
bb = drugnamelist_s.filter(f.col('drug_class') == 'Beta-blockers').drop('remark')
pcsk9i = drugnamelist_s.filter(f.col('drug_class') == 'PCSK9i').drop('remark')
ezetimibe = drugnamelist_s.filter(f.col('drug_class') == 'Ezetimibe').drop('remark')
fibrates = drugnamelist_s.filter(f.col('drug_class') == 'Fibrates').drop('remark')
bileacid = drugnamelist_s.filter(f.col('drug_class') == 'Bile acid sequestrants').drop('remark')

# extract prescription records for ACEi drugs
gp_scripts_statin_df = gp_scripts_df.join(statin, 'drug_name', 'inner')
gp_scripts_acei_df = gp_scripts_df.join(acei, 'drug_name', 'inner')
gp_scripts_arb_df = gp_scripts_df.join(arb, 'drug_name', 'inner')
gp_scripts_bb_df = gp_scripts_df.join(bb, 'drug_name', 'inner')
gp_scripts_pcsk9i_df = gp_scripts_df.join(pcsk9i, 'drug_name', 'inner')
gp_scripts_ezetimibe_df = gp_scripts_df.join(ezetimibe, 'drug_name', 'inner')
gp_scripts_fibrates_df = gp_scripts_df.join(fibrates, 'drug_name', 'inner')
gp_scripts_bileacid_df = gp_scripts_df.join(bileacid, 'drug_name', 'inner')

In [160]:
window_scripts = Window.partitionBy('eid').orderBy('issue_date')
gp_scripts_statin_df = gp_scripts_statin_df.withColumn('med_statin', f.row_number().over(window_scripts))
gp_scripts_acei_df = gp_scripts_acei_df.withColumn('med_acei', f.row_number().over(window_scripts))
gp_scripts_arb_df = gp_scripts_arb_df.withColumn('med_arb', f.row_number().over(window_scripts))
gp_scripts_bb_df = gp_scripts_bb_df.withColumn('med_bb', f.row_number().over(window_scripts))
gp_scripts_pcsk9i_df = gp_scripts_pcsk9i_df.withColumn('med_pcsk9i', f.row_number().over(window_scripts))
gp_scripts_ezetimibe_df = gp_scripts_ezetimibe_df.withColumn('med_ezetimibe', f.row_number().over(window_scripts))
gp_scripts_fibrates_df = gp_scripts_fibrates_df.withColumn('med_fibrates', f.row_number().over(window_scripts))
gp_scripts_bileacid_df = gp_scripts_bileacid_df.withColumn('med_bileacid', f.row_number().over(window_scripts))

In [161]:
gp_scripts_statin_df = gp_scripts_statin_df.filter(f.col('med_statin') == 1)
gp_scripts_acei_df = gp_scripts_acei_df.filter(f.col('med_acei') == 1)
gp_scripts_arb_df = gp_scripts_arb_df.filter(f.col('med_arb') == 1)
gp_scripts_bb_df = gp_scripts_bb_df.filter(f.col('med_bb') == 1)
gp_scripts_pcsk9i_df = gp_scripts_pcsk9i_df.filter(f.col('med_pcsk9i') == 1)
gp_scripts_ezetimibe_df = gp_scripts_ezetimibe_df.filter(f.col('med_ezetimibe') == 1)
gp_scripts_fibrates_df = gp_scripts_fibrates_df.filter(f.col('med_fibrates') == 1)
gp_scripts_bileacid_df = gp_scripts_bileacid_df.filter(f.col('med_bileacid') == 1)

In [162]:
gp_scripts_statin_df = gp_scripts_statin_df.select('eid','issue_date').withColumnRenamed('issue_date','statin_issue_date')
gp_scripts_acei_df = gp_scripts_acei_df.select('eid','issue_date').withColumnRenamed('issue_date','acei_issue_date')
gp_scripts_arb_df = gp_scripts_arb_df.select('eid','issue_date').withColumnRenamed('issue_date','arb_issue_date')
gp_scripts_bb_df = gp_scripts_bb_df.select('eid','issue_date').withColumnRenamed('issue_date','bb_issue_date')
gp_scripts_pcsk9i_df = gp_scripts_pcsk9i_df.select('eid','issue_date').withColumnRenamed('issue_date','pcsk9i_issue_date')
gp_scripts_ezetimibe_df = gp_scripts_ezetimibe_df.select('eid','issue_date').withColumnRenamed('issue_date','ezetimibe_issue_date')
gp_scripts_fibrates_df = gp_scripts_fibrates_df.select('eid','issue_date').withColumnRenamed('issue_date','fibrates_issue_date')
gp_scripts_bileacid_df = gp_scripts_bileacid_df.select('eid','issue_date').withColumnRenamed('issue_date','bileacid_issue_date')


In [163]:
cohort_ascvd = df_cohort_ascvd.select('eid', 'ascvd01_indexdate')

gp_scripts_bsmed = cohort_ascvd.join(gp_scripts_statin_df,'eid','left')
gp_scripts_bsmed = gp_scripts_bsmed.join(gp_scripts_acei_df,'eid','left')
gp_scripts_bsmed = gp_scripts_bsmed.join(gp_scripts_arb_df,'eid','left')
gp_scripts_bsmed = gp_scripts_bsmed.join(gp_scripts_bb_df,'eid','left')
gp_scripts_bsmed = gp_scripts_bsmed.join(gp_scripts_pcsk9i_df,'eid','left')
gp_scripts_bsmed = gp_scripts_bsmed.join(gp_scripts_ezetimibe_df,'eid','left')
gp_scripts_bsmed = gp_scripts_bsmed.join(gp_scripts_fibrates_df,'eid','left')
gp_scripts_bsmed = gp_scripts_bsmed.join(gp_scripts_bileacid_df,'eid','left')

In [164]:
gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_statin', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('statin_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('statin_issue_date')), f.lit(1))
                                                              .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('statin_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('statin_issue_date')), f.lit(0))
                                                              .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('statin_issue_date').isNull()), f.lit(0))
                                                              .otherwise(f.lit(None)))

gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_acei', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('acei_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('acei_issue_date')), f.lit(1))
                                                            .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('acei_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('acei_issue_date')), f.lit(0))
                                                            .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('acei_issue_date').isNull()), f.lit(0))
                                                            .otherwise(f.lit(None)))

gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_arb', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('arb_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('arb_issue_date')), f.lit(1))
                                                           .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('arb_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('arb_issue_date')), f.lit(0))
                                                           .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('arb_issue_date').isNull()), f.lit(0))
                                                           .otherwise(f.lit(None)))

gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_bb', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('bb_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('bb_issue_date')), f.lit(1))
                                                          .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('bb_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('bb_issue_date')), f.lit(0))
                                                          .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('bb_issue_date').isNull()), f.lit(0))
                                                          .otherwise(f.lit(None)))

gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_pcsk9i', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('pcsk9i_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('pcsk9i_issue_date')), f.lit(1))
                                                              .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('pcsk9i_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('pcsk9i_issue_date')), f.lit(0))
                                                              .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('pcsk9i_issue_date').isNull()), f.lit(0))
                                                              .otherwise(f.lit(None)))

gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_ezetimibe', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ezetimibe_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('ezetimibe_issue_date')), f.lit(1))
                                                                 .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ezetimibe_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('ezetimibe_issue_date')), f.lit(0))
                                                                 .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('ezetimibe_issue_date').isNull()), f.lit(0))
                                                                 .otherwise(f.lit(None)))

gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_fibrates', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('fibrates_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('fibrates_issue_date')), f.lit(1))
                                                                .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('fibrates_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('fibrates_issue_date')), f.lit(0))
                                                                .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('fibrates_issue_date').isNull()), f.lit(0))
                                                                .otherwise(f.lit(None)))

gp_scripts_bsmed = gp_scripts_bsmed.withColumn('med_bileacid', f.when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('bileacid_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('bileacid_issue_date')), f.lit(1))
                                                                .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('bileacid_issue_date').isNotNull()) & (f.col('ascvd01_indexdate') <= f.col('bileacid_issue_date')), f.lit(0))
                                                                .when((f.col('ascvd01_indexdate').isNotNull()) & (f.col('bileacid_issue_date').isNull()), f.lit(0))
                                                                .otherwise(f.lit(None)))

gp_scripts_bsmed.show(5)

+-------+-----------------+-----------------+---------------+--------------+-------------+-----------------+--------------------+-------------------+-------------------+----------+--------+-------+------+----------+-------------+------------+------------+
|    eid|ascvd01_indexdate|statin_issue_date|acei_issue_date|arb_issue_date|bb_issue_date|pcsk9i_issue_date|ezetimibe_issue_date|fibrates_issue_date|bileacid_issue_date|med_statin|med_acei|med_arb|med_bb|med_pcsk9i|med_ezetimibe|med_fibrates|med_bileacid|
+-------+-----------------+-----------------+---------------+--------------+-------------+-----------------+--------------------+-------------------+-------------------+----------+--------+-------+------+----------+-------------+------------+------------+
|3319388|       2020-11-02|             null|           null|          null|         null|             null|                null|               null|               null|         0|       0|      0|     0|         0|            0|   

In [165]:
gp_scripts_bsmed = gp_scripts_bsmed.drop('ascvd01_indexdate','statin_issue_date','acei_issue_date','arb_issue_date','bb_issue_date','pcsk9i_issue_date','ezetimibe_issue_date','fibrates_issue_date','bileacid_issue_date')
gp_scripts_bsmed.show(5)

+-------+----------+--------+-------+------+----------+-------------+------------+------------+
|    eid|med_statin|med_acei|med_arb|med_bb|med_pcsk9i|med_ezetimibe|med_fibrates|med_bileacid|
+-------+----------+--------+-------+------+----------+-------------+------------+------------+
|3319388|         0|       0|      0|     0|         0|            0|           0|           0|
|5967127|         0|       0|      0|     0|         0|            0|           0|           0|
|1505589|         0|       0|      0|     0|         0|            0|           0|           0|
|2301343|      null|    null|   null|  null|      null|         null|        null|        null|
|3215021|      null|    null|   null|  null|      null|         null|        null|        null|
+-------+----------+--------+-------+------+----------+-------------+------------+------------+
only showing top 5 rows



In [166]:
df_cohort_ascvd = df_cohort_ascvd.join(gp_scripts_bsmed, 'eid', 'left')

In [167]:
#Identify incident cohort
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit01', f.lit(1))
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit02', f.when((f.col('date_of_ac_0').isNotNull()) & (f.col('ascvd01_indexdate').isNotNull()) & (f.col('ascvd01_indexdate') > f.col('date_of_ac_0')) & (f.col('ascvd01_indexdate') <= f.lit('2021-10-31')), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit03', f.when((f.col('lpa').isNotNull()), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit04', f.when((f.col('ascvd01_index_age') >= 40), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit05', f.when((f.col('last_epistart') > f.col('ascvd01_indexdate')), f.lit(1)).otherwise(f.lit(0)))
#Patients without SMR
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit06', f.when((f.col('hesin_smr') != 1), f.lit(1)).otherwise(f.lit(0)))
#Patients with inpatient record
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit07', f.when((f.col('hesin_record') == 1), f.lit(1)).otherwise(f.lit(0)))
#Patients with primary care data
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit08', f.when((f.col('gp_clinical_record') == 1), f.lit(1)).otherwise(f.lit(0)))
#ascvd01: study cohort, incident ASCVD
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit09', f.when((f.col('ascvd01_indexdate').isNotNull()), f.lit(1)).otherwise(f.lit(0)))
#ascvd02: recent ASCVD
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit10', f.when((f.col('ascvd02_indexdate').isNotNull()), f.lit(1)).otherwise(f.lit(0)))
#ascvd03: premature ASCVD
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit11', f.when((f.col('ascvd03_indexdate').isNotNull()), f.lit(1)).otherwise(f.lit(0)))
#ascvd04: recurrent ASCVD
df_cohort_ascvd = df_cohort_ascvd.withColumn('crit12', f.when((f.col('ascvd04_indexdate').isNotNull()), f.lit(1)).otherwise(f.lit(0)))

In [168]:
df_cohort_ascvd.printSchema()

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- studyend: date (nullable = true)
 |-- date_of_birth: date (nullable = true)
 |-- ethnic: string (nullable = true)
 |-- lpa: double (nullable = true)
 |-- lpa_threshold: string (nullable = false)
 |-- date_of_death: date (nullable = true)
 |-- death_icd10: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- death_level: array (nullable = true)
 |    |-- element: long (containsNull = false)
 |-- death_cv_icd10: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- death_cv_level: array (nullable = true)
 |    |-- element: long (containsNull = false)
 |-- death_cv: integer (nullable = true)
 |-- death_cv_prim

In [169]:
df_cohort_ascvd = df_cohort_ascvd.withColumn('death_icd10', f.concat_ws(';', f.col('death_icd10')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('death_level', f.concat_ws(';', f.col('death_level')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('death_cv_icd10', f.concat_ws(';', f.col('death_cv_icd10')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('death_cv_level', f.concat_ws(';', f.col('death_cv_level')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_epiend', f.concat_ws(';', f.col('ascvd01_epiend')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_admidate', f.concat_ws(';', f.col('ascvd01_admidate')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_disdate', f.concat_ws(';', f.col('ascvd01_disdate')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_ins_index', f.concat_ws(';', f.col('ascvd01_ins_index')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_arr_index', f.concat_ws(';', f.col('ascvd01_arr_index')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_spell_index', f.concat_ws(';', f.col('ascvd01_spell_index')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_diag_level', f.concat_ws(';', f.col('ascvd01_diag_level')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_oper_level', f.concat_ws(';', f.col('ascvd01_oper_level')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_icd9', f.concat_ws(';', f.col('ascvd01_icd9')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_icd10', f.concat_ws(';', f.col('ascvd01_icd10')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_oper3', f.concat_ws(';', f.col('ascvd01_oper3')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_oper4', f.concat_ws(';', f.col('ascvd01_oper4')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_read_2', f.concat_ws(';', f.col('ascvd01_read_2')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_read_3', f.concat_ws(';', f.col('ascvd01_read_3')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_group_icd9', f.concat_ws(';', f.col('ascvd01_group_icd9')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_group_icd10', f.concat_ws(';', f.col('ascvd01_group_icd10')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_group_oper3', f.concat_ws(';', f.col('ascvd01_group_oper3')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_group_oper4', f.concat_ws(';', f.col('ascvd01_group_oper4')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_group_readv2', f.concat_ws(';', f.col('ascvd01_group_readv2')))
df_cohort_ascvd = df_cohort_ascvd.withColumn('ascvd01_group_ctv3', f.concat_ws(';', f.col('ascvd01_group_ctv3')))


df_cohort_ascvd.printSchema()

root
 |-- eid: string (nullable = true)
 |-- month_of_birth: string (nullable = true)
 |-- year_of_birth: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- studyend: date (nullable = true)
 |-- date_of_birth: date (nullable = true)
 |-- ethnic: string (nullable = true)
 |-- lpa: double (nullable = true)
 |-- lpa_threshold: string (nullable = false)
 |-- date_of_death: date (nullable = true)
 |-- death_icd10: string (nullable = false)
 |-- death_level: string (nullable = false)
 |-- death_cv_icd10: string (nullable = false)
 |-- death_cv_level: string (nullable = false)
 |-- death_cv: integer (nullable = true)
 |-- death_cv_primary: integer (nullable = true)
 |-- death_cv02_primary: integer (nullable = true)
 |-- death: integer (nullable = false)
 |-- hesin_record: integer (nullable = true)
 |-- hesin_smr: st

In [170]:
df_cohort_ascvd = df_cohort_ascvd.drop('month_of_birth','year_of_birth')
df_cohort_ascvd.printSchema()

root
 |-- eid: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- date_of_ac_0: date (nullable = true)
 |-- date_of_ac_1: date (nullable = true)
 |-- date_of_ac_2: date (nullable = true)
 |-- date_of_ac_3: date (nullable = true)
 |-- studyend: date (nullable = true)
 |-- date_of_birth: date (nullable = true)
 |-- ethnic: string (nullable = true)
 |-- lpa: double (nullable = true)
 |-- lpa_threshold: string (nullable = false)
 |-- date_of_death: date (nullable = true)
 |-- death_icd10: string (nullable = false)
 |-- death_level: string (nullable = false)
 |-- death_cv_icd10: string (nullable = false)
 |-- death_cv_level: string (nullable = false)
 |-- death_cv: integer (nullable = true)
 |-- death_cv_primary: integer (nullable = true)
 |-- death_cv02_primary: integer (nullable = true)
 |-- death: integer (nullable = false)
 |-- hesin_record: integer (nullable = true)
 |-- hesin_smr: string (nullable = true)
 |-- gp_clinical_record: integer (nullable = false)
 |-- ascvd01_i

In [171]:
# Saving as CSV file
df_cohort_ascvd.toPandas().to_csv('cohort_b.csv', index=False)

  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series
  df[column_name] = series


In [172]:
%%bash
dx upload cohort_b.csv --dest /Users/yonghu4/Lpa_EB/cohort/cohort_b.csv

ID                                file-GqVB5gQJ2V1jf00qzZ670v9G
Class                             file
Project                           project-Gfkv20QJ2V1fyG3JfqJB249Z
Folder                            /Users/yonghu4/Lpa_EB/cohort
Name                              cohort_b.csv
State                             closing
Visibility                        visible
Types                             -
Properties                        -
Tags                              -
Outgoing links                    -
Created                           Thu Sep 12 08:14:22 2024
Created by                        huiyee1
 via the job                      job-GqV8XXQJ2V1V75FPb548Gy8p
Last modified                     Thu Sep 12 08:14:24 2024
Media type                        
archivalState                     "live"
cloudAccount                      "cloudaccount-dnanexus"


In [173]:
#reload
dftest = pd.read_csv("/mnt/project/Users/yonghu4/Lpa_EB/cohort/cohort_b.csv", dtype=str, keep_default_na=False)
dftest = spark.createDataFrame(dftest)

In [174]:
dftest.show(5)

+-------+------+------------+------------+------------+------------+----------+-------------+------+-----+------------------+-------------+-----------+-----------+--------------+--------------+--------+----------------+------------------+-----+------------+---------+------------------+-----------------+-----------------+--------------+----------------+---------------+-----------------+-----------------+-------------------+------------------+------------------+------------+-------------+-------------+-------------+--------------+--------------+------------------+-------------------+-------------------+-------------------+--------------------+------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+------------------------+-----------------------+----------+----------+-----------+-----------------+-----------------+------------------------+-----------------------+-----------------

In [None]:
#identify the earliest smoking record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_smoke2index_0', f.when(f.col('smoking_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_smoke2index_1', f.when(f.col('smoking_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_smoke2index_2', f.when(f.col('smoking_2').isNotNull(), f.col('days_ac2index_2')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_smoke2index_3', f.when(f.col('smoking_3').isNotNull(), f.col('days_ac2index_3')).otherwise(f.lit(None)))

#Final smoking status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_smoke2index_closest', f.least(f.col('days_smoke2index_0'),f.col('days_smoke2index_1'),f.col('days_smoke2index_2'),f.col('days_smoke2index_3')))
df_cohort_ascvd.select('eid','smoking_0','smoking_1','smoking_2','smoking_3','days_smoke2index_closest','days_smoke2index_0','days_smoke2index_1','days_smoke2index_2','days_smoke2index_3').filter(f.col('days_smoke2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','smoking_0','smoking_1','smoking_2','smoking_3','days_smoke2index_closest','days_smoke2index_0','days_smoke2index_1','days_smoke2index_2','days_smoke2index_3').filter(f.col('days_smoke2index_closest').isNotNull()).show(5)


In [None]:
smokecols = ['days_smoke2index_0','days_smoke2index_1','days_smoke2index_2','days_smoke2index_3']
smokecol_string = ', '.join("'{0}'".format(c) for c in smokecols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(smokecols)) \
                                  .withColumn('smokecols', f.expr('Array(' + smokecol_string + ')')) \
                                  .withColumn('smokezipped', f.arrays_zip('vals', 'smokecols')) \
                                  .withColumn('smokewithout_nulls', f.expr('filter(smokezipped, x -> not x.vals is null)')) \
                                  .withColumn('smokesorted', f.expr('array_sort(smokewithout_nulls)')) \
                                  .withColumn('smoking_final_col', f.col('smokesorted')[0].smokecols) \
                                  .drop('vals', 'smokecols', 'smokezipped', 'smokewithout_nulls', 'smokesorted')

In [None]:
df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_smoking', f.when((f.col('smoking_final_col') == f.lit('days_smoke2index_0')), f.col('smoking_0'))
                                                                  .when((f.col('smoking_final_col') == f.lit('days_smoke2index_1')), f.col('smoking_1'))
                                                                  .when((f.col('smoking_final_col') == f.lit('days_smoke2index_2')), f.col('smoking_2'))
                                                                  .when((f.col('smoking_final_col') == f.lit('days_smoke2index_3')), f.col('smoking_3'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_smoking', 'days_smoke2index_0', 'days_smoke2index_1', 'days_smoke2index_2', 'days_smoke2index_3', 'smoking_0', 'smoking_1', 'smoking_2', 'smoking_3').show(5)


In [None]:
#identify the earliest height record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_height2index_0', f.when(f.col('height_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_height2index_1', f.when(f.col('height_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_height2index_2', f.when(f.col('height_2').isNotNull(), f.col('days_ac2index_2')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_height2index_3', f.when(f.col('height_3').isNotNull(), f.col('days_ac2index_3')).otherwise(f.lit(None)))

#Final height status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_height2index_closest', f.least(f.col('days_height2index_0'),f.col('days_height2index_1'),f.col('days_height2index_2'),f.col('days_height2index_3')))
df_cohort_ascvd.select('eid','height_0','height_1','height_2','height_3','days_height2index_closest','days_height2index_0','days_height2index_1','days_height2index_2','days_height2index_3').filter(f.col('days_height2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','height_0','height_1','height_2','height_3','days_height2index_closest','days_height2index_0','days_height2index_1','days_height2index_2','days_height2index_3').filter(f.col('days_height2index_closest').isNotNull()).show(5)

heightcols = ['days_height2index_0','days_height2index_1','days_height2index_2','days_height2index_3']
heightcol_string = ', '.join("'{0}'".format(c) for c in heightcols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(heightcols)) \
                                  .withColumn('heightcols', f.expr('Array(' + heightcol_string + ')')) \
                                  .withColumn('heightzipped', f.arrays_zip('vals', 'heightcols')) \
                                  .withColumn('heightwithout_nulls', f.expr('filter(heightzipped, x -> not x.vals is null)')) \
                                  .withColumn('heightsorted', f.expr('array_sort(heightwithout_nulls)')) \
                                  .withColumn('height_final_col', f.col('heightsorted')[0].heightcols) \
                                  .drop('vals', 'heightcols', 'heightzipped', 'heightwithout_nulls', 'heightsorted')
  
df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_height', f.when((f.col('height_final_col') == f.lit('days_height2index_0')), f.col('height_0'))
                                                                  .when((f.col('height_final_col') == f.lit('days_height2index_1')), f.col('height_1'))
                                                                  .when((f.col('height_final_col') == f.lit('days_height2index_2')), f.col('height_2'))
                                                                  .when((f.col('height_final_col') == f.lit('days_height2index_3')), f.col('height_3'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_height', 'days_height2index_0', 'days_height2index_1', 'days_height2index_2', 'days_height2index_3', 'height_0', 'height_1', 'height_2', 'height_3').show(5)




In [None]:
#identify the earliest weight record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_weight2index_0', f.when(f.col('weight_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_weight2index_1', f.when(f.col('weight_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_weight2index_2', f.when(f.col('weight_2').isNotNull(), f.col('days_ac2index_2')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_weight2index_3', f.when(f.col('weight_3').isNotNull(), f.col('days_ac2index_3')).otherwise(f.lit(None)))

#Final weight status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_weight2index_closest', f.least(f.col('days_weight2index_0'),f.col('days_weight2index_1'),f.col('days_weight2index_2'),f.col('days_weight2index_3')))
df_cohort_ascvd.select('eid','weight_0','weight_1','weight_2','weight_3','days_weight2index_closest','days_weight2index_0','days_weight2index_1','days_weight2index_2','days_weight2index_3').filter(f.col('days_weight2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','weight_0','weight_1','weight_2','weight_3','days_weight2index_closest','days_weight2index_0','days_weight2index_1','days_weight2index_2','days_weight2index_3').filter(f.col('days_weight2index_closest').isNotNull()).show(5)

weightcols = ['days_weight2index_0','days_weight2index_1','days_weight2index_2','days_weight2index_3']
weightcol_string = ', '.join("'{0}'".format(c) for c in weightcols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(weightcols)) \
                                  .withColumn('weightcols', f.expr('Array(' + weightcol_string + ')')) \
                                  .withColumn('weightzipped', f.arrays_zip('vals', 'weightcols')) \
                                  .withColumn('weightwithout_nulls', f.expr('filter(weightzipped, x -> not x.vals is null)')) \
                                  .withColumn('weightsorted', f.expr('array_sort(weightwithout_nulls)')) \
                                  .withColumn('weight_final_col', f.col('weightsorted')[0].weightcols) \
                                  .drop('vals', 'weightcols', 'weightzipped', 'weightwithout_nulls', 'weightsorted')

df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_weight', f.when((f.col('weight_final_col') == f.lit('days_weight2index_0')), f.col('weight_0'))
                                                                  .when((f.col('weight_final_col') == f.lit('days_weight2index_1')), f.col('weight_1'))
                                                                  .when((f.col('weight_final_col') == f.lit('days_weight2index_2')), f.col('weight_2'))
                                                                  .when((f.col('weight_final_col') == f.lit('days_weight2index_3')), f.col('weight_3'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_weight', 'days_weight2index_0', 'days_weight2index_1', 'days_weight2index_2', 'days_weight2index_3', 'weight_0', 'weight_1', 'weight_2', 'weight_3').show(5)



In [None]:
#identify the earliest bmi record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_bmi2index_0', f.when(f.col('bmi_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_bmi2index_1', f.when(f.col('bmi_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_bmi2index_2', f.when(f.col('bmi_2').isNotNull(), f.col('days_ac2index_2')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_bmi2index_3', f.when(f.col('bmi_3').isNotNull(), f.col('days_ac2index_3')).otherwise(f.lit(None)))

#Final bmi status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_bmi2index_closest', f.least(f.col('days_bmi2index_0'),f.col('days_bmi2index_1'),f.col('days_bmi2index_2'),f.col('days_bmi2index_3')))
df_cohort_ascvd.select('eid','bmi_0','bmi_1','bmi_2','bmi_3','days_bmi2index_closest','days_bmi2index_0','days_bmi2index_1','days_bmi2index_2','days_bmi2index_3').filter(f.col('days_bmi2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','bmi_0','bmi_1','bmi_2','bmi_3','days_bmi2index_closest','days_bmi2index_0','days_bmi2index_1','days_bmi2index_2','days_bmi2index_3').filter(f.col('days_bmi2index_closest').isNotNull()).show(5)

bmicols = ['days_bmi2index_0','days_bmi2index_1','days_bmi2index_2','days_bmi2index_3']
bmicol_string = ', '.join("'{0}'".format(c) for c in bmicols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(bmicols)) \
                                  .withColumn('bmicols', f.expr('Array(' + bmicol_string + ')')) \
                                  .withColumn('bmizipped', f.arrays_zip('vals', 'bmicols')) \
                                  .withColumn('bmiwithout_nulls', f.expr('filter(bmizipped, x -> not x.vals is null)')) \
                                  .withColumn('bmisorted', f.expr('array_sort(bmiwithout_nulls)')) \
                                  .withColumn('bmi_final_col', f.col('bmisorted')[0].bmicols) \
                                  .drop('vals', 'bmicols', 'bmizipped', 'bmiwithout_nulls', 'bmisorted')
  
df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_bmi', f.when((f.col('bmi_final_col') == f.lit('days_bmi2index_0')), f.col('bmi_0'))
                                                                  .when((f.col('bmi_final_col') == f.lit('days_bmi2index_1')), f.col('bmi_1'))
                                                                  .when((f.col('bmi_final_col') == f.lit('days_bmi2index_2')), f.col('bmi_2'))
                                                                  .when((f.col('bmi_final_col') == f.lit('days_bmi2index_3')), f.col('bmi_3'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_bmi', 'days_bmi2index_0', 'days_bmi2index_1', 'days_bmi2index_2', 'days_bmi2index_3', 'bmi_0', 'bmi_1', 'bmi_2', 'bmi_3').show(5)


In [None]:
#identify the earliest tc record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_tc2index_0', f.when(f.col('tc_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_tc2index_1', f.when(f.col('tc_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))

#Final tc status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_tc2index_closest', f.least(f.col('days_tc2index_0'),f.col('days_tc2index_1')))
df_cohort_ascvd.select('eid','tc_0','tc_1','days_tc2index_closest','days_tc2index_0','days_tc2index_1').filter(f.col('days_tc2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','tc_0','tc_1','days_tc2index_closest','days_tc2index_0','days_tc2index_1').filter(f.col('days_tc2index_closest').isNotNull()).show(5)

tccols = ['days_tc2index_0','days_tc2index_1']
tccol_string = ', '.join("'{0}'".format(c) for c in tccols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(tccols)) \
                                  .withColumn('tccols', f.expr('Array(' + tccol_string + ')')) \
                                  .withColumn('tczipped', f.arrays_zip('vals', 'tccols')) \
                                  .withColumn('tcwithout_nulls', f.expr('filter(tczipped, x -> not x.vals is null)')) \
                                  .withColumn('tcsorted', f.expr('array_sort(tcwithout_nulls)')) \
                                  .withColumn('tc_final_col', f.col('tcsorted')[0].tccols) \
                                  .drop('vals', 'tccols', 'tczipped', 'tcwithout_nulls', 'tcsorted')
  
df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_tc', f.when((f.col('tc_final_col') == f.lit('days_tc2index_0')), f.col('tc_0'))
                                                                  .when((f.col('tc_final_col') == f.lit('days_tc2index_1')), f.col('tc_1'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_tc', 'days_tc2index_0', 'days_tc2index_1', 'tc_0', 'tc_1').show(5)


In [None]:
#identify the earliest hdl record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_hdl2index_0', f.when(f.col('hdl_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_hdl2index_1', f.when(f.col('hdl_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))

#Final hdl status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_hdl2index_closest', f.least(f.col('days_hdl2index_0'),f.col('days_hdl2index_1')))
df_cohort_ascvd.select('eid','hdl_0','hdl_1','days_hdl2index_closest','days_hdl2index_0','days_hdl2index_1').filter(f.col('days_hdl2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','hdl_0','hdl_1','days_hdl2index_closest','days_hdl2index_0','days_hdl2index_1').filter(f.col('days_hdl2index_closest').isNotNull()).show(5)

hdlcols = ['days_hdl2index_0','days_hdl2index_1']
hdlcol_string = ', '.join("'{0}'".format(c) for c in hdlcols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(hdlcols)) \
                                  .withColumn('hdlcols', f.expr('Array(' + hdlcol_string + ')')) \
                                  .withColumn('hdlzipped', f.arrays_zip('vals', 'hdlcols')) \
                                  .withColumn('hdlwithout_nulls', f.expr('filter(hdlzipped, x -> not x.vals is null)')) \
                                  .withColumn('hdlsorted', f.expr('array_sort(hdlwithout_nulls)')) \
                                  .withColumn('hdl_final_col', f.col('hdlsorted')[0].hdlcols) \
                                  .drop('vals', 'hdlcols', 'hdlzipped', 'hdlwithout_nulls', 'hdlsorted')

df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_hdl', f.when((f.col('hdl_final_col') == f.lit('days_hdl2index_0')), f.col('hdl_0'))
                                                                  .when((f.col('hdl_final_col') == f.lit('days_hdl2index_1')), f.col('hdl_1'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_hdl', 'days_hdl2index_0', 'days_hdl2index_1', 'hdl_0', 'hdl_1').show(5)


In [None]:
#identify the earliest ldl record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_ldl2index_0', f.when(f.col('ldl_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_ldl2index_1', f.when(f.col('ldl_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))

#Final ldl status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_ldl2index_closest', f.least(f.col('days_ldl2index_0'),f.col('days_ldl2index_1')))
df_cohort_ascvd.select('eid','ldl_0','ldl_1','days_ldl2index_closest','days_ldl2index_0','days_ldl2index_1').filter(f.col('days_ldl2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','ldl_0','ldl_1','days_ldl2index_closest','days_ldl2index_0','days_ldl2index_1').filter(f.col('days_ldl2index_closest').isNotNull()).show(5)

ldlcols = ['days_ldl2index_0','days_ldl2index_1']
ldlcol_string = ', '.join("'{0}'".format(c) for c in ldlcols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(ldlcols)) \
                                  .withColumn('ldlcols', f.expr('Array(' + ldlcol_string + ')')) \
                                  .withColumn('ldlzipped', f.arrays_zip('vals', 'ldlcols')) \
                                  .withColumn('ldlwithout_nulls', f.expr('filter(ldlzipped, x -> not x.vals is null)')) \
                                  .withColumn('ldlsorted', f.expr('array_sort(ldlwithout_nulls)')) \
                                  .withColumn('ldl_final_col', f.col('ldlsorted')[0].ldlcols) \
                                  .drop('vals', 'ldlcols', 'ldlzipped', 'ldlwithout_nulls', 'ldlsorted')
  
df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_ldl', f.when((f.col('ldl_final_col') == f.lit('days_ldl2index_0')), f.col('ldl_0'))
                                                                  .when((f.col('ldl_final_col') == f.lit('days_ldl2index_1')), f.col('ldl_1'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_ldl', 'days_ldl2index_0', 'days_ldl2index_1', 'ldl_0', 'ldl_1').show(5)


In [None]:
#identify the earliest crp record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_crp2index_0', f.when(f.col('crp_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_crp2index_1', f.when(f.col('crp_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))

#Final crp status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_crp2index_closest', f.least(f.col('days_crp2index_0'),f.col('days_crp2index_1')))
df_cohort_ascvd.select('eid','crp_0','crp_1','days_crp2index_closest','days_crp2index_0','days_crp2index_1').filter(f.col('days_crp2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','crp_0','crp_1','days_crp2index_closest','days_crp2index_0','days_crp2index_1').filter(f.col('days_crp2index_closest').isNotNull()).show(5)

crpcols = ['days_crp2index_0','days_crp2index_1']
crpcol_string = ', '.join("'{0}'".format(c) for c in crpcols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(crpcols)) \
                                  .withColumn('crpcols', f.expr('Array(' + crpcol_string + ')')) \
                                  .withColumn('crpzipped', f.arrays_zip('vals', 'crpcols')) \
                                  .withColumn('crpwithout_nulls', f.expr('filter(crpzipped, x -> not x.vals is null)')) \
                                  .withColumn('crpsorted', f.expr('array_sort(crpwithout_nulls)')) \
                                  .withColumn('crp_final_col', f.col('crpsorted')[0].crpcols) \
                                  .drop('vals', 'crpcols', 'crpzipped', 'crpwithout_nulls', 'crpsorted')
  
df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_crp', f.when((f.col('crp_final_col') == f.lit('days_crp2index_0')), f.col('crp_0'))
                                                                  .when((f.col('crp_final_col') == f.lit('days_crp2index_1')), f.col('crp_1'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_crp', 'days_crp2index_0', 'days_crp2index_1', 'crp_0', 'crp_1').show(5)


In [None]:
#identify the earliest nonhdl record pre-index
df_cohort_ascvd = df_cohort_ascvd.withColumn('nonhdl_0', f.when((f.col('tc_0').isNotNull()) & (f.col('hdl_0').isNotNull()), f.col('tc_0') - f.col('hdl_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('nonhdl_1', f.when((f.col('tc_1').isNotNull()) & (f.col('hdl_1').isNotNull()), f.col('tc_1') - f.col('hdl_1')).otherwise(f.lit(None)))
                                                    

df_cohort_ascvd = df_cohort_ascvd.withColumn('days_nonhdl2index_0', f.when(f.col('nonhdl_0').isNotNull(), f.col('days_ac2index_0')).otherwise(f.lit(None)))
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_nonhdl2index_1', f.when(f.col('nonhdl_1').isNotNull(), f.col('days_ac2index_1')).otherwise(f.lit(None)))

#Final nonhdl status day closest to index
df_cohort_ascvd = df_cohort_ascvd.withColumn('days_nonhdl2index_closest', f.least(f.col('days_nonhdl2index_0'),f.col('days_nonhdl2index_1')))
df_cohort_ascvd.select('eid','nonhdl_0','nonhdl_1','days_nonhdl2index_closest','days_nonhdl2index_0','days_nonhdl2index_1').filter(f.col('days_nonhdl2index_closest').isNull()).show(5)
df_cohort_ascvd.select('eid','nonhdl_0','nonhdl_1','days_nonhdl2index_closest','days_nonhdl2index_0','days_nonhdl2index_1').filter(f.col('days_nonhdl2index_closest').isNotNull()).show(5)

nonhdlcols = ['days_nonhdl2index_0','days_nonhdl2index_1']
nonhdlcol_string = ', '.join("'{0}'".format(c) for c in nonhdlcols)

df_cohort_ascvd = df_cohort_ascvd.withColumn('vals', f.array(nonhdlcols)) \
                                  .withColumn('nonhdlcols', f.expr('Array(' + nonhdlcol_string + ')')) \
                                  .withColumn('nonhdlzipped', f.arrays_zip('vals', 'nonhdlcols')) \
                                  .withColumn('nonhdlwithout_nulls', f.expr('filter(nonhdlzipped, x -> not x.vals is null)')) \
                                  .withColumn('nonhdlsorted', f.expr('array_sort(nonhdlwithout_nulls)')) \
                                  .withColumn('nonhdl_final_col', f.col('nonhdlsorted')[0].nonhdlcols) \
                                  .drop('vals', 'nonhdlcols', 'nonhdlzipped', 'nonhdlwithout_nulls', 'nonhdlsorted')
 
df_cohort_ascvd = df_cohort_ascvd.withColumn('baseline_nonhdl', f.when((f.col('nonhdl_final_col') == f.lit('days_nonhdl2index_0')), f.col('nonhdl_0'))
                                                                  .when((f.col('nonhdl_final_col') == f.lit('days_nonhdl2index_1')), f.col('nonhdl_1'))
                                                                  .otherwise(f.lit('missing')))

df_cohort_ascvd.select('eid', 'baseline_nonhdl', 'days_nonhdl2index_0', 'days_nonhdl2index_1', 'nonhdl_0', 'nonhdl_1').show(5)


In [None]:
df_cohort_ascvd = df_cohort_ascvd.withColumn('bp_diastolic_i0', f.when((f.col('bp_diastolic_i0a0').isNotNull()) & (f.col('bp_diastolic_i0a1').isNotNull()), (f.col('bp_diastolic_i0a0') + f.col('bp_diastolic_i0a1')) / 2)
                                                                 .when((f.col('bp_diastolic_i0a0').isNotNull()) & (f.col('bp_diastolic_i0a1').isNull()), f.col('bp_diastolic_i0a0'))
                                                                 .when((f.col('bp_diastolic_i0a0').isNull()) & (f.col('bp_diastolic_i0a1').isNotNull()), f.col('bp_diastolic_i0a1'))
                                                                 .otherwise(f.lit(None)))

# bp_diastolic
for x in range(4):
    var = 'bp_diastolic'
    var_ix = var + '_i' + str(x)
    var_a0 = var_ix + 'a0'
    var_a1 = var_ix + 'a1'
    var_ixa0 = var_a0.replace('x', str(x))
    var_ixa1 = var_a1.replace('x', str(x))
    df_cohort_ascvd = df_cohort_ascvd.withColumn(var_ix, f.when((f.col(var_ixa0).isNotNull()) & (f.col(var_ixa1).isNotNull()), (f.col(var_ixa0) + f.col(var_ixa1)) / 2)
                                                          .when((f.col(var_ixa0).isNotNull()) & (f.col(var_ixa1).isNull()), f.col(var_ixa0))
                                                          .when((f.col(var_ixa0).isNull()) & (f.col(var_ixa1).isNotNull()), f.col(var_ixa1))
                                                          .otherwise(f.lit(None))

'p4079_i0_a0': 'bp_diastolic_i0a0', # (HY) Diastolic blood pressure instance 1 array 1
'p4079_i0_a1': 'bp_diastolic_i0a1', # (HY) Diastolic blood pressure instance 1 array 2
'p4079_i1_a0': 'bp_diastolic_i1a0', # (HY) Diastolic blood pressure instance 2 array 1
'p4079_i1_a1': 'bp_diastolic_i1a1', # (HY) Diastolic blood pressure instance 2 array 2
'p4079_i2_a0': 'bp_diastolic_i2a0', # (HY) Diastolic blood pressure instance 3 array 1
'p4079_i2_a1': 'bp_diastolic_i2a1', # (HY) Diastolic blood pressure instance 3 array 2
'p4079_i3_a0': 'bp_diastolic_i3a0', # (HY) Diastolic blood pressure instance 4 array 1
'p4079_i3_a1': 'bp_diastolic_i3a1', # (HY) Diastolic blood pressure instance 4 array 2

In [None]:
# bp_diastolic
for x in range(4):
    var = 'bp_diastolic'
    var_ix = var + '_i' + str(x)
    var_a0 = var_ix + 'a0'
    var_a1 = var_ix + 'a1'
    var_ixa0 = var_a0.replace('x', str(x))
    var_ixa1 = var_a1.replace('x', str(x))
    df_cohort_ascvd = df_cohort_ascvd.withColumn(var_ix, f.when((f.col(var_ixa0).isNotNull()) & (f.col(var_ixa1).isNotNull()), (f.col(var_ixa0) + f.col(var_ixa1)) / 2)
                                                          .when((f.col(var_ixa0).isNotNull()) & (f.col(var_ixa1).isNull()), f.col(var_ixa0))
                                                          .when((f.col(var_ixa0).isNull()) & (f.col(var_ixa1).isNotNull()), f.col(var_ixa1))
                                                          .otherwise(f.lit(None))

In [None]:
# Create CVD related ICD-10 code list
code_ascvd_icd10_diag = code_ascvd_icd10.select(f.col('code').alias('diag_icd10')).distinct()
code_cvd_icd10 = code_ascvd_icd10_diag.union(code_hf_icd10_diag)

# Identify CV death
death_cause_cv_df = death_cause_df.join(code_cvd_icd10, 'diag_icd10','inner')
death_cause_cv_eid = death_cause_cv_df.select('eid').distinct()
death_cause_cv_eid = death_cause_cv_eid.withColumn('death_cv', f.lit(1))
death_cv_df = death_df.join(death_cause_cv_eid,'eid','left')
death_cv_df = death_cv_df.withColumn('death_cv', f.when((f.col('death_cv').isNull()), f.lit(0)).otherwise(f.col('death_cv')))




death_
code_cvd_icd10 = 

death_field_names = ['eid',
                     'ins_index',
                     #'dsource',
                     #'source',
                     'date_of_death'
                    ]

death_cause_field_names = ['eid',
                           'ins_index',
                           'arr_index',
                           'level',
                           'cause_icd10'
                          ]


In [None]:
#patient eid fulfilled f/mnt/project/Users/yonghu4/Lpa_EB/pgm/read2ctv3_ascvd_codelist_temp.xlsx linked hospital care data does not span from the date of the initial ASCVD diagnosis/event.
eidc5 = df_cohort_ascvd.select('eid')

#HES data last admidate, epistart, epiend and disdate
hesin_c5 = hesin_df.join(eidc5, 'eid', 'inner')
hesin_c5 = hesin_c5.select('eid', 'dsource', 'admidate', 'epistart', 'epiend', 'disdate','disdate_im','censordate').distinct()

#Get max date for hesin dates
hesin_c5_maxdate = hesin_c5.groupBy('eid').agg(f.max(f.col('admidate')).alias('admidate_last'),f.max(f.col('epistart')).alias('epistart_last'),f.max(f.col('epiend')).alias('epiend_last'),f.max(f.col('disdate')).alias('disdate_last'))

#Get the last record of hesin
windowcd = Window.partitionBy('eid').orderBy(f.col('disdate_im').desc(),f.col('epiend').desc(),f.col('epistart').desc(),f.col('admidate').desc())
hesin_c5_lc = hesin_c5.withColumn('disdate_im_last', f.rank().over(windowcd))                                                               
hesin_c5_lc = hesin_c5_lc.filter(f.col('disdate_im_last') == 1)
hesin_c5_lc_disdatecount = hesin_c5_lc.groupBy('eid').agg(f.count('*').alias('ddcount'))
hesin_c5_lc = hesin_c5_lc.join(hesin_c5_lc_disdatecount, 'eid', 'left')

#Uncommand this code if there are no redundant last record
#hesin_c5_lc = hesin_c5_lc.select('eid','disdate_im_last')

hesin_c5_lc_count = hesin_c5_lc.count()
hesin_c5_lc_eidcount = hesin_c5_lc.select('eid').distinct().count()
hesin_c5_lc_eiddscount = hesin_c5_lc.select('eid','dsource').distinct().count()

print(f'Number of row for hesin_c5_count: {hesin_c5_lc_count}')
print(f'Unique patient count for hesin_c5_count: {hesin_c5_lc_eidcount}')
print(f'Unique patient count and dsource for hesin_c5_count: {hesin_c5_lc_eiddscount}')

hesin_c5_lc.filter(f.col('ddcount') > 1).show(10)
#Check whether there are redundant last record before proceed to next step.

In [None]:
hesin_c5_lc_eiddsdi = hesin_c5_lc.select('eid','dsource','censordate').distinct()
hesin_c5_lc_eiddsdicount = hesin_c5_lc_eiddsdi.count()

print(f'Unique patient count, dsource and disdate_im for hesin_c5_count: {hesin_c5_lc_eiddsdicount}')

In [None]:
death_df_date = death_df.select('eid','date_of_death').distinct()
hesin_date_final = hesin_c5_maxdate.join(death_df_date, 'eid', 'left')
hesin_date_final = hesin_date_final.join(hesin_c5_lc_eiddsdi, 'eid', 'left')
hesin_date_final = hesin_date_final.withColumn('last_followup_date', f.when((f.col('date_of_death') < f.col('censordate')), f.col('date_of_death')).otherwise(f.col('censordate')))

# hesin_c5_death = hesin_c5_death.withColumn('last_followupdate_all', f.greatest(f.col('admidate_last'), f.col('epistart_last'), f.col('epiend_last'), f.col('disdate_last'), f.col('date_of_death')))
# hesin_c5_death = hesin_c5_death.withColumn('last_followupdate_hes', f.greatest(f.col('admidate_last'), f.col('epistart_last'), f.col('epiend_last'), f.col('disdate_last')))
# hesin_c5_death = hesin_c5_death.withColumn('last_followupdate', f.when((f.col('dsource') == 'HES') & (f.col('date_of_death').isNotNull()) & (f.col('date_of_death') <= f.lit('2022-10-31')), f.col('date_of_death'))
#                                                                  .when((f.col('dsource') == 'SMR') & (f.col('date_of_death').isNotNull()) & (f.col('date_of_death') <= f.lit('2022-08-31')), f.col('date_of_death'))
#                                                                  .when((f.col('dsource') == 'PEDW') & (f.col('date_of_death').isNotNull()) & (f.col('date_of_death') <= f.lit('2022-05-31')), f.col('date_of_death'))
#                                                                  .when((f.col('dsource') == 'HES') & (f.col('date_of_death').isNull()), f.lit('2022-10-31'))
#                                                                  .when((f.col('dsource') == 'SMR') & (f.col('date_of_death').isNull()), f.lit('2022-08-31'))
#                                                                  .when((f.col('dsource') == 'PEDW') & (f.col('date_of_death').isNull()), f.lit('2022-05-31'))
#                                                                  .otherwise(f.lit(None)))
hesin_date_final.show(5)

hesin_date_final_count = hesin_date_final.count()
hesin_date_final_eidcount = hesin_date_final.select('eid').distinct().count()

print(f'Number of row for hesin_date_final: {hesin_date_final_count}')
print(f'Unique patient count for hesin_date_final: {hesin_date_final_eidcount}')
# hesin_c5_deathqc = hesin_c5_death.filter((f.col('date_of_death') < f.col('admidate_last')) | (f.col('date_of_death') < f.col('epistart_last')) | (f.col('date_of_death') < f.col('epistart_last')) | (f.col('date_of_death') < f.col('epiend_last')) | (f.col('date_of_death') < f.col('disdate_last')))
# hesin_c5_deathqc.show()

In [None]:
df_cohort_ascvd_i = df_cohort_ascvd.join(hesin_date_final,'eid','left')
#df_cohort_ascvd_i.count()
df_cohort_ascvd_i.printSchema()

In [None]:
#Identify incident cohort
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit01', f.lit(1))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit02', f.when((f.col('date_of_ac_0').isNotNull()) & (f.col('indexdate').isNotNull()) & (f.col('indexdate') > f.col('date_of_ac_0')) & (f.col('indexdate') <= f.lit('2022-10-31')), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit03', f.when((f.col('lpa_baseline').isNotNull()), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit04', f.when((f.col('age_index') >= 40), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit05', f.when((f.col('epistart_last') > f.col('indexdate')), f.lit(1)).otherwise(f.lit(0)))
#Patients without SMR
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit06', f.when((f.col('hesin_smr') != 'Yes'), f.lit(1)).otherwise(f.lit(0)))
#Patients with primary care data
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit07', f.when((f.col('gp_diag_record') == 'Yes'), f.lit(1)).otherwise(f.lit(0)))
#Patients with hospital data
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit08', f.when((f.col('hesin_record') == 'Yes'), f.lit(1)).otherwise(f.lit(0)))

In [None]:
#Identify incident cohort: subgroup C
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_crit01', f.lit(1))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_crit02', f.when((f.col('date_of_ac_0').isNotNull()) & (f.col('c_indexdate').isNotNull()) & (f.col('c_indexdate') > f.col('date_of_ac_0')) & (f.col('c_indexdate') <= f.lit('2022-10-31')), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_crit03', f.when((f.col('lpa_baseline').isNotNull()), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_crit04', f.when((f.col('c_age_index') >= 40), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_crit05', f.when((f.col('epistart_last') > f.col('c_indexdate')), f.lit(1)).otherwise(f.lit(0)))


In [None]:
# Patients with outpatient records

## Total UK Biobank participants
count_outpat01 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 1)).count()

## Participants without Scotland admission record
count_outpat02 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 1) & (f.col('crit06') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment
count_outpat03 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement
count_outpat04 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1) & (f.col('crit03') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement aged >= 40 years old
count_outpat05 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1) & (f.col('crit03') == 1) & (f.col('crit04') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date
df_cohort_ascvd_i_a = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1) & (f.col('crit03') == 1) & (f.col('crit04') == 1) & (f.col('crit05') == 1))
count_outpat06 = df_cohort_ascvd_i_a.count()

## - ASCVD diagnosis at inpatient
df_cohort_ascvd_i_a_in = df_cohort_ascvd_i_a.filter(f.col('in_or_out') == 'in')
count_outpat06_in = df_cohort_ascvd_i_a_in.count()

## - ASCVD diagnosis at outpatient
df_cohort_ascvd_i_a_out = df_cohort_ascvd_i_a.filter(f.col('in_or_out') == 'out')
count_outpat06_out = df_cohort_ascvd_i_a_out.count()

## - ASCVD diagnosis at both outpatient and inpatient
df_cohort_ascvd_i_a_outin = df_cohort_ascvd_i_a.filter(f.col('in_or_out') == 'out,in')
count_outpat06_outin = df_cohort_ascvd_i_a_outin.count()

print(f'Total UK Biobank participants: {count_outpat01}')
print(f'Participants without Scotland admission record: {count_outpat02}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort): {count_outpat03}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement: {count_outpat04}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old: {count_outpat05}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date: {count_outpat06}')
print(f'- ASCVD diagnosis at inpatient: {count_outpat06_in}')
print(f'- ASCVD diagnosis at outpatient : {count_outpat06_out}')
print(f'- ASCVD diagnosis at both outpatient and inpatient: {count_outpat06_outin}')

In [None]:
df_cohort_ascvd_i_a_in.groupBy('group_hesin').count().show(truncate=False)
df_cohort_ascvd_i_a_in_count = df_cohort_ascvd_i_a_in.groupBy('group_hesin').count()

In [None]:
df_cohort_ascvd_i_a_out.groupBy('group_gp').count().show(truncate=False)
df_cohort_ascvd_i_a_out.groupBy('group_hesin').count().show(truncate=False)
df_cohort_ascvd_i_a_out.groupBy('group_gphesin').count().show(truncate=False)
df_cohort_ascvd_i_a.groupBy('group_gphesin').count().show(truncate=False)

df_cohort_ascvd_i_a_out_gpcount = df_cohort_ascvd_i_a_out.groupBy('group_gp').count()
df_cohort_ascvd_i_a_out_hesincount = df_cohort_ascvd_i_a_out.groupBy('group_hesin').count()
df_cohort_ascvd_i_a_out_gphesincount = df_cohort_ascvd_i_a_out.groupBy('group_gphesin').count()
df_cohort_ascvd_i_a_gphesincount = df_cohort_ascvd_i_a.groupBy('group_gphesin').count()

In [None]:
# Patients with inpatient record only

## Total UK Biobank participants
count_inonlypat01 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 0) & (f.col('crit08') == 1)).count()

## Participants without Scotland admission record
count_inonlypat02 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 0) & (f.col('crit08') == 1) & (f.col('crit06') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment
count_inonlypat03 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 0) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement
count_inonlypat04 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 0) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1) & (f.col('crit03') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement aged >= 40 years old
count_inonlypat05 = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 0) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1) & (f.col('crit03') == 1) & (f.col('crit04') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date
df_cohort_ascvd_i_b = df_cohort_ascvd_i.filter((f.col('crit01') == 1) & (f.col('crit07') == 0) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('crit02') == 1) & (f.col('crit03') == 1) & (f.col('crit04') == 1) & (f.col('crit05') == 1))
count_inonlypat06 = df_cohort_ascvd_i_b.count()

## - ASCVD diagnosis at inpatient
count_inonlypat06_in = df_cohort_ascvd_i_b.filter(f.col('in_or_out') == 'in').count()

print(f'Total UK Biobank participants: {count_inonlypat01}')
print(f'Participants without Scotland admission record: {count_inonlypat02}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort): {count_inonlypat03}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement: {count_inonlypat04}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old: {count_inonlypat05}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date: {count_inonlypat06}')
print(f'- ASCVD diagnosis at inpatient: {count_inonlypat06_in}')
# count_inonlypat06_in must be same as count_inonlypat06

In [None]:
df_cohort_ascvd_i_b.groupBy('group_hesin').count().show(truncate=False)

df_cohort_ascvd_i_b_hesin_count = df_cohort_ascvd_i_b.groupBy('group_hesin').count()

In [None]:
# Patients with inpatient records

## Total UK Biobank participants
count_inpat01 = df_cohort_ascvd_i.filter((f.col('c_crit01') == 1) & (f.col('crit08') == 1)).count()

## Participants without Scotland admission record
count_inpat02 = df_cohort_ascvd_i.filter((f.col('c_crit01') == 1) & (f.col('crit08') == 1) & (f.col('crit06') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment
count_inpat03 = df_cohort_ascvd_i.filter((f.col('c_crit01') == 1) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('c_crit02') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement
count_inpat04 = df_cohort_ascvd_i.filter((f.col('c_crit01') == 1) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('c_crit02') == 1) & (f.col('c_crit03') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement aged >= 40 years old
count_inpat05 = df_cohort_ascvd_i.filter((f.col('c_crit01') == 1) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('c_crit02') == 1) & (f.col('c_crit03') == 1) & (f.col('c_crit04') == 1)).count()

## First ASCVD diagnosis post UK Biobank enrolment with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date
df_cohort_ascvd_i_c = df_cohort_ascvd_i.filter((f.col('c_crit01') == 1) & (f.col('crit08') == 1) & (f.col('crit06') == 1) & (f.col('c_crit02') == 1) & (f.col('c_crit03') == 1) & (f.col('c_crit04') == 1) & (f.col('c_crit05') == 1))
count_inpat06 = df_cohort_ascvd_i_c.count()

## - ASCVD diagnosis at inpatient
count_inpat06_in = df_cohort_ascvd_i_c.filter(f.col('in_or_out') == 'in').count()

print(f'Total UK Biobank participants: {count_inpat01}')
print(f'Participants without Scotland admission record: {count_inpat02}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort): {count_inpat03}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement: {count_inpat04}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old: {count_inpat05}')
print(f'First ASCVD diagnosis post UK Biobank enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date: {count_inpat06}')
print(f'- ASCVD diagnosis at inpatient: {count_inpat06_in}')
# count_inpat06_in might be same as count_inpat06

In [None]:
df_cohort_ascvd_i_c.groupBy('c_group_hesin').count().show(truncate=False)
df_cohort_ascvd_i_c_hesin_count = df_cohort_ascvd_i_c.groupBy('c_group_hesin').count()

In [None]:
# Saving as CSV file
df_cohort_ascvd_i_a_in_count.toPandas().to_csv('df_cohort_ascvd_i_a_in_count.csv', index=False)

df_cohort_ascvd_i_a_out_gpcount.toPandas().to_csv('df_cohort_ascvd_i_a_out_gpcount.csv', index=False)
df_cohort_ascvd_i_a_out_hesincount.toPandas().to_csv('df_cohort_ascvd_i_a_out_hesincount.csv', index=False)
df_cohort_ascvd_i_a_out_gphesincount.toPandas().to_csv('df_cohort_ascvd_i_a_out_gphesincount.csv', index=False)
df_cohort_ascvd_i_a_gphesincount.toPandas().to_csv('df_cohort_ascvd_i_a_gphesincount.csv', index=False)

df_cohort_ascvd_i_b_hesin_count.toPandas().to_csv('df_cohort_ascvd_i_b_hesin_count.csv', index=False)

df_cohort_ascvd_i_c_hesin_count.toPandas().to_csv('df_cohort_ascvd_i_c_hesin_count.csv', index=False)

In [None]:
%%bash
dx upload df_cohort_ascvd_i_a_in_count.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_a_in_count.csv
dx upload df_cohort_ascvd_i_a_out_gpcount.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_a_out_gpcount.csv
dx upload df_cohort_ascvd_i_a_out_hesincount.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_a_out_hesincount.csv
dx upload df_cohort_ascvd_i_a_out_gphesincount.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_a_out_gphesincount.csv
dx upload df_cohort_ascvd_i_a_gphesincount.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_a_gphesincount.csv
dx upload df_cohort_ascvd_i_b_hesin_count.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_b_hesin_count.csv
dx upload df_cohort_ascvd_i_c_hesin_count.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_c_hesin_count.csv

In [None]:
# 1: <65 nmol/L
# 2: >=65 - <150 nmol/L
# 3: >=150 - <175 nmol/L 
# 4: >=175 - <190 nmol/L
# 5: >=190 nmol/L # different from Framework

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('lpa_threshold', f.when((f.col('lpa_baseline') < 65), f.lit('<65 nmol/L'))
                                                                   .when((f.col('lpa_baseline') >= 65) & (f.col('lpa_baseline') < 150), f.lit('>=65 - <150 nmol/L'))
                                                                   .when((f.col('lpa_baseline') >= 150) & (f.col('lpa_baseline') < 175), f.lit('>=150 - <175 nmol/L'))
                                                                   .when((f.col('lpa_baseline') >= 175) & (f.col('lpa_baseline') < 190), f.lit('>=175 - <190 nmol/L'))
                                                                   .when((f.col('lpa_baseline') >= 190) & (f.col('lpa_baseline') < 225), f.lit('>=190 - <225 nmol/L'))
                                                                   .when((f.col('lpa_baseline') >= 225) & (f.col('lpa_baseline') < 250), f.lit('>=225 - <250 nmol/L'))
                                                                   .when((f.col('lpa_baseline') >= 250), f.lit('>=250 nmol/L'))
                                                                   .otherwise(f.lit('XXX')))


In [None]:
#Comorbidity
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('hbp_date_first', f.least(f.col('hbp_i10_date_first'),f.col('hbp_i11_date_first'),f.col('hbp_i12_date_first'),f.col('hbp_i13_date_first'),f.col('hbp_i15_date_first')))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('dm_date_first', f.least(f.col('dm_e10_date_first'),f.col('dm_e11_date_first'),f.col('dm_e12_date_first'),f.col('dm_e13_date_first'),f.col('dm_e14_date_first'),f.col('dm_e15_date_first')))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('cm_hbp', f.when((f.col('hbp_date_first').isNotNull()) & (f.col('hbp_date_first') < f.col('indexdate')), 1).otherwise(0))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('cm_dm', f.when((f.col('dm_date_first').isNotNull()) & (f.col('dm_date_first') < f.col('indexdate')), 1).otherwise(0))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('cm_hbp', f.when((f.col('crf_n18_date_first').isNotNull()) & (f.col('crf_n18_date_first') < f.col('indexdate')), 1).otherwise(0))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_cm_hbp', f.when((f.col('hbp_date_first').isNotNull()) & (f.col('hbp_date_first') < f.col('c_indexdate')), 1).otherwise(0))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_cm_dm', f.when((f.col('dm_date_first').isNotNull()) & (f.col('dm_date_first') < f.col('c_indexdate')), 1).otherwise(0))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('c_cm_hbp', f.when((f.col('crf_n18_date_first').isNotNull()) & (f.col('crf_n18_date_first') < f.col('c_indexdate')), 1).otherwise(0))

In [None]:
# Saving as CSV file
df_cohort_ascvd_i.toPandas().to_csv('df_cohort_ascvd_i.csv', index=False)


In [None]:
%%bash
dx upload df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i_2.csv

In [None]:
#Comment this code if the above two row counts are the same
#Group by'eid','ins_index','arr_index', 'epistart', 'admidate', 'indexdate', etc
hesin_iddo_df_ascvd_agg = hesin_iddo_df_ascvd.groupBy('eid','indexdate').agg(f.sum('indexdt_order').alias('indexcount'),
                                                                             f.collect_list('admidate').alias('admidate'),
                                                                             f.collect_list('disdate').alias('disdate'),
                                                                             f.collect_list('disdate_im').alias('disdate_im'),
                                                                             f.collect_list('ins_index').alias('ins_index'), 
                                                                             f.collect_list('arr_index').alias('arr_index'),
                                                                             f.collect_list('level').alias('level'),
                                                                             f.collect_list('diag_icd9').alias('diag_icd9'),
                                                                             f.collect_list('diag_icd10').alias('diag_icd10'),
                                                                             f.collect_list('oper_level').alias('oper_level'),
                                                                             f.collect_list('oper3').alias('oper3'),
                                                                             f.collect_list('oper4').alias('oper4'),
                                                                             f.collect_list('group_icd9').alias('group_icd9'),
                                                                             f.collect_list('group_icd10').alias('group_icd10'),
                                                                             f.collect_list('group_opcs3').alias('group_opcs3'),
                                                                             f.collect_list('group_opcs4').alias('group_opcs4'))

hesin_iddo_df_ascvd_2_agg = hesin_iddo_df_ascvd_2.groupBy('eid','indexdate').agg(f.sum('indexdt_order').alias('indexcount_2'),
                                                                                 f.collect_list('admidate').alias('admidate_2'),
                                                                                 f.collect_list('disdate').alias('disdate_2'),
                                                                                 f.collect_list('disdate_im').alias('disdate_im_2'),
                                                                                 f.collect_list('ins_index').alias('ins_index_2'), 
                                                                                 f.collect_list('arr_index').alias('arr_index_2'),
                                                                                 f.collect_list('level').alias('level_2'),
                                                                                 f.collect_list('diag_icd9').alias('diag_icd9_2'),
                                                                                 f.collect_list('diag_icd10').alias('diag_icd10_2'),
                                                                                 f.collect_list('oper_level').alias('oper_level_2'),
                                                                                 f.collect_list('oper3').alias('oper3_2'),
                                                                                 f.collect_list('oper4').alias('oper4_2'),
                                                                                 f.collect_list('group_icd9').alias('group_icd9_2'),
                                                                                 f.collect_list('group_icd10').alias('group_icd10_2'),
                                                                                 f.collect_list('group_opcs3').alias('group_opcs3_2'),
                                                                                 f.collect_list('group_opcs4').alias('group_opcs4_2'))


In [None]:
##QC to check whether there are duplicated ASCVD diagnosis on same indexdate
hesin_iddo_df_ascvd_aggrow =  hesin_iddo_df_ascvd_agg.count()
hesin_iddo_df_ascvd_agguniq = hesin_iddo_df_ascvd_agg.select('eid').distinct().count()

print(f'Number of row for hesin_iddo_df_ascvd_agg: {hesin_iddo_df_ascvd_aggrow}')
print(f'Number of row for hesin_iddo_df_ascvd_agg with unique eid: {hesin_iddo_df_ascvd_agguniq}')

In [None]:
hesin_iddo_df_ascvd_agg.show(5)
hesin_iddo_df_ascvd_agg.filter(f.col('indexcount') > 1).show(5)

In [None]:
#Merge df_cohort and hesin_iddo_df_ascvd_agg
df_cohort_ascvd = df_cohort.join(hesin_iddo_df_ascvd_agg, 'eid', 'left')

#ASCVD patient count
df_cohort_ascvd_totalcount = df_cohort_ascvd.count()
df_cohort_ascvd_acddcount = df_cohort_ascvd.filter(f.col('date_of_ac_0').isNotNull()).count()
df_cohort_ascvd_ascvdcount = df_cohort_ascvd.filter((f.col('date_of_ac_0').isNotNull()) & (f.col('indexdate').isNotNull())).count()

print(f'Total UK Biobank participants: {df_cohort_ascvd_totalcount}')
print(f'Total UK Biobank participants with first assessment date: {df_cohort_ascvd_acddcount}')
print(f'Total UK Biobank participants with first assessment date and ASCVD diagnosis: {df_cohort_ascvd_ascvdcount}')

In [None]:
#Male & female patients aged ≥40 years

#date_of_birth: birthdate
# Impute day_of_birth with 15
df_cohort_ascvd = df_cohort_ascvd.withColumn('month_of_birth_n', f.from_unixtime(f.unix_timestamp(f.col('month_of_birth'),'MMMM'),'MM'))
df_cohort_ascvd = df_cohort_ascvd.withColumn('month_of_birth_n', f.col('month_of_birth_n').cast('int'))
df_cohort_ascvd = df_cohort_ascvd.withColumn('day_of_birth', f.lit(15))
datecols=['year_of_birth','month_of_birth_n','day_of_birth']
df_cohort_ascvd = df_cohort_ascvd.withColumn("date_of_birth",f.to_date(f.concat_ws("-",*datecols).cast("date"),"MM-dd-yyyy"))
df_cohort_ascvd = df_cohort_ascvd.drop('month_of_birth_n','day_of_birth')

# Calculate patient age
df_cohort_ascvd = df_cohort_ascvd.withColumn('age_index', f.when((f.col('indexdate').isNotNull()), f.floor(f.datediff(f.col('indexdate'), f.col('date_of_birth'))/365.25)).otherwise(f.lit(None)))

# Filter for patients aged ≥40 years at indexdate
#df_cohort_ascvd = df_cohort_ascvd.filter(f.col('age_index') >= 40)
#df_cohort_ascvd.count()

In [None]:
#QC death_df
death_dfrowcount = death_df.count()
death_dfrowcountu = death_df.select('eid').distinct().count()
death_dfrowcountu2 = death_df.select('eid','date_of_death').distinct().count()
print(f'Number of row for death_df: {death_dfrowcount}')
print(f'Unique patient count for death_df: {death_dfrowcountu}')
print(f'Unique patient count for death_df eid and date_of_death: {death_dfrowcountu2}')

death_df.groupBy('eid').agg(f.count('*').alias('count')).filter(f.col('count') >= 2).show()
death_df.filter(f.col('eid').isin(['2322314','2650383','5991199','3307353'])).show()

In [None]:
hesin_df.select('dsource').distinct().show()

In [None]:
#patient eid fulfilled f/mnt/project/Users/yonghu4/Lpa_EB/pgm/read2ctv3_ascvd_codelist_temp.xlsx linked hospital care data does not span from the date of the initial ASCVD diagnosis/event.
eidc5 = df_cohort_ascvd.select('eid')

#HES data last admidate, epistart, epiend and disdate
hesin_c5 = hesin_df.join(eidc5, 'eid', 'inner')
hesin_c5 = hesin_c5.select('eid', 'dsource', 'admidate', 'epistart', 'epiend', 'disdate','disdate_im','censordate').distinct()

# # Some patients have more than 1 dsource. Do not calculate follow up period first.
# hesin_c5 = hesin_c5.select('eid', 'dsource', 'admidate', 'epistart', 'epiend', 'disdate').distinct()
# hesin_c5.select(f.count(f.col('eid'))).show() #4184747
# hesin_c5.select('eid','dsource').distinct().select(f.count(f.col('eid'))).show() #455203
# hesin_c5.select('eid').distinct().select(f.count(f.col('eid'))).show() #449078
# hesin_c5.select('eid','dsource').distinct().groupBy('eid').agg(f.count('*').alias('rowcount'),f.collect_list(f.col('dsource').alias('dsource_list'))).filter(f.col('rowcount') >= 2).show(5)
# # +-------+--------+-------------------------------------+
# # |    eid|rowcount|collect_list(dsource AS dsource_list)|
# # +-------+--------+-------------------------------------+
# # |1000759|       2|                          [PEDW, HES]|
# # |1018075|       2|                          [HES, PEDW]|
# # |1031619|       2|                           [SMR, HES]|
# # |1034215|       2|                          [PEDW, HES]|
# # |1034264|       2|                           [HES, SMR]|
# # +-------+--------+-------------------------------------+

#Get max date for hesin dates
hesin_c5_maxdate = hesin_c5.groupBy('eid').agg(f.max(f.col('admidate')).alias('admidate_last'),f.max(f.col('epistart')).alias('epistart_last'),f.max(f.col('epiend')).alias('epiend_last'),f.max(f.col('disdate')).alias('disdate_last'))

#Get the last record of hesin
windowcd = Window.partitionBy('eid').orderBy(f.col('disdate_im').desc(),f.col('epiend').desc(),f.col('epistart').desc(),f.col('admidate').desc())
hesin_c5_lc = hesin_c5.withColumn('disdate_im_last', f.rank().over(windowcd))                                                               
hesin_c5_lc = hesin_c5_lc.filter(f.col('disdate_im_last') == 1)
hesin_c5_lc_disdatecount = hesin_c5_lc.groupBy('eid').agg(f.count('*').alias('ddcount'))
hesin_c5_lc = hesin_c5_lc.join(hesin_c5_lc_disdatecount, 'eid', 'left')

#Uncommand this code if there are no redundant last record
#hesin_c5_lc = hesin_c5_lc.select('eid','disdate_im_last')

hesin_c5_lc_count = hesin_c5_lc.count()
hesin_c5_lc_eidcount = hesin_c5_lc.select('eid').distinct().count()
hesin_c5_lc_eiddscount = hesin_c5_lc.select('eid','dsource').distinct().count()

print(f'Number of row for hesin_c5_count: {hesin_c5_lc_count}')
print(f'Unique patient count for hesin_c5_count: {hesin_c5_lc_eidcount}')
print(f'Unique patient count and dsource for hesin_c5_count: {hesin_c5_lc_eiddscount}')

hesin_c5_lc.filter(f.col('ddcount') > 1).show(10)
#Check whether there are redundant last record before proceed to next step.

In [None]:
hesin_c5_lc_eiddsdi = hesin_c5_lc.select('eid','dsource','censordate').distinct()
hesin_c5_lc_eiddsdicount = hesin_c5_lc_eiddsdi.count()

print(f'Unique patient count, dsource and disdate_im for hesin_c5_count: {hesin_c5_lc_eiddsdicount}')

In [None]:
death_df_date = death_df.select('eid','date_of_death').distinct()
hesin_date_final = hesin_c5_maxdate.join(death_df_date, 'eid', 'left')
hesin_date_final = hesin_date_final.join(hesin_c5_lc_eiddsdi, 'eid', 'left')
hesin_date_final = hesin_date_final.withColumn('last_followup_date', f.when((f.col('date_of_death') < f.col('censordate')), f.col('date_of_death')).otherwise(f.col('censordate')))

# hesin_c5_death = hesin_c5_death.withColumn('last_followupdate_all', f.greatest(f.col('admidate_last'), f.col('epistart_last'), f.col('epiend_last'), f.col('disdate_last'), f.col('date_of_death')))
# hesin_c5_death = hesin_c5_death.withColumn('last_followupdate_hes', f.greatest(f.col('admidate_last'), f.col('epistart_last'), f.col('epiend_last'), f.col('disdate_last')))
# hesin_c5_death = hesin_c5_death.withColumn('last_followupdate', f.when((f.col('dsource') == 'HES') & (f.col('date_of_death').isNotNull()) & (f.col('date_of_death') <= f.lit('2022-10-31')), f.col('date_of_death'))
#                                                                  .when((f.col('dsource') == 'SMR') & (f.col('date_of_death').isNotNull()) & (f.col('date_of_death') <= f.lit('2022-08-31')), f.col('date_of_death'))
#                                                                  .when((f.col('dsource') == 'PEDW') & (f.col('date_of_death').isNotNull()) & (f.col('date_of_death') <= f.lit('2022-05-31')), f.col('date_of_death'))
#                                                                  .when((f.col('dsource') == 'HES') & (f.col('date_of_death').isNull()), f.lit('2022-10-31'))
#                                                                  .when((f.col('dsource') == 'SMR') & (f.col('date_of_death').isNull()), f.lit('2022-08-31'))
#                                                                  .when((f.col('dsource') == 'PEDW') & (f.col('date_of_death').isNull()), f.lit('2022-05-31'))
#                                                                  .otherwise(f.lit(None)))
hesin_date_final.show(5)

hesin_date_final_count = hesin_date_final.count()
hesin_date_final_eidcount = hesin_date_final.select('eid').distinct().count()

print(f'Number of row for hesin_date_final: {hesin_date_final_count}')
print(f'Unique patient count for hesin_date_final: {hesin_date_final_eidcount}')
# hesin_c5_deathqc = hesin_c5_death.filter((f.col('date_of_death') < f.col('admidate_last')) | (f.col('date_of_death') < f.col('epistart_last')) | (f.col('date_of_death') < f.col('epistart_last')) | (f.col('date_of_death') < f.col('epiend_last')) | (f.col('date_of_death') < f.col('disdate_last')))
# hesin_c5_deathqc.show()

In [None]:
df_cohort_ascvd_i = df_cohort_ascvd.join(hesin_date_final,'eid','left')
df_cohort_ascvd_i.count()
df_cohort_ascvd_i.printSchema()

In [None]:
#Identify incident cohort
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit01', f.lit(1))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit02', f.when((f.col('date_of_ac_0').isNotNull()) & (f.col('indexdate').isNotNull()) & (f.col('indexdate') > f.col('date_of_ac_0')) & (f.col('indexdate') <= f.lit('2022-10-31')), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit03', f.when((f.col('date_of_ac_0').isNotNull()) & (f.col('indexdate').isNotNull()) & (f.col('indexdate') > f.col('date_of_ac_0')) & (f.col('indexdate') <= f.lit('2022-10-31')) & (f.col('lpa_baseline').isNotNull()), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit04', f.when((f.col('date_of_ac_0').isNotNull()) & (f.col('indexdate').isNotNull()) & (f.col('indexdate') > f.col('date_of_ac_0')) & (f.col('indexdate') <= f.lit('2022-10-31')) & (f.col('lpa_baseline').isNotNull()) & (f.col('age_index') >= 40), f.lit(1)).otherwise(f.lit(0)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('crit05', f.when((f.col('date_of_ac_0').isNotNull()) & (f.col('indexdate').isNotNull()) & (f.col('indexdate') > f.col('date_of_ac_0')) & (f.col('indexdate') <= f.lit('2022-10-31')) & (f.col('lpa_baseline').isNotNull()) & (f.col('age_index') >= 40) & (f.col('epistart_last') > f.col('indexdate')), f.lit(1)).otherwise(f.lit(0)))

patcount_step1 = df_cohort_ascvd_i.filter(f.col('crit01') == 1).count()
patcount_step2 = df_cohort_ascvd_i.filter(f.col('crit02') == 1).count()
patcount_step3 = df_cohort_ascvd_i.filter(f.col('crit03') == 1).count()
patcount_step4 = df_cohort_ascvd_i.filter(f.col('crit04') == 1).count()
patcount_step5 = df_cohort_ascvd_i.filter(f.col('crit05') == 1).count()

print(f'Total UK Biobank participants: {patcount_step1}')
print(f'First ASCVD diagnosis post UK Biobak enrolment (Incident ASCVD cohort): {patcount_step2}')
print(f'First ASCVD diagnosis post UK Biobak enrolment (Incident ASCVD cohort) with valid Lp(a) measurement: {patcount_step3}')
print(f'First ASCVD diagnosis post UK Biobak enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old: {patcount_step4}')
print(f'First ASCVD diagnosis post UK Biobak enrolment (Incident ASCVD cohort) with valid Lp(a) measurement aged >= 40 years old with hospital care data span from index date: {patcount_step5}')

In [None]:
#The identification period will be 2006 to October 2021 
#Check summary of dates first

df_cohort_ascvd_i.filter(f.col('crit02') == 1).select(f.max(f.col('indexdate'))).show()

In [None]:
df_cohort_ascvd_i.printSchema()

In [None]:
# Final cohort
df_i_final = df_cohort_ascvd_i.filter(f.col('crit05') == 1)
df_i_final.count()

In [None]:
ascvd_i_eid = df_i_final.select('eid')

In [None]:
df_i_final_disdate = df_i_final.select('eid','disdate','disdate_im')
df_i_final_disdate.show(10)

In [None]:
#identify the discharge date for first ASCVD diagnosis
#Use disdate_im_max?
df_i_final_disdate = df_i_final_disdate.withColumn('disdate_im_uniq', f.array_distinct('disdate_im'))
df_i_final_disdate = df_i_final_disdate.withColumn('disdate_im_uniq_size', f.size('disdate_im_uniq'))
df_i_final_disdate = df_i_final_disdate.withColumn('disdate_im_max', f.array_max('disdate_im_uniq'))
df_i_final_disdate.show(10,truncate=False)

In [None]:
df_i_final_disdate.filter(f.col('disdate_im_uniq_size') > 1).show(10,truncate=False)

In [None]:
hesin_iddo_df_ascvd.filter(f.col('eid') == '1004680').show(truncate=False)
hesin_iddo_df_ascvd.filter(f.col('eid') == '1010972').show(truncate=False)

In [None]:
#Use disdate_im_max as discharge date for incident ASCVD diagnosis

df_i_final_disdate_max = df_i_final_disdate.select('eid','disdate_im_max')
df_i_final_disdate_max_count = df_i_final_disdate_max.count()
df_i_final_disdate_max_eidcount = df_i_final_disdate_max.select('eid').distinct().count()

print(f'Number of row for df_i_final_disdate_max: {df_i_final_disdate_max_count}')
print(f'Unique patient count for df_i_final_disdate_max: {df_i_final_disdate_max_eidcount}')

In [None]:
#add 'disdate_im_max' to df_cohort_ascvd_i

df_cohort_ascvd_i = df_cohort_ascvd_i.join(df_i_final_disdate_max,'eid','left')

In [None]:
#Calculate post discharge follow-up duration
## followup06m_dur
## followup12m_dur
## followup24m_dur

#Take into consideration last follow-up date that consideraing death date

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup06m_date_ori', f.when((f.col('indexdate').isNotNull()) & (f.col('disdate_im_max').isNotNull()), f.date_add(f.col('disdate_im_max'),183))
                                                                           .otherwise(f.lit(None)))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup06m_date', f.when((f.col('followup06m_date_ori').isNotNull()) & (f.col('followup06m_date_ori') > f.col('last_followup_date')), f.col('last_followup_date'))
                                                                      .when((f.col('followup06m_date_ori').isNotNull()) & (f.col('followup06m_date_ori') <= f.col('last_followup_date')), f.col('followup06m_date_ori'))
                                                                      .otherwise(f.lit(None)))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup06m_dur', f.when((f.col('followup06m_date_ori').isNotNull()), f.datediff(f.col('followup06m_date'),f.col('disdate_im_max'))).otherwise(f.lit(None)))
                                                 
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup12m_date_ori', f.when((f.col('indexdate').isNotNull()) & (f.col('disdate_im_max').isNotNull()), f.date_add(f.col('disdate_im_max'),365))
                                                                           .otherwise(f.lit(None)))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup12m_date', f.when((f.col('followup12m_date_ori').isNotNull()) & (f.col('followup12m_date_ori') > f.col('last_followup_date')), f.col('last_followup_date'))
                                                                      .when((f.col('followup12m_date_ori').isNotNull()) & (f.col('followup12m_date_ori') <= f.col('last_followup_date')), f.col('followup12m_date_ori'))
                                                                      .otherwise(f.lit(None)))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup12m_dur', f.when((f.col('followup12m_date_ori').isNotNull()), f.datediff(f.col('followup12m_date'),f.col('disdate_im_max'))).otherwise(f.lit(None)))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup24m_date_ori', f.when((f.col('indexdate').isNotNull()) & (f.col('disdate_im_max').isNotNull()), f.date_add(f.col('disdate_im_max'),730))
                                                                           .otherwise(f.lit(None)))
df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup24m_date', f.when((f.col('followup24m_date_ori').isNotNull()) & (f.col('followup24m_date_ori') > f.col('last_followup_date')), f.col('last_followup_date'))
                                                                      .when((f.col('followup24m_date_ori').isNotNull()) & (f.col('followup24m_date_ori') <= f.col('last_followup_date')), f.col('followup24m_date_ori'))
                                                                      .otherwise(f.lit(None)))

df_cohort_ascvd_i = df_cohort_ascvd_i.withColumn('followup24m_dur', f.when((f.col('followup24m_date_ori').isNotNull()), f.datediff(f.col('followup24m_date_ori'),f.col('disdate_im_max'))).otherwise(f.lit(None)))

In [None]:
df_cohort_ascvd_i.select('eid','indexdate','disdate_im_max','followup12m_date_ori','followup12m_date','date_of_death','followup12m_dur').filter(f.col('followup12m_dur') < 365).show(5)

In [None]:
ascvd_i_followup_date_df = df_cohort_ascvd_i.filter(f.col('crit05') == 1).select('eid','indexdate','disdate_im_max','followup06m_date','followup06m_dur','followup12m_date','followup12m_dur','followup24m_date','followup24m_dur')

#disdate_im_max: start date of follow-up - 1

In [None]:
#extract hesin_df for ascvd i patients
ascvd_i_hesin_df = hesin_df.join(ascvd_i_followup_date_df,'eid','inner')

#check whether admidate containing null
ascvd_i_hesin_df_count = ascvd_i_hesin_df.count()
ascvd_i_hesin_df_admidatenullcount =  ascvd_i_hesin_df.filter(f.col('admidate').isNull()).count()
ascvd_i_hesin_df_admidatenulleidcount =  ascvd_i_hesin_df.filter(f.col('admidate').isNull()).select('eid','ins_index').distinct().count()

print(f'Number of row for ascvd_i_hesin_df: {ascvd_i_hesin_df_count}')
print(f'Number of row null for admission date for ascvd_i_hesin_df: {ascvd_i_hesin_df_admidatenullcount}')
print(f'Number of unique patient & ins_index null for admission date for ascvd_i_hesin_df: {ascvd_i_hesin_df_admidatenulleidcount}')

ascvd_i_hesin_df.filter(f.col('admidate').isNull()).show(5)

In [None]:
ascvd_i_hesin_df.filter(f.col('admidate').isNull()).select('dsource').distinct().show()

In [None]:
ascvd_i_hesin_df.printSchema()

In [None]:
ascvd_i_hesin_df.filter((f.col('eid') == '1029473') & (f.col('ins_index') == 20)).select('eid', 'ins_index', 'dsource', 'epistart', 'epiend', 'admidate', 'disdate').show()

In [None]:
#Impute null admidate with epistart
#No need to impute null disdate, already done
ascvd_i_hesin_df = ascvd_i_hesin_df.withColumn('admidate_im', f.when((f.col('admidate').isNull()), f.col('epistart')).otherwise(f.col('admidate')))

ascvd_i_hesin_df_admidate_im_nullcount =  ascvd_i_hesin_df.filter(f.col('admidate_im').isNull()).count()
ascvd_i_hesin_df_disdate_im_nullcount =  ascvd_i_hesin_df.filter(f.col('disdate_im').isNull()).count()

print(f'Number of row null for admidate_im date for ascvd_i_hesin_df: {ascvd_i_hesin_df_admidate_im_nullcount}')
print(f'Number of row null for disdate_im date for ascvd_i_hesin_df: {ascvd_i_hesin_df_disdate_im_nullcount}')

In [None]:
ascvd_i_hesin_df.filter(f.col('admidate_im').isNull()).show()

In [None]:
#Extract hesin main record within follow up period of 12m
ascvd_i_hesin_df_fu12m = ascvd_i_hesin_df.filter((f.col('admidate_im') > f.col('disdate_im_max')) & (f.col('admidate_im') <= f.col('followup12m_date')))

#Extract records from hesin_df, hesin_do_df, hesin_diag_df, hesin_oper_df and hesin_critical_df
ascvd_i_hesin_df_eidins = ascvd_i_hesin_df.select('eid','ins_index')
ascvd_i_hesin_df_fu12m_eidins = ascvd_i_hesin_df_fu12m.select('eid','ins_index')

ascvd_i_hesin_iddo_df_fuall = ascvd_i_hesin_df_eidins.join(hesin_do_df, ['eid','ins_index'], 'inner')
ascvd_i_hesin_iddo_df_fu12m = ascvd_i_hesin_df_fu12m_eidins.join(hesin_do_df, ['eid','ins_index'], 'inner')

ascvd_i_hesin_diag_df_fuall = ascvd_i_hesin_df_eidins.join(hesin_diag_df, ['eid','ins_index'], 'left')
ascvd_i_hesin_diag_df_fu12m = ascvd_i_hesin_df_fu12m_eidins.join(hesin_diag_df, ['eid','ins_index'], 'inner')

ascvd_i_hesin_oper_df_fuall = ascvd_i_hesin_df_eidins.join(hesin_oper_df, ['eid','ins_index'], 'inner')
ascvd_i_hesin_oper_df_fu12m = ascvd_i_hesin_df_fu12m_eidins.join(hesin_oper_df, ['eid','ins_index'], 'inner')

ascvd_i_hesin_critical_df_fuall = ascvd_i_hesin_df_eidins.join(hesin_critical_df, ['eid','ins_index'], 'inner')
ascvd_i_hesin_critical_df_fu12m = ascvd_i_hesin_df_fu12m_eidins.join(hesin_critical_df, ['eid','ins_index'], 'inner')


In [None]:
ascvd_i_hesin_iddo_df_fuall_count = ascvd_i_hesin_iddo_df_fuall.count()
ascvd_i_hesin_iddo_df_fu12m_count = ascvd_i_hesin_iddo_df_fu12m.count()

ascvd_i_hesin_diag_df_fuall_count = ascvd_i_hesin_diag_df_fuall.count()
ascvd_i_hesin_diag_df_fu12m_count = ascvd_i_hesin_diag_df_fu12m.count()

ascvd_i_hesin_oper_df_fuall_count = ascvd_i_hesin_oper_df_fuall.count()
ascvd_i_hesin_oper_df_fu12m_count = ascvd_i_hesin_oper_df_fu12m.count()

ascvd_i_hesin_critical_df_fuall_count = ascvd_i_hesin_critical_df_fuall.count()
ascvd_i_hesin_critical_df_fu12m_count = ascvd_i_hesin_critical_df_fu12m.count()

print(f'Number of row for ascvd_i_hesin_iddo_df_fuall: {ascvd_i_hesin_iddo_df_fuall_count}')
print(f'Number of row for ascvd_i_hesin_iddo_df_fu12m: {ascvd_i_hesin_iddo_df_fu12m_count}')
print(f'Number of row for ascvd_i_hesin_diag_df_fuall: {ascvd_i_hesin_diag_df_fuall_count}')
print(f'Number of row for ascvd_i_hesin_diag_df_fu12m: {ascvd_i_hesin_diag_df_fu12m_count}')
print(f'Number of row for ascvd_i_hesin_oper_df_fuall: {ascvd_i_hesin_oper_df_fuall_count}')
print(f'Number of row for ascvd_i_hesin_oper_df_fu12m: {ascvd_i_hesin_oper_df_fu12m_count}')
print(f'Number of row for ascvd_i_hesin_critical_df_fuall: {ascvd_i_hesin_critical_df_fuall_count}')
print(f'Number of row for ascvd_i_hesin_critical_df_fu12m: {ascvd_i_hesin_critical_df_fu12m_count}')

In [None]:
import os
os.getcwd()

In [None]:
#Failed to save parquet file
#write df_cohort_ascvd_i as parquet file
#df_cohort_ascvd_i.write.parquet('/opt/notebooks/df_cohort_ascvd_i.parquet')

In [None]:
df_cohort_ascvd_i.write.parquet('/home/dnanexus/df_cohort_ascvd_i.parquet')

In [None]:
#%%bash
#dx df_cohort_ascvd_i.parquet --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i.parquet

In [None]:
#QC: read parquet file for df_cohort_ascvd_i.parquet
#dfpq=spark.read.parquet('df_cohort_ascvd_i.parquet')

In [None]:
#Extract all records for ascvd i cohort
hesin_df_cohort_ascvd_i = hesin_df.join(ascvd_i_eid, 'eid', 'inner')
hesin_diag_df_cohort_ascvd_i = hesin_diag_df.join(ascvd_i_eid, 'eid', 'inner')
hesin_oper_df_cohort_ascvd_i = hesin_oper_df.join(ascvd_i_eid, 'eid', 'inner')
hesin_critical_df_cohort_ascvd_i = hesin_critical_df.join(ascvd_i_eid, 'eid', 'inner')
gp_clinical_df_cohort_ascvd_i = gp_clinical_df.join(ascvd_i_eid, 'eid', 'inner')
gp_scripts_df_cohort_ascvd_i = gp_scripts_df.join(ascvd_i_eid, 'eid', 'inner')
gp_registrations_df_cohort_ascvd_i = gp_registrations_df.join(ascvd_i_eid, 'eid', 'inner')
death_df_cohort_ascvd_i = death_df.join(ascvd_i_eid, 'eid', 'inner')
death_cause_df_cohort_ascvd_i = death_cause_df.join(ascvd_i_eid, 'eid', 'inner')

hesin_iddo_df_ascvd_ori_cohort_ascvd_i = hesin_iddo_df_ascvd_ori.join(ascvd_i_eid, 'eid', 'inner')
hesin_iddo_df_ascvd_rank1_cohort_ascvd_i = hesin_iddo_df_ascvd.join(ascvd_i_eid, 'eid', 'inner')
hesin_iddo_df_ascvd_rank2_cohort_ascvd_i = hesin_iddo_df_ascvd_2.join(ascvd_i_eid, 'eid', 'inner')
#hesin_iddo_df_ascvd_ori
#hesin_iddo_df_ascvd
#hesin_iddo_df_ascvd_2

In [None]:
df_cohort_ascvd_i.printSchema()

In [None]:
# Saving as CSV file
df_cohort_ascvd_i.toPandas().to_csv('df_cohort_ascvd_i.csv', index=False)
hesin_df_cohort_ascvd_i.toPandas().to_csv('hesin_df_cohort_ascvd_i.csv', index=False)
hesin_diag_df_cohort_ascvd_i.toPandas().to_csv('hesin_diag_df_cohort_ascvd_i.csv', index=False)
hesin_oper_df_cohort_ascvd_i.toPandas().to_csv('hesin_oper_df_cohort_ascvd_i.csv', index=False)
hesin_critical_df_cohort_ascvd_i.toPandas().to_csv('hesin_critical_df_cohort_ascvd_i.csv', index=False)
gp_clinical_df_cohort_ascvd_i.toPandas().to_csv('gp_clinical_df_cohort_ascvd_i.csv', index=False)
gp_scripts_df_cohort_ascvd_i.toPandas().to_csv('gp_scripts_df_cohort_ascvd_i.csv', index=False)
gp_registrations_df_cohort_ascvd_i.toPandas().to_csv('gp_registrations_df_cohort_ascvd_i.csv', index=False)
death_df_cohort_ascvd_i.toPandas().to_csv('death_df_cohort_ascvd_i.csv', index=False)
death_cause_df_cohort_ascvd_i.toPandas().to_csv('death_cause_df_cohort_ascvd_i.csv', index=False)
hesin_iddo_df_ascvd_ori_cohort_ascvd_i.toPandas().to_csv('hesin_iddo_df_ascvd_ori_cohort_ascvd_i.csv', index=False)
hesin_iddo_df_ascvd_rank1_cohort_ascvd_i.toPandas().to_csv('hesin_iddo_df_ascvd_rank1_cohort_ascvd_i.csv', index=False)
hesin_iddo_df_ascvd_rank2_cohort_ascvd_i.toPandas().to_csv('hesin_iddo_df_ascvd_rank2_cohort_ascvd_i.csv', index=False)

ascvd_i_hesin_df.toPandas().to_csv('ascvd_i_hesin_df.csv', index=False)
ascvd_i_hesin_df_fu12m.toPandas().to_csv('ascvd_i_hesin_df_fu12m.csv', index=False)

ascvd_i_hesin_iddo_df_fuall.toPandas().to_csv('ascvd_i_hesin_iddo_df_fuall.csv', index=False)
ascvd_i_hesin_iddo_df_fu12m.toPandas().to_csv('ascvd_i_hesin_iddo_df_fu12m.csv', index=False)

ascvd_i_hesin_diag_df_fuall.toPandas().to_csv('ascvd_i_hesin_diag_df_fuall.csv', index=False)
ascvd_i_hesin_diag_df_fu12m.toPandas().to_csv('ascvd_i_hesin_diag_df_fu12m.csv', index=False)

ascvd_i_hesin_oper_df_fuall.toPandas().to_csv('ascvd_i_hesin_oper_df_fuall.csv', index=False)
ascvd_i_hesin_oper_df_fu12m.toPandas().to_csv('ascvd_i_hesin_oper_df_fu12m.csv', index=False)

ascvd_i_hesin_critical_df_fuall.toPandas().to_csv('ascvd_i_hesin_critical_df_fuall.csv', index=False)
ascvd_i_hesin_critical_df_fu12m.toPandas().to_csv('ascvd_i_hesin_critical_df_fu12m.csv', index=False)

In [None]:
%%bash
dx upload df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/df_cohort_ascvd_i.csv
dx upload hesin_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/hesin_df_cohort_ascvd_i.csv
dx upload hesin_diag_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/hesin_diag_df_cohort_ascvd_i.csv
dx upload hesin_oper_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/hesin_oper_df_cohort_ascvd_i.csv
dx upload hesin_critical_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/hesin_critical_df_cohort_ascvd_i.csv
dx upload gp_clinical_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/gp_clinical_df_cohort_ascvd_i.csv
dx upload gp_scripts_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/gp_scripts_df_cohort_ascvd_i.csv
dx upload gp_registrations_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/gp_registrations_df_cohort_ascvd_i.csv
dx upload death_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/death_df_cohort_ascvd_i.csv
dx upload death_cause_df_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/death_cause_df_cohort_ascvd_i.csv
dx upload hesin_iddo_df_ascvd_ori_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/hesin_iddo_df_ascvd_ori_cohort_ascvd_i.csv
dx upload hesin_iddo_df_ascvd_rank1_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/hesin_iddo_df_ascvd_rank1_cohort_ascvd_i.csv
dx upload hesin_iddo_df_ascvd_rank2_cohort_ascvd_i.csv --dest /Users/yonghu4/Lpa_EB/cohort/hesin_iddo_df_ascvd_rank2_cohort_ascvd_i.csv

dx upload ascvd_i_hesin_df.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_df.csv
dx upload ascvd_i_hesin_df_fu12m.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_df_fu12m.csv
dx upload ascvd_i_hesin_iddo_df_fuall.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_iddo_df_fuall.csv
dx upload ascvd_i_hesin_iddo_df_fu12m.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_iddo_df_fu12m.csv
dx upload ascvd_i_hesin_diag_df_fuall.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_diag_df_fuall.csv
dx upload ascvd_i_hesin_diag_df_fu12m.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_diag_df_fu12m.csv
dx upload ascvd_i_hesin_oper_df_fuall.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_oper_df_fuall.csv
dx upload ascvd_i_hesin_oper_df_fu12m.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_oper_df_fu12m.csv
dx upload ascvd_i_hesin_critical_df_fuall.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_critical_df_fuall.csv
dx upload ascvd_i_hesin_critical_df_fu12m.csv --dest /Users/yonghu4/Lpa_EB/cohort/ascvd_i_hesin_critical_df_fu12m.csv

In [None]:
#End