# Project: Performance of phenotype algorithms for the identification of opioid-exposed infants, Andrew D. Wiese et al. Hospital Pediatrics 2024
# Title: Extract estimated gestational age for infant
# Summary: 
## Extract and clean estimated gestational age and only include infants with >=33 weeks gestation 

# Notes:
- Estimated time of conception is also calculated after extracting estimated gestational age using the formula: Baby Date of Birth - estimated gestational age + 14 days


##### Algorithm steps:
```
1. Obtain EGA info from OBSERVATION table joining with FACT_RELATIONSHIP and CONCEPT tables
2. Transform EGA string (e.g. 38w 6d) to total days (e.g. 272)
3. Ingest data and for duplicate baby IDs, pick record with largest total EGA days
4. Validate results
5. For same total EGA days, pick record with oldest observation date 
6. Validate results
7. Remove value_as_string column and make distinct records
8. Validate results
9. Identify records with 0 weeks EGA as uncertain
10. For EGA >=33 weeks or uncertain, calculate conception date as:
    - Conception Date = Baby Date of Birth - Length of Gestation + 14 days
11. Validate results
```

##### Data Dictionaries:

**ega_raw_df**: EGA info from OBSERVATION table   

**ega_df**: EGA data after transforming string to total days

**largest_ega_df**: Picked largest total EGA days for duplicate baby IDs 

**largest_ega_earliest_obs_date_df**: Added filter to pick oldest observation date for same total EGA days

**ega_distinct_df**: Removed value_as_string column and made distinct records

**gestational_age_uncertain_df**: Filtered for 0 weeks EGA as uncertain

**ega_w33_or_uncertain_gestation_date**: Calculated conception date for EGA >=33 weeks or uncertain

##### Usage Notes:
```
- EGA data is extracted from the OBSERVATION table joining with FACT_RELATIONSHIP and CONCEPT tables
- Duplicate EGA records per baby are resolved by picking the largest total EGA days
- For same total EGA days, the record with the oldest observation date is picked 
- Records with 0 weeks EGA are marked as uncertain
- Conception date is calculated as Baby Date of Birth - Length of Gestation + 14 days
- Final output table contains the calculated conception date for EGA >=33 weeks or uncertain
```

In [0]:
%run "./project_modules"

##### Obtain EGA info from OBSERVATION table based on mom-baby pairs from FACT_RELATIONSHIP table then change EGA time format for later use (e.g., 38w 6d = 272 total EGA days)

In [0]:
sql = f"""
      select fact_id_1 as mom_person_id, fact_id_2 as baby_person_id, birth_datetime,person_source_value as baby_person_source_value, observation_date, value_as_string, observation_concept_id, concept_name from
     (
      select * from global_temp.mom_baby_step1 a
      left join {obs_table} b
      on a.fact_id_1=b.person_id
      where abs(datediff(observation_datetime,birth_datetime))<=2
     ) as a
     left join {concept_table}
     on observation_concept_id=CONCEPT_ID
     where lower(concept_name) like '%gestation%';
""" 
ega_raw_df = spark.sql(sql) #value_as_string: 38w 6d = 272 total EGA days
cal_ega(ega_raw_df,'value_as_string').createOrReplaceTempView("ega_info")

##### This table has the duplicate rows. Please ingest and pick the largest EGA (suggest to use the total ega days column) for any duplicate rows.

In [0]:
sql="""
    select mom_person_id,birth_datetime,baby_person_id,birth_datetime as baby_dob,observation_date,
    value_as_string,observation_concept_id,concept_name,ega_week as weeks,ega_day as days,total_ega_days from 
    ega_info
    """
ega_df = spark.sql(sql)
ega_df.createOrReplaceTempView("ega_df")

##### Validation

In [0]:
df_inspection("ega_df","all")

In [0]:
sql="""
       select distinct a.*,b.max_total_ega_days from ega_df a
       join
       (select baby_person_id,max(total_ega_days) as max_total_ega_days
       from ega_df
       group by (baby_person_id)
       ) b
       on a.baby_person_id = b.baby_person_id and a.total_ega_days = b.max_total_ega_days;
    """
largest_ega_df = spark.sql(sql)
largest_ega_df.createOrReplaceTempView("largest_ega_df")

##### Validation

In [0]:
df_inspection("largest_ega_df","all")

##### Note: has same length_of_gestation, but different observation_date, then pick the oldest observation_date

##### Example:
- 123|1980-09-29 00:00:00|6 238 210|2012-07-17 00:00:00|2012-07-17|39|40 485 048|Estimated fetal gestational age at delivery|39|0|273|273
- 123|1980-09-29 00:00:00|6 238 210|2012-07-17 00:00:00|2012-07-18|39 weeks|40 485 048|Estimated fetal gestational age at delivery|39|0|273|273

In [0]:
sql="""
       select distinct a.* from largest_ega_df a
       join 
       (select baby_person_id,min(observation_date) as min_observation_date
       from largest_ega_df
       group by (baby_person_id)) b
       on a.baby_person_id = b.baby_person_id and a.observation_date = b.min_observation_date;
    """
largest_ega_earliest_obs_date_df = spark.sql(sql)
largest_ega_earliest_obs_date_df.createOrReplaceTempView("largest_ega_earliest_obs_date_df")

##### Validation

In [0]:
df_inspection("largest_ega_earliest_obs_date_df","all")

##### Note: has same numbers as ega value, but different info in the column 'value_as_string', remove this column,
##### then make it distinct

##### Example:
- 333|1985-08-10 00:00:00|5 107 413|2013-12-15 00:00:00|2013-12-15|38 6/7 weeks|40 485 048|Estimated fetal gestational age at delivery|38|6|272|272
- 333|1985-08-10 00:00:00|5 107 413|2013-12-15 00:00:00|2013-12-15|38 6/7|40 485 048|Estimated fetal gestational age at delivery|38|6|272|272

In [0]:
sql="""
    select distinct mom_person_id,baby_person_id,baby_dob,observation_date,weeks,days,max_total_ega_days 
    from largest_ega_earliest_obs_date_df;
    """
ega_distinct_df = spark.sql(sql)
ega_distinct_df.createOrReplaceTempView("ega_distinct_df")

##### Validation

In [0]:
df_inspection("ega_distinct_df","all")

In [0]:
#--The ega is uncertain
sql="""
     select count(distinct mom_person_id) as unique_mom,count(distinct baby_person_id) as unique_baby, count(*) as total 
     from ega_distinct_df where weeks = 0;
    """
insp_df = spark.sql(sql)
insp_df.display()

In [0]:
sql="""
     select * from ega_distinct_df where weeks = 0;
    """
gestational_age_uncertain_df = spark.sql(sql)
gestational_age_uncertain_df.createOrReplaceTempView("gestational_age_uncertain_df")

##### Conception Date: (baby date of birth - length of gestation) + 14 days. For the code application, in the code 'start_gestation_date' means 'Conception Date'
##### For example, if the first day of the last period was March 11 and you recently had a positive pregnancy test, your first day of pregnancy is March 11 (and not the day of conception)

In [0]:
sql="""
     select *,
     case when weeks = 0 then 'uncertain' else 'week >= 33' end as if_33w_or_uncertain,
     case when weeks != 0 then max_total_ega_days else  330  end as length_of_gestation
     from 
     (select * from ega_distinct_df where weeks >= 33 or weeks = 0) w33_or_uncertain;
    """

ega_w33_or_uncertain_gestation_date1 = spark.sql(sql)

ega_w33_or_uncertain_gestation_date1=ega_w33_or_uncertain_gestation_date1.withColumn("length_of_gestation",ega_w33_or_uncertain_gestation_date1["length_of_gestation"].cast(IntegerType()))

ega_w33_or_uncertain_gestation_date=ega_w33_or_uncertain_gestation_date1.withColumn('estimated_gestation_date',
                       F.expr("date_sub(BABY_DOB, length_of_gestation)"))

ega_w33_or_uncertain_gestation_date=ega_w33_or_uncertain_gestation_date.withColumn('start_gestation_date',
                       F.expr("date_sub(BABY_DOB, length_of_gestation-14)"))


ega_w33_or_uncertain_gestation_date.name='ega_w33_or_uncertain_gestation_date'
register_parquet_global_view(ega_w33_or_uncertain_gestation_date)

##### Validation

In [0]:
df_inspection("global_temp.ega_w33_or_uncertain_gestation_date","all")