# Post SCIT-605 Data Delivery Tests

1. Add the most recent data delivery schema to the list of `arcus_deliveries`.
2. Check for patients dropped between deliveries.
3. Commit the updated `missing_patient_info.csv` table.
4. Switch to Respublica and pull the updated table.
5. Run `python check_data_deliveries.py` to check the data deliveries for any missing but requested patient ids.

In [None]:
from google.cloud import bigquery
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)


### 1. Add the most recent data delivery schema to the list of `arcus_deliveries`. In the past, I've had to ask for this specifically from the Arcus team member delivering the data. 

In the lab, there are 2 visible SQL schemas: `arcus` and `lab`. Tables delivered from Arcus are placed in the `arcus` schema. Tables we generate are placed in the `lab` schema. Previously delivered data from Arcus is archived into a schema following the naming convention `arcus_YYYY_MM_DD` where YYYY, MM, and DD are the 4 digit year, 2 digit month, and 2 digit day dates the data was originally delivered to the lab.

In [None]:
arcus_deliveries = [    
    "arcus_2023_04_05",
    "arcus_2023_05_02",
    "arcus_2023_10_23",
    "arcus_2023_12_11",
    "arcus_2024_02_13",
    "arcus_2024_07_16",
    "arcus_2025_02_11",
    "arcus_2025_04_16",
]

# search_username = "Coarse Text Search 2025-04-28"

### 2. Check for dropped patients

In the past, patients sometimes are included in one Arcus data delivery but not the next. Reasons for a patient to be "dropped" vary from the family withdrew consent for the patient's data to be used in research to there was a clerical error assigning a scan belonging to one patient to the incorrect patient. There are not expected to be many of these patients but we would rather know ahead of time.

In [None]:
def check_one_delivery(s_delivery_1, s_delivery_2):
    '''
    Compare the contents of the pat and procedure_order tables
    from two delivery schemas
    @param s_delivery_1 str name of one delivery schema
    @param s_delivery_2 str name of second delivery schema
    @return df_missing_pats dataframe containing patient id, 
            procedure order id, request label, and name of the
            schema where they were no longer included
    '''
    # Initialize the client
    client = bigquery.Client()

    # SQL query to get the patients in schema 1 but not schema 2
    query = """with missing as (select 
        pat.pat_id, 
        proc.proc_ord_id,
        '"""+s_delivery_2+"""' as missing_from_table
      from
        """+s_delivery_1+""".patient pat
        join """+s_delivery_1+""".procedure_order proc on proc.pat_id = pat.pat_id
      where
        pat.pat_id not in (
          select
            pat_id
          from
            """+s_delivery_2+""".patient
        )
      ) select
          req.pat_id,
          req.proc_ord_id,
          request_label,
          missing.missing_from_table
        from
          lab.requested_sessions_main_with_metadata req
          join missing on req.pat_id = missing.pat_id
          and req.proc_ord_id = missing.proc_ord_id"""
    
    # Run the query and get the results as a dataframe
    df_missing_pats = client.query(query).to_dataframe()
    
    # Make a pretty printed summary
    print("There are", len(df_missing_pats[['pat_id', 'missing_from_table']].drop_duplicates()), "requested patients who were in", s_delivery_1, "but were dropped in", s_delivery_2)

    return(df_missing_pats)

In [None]:
def get_dropped_patient_counts(arcus_deliveries):
    '''
    For every pair of delivery dates, identify patients who were included
    in 
    '''
    missing_pat_dfs = []
    for i in range(len(arcus_deliveries)-1):
        missing_pat_dfs.append(check_one_delivery(arcus_deliveries[i], arcus_deliveries[i+1]))

    # Combine the list of missing patient info into a single df
    df_dropped_pats = pd.concat(missing_pat_dfs, axis=0)
    print()
    print("Missing patient ids")
    for idx, row in df_dropped_pats[['pat_id', 'request_label']].drop_duplicates().iterrows():
        print(row['pat_id'], row['request_label'])
    return(df_dropped_pats)

In [None]:
df = get_dropped_patient_counts(arcus_deliveries)

In [None]:
# df.to_csv("./missing_patient_info.csv", index=False)

## Update Projects

In [None]:
# Load the config
cfg = "../queries/config.json"
with open(cfg, "r") as f:
    cohort_lookup = json.load(f)

cohort_list = list(cohort_lookup.keys())
print(cohort_list)

# To make sure all reports for all cohorts are indexed in the project table, 
# uncomment the for loop and its contents before running this cell. 
# Warning: it will take time to run, do not panic.
for cohort in cohort_list:
    print(cohort)
    add_reports_to_project(cohort)

## Apply the Coarse Text Search Filter
Previously, a coarse text search was used to grade reports containing certain substrings indicating severe pathology as 0 to prevent them from being added to grader queues. The code for this step still exists in the notebook but has been commented out since the addition of Maryam Denali’s NLP models. Whether or not to run the coarse text search is up for further discussion.

In [None]:
# q = '''insert into
#   lab.grader_table_with_metadata (
#     proc_ord_id,
#     grader_name,
#     grade,
#     grade_category,
#     pat_id,
#     age_in_days,
#     proc_ord_year,
#     proc_name,
#     report_origin_table,
#     project
#   ) with CTE as (
#     select
#       proc.proc_ord_id,
#       "''' + search_username + '''" as grader_name,
#       0 as grade,
#       "Unique" as grade_category,
#       proc.pat_id,
#       proc.proc_ord_age as age_in_days,
#       proc.proc_ord_year,
#       proc.proc_ord_desc as proc_name,
#       "arcus.procedure_order" as report_origin_table,
#       "SLIP" as project
#     from
#       arcus.procedure_order proc
#       inner join arcus.procedure_order_narrative txt on proc.proc_ord_id = txt.proc_ord_id
#     where
#       proc.proc_ord_desc like "%BRAIN%"
#       and (
#         txt.narrative_text like "%hemotherapy%"
#         or txt.narrative_text like "%resect%"
#         or txt.narrative_text like "%Resect%"
#         or txt.narrative_text like "%raniotomy%"
#         or txt.narrative_text like "%raniectomy%"
#         or txt.narrative_text like "%urgical cavity%"
#         or txt.narrative_text like "%ost surg%"
#         or txt.narrative_text like "%ostsurg%"
#         or txt.narrative_text like "%ost-surg%"
#       )
#   )
# select
#   *
# from
#   CTE;'''

# client = bigquery.Client()

# job = client.query(q)
# job.result()