SHC_Core_2023 + LPCH_Core_2023 data updates #399

jonc101 · 2024-02-08T21:55:39Z

Migrate over this data from Research IT to our secure compute databases.
[ ] Rename the datasets/tables to remove the shc_ and lpch_ prefixes, so that the naming convention matches prior years (instead just store them in separate shc_core_2023 and lpch_core_2023 databases/datasets)
[ ] Add UTC version of all datetimes
[ ] Extract numerical values from flowsheets

jyx-su · 2024-02-09T00:10:34Z

Just finished the first renaming. Here's the code I used for generating the SQL commands

`!bq ls shc_core_2023 | grep '^ shc_' > tables_to_rename.txt

prefix = 'shc_'

with open('tables_to_rename.txt' ,'r') as f:
for line in f:
table_name = line.strip().split(' ')[0].strip()
#print(table_name)
print(f'ALTER TABLE shc_core_2023.{table_name} RENAME TO {table_name[len(prefix):]};')`

fatemeh91 · 2024-02-10T03:54:40Z

The second renaming is finished. The code snippet utilized for this purpose has been added as Item 2 under the

https://github.com/HealthRex/CDSS/blob/master/setup/BigQueryDataUpdateGuide.MD

jonc101 · 2024-02-10T20:09:30Z

Great. Rather than put all the code in the issue, keep it in the actual code base and just link to the files from here. Will help us find and collate when we need to do this again in the next iteration. Thanks, Jonathan From: Fatemeh Amrollahi ***@***.***> Sent: Friday, February 9, 2024 7:55 PM To: HealthRex/CDSS ***@***.***> Cc: jonc101 ***@***.***>; Author ***@***.***> Subject: Re: [HealthRex/CDSS] SHC_Core_2023 + LPCH_Core_2023 data updates (Issue #399) The second renaming is finished. Here is the code snippet that I used. from google.cloud import bigquery; from google.cloud.bigquery import dbapi; import subprocess import json import pdb import sys def All_TimeStamp_Columns(table_name,dataset_name='shc_core_2023',project_id='som-nero-phi-jonc101'): # extract all the columns of table with timestamp command = f'bq show --schema --format=prettyjson {project_id}:{dataset_name}.{table_name}' process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = process.communicate() # command was sucessful if process.returncode == 0: datetimescols=[] schema = json.loads(stdout) for field in schema: if field['type'] == 'DATETIME': datetimescols.append(field['name']) return datetimescols else: print(f'Error running bq command: {stderr.decode("utf-8")}') return None def All_Available_tables(dataset_name='shc_core_2023',project_id='som-nero-phi-jonc101'): #get all the table names aviable in dataset command = f'bq ls --format=json {project_id}:{dataset_name}' process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = process.communicate() # command was sucessful if process.returncode == 0: table_names=[] schema = json.loads(stdout) for field in schema: if field['type'] == 'TABLE': table_names.append(field['tableReference']['tableId']) return table_names else: print(f'Error running bq command: {stderr.decode("utf-8")}') return None def Time_to_UTC(columns,table_name,dataset_name='shc_core_2023',project_id='som-nero-phi-jonc101'): def Auth(project_id='som-nero-phi-jonc101'): client = bigquery.Client(project_id); return client client=Auth() conn = dbapi.connect(client); cursor = conn.cursor(); new_table_name=table_name datetime_col=','.join(map(str, columns)) query=f'CREATE OR REPLACE TABLE `{project_id}.{dataset_name}.{new_table_name}` AS SELECT * ' # add UTC col for col in columns: new_col=col+'_utc' query=query+f",\n timestamp(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',timestamp({col})), 'America/Los_Angeles') as {new_col}" query+=f'\n from `{project_id}.{dataset_name}.{table_name}`;' cursor.execute(query); results = cursor.fetchall(); for row in results: print( row ); cursor.close() conn.close() if name == 'main': project_id='som-nero-phi-jonc101' dataset_name='shc_core_2023' table_names=All_Available_tables(dataset_name,project_id) for table in table_names: datetimecolumns=All_TimeStamp_Columns(table_name=table,dataset_name=dataset_name,project_id=project_id) if len(datetimecolumns)>0: Time_to_UTC(columns=datetimecolumns,table_name=table,dataset_name=dataset_name,project_id=project_id) — Reply to this email directly, view it on GitHub <#399 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHU2RW7SOQEXEJVCGNJSBCTYS3VQZAVCNFSM6AAAAABDAPTSDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWHA2DMMBSGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jonc101 · 2024-05-28T19:19:05Z

[ ] Separate out lpch_core_2023
Looks like a bunch of "lpch_" tables are currently under the shc_core_2023 dataset/database.
These are the Lucile Packhard Children's Hospital data, and should be separate (just like they were labeled as shc_* tables before).

Move these out to a separate "lpch_core_2023" dataset/database to match

jonc101 assigned FatemeNateghi and fatemeh91 Feb 8, 2024

jonc101 added this to To Do in DevOps via automation Feb 8, 2024

jonc101 assigned jyx-su Feb 8, 2024

FatemeNateghi closed this as completed May 28, 2024

DevOps automation moved this from To Do to Done May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SHC_Core_2023 + LPCH_Core_2023 data updates #399

SHC_Core_2023 + LPCH_Core_2023 data updates #399

jonc101 commented Feb 8, 2024

jyx-su commented Feb 9, 2024

fatemeh91 commented Feb 10, 2024 •

edited

jonc101 commented Feb 10, 2024 via email

jonc101 commented May 28, 2024

SHC_Core_2023 + LPCH_Core_2023 data updates #399

SHC_Core_2023 + LPCH_Core_2023 data updates #399

Comments

jonc101 commented Feb 8, 2024

jyx-su commented Feb 9, 2024

fatemeh91 commented Feb 10, 2024 • edited

jonc101 commented Feb 10, 2024 via email

jonc101 commented May 28, 2024

fatemeh91 commented Feb 10, 2024 •

edited