## NCBI SRA in the Cloud

In this notebook we use BigQuery API client to get SRA center_names for a given accession target list.

This notebook requires a Google Cloud Platform (GCP) service account, use the following links to create an account: https://cloud.google.com/iam/docs/service-accounts-create#creating

and read the associated reference: https://googleapis.dev/python/google-api-core/latest/auth.html#service-accounts

Then, use the following link for instructions on how to create a GCP service account key: https://cloud.google.com/iam/docs/keys-create-delete

You will need to enter the path to the JSON service account key file when prompted below.

In [1]:
!pip show google-cloud-bigquery

Name: google-cloud-bigquery
Version: 3.11.0
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /workspaces/PASS/.venv/lib/python3.10/site-packages
Requires: google-api-core, google-cloud-core, google-resumable-media, grpcio, packaging, proto-plus, protobuf, python-dateutil, requests
Required-by: 


In [2]:
!pip show db-dtypes

Name: db-dtypes
Version: 1.1.1
Summary: Pandas Data Types for SQL systems (BigQuery, Spanner)
Home-page: https://github.com/googleapis/python-db-dtypes-pandas
Author: The db-dtypes Authors
Author-email: googleapis-packages@google.com
License: 
Location: /workspaces/PASS/.venv/lib/python3.10/site-packages
Requires: numpy, packaging, pandas, pyarrow
Required-by: 


In [3]:
import os

from google.oauth2 import service_account

# Get the file path to the GCP key file
key_file = input('Enter the file path to your GCP key file (e.g. C:\\Users\\user\\testkeyfile.json or /home/user/testkeyfile.json) and press Enter:') 

# Check to make certain the path exists and either get the credentials or return an error message
if os.path.exists(key_file):
    
    credentials = service_account.Credentials.from_service_account_file(key_file)
    
else: 
    
    print('Error: No such file or directory is found')

In [4]:
from google.cloud import bigquery
import pandas as pd

def search_database(target_acc_df):
    # Create a BigQuery client
    client = bigquery.Client(credentials=credentials)

    # Create the SQL query
    sql_query = f"SELECT acc, center_name FROM `nih-sra-datastore.sra.metadata` WHERE acc IN {tuple(target_acc_df['Acc'])}"

    # Run the SQL query
    query_job = client.query(sql_query)

    # Get the results as a Pandas DataFrame
    results_df = query_job.to_dataframe()

    # Return the results DataFrame
    return results_df

# Example usage
target_acc_df = pd.read_csv('./data/input/sample_accessions.csv')

matching_df = search_database(target_acc_df)
matching_df = matching_df.rename(columns={"acc": "Acc", "center_name":"Center_Names"})
matching_df.head(8)


Unnamed: 0,Acc,Center_Names
0,ERR3347078,UMR 1232
1,ERR006603,CU
2,SRR10738145,ADDIS ABABA UNIVERSITY
3,SRR10202400,CHILDREN'S HOSPITAL OF FUDAN UNIVERSITY
4,SRR11609212,CHILDREN HOSPITAL OF GUANGXI ZHUANG AUTONOMOUS...
5,SRR12147237,CHINA ACADEMY OF CHINESE MEDICAL SCIENCES
6,SRR11423752,CHILDREN'S HOSPITAL OF CHONGQING MEDICAL UNIVE...
7,SRR11478637,SOUTH VALLEY UNIVERSITY


In [5]:
matching_df.to_csv('./data/output/sample_data.csv', encoding='utf-8', index=False)