## NCBI SRA on AWS

In this notebook, we utilize AWS to fetch `center_names` from the Sequence Read Archive (SRA) for a provided list of accession targets.

This notebook requires AWS credentials. If you haven't set up an AWS account yet, you can start by creating one here: [AWS Account Setup](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/).

After setting up your account, you'll need to create an IAM user with the necessary permissions and generate an access key. To do this:

1. Follow the instructions here: [Creating an IAM User in Your AWS Account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html).
2. Once your IAM user is set up, generate access keys by following: [Managing Access Keys for IAM Users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey).

You will be prompted to enter the path to the CSV access key file during the execution of this notebook.

### Setting up a Role for AWS Glue:

To enable AWS Glue to access your data in AWS services, you need to grant it permissions. This is done by creating an IAM role:

1. Navigate to the IAM console: AWS Services → Security, Identity, & Compliance → IAM.
2. In the IAM dashboard, click on "Roles" and then "Create role".
3. Select "AWS service" as the trusted entity type and choose "Glue" as the service that will use this role. Then, click "Next: Permissions".
4. Attach the necessary policies. For AWS Glue to access data in S3, you can attach the managed policy `AWSGlueServiceRole`. Additionally, grant it permissions to your specific S3 bucket by either selecting an existing policy or creating a new one.
5. Review your role and give it a meaningful name, e.g., `MyGlueServiceRole`, and then create the role.
6. After creating the role, copy the Role ARN. It will have a format like this: `arn:aws:iam::YOUR_ACCOUNT_NUMBER:role/YOUR_ROLE_NAME`.

You'll need to provide the Role ARN when setting up the Glue crawler in this notebook.

For data manipulation and querying AWS services, we use the [awswrangler](https://awswrangler.readthedocs.io/en/stable/) library. This library offers a flexible way to handle AWS data sources, making it easier to connect with services like Amazon Athena and AWS Glue.

In [3]:
!pip show awswrangler

Name: awswrangler
Version: 3.3.0
Summary: Pandas on AWS.
Home-page: https://aws-sdk-pandas.readthedocs.io/
Author: Amazon Web Services
Author-email: 
License: Apache-2.0
Location: /home/rstudio/.local/lib/python3.10/site-packages
Requires: boto3, botocore, numpy, packaging, pandas, pyarrow, typing-extensions
Required-by: 


In [4]:
!pip show boto3

Name: boto3
Version: 1.28.43
Summary: The AWS SDK for Python
Home-page: https://github.com/boto/boto3
Author: Amazon Web Services
Author-email: 
License: Apache License 2.0
Location: /home/rstudio/.local/lib/python3.10/site-packages
Requires: botocore, jmespath, s3transfer
Required-by: awswrangler


In [9]:
import awswrangler as wr
import pandas as pd
import boto3
import os

# Prompt user for AWS credential path
aws_cred_path = input("Enter the file path to your AWS credentials (e.g. C:\\Users\\user\\aws_credentials.csv or /home/user/aws_credentials.csv) or /workspaces/PASS/SGMC/athena_sra_accessKeys.csv:")

# Check to make certain the path exists and either get the credentials or return an error message
if os.path.exists(aws_cred_path):
    # Read AWS credentials from CSV
    aws_cred_df = pd.read_csv(aws_cred_path)
    aws_access_key_id = aws_cred_df['Access key ID'].iloc[0]
    aws_secret_access_key = aws_cred_df['Secret access key'].iloc[0]

    # Set up default session using the extracted credentials
    boto3.setup_default_session(
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
        region_name='us-east-1'
    )

else: 
    print('Error: No such file or directory is found')

In [20]:
# AWS setup
wr.config.s3_endpoint_url = "https://s3.us-east-1.amazonaws.com"
wr.config.athena_region_name = "us-east-1"  # Set Athena to use us-east-1 region

# S3 bucket and path for the Sequence Read Archive metadata
bucket = "sra-pub-metadata-us-east-1"
path = f"s3://{bucket}/sra/metadata/"

# Create a Glue crawler to discover the data and create the Athena table
glue = boto3.client('glue', region_name='us-east-1')
crawler_name = "sra-metadata-crawler"

try:
    response = glue.create_crawler(
        Name=crawler_name,
        Role="your_glue_service_role",  # Replace with your Glue service role ARN
        DatabaseName="sra_metadata_db",
        Description="Crawler for SRA metadata",
        Targets={
            "S3Targets": [
                {
                    "Path": path
                },
            ]
        }
    )
except glue.exceptions.AlreadyExistsException:
    print(f"Crawler {crawler_name} already exists.")

# Start the crawler
glue.start_crawler(Name=crawler_name)


Crawler sra-metadata-crawler already exists.


{'ResponseMetadata': {'RequestId': 'bc4dde38-dbd9-4c4c-9171-9a0a92d6d69e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 08 Sep 2023 17:52:41 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'bc4dde38-dbd9-4c4c-9171-9a0a92d6d69e'},
  'RetryAttempts': 0}}

In [23]:
# Function to query the AWS Athena database for SRA metadata
def search_database(target_acc_df):
    # Create the SQL query using double quotes for identifiers
    sql_query = f'SELECT "acc", "center_name" FROM "sra_metadata_db"."metadata" WHERE "acc" IN {tuple(target_acc_df["Acc"])}'
    
    # Execute the SQL query using Athena
    results_df = wr.athena.read_sql_query(sql_query, database="sra_metadata_db")
    
    # Return the results DataFrame
    return results_df

In [24]:
# Load the sample accession numbers
target_acc_df = pd.read_csv("./data/input/sample_accessions.csv")
matching_df = search_database(target_acc_df)
matching_df = matching_df.rename(columns={"acc": "Acc", "center_name":"Center_Names"})

In [25]:
matching_df

Unnamed: 0,Acc,Center_Names
0,SRR11609212,CHILDREN HOSPITAL OF GUANGXI ZHUANG AUTONOMOUS...
1,SRR10566897,AUSTRALIAN INSTITUTE OF TROPICAL HEALTH AND ME...
2,SRR10581381,UNIVERSITY OF OTTAWA
3,SRR11881309,"CALIFORNIA STATE UNIVERSITY, FULLERTON"
4,SRR10018586,CFSAN
5,SRR11802022,INSTITUT PASTEUR DE MONTEVIDEO
6,SRR11005721,UNIVERSITE DE MONTREAL - IRBV
7,SRR015631,BI
8,SRR11423752,CHILDREN'S HOSPITAL OF CHONGQING MEDICAL UNIVE...
9,SRR10206177,UNIVERSITY OF SASKATCHEWAN


In [26]:
# Save the results to a CSV file
matching_df.to_csv('./data/output/sample_data.csv', encoding='utf-8', index=False)