## Downloading Data from S3 and HTTPS
 
#### In this tutorial, we will learn how to:
#### - Download datasets from an HTTPS URL
#### - Access data stored in an AWS S3 bucket

## Let's get started!

## 1. Importing Necessary Libraries


In [1]:
#pip install boto3

In [2]:
import pandas as pd
import json
import boto3  # AWS SDK for Python
from io import StringIO  # For handling in-memory CSV data
import requests  # For downloading data from HTTPS


## 2. Downloading Data from S3 URL
 
#### To access data from an S3 bucket, we need:
#### - AWS credentials (access key and secret key)
#### - The bucket name and object key (file name)

In [5]:
s3_url = "s3://gaspi-dataportal-20250110044710445000000002/projects/diabetes/diabetes_cohort1.json"

# Extract bucket name and object key from the S3 URL
s3_bucket_name = "gaspi-dataportal-20250110044710445000000002"
s3_object_key = "projects/diabetes/diabetes_cohort1.json"

# AWS credentials (ensure these are set up properly in your AWS account)
aws_access_key_id = "INSERT_YOUR_KEY_HERE"  # Replace with your AWS access key
aws_secret_access_key = "INSERT_YOUR_SECRET_ACCESS_KEY_HERE"  # Replace with your AWS secret key

# Initialize the S3 client
s3 = boto3.client(
    "s3",
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
)

try:
    # Fetch the file from S3
    s3_object = s3.get_object(Bucket=s3_bucket_name, Key=s3_object_key)
    
    # Read the JSON content
    json_data = json.loads(s3_object["Body"].read().decode("utf-8"))
    
    # Convert JSON to a Pandas DataFrame (assuming it's a list of records)
    #df = pd.DataFrame(json_data)
    
    print("\nData downloaded from S3 successfully!")
    #print(json_data)
except Exception as e:
    print(f"Failed to download data from S3. Error: {e}")


Data downloaded from S3 successfully!


In [6]:
json_data

{'index': False,
 'datasetId': 'DIABETES-1',
 'dataset': {'id': 'DIABETES-1',
  'createDateTime': '2021-03-21T02:37:00-08:00',
  'dataUseConditions': {'duoDataUse': [{'id': 'DUO:0000042',
     'label': 'general research use',
     'version': '17-07-2016'}]},
  'description': 'Simulation set 1.',
  'externalUrl': 'http://example.org/wiki/Main_Page',
  'info': {'svep_data': 's3://sbeacontestdata/vcf_diabetes_svep.csv'},
  'name': 'Dataset with diabetes data',
  'updateDateTime': '2022-08-05T17:21:00+01:00',
  'version': 'v1.1'},
 'assemblyId': 'GRCH38',
 'vcfLocations': ['s3://sbeacontestdata/vcf_diabetes-1.vcf.gz'],
 'individuals': [{'id': 'dia-individual-1-1',
   'diseases': [{'diseaseCode': {'id': 'SNOMED:73211009',
      'label': 'Diabetes mellitus (disorder)'}}],
   'ethnicity': {'id': 'SNOMED:15086000', 'label': 'African American'},
   'geographicOrigin': {'id': 'SNOMED:223621005', 'label': 'Australia'},
   'interventionsOrProcedures': [{'procedureCode': {'id': 'SNOMED:225970009',
