## HCA: export matrix to Terra as BDBag

Running this notebook creates a compressed BDBag in the S3 location specified by the environment variable `AZUL_S3_BUCKET`. That bag contains the TSV file with the path to the matrix. A signed URL can be generated so developers at the Broad can start to work with it.

In [None]:
import os, csv, boto3, botocore, sys, tempfile, urllib.request, urllib.error
from bdbag import bdbag_api
from shutil import copy, copyfileobj, rmtree
from uuid import uuid4
from zipfile import ZipFile
from filecmp import dircmp

Refer to the README in case of import errors.

### Create bag and add a TSV file to it that contains a link to the matrix

The TSV file contains only two columns and one row: 
* the first column, which I entitled _source_ contains the [URL to the portal](https://staging.data.humancellatlas.org/explore/projects)
* the second column which I entitled _content_ contains the [URL to the matrix](https://staging.data.humancellatlas.org/explore/projects?filter=%5B%7B%22facetName%22%3A%22project%22%2C%22terms%22%3A%5B%22staging%2FSmart-seq2%2F2019-01-20T23%3A01%3A06Z%22%5D%7D%5D), and which has the HTTP parameter _filter_ to specify the matrix

In [None]:
original_dir_list = os.listdir()
bag_path = tempfile.mkdtemp('_bdbag')
bag = bdbag_api.make_bag(bag_path)
assert os.listdir(os.path.join(bag_path, 'data')) == []

### Copy TSV files from current directory into the data directory of the bag

In [None]:
data_path = os.path.join(bag_path, 'data')
copy('matrix.tsv', data_path)
assert 'matrix.tsv' in os.listdir(data_path)
bag = bdbag_api.make_bag(bag_path, update=True)  # write checksums into respective files
assert bdbag_api.is_bag(bag_path)
bdbag_api.validate_bag(bag_path)
assert bdbag_api.check_payload_consistency(bag)

### Compress bag

In [None]:
arc_path = bdbag_api.archive_bag(bag_path, 'zip')
assert arc_path == bag_path + '.zip'

### Upload zipped bag to S3

In [None]:
aws_profile = os.getenv('AWS_PROFILE')
bucket_name = os.getenv('AZUL_S3_BUCKET')
key = 'examples/' + str(uuid4()) + '.zip'
if aws_profile is None or bucket_name is None:
    sys.exit("Check env vars - aborting.")
session = boto3.Session(profile_name=aws_profile)
s3 = session.resource('s3')
try:
    s3.meta.client.upload_file(Filename=arc_path,
                               Bucket=bucket_name,
                               Key=key)
except Exception as e:
    print(e)
rmtree(bag_path, ignore_errors=True)
os.remove(arc_path)

### Confirm that bag is in bucket

In [None]:
try:
    s3.Object(bucket_name, key).load()
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print(f'Object {key} not found in S3 bucket {bucket_name}.')
    else:
        raise

### Download compressed bag from bucket (identified by `key`) from S3
The original bag named `bag_path` is in `/tmp` and we previously uploaded it to S3. Here we download that bag and write it to the _current working directory_.

In [None]:
my_bucket = s3.Bucket(bucket_name)
for item in my_bucket.objects.all():
    if item.key == key:
        bucket_name = item.bucket_name
try:
    s3.meta.client.download_file(bucket_name, key, './bag.zip')
except Exception as e:
    print(e)
assert 'bag.zip' in os.listdir()

### Unzip bag and list its content
The bag's name is (still) `bag_path`. 

In [None]:
with ZipFile('bag.zip','r') as zip_ref:
    zip_ref.extractall('.')
assert os.path.basename(bag_path) in os.listdir()

### Generate signed URL

In [None]:
aws_region = os.getenv('AWS_DEFAULT_REGION')
session = boto3.session.Session(region_name=aws_region)
s3Client = session.client('s3')
params = {'Bucket': bucket_name, 'Key': key}
expiration_in_secs = 604800  # = 7 days; using Signature Version 4 that's the maximum time
url = s3Client.generate_presigned_url('get_object', 
                                      Params = params, 
                                      ExpiresIn = expiration_in_secs)
print(f'Presigned URL to bag with matrix data: {url}')

### Download file using signed URL

In [None]:
try:
    os.path.basename(bag_path) in os.listdir()
except FileNotFoundError as e:
    print(e)

In [None]:
bag_path_original = os.path.basename(bag_path) 
bag_path = os.path.basename(bag_path)
try:
    os.rename(bag_path_original, bag_path_original + '_original')
except FileNotFoundError as e:
    print(e)
try:
    with urllib.request.urlopen(url) as response, open('bag_from_url.zip', 'wb') as out_file:
        copyfileobj(response, out_file)
except urllib.error.HTTPError as err:
    print(err)
    if err.code >= 400:
        print('Did the signed URL time out?')
assert 'bag_from_url.zip' in os.listdir()

In [None]:
with ZipFile('bag_from_url.zip','r') as zip_ref:
    zip_ref.extractall('.')

### Compare the original bag with the one downloaded using the signed URL

In [None]:
def print_diff_files(dcmp):
    for name in dcmp.diff_files:
        print("diff_file %s found in %s and %s" % (name, dcmp.left,
               dcmp.right))
    for sub_dcmp in dcmp.subdirs.values():
        print_diff_files(sub_dcmp)
dcmp = dircmp(bag_path, bag_path_original) 
assert print_diff_files(dcmp) is None

### Clean up local system

In [None]:
bdbag_api.cleanup_bag(bag_path)
_dirs = [x for x in os.listdir() if x.startswith('tmp')] + [x for x in os.listdir() if x.endswith('.zip')]
for _dir in _dirs:
    if os.path.isdir(_dir):
        rmtree(_dir)
    else:
        os.remove(_dir)
current_dir_list = os.listdir()
assert original_dir_list == current_dir_list