## Download data from the HCA

First we're going to install the HCA client and login. The login is omitted here, but here's how you do it:

1. Go to a computer where localhost will resolve to the same place you're running a terminal.
2. Install the hca client with `pip install hca`
3. Run `hca login dss` and follow the instructions.
4. Go to `~/.config/hca/config.json` and find the string in `application_secrets.client_secrets`.
5. In the notebook environment, run `hca login dss --access-token <token>`, where `<token>` is the string you found in the json file above.

Now, initialize a client for the DSS.

In [37]:
import hca, hca.dss, json
client = hca.dss.DSSClient()

Suppose we know the sample ID that we want to look up, `63cba020-fbb5-4fe5-82c2-eeb1cd5dd957`. So, we write an Elasticsearch query to find bundles with that ID. The `files.assay_json.sample_id` field can be discovered using the private Kibana instance.

In [40]:
results = client.post_search(es_query={"query": {"bool": {"must": [{"match": {"files.assay_json.sample_id": "63cba020-fbb5-4fe5-82c2-eeb1cd5dd957"}}]}}}, replica="aws")
print(json.dumps(results, indent=2))

{
  "es_query": {
    "query": {
      "bool": {
        "must": [
          {
            "match": {
              "files.assay_json.sample_id": "63cba020-fbb5-4fe5-82c2-eeb1cd5dd957"
            }
          }
        ]
      }
    }
  },
  "results": [
    {
      "bundle_fqid": "2d87b5f5-757d-40b2-a2a7-49d2df179cb9.2017-11-21T185012.925568Z",
      "bundle_url": "https://dss.staging.data.humancellatlas.org/v1/bundles/2d87b5f5-757d-40b2-a2a7-49d2df179cb9?version=2017-11-21T185012.925568Z&replica=aws",
      "search_score": 0.074107975
    },
    {
      "bundle_fqid": "0cae8147-8268-46bd-b743-23b4b113147d.2017-10-27T184156.405768Z",
      "bundle_url": "https://dss.staging.data.humancellatlas.org/v1/bundles/0cae8147-8268-46bd-b743-23b4b113147d?version=2017-10-27T184156.405768Z&replica=aws",
      "search_score": 0.074107975
    },
    {
      "bundle_fqid": "a76fda85-8e75-4b8c-9c04-fbbbeed05284.2017-11-17T184633.318126Z",
      "bundle_url": "https://dss.staging.data.humancellatlas.o

Okay, 48 results! There's probably some way to filter these, but I'd rather not go back to Kibana. So let's go ahead and assume we want the first one, and we'll download it.

In [41]:
client.download(results["results"][0]["bundle_fqid"], replica="aws")

SwaggerAPIException: not_found: Cannot find file! (HTTP 404). Details:
Traceback (most recent call last):
  File "/var/task/chalicelib/dss/error.py", line 26, in wrapper
    return func(*args, **kwargs)
  File "/var/task/chalicelib/dss/api/bundles/__init__.py", line 34, in get
    return get_bundle(uuid, Replica[replica], version, directurls)
  File "/var/task/chalicelib/dss/storage/bundles.py", line 132, in get_bundle
    directurls
  File "/var/task/chalicelib/dss/storage/bundles.py", line 46, in get_bundle_from_bucket
    raise DSSException(404, "not_found", "Cannot find file!")
dss.error.DSSException


Oh no, okay we want the first 36 characters of the fqid.

In [42]:
client.download(results["results"][0]["bundle_fqid"][:36], replica="aws")

File assay.json: Retrieving...
File assay.json: GET SUCCEEDED. Writing to disk.
File assay.json: GET SUCCEEDED. Stored at 2d87b5f5-757d-40b2-a2a7-49d2df179cb9/assay.json.
File manifest.json: Retrieving...
File manifest.json: GET SUCCEEDED. Writing to disk.
File manifest.json: GET SUCCEEDED. Stored at 2d87b5f5-757d-40b2-a2a7-49d2df179cb9/manifest.json.
File project.json: Retrieving...
File project.json: GET SUCCEEDED. Writing to disk.
File project.json: GET SUCCEEDED. Stored at 2d87b5f5-757d-40b2-a2a7-49d2df179cb9/project.json.
File sample.json: Retrieving...
File sample.json: GET SUCCEEDED. Writing to disk.
File sample.json: GET SUCCEEDED. Stored at 2d87b5f5-757d-40b2-a2a7-49d2df179cb9/sample.json.
File async_copied_file: Retrieving...
File async_copied_file: GET SUCCEEDED. Writing to disk.
File async_copied_file: GET SUCCEEDED. Stored at 2d87b5f5-757d-40b2-a2a7-49d2df179cb9/async_copied_file.


{}

Looks good, we've downloaded some files to `2d87b5f5-757d-40b2-a2a7-49d2df179cb9`. Let's see what's in there.

In [44]:
%%bash
ls -l 2d87b5f5-757d-40b2-a2a7-49d2df179cb9

total 65556
-rw-r--r-- 1 jovyan users     2019 Feb 22 22:35 assay.json
-rw-r--r-- 1 jovyan users 67108865 Feb 22 22:36 async_copied_file
-rw-r--r-- 1 jovyan users      555 Feb 22 22:35 manifest.json
-rw-r--r-- 1 jovyan users     1361 Feb 22 22:35 project.json
-rw-r--r-- 1 jovyan users      805 Feb 22 22:35 sample.json


Some json files with metadata, and a ~mystery~ `async_copied_file`. No data files though, how do I get those? Maybe the answer is in one of these jsons.

In [45]:
import json
json.load(open("2d87b5f5-757d-40b2-a2a7-49d2df179cb9/manifest.json"))

{'dir': 'http://hgwdev.soe.ucsc.edu/~kent/hca/big_data_files',
 'files': [{'format': '.fastq.gz', 'name': 'pbmc8k_S1_L007_R1_001.fastq.gz'},
  {'format': '.fastq.gz', 'name': 'pbmc8k_S1_L007_R2_001.fastq.gz'},
  {'format': '.fastq.gz', 'name': 'pbmc8k_S1_L008_R1_001.fastq.gz'},
  {'format': '.fastq.gz', 'name': 'pbmc8k_S1_L008_R2_001.fastq.gz'}],
 'version': '1'}

Ah great, Jim Kent has the data. We can download it from him.

In [35]:
%%bash
wget http://hgwdev.soe.ucsc.edu/~kent/hca/big_data_files/pbmc8k_S1_L007_R1_001.fastq.gz

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [47]:
%%bash
ls -lrth

total 7.2G
-rw-r--r-- 1 jovyan users 7.2G Feb  5  2017 pbmc8k_S1_L007_R1_001.fastq.gz
-rwxr-xr-x 1 jovyan users 2.9M May 21  2017 jq
drwsrwsr-x 1 jovyan users 4.0K Feb 22 21:39 work
drwxr-sr-x 2 jovyan users 4.0K Feb 22 22:12 2d87b5f5-757d-40b2-a2a7-49d2df179cb9
-rw-r--r-- 1 jovyan users  32K Feb 22 22:43 HCA Download.ipynb


Success!