# Download SmartSeq2 Expression Matrix as an Input to Scanpy

In this notebook, we demonstrate how to obtain a Smartseq2 expression matrix programmatically using the API for the Human Cell Atlas's Data Store (DSS).

Start by importing the HCA's Python API and creating a DSS API client:

In [1]:
import hca.dss
client = hca.dss.DSSClient()

In order to find an expresison matrix to analyze with Scanpy, we need to search for bundles containing a `*.results` file. This file will contain the information needed to assemble the expression matrix.

The DSS uses ElasticSearch to index all of the data, so we must assemble an ElasticSearch query. We want to set two conditions with our query: first, we need the bundle to contain a file named `*.results`; and second, we are interested in more recent expression matrices, so we exclude results older than 7/12/2018. The query looks like this:

In [2]:
query = {
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "manifest.files.name": {
                            # We need a *.results file...
                            "value": "*.results"
                        }
                    }
                },
                {
                    "range": {
                        "manifest.version": {
                            # ...and preferably not too old, either.
                            "gte": "2018-07-12T100000.000000Z"
                        }
                    }
                }
            ]
        }
    }
}

While this query is a bit hard to decipher, we can think about it this way:

* the **query** looks for bundles matching a **boolean** condition: that the two checks in the query **must** be true. 

* The **wildcard** check looks for the `manifest.files.name` field, which contains the bundle manifest, and returns true if the name of a file listed in the bundle manifest ends with `.results`.

* The **range** check returns true if the bundle's `manifest.version` has a value greater than or equal to 7/12/2018. (It is important to note that the DSS uses timestamps for version numbers, which makes it possible to implement the date restriction.)

In short, __this query will find bundles that contain a `.results` file and are newer than July 12, 2018.__

Next, we execute the search with `post_search()`:

In [3]:
# Use output_format='raw' to inculde metadata with search results
bundles = client.post_search(es_query=query, replica='aws', output_format='raw')

The metadata for search results are stored in a list under the `results` key. From there we can access information about the results, including the files list. We can use that to get the name of the `*.results` file we are after, and then download that file from the DSS:

In [4]:
the_first_bundle = bundles['results'][0]
bundle_files = the_first_bundle['metadata']['manifest']['files']

# Print part of a recent analysis bundle with a results file
for f in bundle_files:
    if f['name'].endswith('.results'):
        results_file_uuid, results_file_version = f['uuid'], f['version']
print(results_file_uuid, results_file_version)

ec727ae1-d47a-47a3-8c8e-b42d7a0e8cf4 2019-05-18T173116.989870Z


The UUID of the file uniquely identifies this file in the entire DSS, so we can pass it to the `get_file()` API call to download the file from the data store. We use the API to get the data and write it to a file:

In [5]:
results = client.get_file(replica='aws', uuid=results_file_uuid, version=results_file_version)
open('matrix.results', 'w').write(results.decode("utf-8"))

18865025

Here are the first few lines of the `matrix.results` file we have now obtained:

In [6]:
print(open('matrix.results', 'r').read()[:852])

transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct	posterior_mean_count	posterior_standard_deviation_of_count	pme_TPM	pme_FPKM	IsoPct_from_pme_TPM
ENST00000373020.8	ENSG00000000003.14	2206	2016.44	0.00	0.00	0.00	0.00	0.00	0.00	0.12	0.21	9.99
ENST00000494424.1	ENSG00000000003.14	820	630.44	0.00	0.00	0.00	0.00	0.00	0.00	0.38	0.68	31.95
ENST00000496771.5	ENSG00000000003.14	1025	835.44	0.00	0.00	0.00	0.00	0.00	0.00	0.28	0.51	24.11
ENST00000612152.4	ENSG00000000003.14	3796	3606.44	0.00	0.00	0.00	0.00	0.00	0.00	0.07	0.12	5.59
ENST00000614008.4	ENSG00000000003.14	900	710.44	0.00	0.00	0.00	0.00	0.00	0.00	0.33	0.60	28.36
ENST00000373031.4	ENSG00000000005.5	1339	1149.44	0.00	0.00	0.00	0.00	0.00	0.00	0.21	0.37	23.47
ENST00000485971.1	ENSG00000000005.5	542	352.44	0.00	0.00	0.00	0.00	0.00	0.00	0.67	1.22	76.53
ENST00000371582.8	


Suppose we are only interested in some of these fields, for example the `gene_id`, `TPM`, and `FPKM` columns. We can use Python's `csv` module to read the data from the results file, filter it, and write it to a new file.

We can open a reader and a writer at the same time, and trim columns on each row as they are read in from `matrix.results`.

In [7]:
import csv

# Take the data we want out of the results file and store it into a tsv file
with open('matrix.results', 'r') as infile, open('matrix.tsv', 'w', newline='') as outfile:
    reader = csv.DictReader(infile, delimiter='\t')
    writer = csv.DictWriter(outfile, fieldnames=['gene_id', 'TPM', 'FPKM'], delimiter='\t')
    writer.writeheader()
    for row in reader:
        writer.writerow({'gene_id': row['gene_id'], 'TPM': row['TPM'], 'FPKM': row['FPKM']})

Our new file, `matrix.tsv`, looks something like this:

In [8]:
print("".join(open('matrix.tsv', 'r').readlines()[:8]))

gene_id	TPM	FPKM
ENSG00000000003.14	0.00	0.00
ENSG00000000003.14	0.00	0.00
ENSG00000000003.14	0.00	0.00
ENSG00000000003.14	0.00	0.00
ENSG00000000003.14	0.00	0.00
ENSG00000000005.5	0.00	0.00
ENSG00000000005.5	0.00	0.00



Now that we have a file containing the expression matrix, we can transpose it and read it into scanpy.

In [9]:
import scanpy as sc

adata = sc.read_csv(filename='matrix.tsv', delimiter='\t').transpose()
print(adata)

Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.


AnnData object with n_obs × n_vars = 2 × 200468 


We can use the Scanpy AnnData object to get information, such as particlar observations or the names of all variables:

In [10]:
print(adata.obs_vector(0))
print(list(adata.var_names)[15])

[0. 0.]
ENSG00000000457.13


We can also access the raw data using the `adata.X` attribute:

In [11]:
print("Var 1:")
for i in range(0, 153):
    print('{:<6}'.format('{:.1f}'.format(adata.X[0][i])), end='' if (i + 1) % 17 != 0 else '\n' )

Var 1:
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   6.5   0.0   0.0   0.0   0.0   0.0   167.1 
0.0   3.2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   


And just to make it easier to see the relevant data in our matrix...

In [12]:
for i in range(len(adata.X[0])-1):
    if adata.X[0][i] != 0:
        print(adata.X[0][i], end=' ')

6.53 167.05 3.25 2.52 3.04 1.0 7.05 6.08 12.84 10.88 3.11 10.06 8.9 12.26 9.91 1.38 11.71 69.56 89.88 1.45 0.53 10.06 64.21 0.85 39.42 21.5 2.81 21.39 2.46 1.37 0.57 41.7 1.05 6.21 3.64 5.22 11.23 48.57 0.22 6.72 8.65 2.24 0.57 0.86 1.27 1.17 2.48 0.91 9.6 168.78 7.53 434.51 279.39 3.62 1.31 2.09 1.75 130.58 1.08 0.66 7.79 5.89 13.77 5.16 3.22 2.04 0.4 1.67 3.34 2.07 2.22 0.39 11.2 4.03 0.65 2.68 1.93 2.8 36.79 1.35 30.25 2.69 11.25 0.51 22.75 1.92 0.86 0.52 49.27 634.34 650.86 149.68 0.67 24.23 1.02 5.38 10.23 0.38 4.57 11.69 23.09 4.82 4.11 8.05 0.63 10.47 5.71 1.76 1.7 1.3 8.48 0.8 11.97 2.45 11.12 1.39 3.75 31.57 0.46 3.08 2.78 1.91 9.58 11.81 14.14 548.93 1033.31 2.02 11.71 1.48 6.2 1.87 2.8 1.62 2.94 4.99 37.99 0.92 8.03 3.22 65.9 6.83 47.61 2.16 0.44 15.74 6.07 1.0 0.42 2.15 0.88 2.23 5.08 1.11 4.08 4.42 2.3 1.11 19.06 0.62 4.89 1.44 9.8 1.55 1.64 6.58 0.81 1.24 2.85 2.9 17.77 25.15 0.49 4.73 0.04 0.78 2.04 2.26 2.43 0.44 1.34 24.36 2.94 6.62 6.65 21.71 20.38 4.72 0.83 10.44 5.5

 198166.31 21.75 7.51 2541.42 4.84 605.37 8388.25 12.2 8.62 25.43 0.3 53.82 8.07 96.63 2.31 1.0 50.13 3.79 4.32 0.85 7.29 1.09 15.12 2.1 12.59 6.91 28.22 1.16 3.59 4.42 7.28 188.85 11.32 2.35 1.92 12.41 10.32 56.05 1.34 76.52 1.89 7.95 6.69 15.35 0.67 1.33 2.59 11.55 5.41 2.39 1.53 4.73 22.64 22.42 26.06 46.36 1.58 1.01 3.02 3.83 2.42 0.43 19.56 3.99 6.86 210.51 2.34 1.72 282.56 1.46 2.59 1.53 4.04 12.69 7.99 27.66 52.37 11.24 1.98 2.68 0.33 1.28 8.25 0.93 21.96 4.28 68.75 7.5 4.3 4.35 1.06 5.53 0.62 2.13 6.49 3.25 3.13 3.6 29231.99 35.75 63.12 2.09 9.78 1.85 0.97 0.68 11.41 0.5 2.07 5.11 6.98 1.17 6.65 2.12 672.91 3.56 4.59 20.12 7.28 1.04 2.34 8.16 19.8 1.48 3.13 3.23 2.03 9.9 19.69 2.8 1.79 0.44 29.76 2.32 1723.78 13.12 9.57 9.99 3.1 12.6 4.16 7.2 6.64 6.96 0.71 14.46 19.83 2.8 0.77 46.54 1.23 8.01 1.34 5.35 12.36 5.23 10.67 0.77 2.51 2.67 14.03 69.29 0.59 4.42 321.17 5.75 4.28 3.59 1.52 4.93 46.54 2.96 10.76 1.61 12.55 38.64 5.97 2.36 1.2 35.04 16.69 11.44 2.71 4.17 63.7 1.73 4.88 

We can now download more results files from other bundles by re-using our query results:

In [13]:
the_second_bundle = bundles['results'][1]
bundle_files = the_second_bundle['metadata']['manifest']['files']
for f in bundle_files:
    if f['name'].endswith('.results'):
        results_file_uuid, results_file_version = f['uuid'], f['version']

With the new results file UUID, we can get the `.results` file itself using a `get_file()` API call:

In [14]:
results2 = client.get_file(replica='aws', uuid=results_file_uuid, version=results_file_version)
open('matrix2.results', 'w').write(results2.decode("utf-8"))

18899409

As before, we extract data from the results file, filter it, and write it to a new file using a csv writer and a csv reader:

In [15]:
with open('matrix2.results', 'r') as infile, open('matrix2.tsv', 'w', newline='') as outfile:
    reader = csv.DictReader(infile, delimiter='\t')
    writer = csv.DictWriter(outfile, fieldnames=['gene_id', 'TPM', 'FPKM'], delimiter='\t')
    for row in reader:
        writer.writerow({'gene_id': row['gene_id'], 'TPM': row['TPM'], 'FPKM': row['FPKM']})

Now create a new Scanpy AnnData object:

In [16]:
adata2 = sc.read_csv( filename='matrix.tsv', delimiter='\t' ).transpose()
print(adata2)

Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.


AnnData object with n_obs × n_vars = 2 × 200468 


As before, we can access the data using the AnnData object's built-in methods or by using the `adata2.X` attribute:

In [17]:
for i in range(0, 153):
    print( '{:<6}'.format('{:.1f}'.format(adata2.X[0][i])), end='' if (i + 1) % 17 != 0 else '\n' )

0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   6.5   0.0   0.0   0.0   0.0   0.0   167.1 
0.0   3.2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   


Now we have two matrices in Scanpy, aand we are ready to perform our analysis of choice. This concludes the download tutorial.