# Download SmartSeq2 Expression Matrix as an Input to Scanpy

Suppose I want to get a SmartSeq2 expression matrix that I can analyze using scanpy. How can I go about finding something like this using the DSS API?

In [16]:
import hca.dss
client = hca.dss.DSSClient()

Well, first things first: We're going to need to search for a `.results` file. This file should contain what we need to put together an expression matrix.

Here's our ElasticSearch query.

In [17]:
query = {
    "query": {
      "bool": {
        "must": [
          {
            "match": {
              "files.file_json.files.content.file_core.file_format": "results" # It needs to have a .results file...
            }
          },
          {
            "range": {
              "manifest.version": {
                "gte": "2018-07-12T100000.000000Z" # ...and preferably not be too old, either.
              }
            }
          }
        ]
      }
    }
  }

We can use this query in the `post_search()` method to find a recent analysis bundle containing a `.results` file. However, __we won't find any analysis bundles under _production,___ where we're currently looking. We need to look under _integration_ instead. To do this, we can follow a few short steps:

> 1) Go to ```~/.config/hca/``` on your local machine and edit the `config.json` file there.

> 2) Change the path `https://dss.data.humancellatlas.org/v1/swagger.json` to `https://dss.integration.data.humancellatlas.org/v1/swagger.json`.

> 3) Restart your computer.

After completing these steps, we can execute our search with no problems.

* _Reminder: Don't forget to change this path back to its original value if you want to search through production again!_

In [54]:
import json

# Print a recent analysis bundle with a results file.
# Note that to save space, I'm only printing the 'file_json' portion of the bundle.

bundle = client.post_search(es_query=query, replica='aws', output_format='raw')
print(json.dumps(bundle['results'][0]['metadata']['files']['file_json'], indent=4, sort_keys=True))

{
    "describedBy": "https://schema.humancellatlas.org/bundle/1.0.0/file",
    "files": [
        {
            "content": {
                "describedBy": "https://schema.humancellatlas.org/type/file/6.1.1/sequence_file",
                "file_core": {
                    "file_format": "fastq.gz",
                    "file_name": "R1.fastq.gz"
                },
                "lane_index": 1,
                "read_index": "read1",
                "schema_type": "file"
            },
            "hca_ingest": {
                "document_id": "c7a4af9b-feb1-4efb-9cbb-7e53fee2e086",
                "submissionDate": "2018-07-18T16:55:58.974Z"
            }
        },
        {
            "content": {
                "describedBy": "https://schema.humancellatlas.org/type/file/6.1.1/sequence_file",
                "file_core": {
                    "file_format": "fastq.gz",
                    "file_name": "R2.fastq.gz"
                },
                "lane_index": 1,
            

Okay! It looks like we've found a file uuid we can use: `330f62be-05a5-4c64-9b49-d884c3babeb6v`. Let's save it locally.

In [19]:
# Get the results file

file = client.get_file(replica='aws', uuid='330f62be-05a5-4c64-9b49-d884c3babeb6')
with open('matrix.results', 'w') as outfile:
    outfile.write(file.decode("utf-8"))
    

Here's what our file, `matrix.results`, looks like. I've truncated the output so it doesn't take up too much room.

In [55]:
with open('matrix.results', 'r') as m:
    print(m.read()[:591])
    print('...')

gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM	posterior_mean_count	posterior_standard_deviation_of_count	pme_TPM	pme_FPKM
ENSG00000000003.14	ENST00000373020.8,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4	1749.40	1541.80	0.00	0.00	0.00	0.00	0.00	2.02	23.44
ENSG00000000005.5	ENST00000373031.4,ENST00000485971.1	940.50	732.90	0.00	0.00	0.00	0.00	0.00	1.54	17.88
ENSG00000000419.12	ENST00000371582.8,ENST00000371584.8,ENST00000371588.9,ENST00000413082.1,ENST00000466152.5,ENST00000494752.1	977.83	770.24	0.00	0.00	0.00	0.00	0.00	3.32	38.56

...


For our matrix, however, we might only want _some_ of these values. In my case, suppose I only want the `gene_id` and `TPM` values. We can extract these values easily using Python's `csv` module.

In [21]:
import csv

# Transform the results file into a mtx file

with open('matrix.results', newline='') as infile, open('matrix.mtx', 'w') as outfile:
    reader = csv.DictReader(infile, delimiter='\t')
    writer = csv.DictWriter(outfile, delimiter='\t', fieldnames=['gene_id', 'TPM'])
    for row in reader:
        writer.writerow({'gene_id': row['gene_id'], 'TPM': row['TPM']})


Our new file, `matrix.mtx`, looks something like this:

In [28]:
with open('matrix.mtx', 'r') as m:
    print(m.read()[:98])
    print('...')

ENSG00000000003.14	0.00

ENSG00000000005.5	0.00

ENSG00000000419.12	0.00

ENSG00000000457.13	0.00

...


Now that we have a file containing exactly what we want, we can read it into scanpy.

In [23]:
import scanpy.api as sc

adata = sc.read_csv( filename='matrix.mtx', delimiter='\t' )

But how do we know that everything worked? Let's see what our AnnData object looks like.

In [66]:
for i in range(757):
    print( '{:<6}'.format('{:.1f}'.format(adata.X[i])), end='' if i % 18 != 0 else '\n' )
print('...')

0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
0.0   0.0   

And just to make it easier to see the relevant data in our matrix...

In [25]:
for i in range(len(adata.X)-1):
    if adata.X[i] != 0:
        print(adata.X[i], end=' ')

309.74 2289.43 1079.51 33.39 71.25 306.59 55.48 262.12 233.33 359.17 579.74 875.4 1571.24 7501.46 114659.92 347.55 76.15 1258.66 42.63 170.67 1384.1 2636.91 8210.16 23152.72 40069.76 15311.09 51841.59 988.92 8740.85 1786.69 13603.32 33274.64 314.7 1665.7 6726.94 12717.9 46112.46 10746.5 2619.5 637.16 21969.11 17856.29 794.83 23246.24 2372.67 78258.91 26619.13 109.81 8582.35 1111.28 1170.91 6920.93 161.06 250.58 171.18 126.3 20162.46 118.0 6422.18 774.58 9770.59 9643.0 3571.98 36718.99 45083.26 1618.13 18952.62 24666.38 1357.03 1071.54 694.25 652.7 1615.82 386.99 5256.2 867.7 17468.06 199.05 841.76 99981.39 15283.49 127.55 858.33 220.5 1144.79 1475.82 9094.49 6.09 35948.45 183.31 12083.88 1119.97 561.17 766.69 147.05 1427.83 3711.39 305.53 279.72 2907.42 105.23 

Now that the matrix has been set up in scanpy, you're free to perform whatever analysis piques your interest.