# Explore SRA Download and BigQuery
Often you will access SRA data using the sra-tools with [fasterq-dump](https://github.com/ncbi/sra-tools/wiki/HowTo%3A-fasterq-dump), but that tool only downloads one accession number at a time. To download batch data, we used the ipyrad api and borrow language from an ipyrad cookbook data download example. To see more ipyrad example notebooks go to this [github page](https://github.com/dereneaton/ipyrad/blob/master/newdocs/API-analysis/).

### Part 1, download SRA data and align to a reference genome

### Required software
Specify the number of CPUs on your VM

In [None]:
CPU=4

Install mamba

In [None]:
%%bash
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

Install dependencies

In [None]:
#Install the following since this is a fresh VM
#see the pangolin notebook for how to install mamba and add to your path
#you can also use conda, since it is preinstalled, it is just a lot slower and less reliable
!./mambaforge/bin/mamba install -y -c bioconda -c conda-forge ipyrad sra-tools vcftools biopython

Import packages

In [None]:
import ipyrad.analysis as ipa
import os

### Fetch info for a published data set by its accession ID
You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.). 


### Download the data
From an sratools object you can fetch just the info, or you can download the files as well. Here we call `.run()` to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above.

In [None]:
#manually define list of SRA files
list_of_srrs = ['SRR14086881','SRR14086882','SRR14086883','SRR14086884','SRR14086885','SRR14086886','SRR14086887','SRR14086888','SRR14086889','SRR14086890']

In [None]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir="downloaded")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))

### Check the data files
You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved. 


In [None]:
#you can also navigate in the menu on the left
! ls -l downloaded

Align the data to a reference genome
First, download the sars-cov-2 reference

In [None]:
from Bio import Entrez

#download the reference
#use the bio.entrez toolkit within biopython to download the accession numbers
#save those sequences to a single fasta file
Entrez.email = "email@example.com"  # Always tell NCBI who you are
filename = "sarscov2_reference.fasta"
acc_nums=['NC_045512']
for acc in acc_nums:
        net_handle = Entrez.efetch(
            db="nucleotide", id=acc, rettype="fasta", retmode="text"
        )
        out_handle = open(filename, "a")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved",acc)

Now run BWA. It will just take a second

In [None]:
# Index the reference
!bwa index sarscov2_reference.fasta
# run bwa mem
!bwa mem -t $CPU sarscov2_reference.fasta ./downloaded/SRR14086881_MSHSPSP-RNP263.fastq > sars.bam

Check the quality of the alignment, what percentage of the input reads aligned to the reference?

In [None]:
!samtools flagstat sars.bam

That is it for Part 1! We downloaded SRA data, downloaded a reference from Genbank, and then aligned our fastq files to the reference! 

### Part 2, Copy VCF to Storage Bucket and then Run a Query with BigQuery

Here we are going to shift gears and use Google BigQuery, which is a serverless (you aren't responsible for managing VMs) data warehouse for analyzing data using SQL. We could translate that to 'a giant database with very fast SQL query capabilities'. We will download a vcf file, create a bucket, copy our data to the bucket, and then import some example bigquery vcf files to query.

First we will download a vcf from this [galaxy url](https://usegalaxy.org/u/carlosfarkas/h/snpeffsars-cov-2) which comes from this [paper](https://www.frontiersin.org/articles/10.3389/fmicb.2021.665041/full).

In [None]:
%%bash
curl https://usegalaxy.org/datasets/bbd44e69cb8906b553c8fa023002fca0/display?to_ext=vcf --output sra-sars.vcf

In [None]:
%%bash
vcftools --vcf sra-covid.vcf --maf 0.005 --recode --out sra-covid-maf

### Create a cloud bucket
The data is currently stored on this notebook instance, which is fine for any analyses here, but if we want to use that data somewhere else on GCP, or to access it when the VM is shut down, we need to copy the data to a cloud storage bucket. If we do not have one made, we can create one using the gsutil command line tool, or you can do it using the [console](https://cloud.google.com/storage/docs/quickstart-console).

In [None]:
#put your own project here and bucket name of choice. It must be globally unique
!export PROJECT='us-gcp-ame-adv-c01-npd-1'
!export BUCKET='vcf-to-bq2'

You can do a lot with the gsutil mb command, read more [here](https://cloud.google.com/storage/docs/gsutil/commands/mb).
Note that if you end up with permission issues further down, you may need to skip these cells and create the bucket using the UI. You can also try going to IAM and adding the necessary permissions to the service account.

In [None]:
!gsutil mb -c regional -l us-east4 gs://vcf-to-bq2

In [None]:
!gsutil ls gs://vcf-to-bq2/

Now let's copy the vcf file to the gs bucket using gsutil cp. If you are moving lots of files use gsutil -m cp /ex_dir/* or recursive with gsutil -m cp -r /ex_dir/

In [None]:
!gsutil cp sra-covid-maf.recode.vcf gs://vcf-to-bq2/

In [None]:
#look at our bucket to make sure the files are organized as expected. You can also just look at the bucket in the UI
!gsutil ls gs://vcf-to-bq2

### Explore VCF queries in BigQuery
The rest of the notebook could be run in the BigQuery UI in the console. Here we are querying bigquery datasets via the APi, but if you paste the block of SQL code in the editor window of bigquery, it would have the same effect.

In [None]:
# Import the biquery api
from google.cloud import bigquery
import pandas

In [None]:
# Designate the client for the API
client = bigquery.Client(location="US")
print("Client creating using default project: {}".format(client.project))

Let's explore some cancer genomic data from the [ISB Cancer Gateway in the Cloud](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/progapi/bigqueryGUI/GettingStartedWithGoogleBigQuery.html). If you follow that link, it gives a pretty good tutorial on how to use BigQuery using the UI. You can also use some of the sample notebooks from the left hand nav menu that walk you through how to use bigquery from a notebook in addition to what we do here.

In [None]:
#Let's see what the most frequently observed mutations are for KRAS mutant tumor samples.
query = """
WITH temp1 AS (
   SELECT
     sample_barcode_tumor,
     Hugo_Symbol,
     Variant_Classification,
     Variant_Type,
     SIFT,
     PolyPhen
   FROM `isb-cgc-bq.TCGA_versioned.somatic_mutation_hg38_gdc_r10`
   GROUP BY
     sample_barcode_tumor,
     Hugo_Symbol,
     Variant_Classification,
     Variant_Type,
     SIFT,
     PolyPhen)
SELECT
  COUNT(*) AS num,
  Hugo_Symbol,
  Variant_Classification,
  Variant_Type,
  SIFT,
  PolyPhen
FROM temp1
GROUP BY
  Hugo_Symbol,
  Variant_Classification,
  Variant_Type,
  SIFT,
  PolyPhen
ORDER BY num DESC
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()
df

Now let's follow these steps to look at the [DeepVariant platinum genomes calls](https://cloud.google.com/life-sciences/docs/tutorials/analyze-variants-advanced).

In [None]:
#first look at total calls per sample
query = """
#standardSQL
SELECT
  call.name AS call_name,
  COUNT(call.name) AS call_count_for_call_set
FROM
  `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
GROUP BY
  call_name
ORDER BY
  call_name"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()
df

People don't actually have ~30 million variants, so let's filter to real variants

In [None]:
query = """
#standardSQL
SELECT
  call.name AS call_name,
  COUNT(call.name) AS call_count_for_call_set
FROM
  `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
  EXISTS (SELECT 1
            FROM UNNEST(v.alternate_bases) AS alt
          WHERE
            alt.alt NOT IN ("<NON_REF>", "<*>"))
GROUP BY
  call_name
ORDER BY
  call_name
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()
df

Now let's remove the no-calls and the number of variants starts to look more reasonable

In [None]:
query = """
#standardSQL
SELECT
  call.name AS call_name,
  COUNT(call.name) AS call_count_for_call_set
FROM
  `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
  EXISTS (SELECT 1 FROM UNNEST(call.genotype) AS gt WHERE gt > 0)
  AND NOT EXISTS (SELECT 1 FROM UNNEST(call.genotype) AS gt WHERE gt < 0)
GROUP BY
  call_name
ORDER BY
  call_name
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()
df

And that's it! Make sure before you move onto the next thing, you try running a few queries in the BigQuery editor within the UI. You can also create a new dataset under your project and explore the public datasets available there too.