## Project Environment Setup

We have downloaded metagenome data from NCBI to use for this project.

There's no need to do this yourselves, but take note of this approach for future examination of metagenomes from NCBI's Short Read Archive (SRA).

## Software Environment Setup

The software needed for this tutorial can be found in the conda env config file ./coverm_env.yml

It is pre-installed on the jupyter hub.  To activate it in the terminal type:

```
conda activate /mnt/storage/envs/coverm/
```

If you haven't done already, select the 'biopy' kernel on the upper right-hand side of the screen.

## Data Setup

For this lesson, we are going to use the SAGs that we will be working all week, **AG-910**, to examine their abundance over time at BATS using metagenomic read recruitment.  

The metagenomes we will use are a small subset of metagenomes reported in [Biller et al., 2018](https://www.nature.com/articles/sdata2018176). These metagenomes are available through NCBI project ID [PRJNA385855](https://www.ncbi.nlm.nih.gov/bioproject?term=PRJNA385855). I've downloaded a metadata sheet with all sra metagenomes from this bioproject to: /mnt/storage/data/metagenomes/PRJNA385855_sra_metadata.csv
(if you do not know how to download metadata tables from NCBI, I will be hapy to demonstrate; please let me know)

Let's check this table out, and I'll show you which metagenomes I selected.

In [3]:
import pandas as pd

df = pd.read_csv("~/storage/data/metagenomes/PRJNA385855_sra_metadata.csv", sep = ",")

df.head()

Unnamed: 0,Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,BioSampleModel,bottle_id,Bytes,Center Name,...,lat_lon,Library Name,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,Sample Name,SRA Study
0,SRR6507277,WGS,300,16133582100,PRJNA385855,SAMN08390922,"MIMS.me,MIGS/MIMS/MIMARKS.water",2140200308,6618578156,MIT,...,22.75 N 158 W,S0627,PAIRED,RANDOM,METAGENOMIC,marine metagenome,ILLUMINA,2018-05-01T00:00:00Z,S0627,SRP109831
1,SRR6507278,WGS,300,15874959000,PRJNA385855,SAMN08390923,"MIMS.me,MIGS/MIMS/MIMARKS.water",2160200304,6562862443,MIT,...,22.75 N 158 W,S0628,PAIRED,RANDOM,METAGENOMIC,marine metagenome,ILLUMINA,2018-05-01T00:00:00Z,S0628,SRP109831
2,SRR6507279,WGS,300,15069825300,PRJNA385855,SAMN08390924,"MIMS.me,MIGS/MIMS/MIMARKS.water",1024800503,6265839401,MIT,...,31.66 N 64.16 W,S0629,PAIRED,RANDOM,METAGENOMIC,marine metagenome,ILLUMINA,2018-05-01T00:00:00Z,S0629,SRP109831
3,SRR6507280,WGS,300,25807308000,PRJNA385855,SAMN08390925,"MIMS.me,MIGS/MIMS/MIMARKS.water",1025200510,10523504402,MIT,...,31.66 N 64.16 W,S0630,PAIRED,RANDOM,METAGENOMIC,marine metagenome,ILLUMINA,2018-05-01T00:00:00Z,S0630,SRP109831
4,SRR5720219,WGS,300,6713331000,PRJNA385855,SAMN07137016,"MIMS.me,MIGS/MIMS/MIMARKS.water",1640201117,2811014041,MIT,...,22.75 N 158 W,S0519,PAIRED,RANDOM,METAGENOMIC,marine metagenome,ILLUMINA,2018-05-01T00:00:00Z,S0519,SRP109831


Let's check all column headers

In [4]:
df.columns

Index(['Run', 'Assay Type', 'AvgSpotLen', 'Bases', 'BioProject', 'BioSample',
       'BioSampleModel', 'bottle_id', 'Bytes', 'Center Name',
       'Collection_date', 'Consent', 'cruise_id', 'DATASTORE filetype',
       'DATASTORE provider', 'DATASTORE region', 'Depth', 'env_biome',
       'env_feature', 'env_material', 'Experiment', 'geo_loc_name_country',
       'geo_loc_name_country_continent', 'geo_loc_name', 'Instrument',
       'isolation-source', 'lat_lon', 'Library Name', 'LibraryLayout',
       'LibrarySelection', 'LibrarySource', 'Organism', 'Platform',
       'ReleaseDate', 'Sample Name', 'SRA Study'],
      dtype='object')

This datasheet isn't only useful for metagenome selection, but it also has useful information on it that we can use for downstream analyses.

Let's get a handle on the metagenomes described within this dataframe:

In [5]:
df[['Run','Assay Type','Collection_date','cruise_id','Depth']].head(20)

Unnamed: 0,Run,Assay Type,Collection_date,cruise_id,Depth
0,SRR6507277,WGS,2009-08-19,HOT214,5m
1,SRR6507278,WGS,2009-11-04,HOT216,100m
2,SRR6507279,WGS,2009-07-14,BATS248,10m
3,SRR6507280,WGS,2009-11-07,BATS252,100m
4,SRR5720219,WGS,2004-10-10,HOT164,5m
5,SRR5720221,WGS,2004-08-15,HOT162,175m
6,SRR5720223,WGS,2004-09-28,HOT163,115m
7,SRR5720224,WGS,2004-09-28,HOT163,175m
8,SRR5720226,WGS,2004-06-15,HOT160,175m
9,SRR5720228,WGS,2004-08-15,HOT162,100m


As you might have noticed the codes names 'HOT' and 'BATS' are noted in the cruise_id column. Those are two time series stations.  HOT is in the Pacific, near Hawaii, and BATS is in the Atlantic, near Bermuda.  

Our SAG plate is from BATS248. For this small exercise, we examine the abundance of AG-910 SAGs in the surface at BATS within this metagenome collection, so let's find just those metagenomes collected at BATS from either 1m or 10m depth.

In [8]:
# dataframe of metagenomes of interest
mgoi = df[df['cruise_id'].str.contains('BATS') & df['Depth'].isin(['10m','1m'])][['Run','Collection_date','cruise_id','BioSample','Depth']].sort_values(by = 'Collection_date')

# going to save this table to file
mgoi.to_csv("./bats_metagenomes_of_interest.csv", index=False)

print("There are", len(mgoi), "metagenomes that we are interested in downloading")
mgoi

There are 21 metagenomes that we are interested in downloading


Unnamed: 0,Run,Collection_date,cruise_id,BioSample,Depth
74,SRR5720233,2003-02-21,BATS173,SAMN07137079,1m
14,SRR5720238,2003-03-22,BATS174,SAMN07137082,1m
119,SRR5720327,2003-04-22,BATS175,SAMN07137064,10m
99,SRR5720283,2003-05-20,BATS176,SAMN07137103,1m
75,SRR5720235,2003-07-15,BATS178,SAMN07137085,10m
38,SRR5720286,2003-08-12,BATS179,SAMN07137088,10m
64,SRR5720332,2003-10-07,BATS181,SAMN07137067,1m
96,SRR5720276,2003-11-04,BATS182,SAMN07137106,1m
90,SRR5720262,2003-12-02,BATS183,SAMN07137109,1m
124,SRR5720338,2004-01-27,BATS184,SAMN07137070,1m


Just for ease of access, I'm going to print out another small text file that has just the runids of the metagenomes we are interested in downloading into the directory we will download the files into:

In [9]:
with open('/mnt/storage/data/metagenomes/subsampled_metagenomes/metagenomes_to_download.txt', 'w') as oh:
    for run in mgoi['Run']:
        print(run, file = oh)

I'm next going to download these metagenomes from ncbi using the fastq-dump function from the sra-tools package.  Some parameters I'll use, I'm going to skip technical reads, and download only 1000000 reads from each metagenome.

Let's check which metagenomes were downloaded, using the python package 'glob' which identifies files using wildcards, and returns them as a list.

In [13]:
from glob import glob

mgs = glob('/mnt/storage/data/metagenomes/subsampled_metagenomes/*_1.fastq.gz')

mgs

['/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720233_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720307_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720338_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720332_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720257_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720322_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR6507279_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720262_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720238_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720260_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720321_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenomes/SRR5720286_1.fastq.gz',
 '/mnt/storage/data/metagenomes/subsampled_metagenom

Now I'm going to use what's called a 'list comprehension' to extract the runids from this list of files:

In [14]:
runids = [i.split("/")[-1].split("_")[0] for i in mgs]
len(runids)

21

In [15]:
mgoi.loc[~mgoi['Run'].isin(runids),'downloaded'] = 'no'
mgoi['downloaded'] = mgoi['downloaded'].fillna('yes')
mgoi

Unnamed: 0,Run,Collection_date,cruise_id,BioSample,Depth,downloaded
74,SRR5720233,2003-02-21,BATS173,SAMN07137079,1m,yes
14,SRR5720238,2003-03-22,BATS174,SAMN07137082,1m,yes
119,SRR5720327,2003-04-22,BATS175,SAMN07137064,10m,yes
99,SRR5720283,2003-05-20,BATS176,SAMN07137103,1m,yes
75,SRR5720235,2003-07-15,BATS178,SAMN07137085,10m,yes
38,SRR5720286,2003-08-12,BATS179,SAMN07137088,10m,yes
64,SRR5720332,2003-10-07,BATS181,SAMN07137067,1m,yes
96,SRR5720276,2003-11-04,BATS182,SAMN07137106,1m,yes
90,SRR5720262,2003-12-02,BATS183,SAMN07137109,1m,yes
124,SRR5720338,2004-01-27,BATS184,SAMN07137070,1m,yes


In [17]:
mgoi.to_csv("/mnt/storage/data/metagenomes/bats_metagenomes_of_interest.csv", index=False)