## Project Environment Setup

We have downloaded metagenome data from NCBI to use for this project.

There's no need to do this yourselves, bu take note of this approach for future examination of metagenomes from NCBI's Short Read Archive (SRA).

## Software Environment Setup

The software needed for this tutorial can be found in the conda env config file ./coverm_env.yml

It is pre-installed on the jupyter hub.  To activate it in the terminal type:

```
conda activate coverM
```

To use it in a jupyter notebook, select the 'coverm' kernel on the upper right-hand side of the screen.

## Data Setup

For this lesson, we are going to use the SAGs that we have been working with all week, AG-910, to examine their abundance over time at BATS using metagenomic read recruitment.  

The metagenomes we will use are a small subset of metagenomes reported in [this](https://www.nature.com/articles/sdata2018176) publication. These metagenomes are available through NCBI project ID [PRJNA385855](https://www.ncbi.nlm.nih.gov/bioproject?term=PRJNA385855). I've downloaded a metadata sheet with all sra metagenomes from this bioproject to: ./data/PRJNA385855_sra_metadata.csv

Let's check this table out, and I'll show you which metagenomes I selected.

In [None]:
import pandas as pd

df = pd.read_csv("./data/PRJNA385855_sra_metadata.csv", sep = ",")

df.columns

This datasheet isn't only useful for metagenome selection, but it also has useful information on it that we can use for downstream analyses.

Let's get a handle on the metagenomes described within this dataframe:

In [None]:
df[['Run','Assay Type','Collection_date','cruise_id','Depth']].head(20)

I see lots of 'HOT' and 'BATS' in the cruise_id column. Those are two time series stations.  HOT is in the Pacific, near Hawaii, and BATS is in the Atlantic, near Bermuda.  

Our SAG plate is from BATS248. For this small project, we are interested in examining the abundance of our SAGs in the surface at BATS within this metagenome collection, so let's find just those metagenomes collected at BATS from either 1m or 10m depth.

In [None]:
# dataframe of metagenomes of interest
mgoi = df[df['cruise_id'].str.contains('BATS') & df['Depth'].isin(['10m','1m'])][['Run','Collection_date','cruise_id','BioSample','Depth']].sort_values(by = 'Collection_date')

# going to save this table to file
mgoi.to_csv("./data/bats_metagenomes_of_interest.csv", index=False)

print("There are", len(mgoi), "metagenomes that we are interested in downloading")
mgoi

Just for ease of access, I'm going to print out another small text file that has just the runids of the metagenomes we are interested in downloading into the directory we will download the files into:

In [None]:
with open('/mnt/storage/data/metagenomes/subsampled_metagenomes/metagenomes_to_download.txt', 'w') as oh:
    for run in mgoi['Run']:
        print(run, file = oh)

I'm next going to download these metagenomes from ncbi using the fastq-dump function from the sra-tools package.  Some parameters I'll use, I'm going to skip technical reads, and download only 1000000 reads from each metagenome.

To download these metagenomes I opened up a temrinal window and ran:

```
conda activate coverm

cd /mnt/storage/data/metagenomes/subsampled_metagenomes/

while read p; do
  fastq-dump --split-files --skip-technical -N 0 -X 1000000 --gzip --readids "$p"
done < metagenomes_to_download.txt
```

Let's check which metagenomes were downloaded, using the python package 'glob' which identifies files using wildcards, and returns them as a list.

In [None]:
from glob import glob

mgs = glob('/mnt/storage/data/metagenomes/subsampled_metagenomes/*_1.fastq.gz')

Now I'm going to use what's called a 'list comprehension' to extract the runids from this list of files:

In [None]:
runids = [i.split("/")[-1].split("_")[0] for i in mgs]
len(runids)

Hmmmm... two SRA archives are missing, I wonder which ones those are.

In [None]:
mgoi.loc[~mgoi['Run'].isin(runids),'downloaded'] = 'no'
mgoi['downloaded'] = mgoi['downloaded'].fillna('yes')
mgoi

In [None]:
mgoi.to_csv("./data/bats_metagenomes_of_interest.csv", index=False)

The error I see in the terminal says:

```
2022-03-23T19:51:04 fastq-dump.2.8.0 err: no error - error with http open 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-11/SRR5720276/SRR5720276.1'
2022-03-23T19:51:04 fastq-dump.2.8.0 err: item not found while constructing within virtual database module - the path 'SRR5720276' cannot be opened as database or table
```

For whatever reason, these two metagenomes can't be downloaded. I'm OK with that. For the sake of this lesson, we'll have enough information with the metagenomes we recovered to look into SAG abundance at BATS over time.