# NBIS support project 6668

Shotgun metagenomic sequencing - Three-generations microbiome study

**Description**

This project aims to study the effects of dibutyl phthalate (a plastic-derived contaminant) on the fecal microbiome composition and functional profile in three generations of mice (F0 "exposed mice" and F1 and F2 "offsprings"). Shotgun metagenomic sequencing of the DNA samples was done at NGI, Stockholm. The project has been discussed with John Sundh previously.

Specific goals include:

- To identify the taxonomic composition and profiling (species level), assess community diversity and characterize the relative abundances of taxa between the phthalates-treated mice vs. untreated mice (controls)
- Deep functional characterization of the phthalate-treated mice and control mice focusing on antibiotic resistance genes, virulence factors, carbohydrate metabolism, functional redundancy, etc
- To perform correlation analysis with other findings observed in F0, F1 and F2 mice, including immune, metabolic and liver phenotypes


## Analysis

In [1]:
import pandas as pd

### Raw data

Raw data is stored on Uppmax at:

`/proj/snic2020-5-486/dbp_gut_microbiome/DataDelivery_2023-01-10_15-17-23_ngisthlm00104/files/P27457/`

### Sample file

A total of 108 samples were divided into 10 assembly groups:

In [20]:
sample_df = pd.read_csv("../data/sample_list.csv", header=0)
sample_df.sort_values("group", inplace=True)
sample_df.to_csv("../data/sample_list.csv", index=False)
group_df = pd.DataFrame(sample_df.groupby("group").size(), columns=["samples"]).sort_index()
group_df

Unnamed: 0_level_0,samples
group,Unnamed: 1_level_1
F0_C,12
F0_H,12
F0_L,10
F1_C,8
F1_H,8
F1_L,6
F2_C,16
F2_H,16
F2_L,17
mock,3


In [21]:
#for group in sample_df["group"].unique():
    #_df = sample_df.copy()
    # Create one file with only samples from this group (for assembly)
    #group_only = sample_df.loc[sample_df["group"]==group]
    # Create another file where all samples appear to be assigned to this group
    #_df["group"] = [group]*_df.shape[0]
    #group_only.to_csv(f"../data/sample_list.{group}-assemble.csv", sep=",", index=False)
    #_df.to_csv(f"../data/sample_list.{group}-binning.csv", sep=",", index=False)

### Setting up nf-core/mag

I installed a conda environment specifically to use with nf-core/mag using:

```bash
export CONDARC="/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/.condarc"
mamba env create -f mag-env.yml -p envs/mag
mamba activate envs/mag
# Copy scripts to set and unset environment variables
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d $CONDA_PREFIX/etc/conda/activate.d
cp src/activate.d/env_vars.sh $CONDA_PREFIX/etc/conda/activate.d/
cp src/deactivate.d/env_vars.sh $CONDA_PREFIX/etc/conda/deactivate.d/
# Re-activate the environment
mamba activate envs/mag
```

#### Troubleshooting

There are some issues with the nf-core/mag pipeline that become apparent when trying to use on a large dataset with several co-assemblies. Using the setup specified here with 108 samples divided into 10 assembly groups and specifying `--binning_map_mode all` results in 108 x 10 = 1080 bam files. With each bam file ~ 3-4 G in size that comes to 3.2 - 4.2 TB of data.

If assemblies could be created and binned one at a time that would allow temporary bam files to be cleaned up.

##### Strategy 1

First I tried to run the full pipeline as a node job after a tip from Phil Ewels. In theory that would allow the pipeline to use local node storage for the work directory, after which only the finished results could be copied to the project folder. 

However this resulted in failed runs with no apparent output in either the nextflow log or slurm logs. Also there are obvious caveats to this as the work directory cannot be saved in intermediate states if the pipeline fails (which in my experience is **always** the case at least once).

##### Strategy 2

Instead I tried to use group-specific input files where there'd be one sample list per assembly group. For example the 'mock' sample list would contain all samples but there would only be group assignments for the 'mock' group. The plan was to run the pipeline 10 times, once for each sample list. 

However, it appears the pipeline tries to create an assembly also for the unassigned group which obviously fails.

##### Strategy 3

Next strategy is to try and first generate the co-assemblies using a params file where `--skip_binning` is set to true. That would run only the QC + Assembly steps which would only generate ~350 G of storage for the QCd reads + a few additional G for each assembly.

Then using another params file where all samples are assigned to the already assembled group and specifying `--binning_map_mode all` maybe the pipeline would use the assembly as-is but map all samples to it?

For example, to run assembly for the 'mock' group:

```bash
nextflow run -c conf/custom.config -params-file conf/mag.assembly.yml nf-core/mag -r 2.3.0 -resume -profile uppmax --project snic2022-5-350 --input data/sample_list.mock-assemble.csv
```

Then to run binning for the same group:

```bash
nextflow run -c conf/custom.config -params-file conf/mag.binning.yml nf-core/mag -r 2.3.0 -resume -profile uppmax --project snic2022-5-350 --input data/sample_list.mock-binning.csv
```

Update: This doesn't work either because the assembly is re-generated with all the samples.

### Setting up ATLAS

As an alternative I tried [ATLAS](https://github.com/metagenome-atlas/atlas/).

Atlas is a workflow written in Snakemake that performs QC, Assembly, Binning and functional annotation of contigs using gene catalogs (genes are clustered with mmseqs and annotated with eggNOG).

To install I ran:

```bash 
export CONDARC="/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/.condarc"
mamba env create -f atlas-env.yml -p envs/atlas
```
with the following conda environment file:

```yaml
channels:
  - conda-forge
  - bioconda
dependencies:
  - metagenome-atlas=2.15.0
  - pandas
  - cookiecutter
```

As a starting point I copied the `template_config.yaml` file from the atlas package into `conf/atlas-config.yml`

```bash
cp $CONDA_PREFIX/lib/python3.10/site-packages/atlas/workflow/config/template_config.yaml conf/atlas-config.yml
```

A sample list was created with:

```bash
python src/make_sample_list.py atlas > data/sample_list_atlas.tsv
```

Set up output dir:

```bash
mkdir atlas
cd atlas
ln -s ../data/sample_list_atlas.tsv samples.tsv
```

#### Setting up cluster execution

To use the cluster profile for ATLAS I ran:
```bash
cookiecutter --output-dir ~/.config/snakemake https://github.com/metagenome-atlas/clusterprofile.git
```

and used defaults when prompted.

I then updated the `~/.config/snakemake/cluster/cluster_config.yaml` file to be:

```yaml
__default__:
  #queue: normal
  account: "snic2022-22-722" # <- testing first with NBIS account

rulename:
  queue: long
  account: ""
  time_min:  # min
  threads:
```

#### Atlas QC

```bash
atlas run -w /proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/atlas -c /proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/conf/atlas-config.yml --profile cluster -n qc
```

To test with mock samples I changed samples.tsv to only contain the `m.c`, `m.c-2` and `m.c-3` samples.