### Necessary files for running Snakemake

The necessary files for running Snakemake are:
- [Snakefile](Snakemake/Snakefile.py)

**Snakefile** is the file where the different steps are designed and concatenated, the core code.
- [Cluster file](Snakemake/config.yaml)

Although Snakemake is able to generate the necessary jobs and submit them to the cluster, minimal guides need to be prepared: general rules but also specific rules for each job. The **cluster.json** serves for this purpose. Running time, number of cores used, memory required, name of the job and log locations are the parameters to be set within this JSON file.
- [Config file](Snakemake/cluster.json)

The **config.yaml** file has been created in order to make the variables independent from the code, so the Snakefile does not need to change when these variables change. For instance, input files, paths, softwares or wildcards are specified at the YAML file. This way, the core code (Snakefile) remain unchanged from one project to the other: the new user will only need to replace these variables.


### Submit all jobs via Snakemake

Once the different steps are designed and concatenated in a Snakefile and the variables are given as a config file, it is time for submitting all the jobs and let the cluster do the work. Running jobs with Snakemake in the cluster is a bit different from running the jobs locally, as Snakemake is a workflow manager that automatically distributes the loads and is able to submit the jobs by taking the parameters given as part of the cluster file. 

As the whole pipeline may take several hours, we need to run the job as "screen". Otherwise, either Snakemake or the cluster would close letting the job finish incomplete.

Snakefile can be run in screen as follows:

In [None]:
#screen - control a + control d to deattach / screen -r to re-attach
module load python_gpu/3.6.4 #python is loaded, otherwise snakemake will not work
snakemake --snakefile Snakefile --jobs 999 --cluster-config cluster.json --cluster "bsub -J {cluster.jobname} -n {cluster.ncore} -W {cluster.jobtime} -o {cluster.logi} -R \"rusage[mem={cluster.memo}]\"" 
#running Snakefile with cluster configuration where job name, time, memory, cores and log destination are defined. 
#999 is the maximum of jobs that will be created. This number can be increased or reduced depending on the expected number of jobs

### Requirements before running the Snakemake

* a reference_path including the reference genome is necessary for the alignment step. Not only the fasta (.fa) file is required (as seen in the command) but also the complimentary files such as the indices (fai) and the ones listed below. It is important that the names are changed so they follow the necessary pattern for Snakefile to find them.

In [1]:
ls /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref* -lrth

-r-xr-x--- 1 avillas hest-hpc-tg 2.7G Jan 13 12:14 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa
-r-xr-x--- 1 avillas hest-hpc-tg  82K Jan 13 18:49 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa.fai
-r-xr-x--- 1 avillas hest-hpc-tg 309K Jan 13 18:57 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.dict
-r-xr-x--- 1 avillas hest-hpc-tg 365K Jan 13 18:57 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa.ann
-r-xr-x--- 1 avillas hest-hpc-tg 6.5K Jan 13 18:57 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa.amb
-r-xr-x--- 1 avillas hest-hpc-tg 321K Jan 13 18:57 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa.NAMES
-r-xr-x--- 1 avillas hest-hpc-tg 2.6G Jan 13 18:57 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa.bwt
-r-xr-x--- 1 avillas hest-hpc-tg 658M Jan 13 18:57 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa.pac
-r-xr-x--- 1 avillas hest-hpc-tg 1.3G Jan 13 18:57 /cluster/work/pausch/temp_scratch/audald/ref/UCD_ref.fa.sa


* All logs are being written in specific folders in log_folder, as described in the cluster JSON file. Create folder "log_folder" and the following sub-folders, one per rule, before starting the submission. The logs will be otherwise be lost or sent to your e-mail.

In [2]:
ls /cluster/work/pausch/temp_scratch/audald/best_assembly/log_folder

alignment    fastp	      sambamba_flagstat_1  sambamba_merge  split_fastq
build_index  mark_duplicates  sambamba_flagstat_2  sambamba_sort


### Tips and lessons learnt

Tips and lessons learnt are being continously added to ["How_I_did_stuff" notebook](https://github.com/Audald/ETH_Jupyter/blob/master/Catch_all/How_I_did_stuff.ipynb).

However, some notes I found interesting to document here are:
* expand() should be only used in the rule all
* "\t" had to be replaced by "\\t" in the samblaster call
* "\n" should be added at the end of the commands, before piping, so each command is separated
* Plus signs can be added when piping in th shell from one command to the other

### Potential improvement for the alignment workflow

* Remove intermediate files/folders and rename the final files
* Samples are currently given as a python list contained in config.yaml. It would be amazing to integrate the R/bash/Python scripts I use for fetching the animnals of interest in the workflow.
* Depending on the number of read-groups per sample, there will be different sizes for the read-group specific files. While big files take all the resources, tiny ones are underusing them. It would be great to consider a solution in order to optimise resources per sample