<a href="https://colab.research.google.com/github/AskelaAsk/infr/blob/main/HW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 - Dependencies management

***git branch name:*** dependencies

## Theory [2]

As usual, we will start with a few theoretical questions:

* [0.5] What is Docker, and how it differs from dependencies management systems? From virtual machines?
* [0.5] What are the advantages and disadvantages of using containers over other approaches?
* [0.5] Explain how Docker works: what are Dockerfiles, how are containers created, and how are they run and destroyed?
* [0.25] Name and describe at least one Docker competitor (i.e., a tool based on the same containerization technology).
* [0.25] What is conda? How it differs from apt, yarn, and others?

## Problem [6.5]

The problem itself is relatively simple. 

Imagine that you developed an excellent RNA-seq analysis pipeline and want to share it with the world. Based on your experience, you are confident that the popularity of the pipeline will be proportional to its ease of use. So, you decided to help your future users and to pack all dependencies in a Conda environment and a Docker container.

Here is the list of tools and their versions that are used in your work:
* [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), v0.11.9
* [STAR](https://github.com/alexdobin/STAR), v2.7.10b
* [samtools](https://github.com/samtools/samtools), v1.16.1
* [picard](https://github.com/broadinstitute/picard), v2.27.5
* [salmon](https://github.com/COMBINE-lab/salmon), commit tag 1.9.0
* [bedtools](https://github.com/arq5x/bedtools2), v2.30.0
* [multiqc](https://github.com/ewels/MultiQC), v1.13



**Anaconda**:

* [1] Install conda, create a new virtual environment, and install all necessary packages. 
* [0.75] You won't be able to install some tools - that's fine. List their names, and explain what should be done to make them conda-friendly ([conda-forge](https://conda-forge.org/docs/maintainer/adding_pkgs.html) channel, [bioconda](https://bioconda.github.io/contributor/workflow.html) channel). 
* [0.25] [Export](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#exporting-the-environment-yml-file) the environment ([example](https://github.com/nf-core/clipseq/blob/master/environment.yml)) to the file and verify that it can be [rebuilt](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) from the file without problems.


**Docker**:
* [3] Create a Dockerfile for a container with **all** required dependencies. Conda usage is not allowed, don't forget about comments; test that all tools are accessible and work inside the container. Hints:
 - If needed, grant rights to execute downloaded/compiled binaries using chmod (`chmod a+x BINARY_NAME`)
 - Move all executables to $PATH folders (e.g.`/usr/local/bin`) to make them accessible without specifying the full path.
 - Typical command to run a container interactively (`-it`) and delete on exit(`--rm`): `docker run --rm -it name:tag`
* [1] Use [hadolint](https://hadolint.github.io/hadolint/) and remove as many reported warnings as possible.
* [0.5] Add relevant [labels](https://docs.docker.com/engine/reference/builder/#label), e.g. maintainer, version, etc. ([hint](https://medium.com/@chamilad/lets-make-your-docker-image-better-than-90-of-existing-ones-8b1e5de950d))

## Extra points [1.5]

You will be awarded extra points for the following:
* [0.5] Using [multi-stage builds](https://docs.docker.com/build/building/multi-stage/) in Docker. E.g. to build STAR and copy only the executable to the final image.

* [0.75] Minimizing the size of the final Docker image. That is, removing all intermediates, unnecessary binaries/caches, etc. Don't forget to compare & report the final size before and after all the optimizations.

* [0.25] Create an extra Dockerfile that starts from [a conda base image](https://hub.docker.com/r/continuumio/anaconda3) and builds everything from your conda environment file. 

Hint: `conda env create --quiet -f environment.yml && conda clean -a` ([example](https://github.com/nf-core/clipseq/blob/master/Dockerfile))


# 2 - Working with remote servers

**git branch name:** jbrowser

## Theory [2]

* [0.4] What are [computer ports](https://www.cloudflare.com/learning/network-layer/what-is-a-computer-port/) on a high level? How many ports are there on a typical computer?
* [0.4] What is the difference between http, https, ssh, and other protocols? In what sense are they similar? Name default ports for several data transfer protocols.
* [0.4] Explain briefly: (1) what is IP, (2) what IPs are called 'white'/public, (3) and what happens when you enter 'google.com' into the web browser. 
* [0.4] What is Nginx? How does it work on the high level? List several alternative web servers.
* [0.4] What is SSH, and for what is it typically used? Explain two ways to authenticate in an SSH server in detail.

## Problem [6.5]

A real-life situation that occurred to me several times over the years.

Imagine wrapping up a large bioinformatics project and wanting to share raw data with your colleagues in a friendly and straightforward format. The best option would be to use an online genome browser and host your data remotely, so it is easily accessible by anyone with a valid link. This is exactly what we will be doing here.

*Please consider doing this HW using Linux since setting up the SSH client on Windows is painful, and I won't be able to help you.*

**Remote Server**:
* [2] Create a new virtual machine in the Yandex/Mail/etc cloud (order at least 10GB of free disk space). Generate SSH key pair and use it to connect to your server.
* [1] Download the latest human genome assembly (GRCh38) from the Ensemble FTP server ([fasta](https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz), [GFF3](https://ftp.ensembl.org/pub/release-108/gff3/homo_sapiens/Homo_sapiens.GRCh38.108.gff3.gz)). Index the fasta using samtools (`samtools faidx`) and GFF3 using tabix. 
* [1] Select and download BED files for three ChIP-seq and one ATAC-seq experiment from the ENCODE (use one tissue/cell line). Sort, bgzip, and index them using tabix.

**JBrowse 2**
* [1] Download and install [JBrowse 2](https://jbrowse.org/jb2/). Create a new jbrowse [repository](https://jbrowse.org/jb2/docs/cli/#jbrowse-create-localpath) in `/mnt/JBrowse/` (or some other folder).
* [0.25] Install nginx and amend its config(/etc/nginx/nginx.conf) to contain the following section:
```conf
http {
  # Don't touch other options!
  # ........
  # ........

  # Comment this line(!):
  # include /etc/nginx/sites-enabled/*;

  # Add this:
  server {
    listen 80 default_server;
    index index.html;
    server_name _;

    location /jbrowse/ {
      alias /mnt/JBrowse/;	
    }
  }
}
```

* [0.25] Restart the nginx (reload its config) and make sure that you can access the browser using a link like this: `http://64.129.58.13/jbrowse/`. Here `64.129.58.13` is your public IP address.
* [1] Add your files (BED & FASTA & GFF3) to the genome browser and verify that everything works as intended. Don't forget to [index](https://jbrowse.org/jb2/docs/cli/#jbrowse-text-index) the genome annotation, so you could later search by gene names. Provide a [persistent link](https://jbrowse.org/jb2/docs/user_guides/basic_usage/#sharing-sessions) to a JBrowse 2 session with all your BED files and the genome annotation in the report (like [this](https://jbrowse.org/code/jb2/v2.3.1/?session=share-HShsEcnq3i&password=nYzTU)). *I must be able to access it without problems later.*


**Common mistakes**:
* Using `/home/username` folder for JBrowse. Don't do this - you will have permission issues (403 forbidden) because by default home is only available to your user, not to the nginx user(group).
* No trailing `/` in the config (`/jbrowse/`, `/mnt/JBrowse/`), or in the URL (`http://64.129.58.13/jbrowse/`).
* If you have added tracks but they are not showing up in JBrowse - try reloading the page or use a private/incognito window.
* Don't use `sudo` when using JBrowse CLI: (1) you risk messing up with permissions, (2) you don't really need it.



## Extra points [1.5]

* [1] Create a Docker container for running JBrowse 2. It should be a self-contained application, listening on the default HTTP port. Users must be able to mount directories with custom configs and access them later without any problems. 

Hint: to specify the config, use the config=PATH query parameter. E.g. `http://64.129.58.13/jbrowse/?config=my_folder%2Fconfig.json` where `my_folder%2Fconfig.json` is the [escaped](https://en.wikipedia.org/wiki/Percent-encoding) path to the config file.

* [0.5] Give an in-depth explanation of the OSI model and how the TCP/IP stack works. Don't copy-paste descriptions from the internet; paraphrase and shorten as much as possible (imagine writing a cheat sheet for yourself).






# 3 - Bioinformatic pipelines

**git branch name:** bpipelines

## Theory [2]

* [0.2] What is a pipeline (in bioinformatics)? Why are they so popular in bioinformatics and not in other areas?
* [0.5] Explain how Snakemake and Nextflow work on the high level. I.e., what are their general paradigms?
* [0.5] Name the most flexible and the least flexible way to organize a pipeline, and list their key advantages and disadvantages.
* [0.8] Read the original [Snakemake](https://doi.org/10.1093/bioinformatics/bts480) and [Nextflow](https://doi.org/10.1038/nbt.3820) papers. What crucial problems the authors strived to solve?

I also recommend to read [this](https://academic.oup.com/bib/article/18/3/530/2562749) excellent but somewhat outdated review of popular frameworks used to create bioinformatic pipelines.

## Problem [6.5]

It's not possible and feasible to have an in-depth knowledge of all typical biological experiments, especially in your early career days. 
And a good strategy to deal with unknown experiments is to start with previously validated approaches, aka best practices. Nowadays, it is even easier, thanks to the development of publicly curated automatic pipelines.

Our toy experiment for today is ChIP-seq, specifically ChIP for Myc transcription factor from a human cell line. We will analyze it using the public Nextflow [pipeline](https://github.com/nf-core/chipseq).

* [1.5] Prerequisites: (1) download and install Nextflow, (2) download sequencing data for the first replica from [this](https://www.encodeproject.org/experiments/ENCSR000EGJ/) ENCODE experiment (2 fastq files: control & treatment), (3) download GRCh38 sequence (fasta) and annotation (GFF3) for the first chromosome from the Ensembl.
* [2] Prepare design document and params YAML file for [this](https://nf-co.re/chipseq/2.0.0/parameters) Nextflow pipeline. Explain what parameters you used and why you had to specify them.
* [1] Launch the pipeline and wait for its finish. Amend resources allocated for each process if needed (see Nexflow [config files](https://www.nextflow.io/docs/latest/config.html) for details).
* [2] Analyze and decipher the generated QC report, report all major findings. Use [IGV](https://software.broadinstitute.org/software/igv/download) to inspect some of the reported peaks manually.


**Note you will need a PC (or VM) running Linux with at least 16GB of RAM, and around 100GB of disk space to complete this HW.**


## Extra points [1.5]

Several separate tasks:
* [0.5] Create a basic VM, install and configure JBrowse, and make your analysis results accessible online (e.g. bed/bigwig files). Remember to include a persistent link to the JBrowse session in the report.
* [1] Use Nextflow (DSL2) to write a simple pipeline that takes as input a folder with fastq files, runs `fastqc` on each file, and then integrates results using `multiqc`.


# General requirements

**Works that fail to follow the below requirements won't be graded.**


## How to submit the homework?

All homework must be submitted as a link to a public git repository (preferably GitHub/GitLab). In addition, you must complete each task in a separate branch with a particular name (see each section).


## What is the reporting format?

Reports should be provided as markdown README files in the root folder of the git repository. Please, use markdown features to make the report easy to read, e.g., you must include code in special blocks, use headings, etc. 

You are strongly advised to write in English, but usage of Russian won't be penalized (except for code comments, see below). 

Reports should be self-contained, i.e. include all code (bash/dockerfiles/etc), explanations, and illustrations (e.g. screenshots). Ideally, to evaluate your assignment and reproduce results, I would only need the README file and nothing else.

It is a must to comment your code and explain what is happening in each code block. You must write code comments in English. 

# Resources

* [Git immersion](http://gitimmersion.com/) - a popular git course