## Project Plan

The **data** that I am working on is extracted from the paper: "RNA-seq and Tn-seq reveal fitness determinants Enterococcus faecium during growth in human serum of vancomycin-resistant" - Xinglin Zhang, Vincent de Maat, Ana M. Guzmán Prieto, Tomasz K. Prajsnar, Jumamurat R. Bayjanov, Mark de Been , Malbert R. C. Rogers, Marc J. M. Bonten, Stéphane Mesnage, Rob J. L. Willems and Willem van Schaik.    
  
The **aim of this project** is to determine which genetic elements are responsible for the growth of *Enterococcus faecium* vancomycin-resistant strain in human serum. This is important because *E. faecium* frequently causes bloodstream infections in hospitalized patients. After a *de-novo* assembly of the genome of *E. faecium*, is required to do a differential expression analysis of the transcriptome between 2 different states: the bacteria grown in rich medium or in human serum.  
  
The genome is available from NCBI Genbank (CP014529 - CP014535). Available DNA, RNA and transposons reads are in FASTQ format.  

<!-- <img style="float: left;" src="images/short_data_dir_tree.png" width=175 > -->
![eu](images/long_data_dir_tree1.png =150x100)

![eu](images/long_data_dir_tree1.png =150)

![eu](images/long_data_dir_tree1.png)  

![Kitten](images/long_data_dir_tree1.png){: width=150 height=100 style="float:right; padding:16px"}  

![Kitten](images/long_data_dir_tree1.png){ width=50% }

## Workflow

1) Quality control of the long DNA reads (PacBio) for assembly  
    - use **FASTQC** and generate a quality control report  
2) If the reads pass QC, then move to step 3)   
    - if not, pre-process the reads with **Trimmomatic** and do another quality control  
3) Genome assembly of long reads (PacBio)  
    - use **Canu** and **SPAdes**, each with various parameters  
    
4) Quality control of the short DNA reads (Illumina)  
    - use **FASTQC** and generate a quality control report  
5) Pre-processing the short reads    
    - **Trimmomatic** and do another quality control  
6) Mapping processed Illumina reads to PacBio assembly:  
    - **BWA-MEM** for mapping   
    - **Pilon** to combine genome assembly  
7) Evaluate genome assembly  
    - **QUAST**, **BCFtools**, **MUMmerplot**  
    
8) Genome assembly annotation:  
    - **Prokka**, **Maker2**  
9) Homology search:  
    - **Blastn**   
10) Synteny:  
    - **ACT**  
  
11) Quality control of the RNA reads (Illumina)  
    - use **FASTQC** and generate a quality control report  
12) Pre-processing the short reads    
    - **Trimmomatic** and do another quality control  
13) RNA Mapping:  
    - **BWA-MEM**

14) Counting RNA reads:
    - **HTseq-count**
15) Differential expression 
    - **DEseq2**

16) Biological interpretation of results
    - **R**

#### Deadlines to keep  

|Done?|Day| Hours | Prokaryotes|
|------|------|------|------|
|[x]|24/3 | 2 |Seminar|
|[ ]|30/3 |4 |Project planning|
|[ ]|17/4 |4 |Genome Assembly + Genome annotation |
|[ ]|29/4 |4 |Comparative genomics|
|[ ]|8/5 |4| RNA mapping |

#### Estimated analyses and their running time  
  
| Analysis | Software |Running time (for paper I)|
|------|------|------|
| Reads preprocessing | Trimmomatic | ~ 50 min per file (1 core) |
| DNA assembly | Spades | (short reads + long reads) ~ 2 h (1 core) |
|  | Canu | ~ 4,5 h (1 core) |
| Assembly evaluation | Quast | < 15 min (1 core) |
|  | MUMmerplot | < 5 min (1 core) |
|  | BCFtools | ~ 90 min (1 core) |
| Annotation | Prokka | < 5 min (1 core) |
| Aligner | BWA | (paired-end reads): ~ 30 min (1 core) |
|  |  | (single reads): < 15 min (1 core) |
| Differential Expression | Htseq | (paired-end reads) ~ 2-7 h (1 core) |
|  |  | (single reads) < 10 min (1 core) |
|  | Deseq2 (R library) | Variable |