# WGS analysis

In [None]:
conda activate /home/junyuchen/Biosoft/anaconda3/envs/wgs

In [None]:
export data=/home/LDlab/WGS-example/rawdata

In [None]:
export result=/home/LDlab/WGS-example/result

In [None]:
echo $data

In [None]:
echo $result

## QC/Trimming

### Fastqc

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

`fastqc -t 8 $data/ERR044595_1M_1.fastq.gz $data/ERR044595_1M_2.fastq.gz -o $result/fastqc`

### Trimmomatic

http://www.usadellab.org/cms/index.php?page=trimmomatic

`conda install -c bioconda trimmomatic`

```shell
trimmomatic PE -threads 16 $data/ERR044595_1M_1.fastq.gz $data/ERR044595_1M_2.fastq.gz \
              $result/trimmomatic/ERR044595_1M_1.paired.fastq $result/trimmomatic/ERR044595_1M_1.unpaired.fastq \
              $result/trimmomatic/ERR044595_1M_2.paired.fastq $result/trimmomatic/ERR044595_1M_2.unpaired.fastq \
              ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
```

## Assembly

Based on this [Article](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-019-0626-5), it's better to choose SPAdes (meta)

And we can use [QUAST](http://cab.spbu.ru/software/quast/) to evaluate the assembly performance

### SPAdes

https://github.com/ablab/spades

`conda install -c bioconda spades`

```shell
metaspades.py -1 $result/trimmomatic/ERR044595_1M_1.paired.fastq -2 $result/trimmomatic/ERR044595_1M_2.paired.fastq -o $result/metaspades
```

### MEGAHIT

https://github.com/voutcn/megahit

`conda install -c bioconda megahit`

```shell
megahit -1 $result/trimmomatic/ERR044595_1M_1.paired.fastq -2 $result/trimmomatic/ERR044595_1M_2.paired.fastq -o $result/megahit
```

### Unicycler

https://github.com/rrwick/Unicycler

`conda install -c bioconda unicycler`

```shell
unicycler -1 $result/trimmomatic/ERR044595_1M_1.paired.fastq -2 $result/trimmomatic/ERR044595_1M_2.paired.fastq -o $result/unicycler
```

### QUAST

http://cab.spbu.ru/software/quast/

### Bandage

http://rrwick.github.io/Bandage/

## Annotation

### prokka

https://github.com/tseemann/prokka

`prokka --outdir $result/prokka --prefix demo $result/megahit/final.contigs.fa`

### Prodigal

In [None]:
prodigal -i my.genome.fna -o gene.coords.gbk -a protein.translations.faa

```shell
Input/Output Parameters

  -i, --input_file:     Specify input file (default stdin).
  -o, --output_file:    Specify output file (default stdout).
  -a, --protein_file:   Specify protein translations file.
  -d, --mrna_file:      Specify nucleotide sequences file.
  -s, --start_file:     Specify complete starts file.
  -w, --summ_file:      Specify summary statistics file.
  -f, --output_format:  Specify output format.
                          gbk:  Genbank-like format (Default)
                          gff:  GFF format
                          sqn:  Sequin feature table format
                          sco:  Simple coordinate output
  -q, --quiet:          Run quietly (suppress logging output).
```

### Diamond

http://www.diamondsearch.org/index.php

#### e.g.

```shell
diamond blastp --db /home/malab/databases_of_malab/nr/nr --query $result/prokka/demo.faa --out $result/demo_nr_annatation.out --evalue 1e-05 --outfmt 6 --max-target-seqs 1 --threads 10
```

### Custom DataBase

`diamond makedb --in aa.fasta -d aa.fasta`

`makeblastdb -in nr.fasta  -dbtype nucl`

## nf-core/bacass

https://nf-co.re/bacass/1.1.0

`conda activate /home/junyuchen/Biosoft/anaconda3/envs/nf-core`

```shell
nextflow run /home/junyuchen/Lab/WGS/nf-core-bacass-1.1.0/workflow --input /home/junyuchen/Lab/WGS/bacass_short_test.tsv --outdir $result/bacass -profile docker --skip_annotation --skip_kraken2
```

## Appendix

### Diamond Argument:

-  `--db/-d <file>`

    Path to the DIAMOND database file.

-  `--query/-q <file>`

    Path to the query input file in FASTA or FASTQ format (may be gzip compressed). If this parameter is omitted, the input will be read from `stdin`.

-  `--out/-o <file>`

    Path to the output file. If this parameter is omitted, the results will be written to the standard output and all other program output will be suppressed.

-  `--evalue/-e #`

    Maximum expected value to report an alignment (default=0.001).
    
-  `--outfmt/-f 6`  

    BLAST tabular format (default). This format can be customized, the `6` may be followed by a space-separated list of the following keywords, each specifying a field of the output.

-  `--max-target-seqs/-k #`

    The maximum number of target sequences per query to report alignments for (default=25). Setting this to 0 will report all alignments that were found.

-  `--threads/-p #`

    Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine.

### BLAST tabular format

By default, there are 12 preconfigured fields:   
`qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore`

* `qseqid` Query Seq - id

* `sseqid` Subject Seq - id

* `pident` Percentage of identical matches

* `length` Alignment length

* `mismatch` Number of mismatches

* `gapopen` Number of gap openings

* `qstart` Start of alignment in query

* `qend` End of alignment in query

* `sstart` Start of alignment in subject

* `send` End of alignment in subject

* `evalue` Expect value

* `bitscore` Bit score

## Resources

https://github.com/SionBayliss/Bio-Courses

https://github.com/BacterialCommunitiesAndPopulation/Wednesday18thMay/blob/master/Assembly_Tutorial.md

https://www.coursera.org/lecture/wgs-bacteria/tutorial-D1f65

https://genohub.com/recommended-sequencing-coverage-by-application/