## Running the GORG-classifier
#### *(We already did this, we just want you to see what we did)*

To get started, we are going to use the metagenomic libraries downloaded for the previous lesson. On the hub, they are located at: 

```
/mnt/storage/data/metagenomes/subsampled_metagenomes/
```

But to run them through the classifier, we need to correct the read names in the forward and reverse reads so that the names are identical between the paired files.  There is a script within the GORG-classifier lesson repository that corrected this for us.  We ran this script on all downloaded metagenomes in the following way:

```
$ cd /mnt/storage/data/metagenomes/subsampled_metagenomes/
$ for FILE in *.fastq.gz; 
do 
zcat $FILE | python ~/repos/Day3PM_gorg-classifier/scripts/update_names.py | gzip > ../subsampled_metagenomes_corrected/$FILE
done
```

This script removes the final character within the read id, which differed between the forward and reverse files. 

Next, we will run them through the GORG-classifier. We will do so using the generic script:

```
nextflow run BigelowLab/gorg-classifier \
-profile docker \
--seqs "/mnt/storage/data/metagenomes/subsampled_metagenomes_corrected/SRR*_{1,2}.fastq.gz" \
--mode local \
--nodes /mnt/storage/reference_dbs/gorg_classifier/nodes.dmp \
--names /mnt/storage/reference_dbs/gorg_classifier/names.dmp \
--annotations /mnt/storage/reference_dbs/gorg_classifier/GORG_v1.tsv \
--fmi /mnt/storage/reference_dbs/gorg_classifier/GORG_v1_NCBI.fmi \
--cpus 10 \
--outdir /mnt/storage/lessons/day2_pm_classifier/results/
```

This workflow runs using Nextflow which is a workflow manager.  Nextflow manages software dependencies using either docker or singularity. Other software we've used this week (e.g. prokka) also provides an option to run via Docker rather than installing the software locally. 

### Wow, that's a long command!
What does it all mean?

<u>Run options:</u>
* **nextflow run BigelowLab/gorg-classifier** -- Run a nextflow script called "gorg-classifier" from the github account "BigelowLab"
* *--profile* **singularity** --Tells nextflow to download required dependencies (kaiju, etc) as containers via singularity
* *--cpus* **10** -- Use 10 cores

<u>Inputs and outputs:</u>
* *--seqs* **SRR\*_{1,2}.fastq.gz** -- Input sequences: tells classifier we're giving it paired reads with names beginning with "SRR"
* *--outdir* -- self-explanatory
  
<u>Config files required by Kaiju:</u>
* *--mode* **local** -- Tells nextflow to use local reference database files (see below) 
* *--nodes* **nodes.dmp** -- Describes taxonomic hierarchy of reference taxa (e.g. NCBI taxonomy)
* *--names* **names.dmp** -- Has names for the reference taxa
* *--fmi* **GORG_v1_NCBI.fmi** -- The kaiju index file

<u>Annotations required by Classifier:</u>
* *--annotations* **GORG_v1.tsv**


### Okay, but why so complicated?

According to the [ReadMe](https://github.com/BigelowLab/gorg-classifier), Classifier *could* be run as simply as:
```
nextflow run BigelowLab/gorg-classifier -profile docker
--seqs '/data/*.fastq'
```

However, I've found that--due to permissions issues--it's necessary to run it in local mode (specifying names, nodes, annotations and fmi files)


### After the workshop, can you help me run Classifier on my own system?
Yes!<Br>
You can contact me (Greg) at ggavelis@bigelow.org<br>
If you're working on a computer cluster, you may also need your IT admin to install nextflow and singularity.