# contamination removal

* FLYE assembly will be checked with bacteria contamination using KRAKEN
* Workflow adopted from Cecilia

**NOTE** from Cecilia:  the DB I used was /input/powerPlant/appsdata/kraken2db/Refseq91 (Dated Feb 2019). It took ages to build the db hence we don’t update it often. There is a newer version, RefSeq93, built about two weeks ago. I’d recommend to use a server with large RAM to avoid the memory issue when loading DB.

## 1. input contigs

In [1]:
mkdir 008.Contamination.analysis

In [2]:
WORKDIR=/workspace/hraczw/github/hoki_genomics/008.Contamination.analysis

create a file named ContigFiles.txt containing all contig sets required for analysis

In [3]:
cat 008.Contamination.analysis/ContigFiles.txt

/workspace/hraczw/github/hoki_genomics/005.pilon_correctFLYE/FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected.fasta


## 2. Kraken

In [5]:
python << EOF

import sys, os

filename = '$WORKDIR/ContigFiles.txt'
f = open(filename,'r')

for line in f:
    sampleName = line.split('/')[-1].split('.')[0]
    fileName = line.split('\n')[0]

    os.system('bsub -J kraken \
               -m wkoppb50 \
               -n 20 \
               -o $WORKDIR/' + sampleName + '.out \
               -e $WORKDIR/' + sampleName + '.err \
               "/workspace/hraaxt/kraken2/kraken2 \
                --output $WORKDIR/' + sampleName + '.kraken2.cut \
                --unclassified-out $WORKDIR/' + sampleName + '.kraken2.unclassified.out \
                --classified-out $WORKDIR/' + sampleName + '.kraken2.classified.out \
                --report $WORKDIR/' + sampleName + '.kraken2.report.txt \
                --use-names \
                --db /input/powerPlant/appsdata/kraken2db/RefSeq93 \
                --threads 20 \
                --memory-mapping \
                ' + fileName + '"')

f.close()

EOF

# > .log/Try2.G3_2_S3.1.log 2>.log/Try2.G3_2_S3.1.err &


Job <295151> is submitted to default queue <lowpriority>.


## 3. Generate report

In [4]:
module load Krona/2.7

In [5]:
# generate krona input files from kraken output .cut file

for i in FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected
do
   cutFile=$WORKDIR/$i'.kraken2.cut'
   outFile=$WORKDIR/$i'.kraken2.out'
   perl -lane '@a=split /\t/; if ($a[2] =~ /taxid\s+(\d+)/) {print "$a[1]\t$1";}' $cutFile > $outFile
done

In [6]:
python << EOF

import sys, os

filename = '$WORKDIR/ContigFiles.txt'
f = open(filename,'r')

for line in f:
    sampleName = line.split('/')[-1].split('.')[0]
    os.system('bsub -J krona \
               -o $WORKDIR/' + sampleName + '.krona.out \
               -e $WORKDIR/' + sampleName + '.krona.err \
               "ktImportTaxonomy \
               $WORKDIR/' + sampleName + '.kraken2.out \
               -i \
               -o $WORKDIR/' + sampleName + '.kraken2.html \
               -tax /software/bioinformatics/Krona-2.7/taxonomy/"')

f.close()

EOF

Job <301828> is submitted to default queue <lowpriority>.


## 4. remove contamination seqs

In [9]:
module load pfr-python3
module list

Currently Loaded Modulefiles:
  1) powerPlant/core     4) git/2.21.0          7) asub/2.2
  2) texlive/20151117    5) perlbrew/0.76       8) Krona/2.7
  3) pandoc/1.19.2       6) perl/5.28.0         9) pfr-python3/3.7.7


In [7]:
ls $WORKDIR/*.bacteria.txt

/workspace/hraczw/github/hoki_genomics/008.Contamination.analysis/FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected.bacteria.txt


In [10]:
python /workspace/hraczw/scriptomics/filter_fasta_by_list_of_headers.py \
/workspace/hraczw/github/hoki_genomics/005.pilon_correctFLYE/FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected.fasta \
$WORKDIR/FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected.bacteria.txt > \
/workspace/hraczw/github/hoki_genomics/005.pilon_correctFLYE/FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected.noBacteria.fasta

In [11]:
grep -c '>' /workspace/hraczw/github/hoki_genomics/005.pilon_correctFLYE/FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected.fasta

589


In [12]:
grep -c '>' /workspace/hraczw/github/hoki_genomics/005.pilon_correctFLYE/FlyeAssembly_PFR_AGRF_plusP_P1_pilon-2_corrected.noBacteria.fasta

566
