# contamination removal

* Microbiome sequences are suggested to be erased from assembly (gap-filled scaffolds)
* Kraken is used to remove contigs belonging to microbiomes
* Workflow adopted from Cecilia

**NOTE** from Cecilia:  the DB I used was /input/powerPlant/appsdata/kraken2db/Refseq91 (Dated Feb 2019). It took ages to build the db hence we don’t update it often. There is a newer version, RefSeq93, built about two weeks ago. I’d recommend to use a server with large RAM to avoid the memory issue when loading DB.

## 1. input contigs

In [2]:
mkdir 0031.Contamination.analysis

In [1]:
WORKDIR=/workspace/hraczw/github/GA/Gillenia_genome/0031.Contamination.analysis

create a file named ContigFiles.txt containing all contig sets required for analysis

In [2]:
cat 0031.Contamination.analysis/ContigFiles.txt

/workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaff_links_i1_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs.fa
/workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaff_links_i4_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs.fa


## 2. Kraken

In [3]:
python << EOF

import sys, os

filename = '$WORKDIR/ContigFiles.txt'
f = open(filename,'r')

for line in f:
    sampleName = line.split('/')[-1].split('.fa')[0]
    fileName = line.split('\n')[0]

    os.system('bsub -J kraken \
               -m wkoppb50 \
               -n 40 \
               -o $WORKDIR/' + sampleName + '.out \
               -e $WORKDIR/' + sampleName + '.err \
               "/workspace/hraaxt/kraken2/kraken2 \
                --output $WORKDIR/' + sampleName + '.kraken2.cut \
                --unclassified-out $WORKDIR/' + sampleName + '.kraken2.unclassified.out \
                --classified-out $WORKDIR/' + sampleName + '.kraken2.classified.out \
                --report $WORKDIR/' + sampleName + '.kraken2.report.txt \
                --use-names \
                --db /input/powerPlant/appsdata/kraken2db/RefSeq93 \
                --threads 40 \
                --memory-mapping \
                ' + fileName + '"')

f.close()

EOF

# > .log/Try2.G3_2_S3.1.log 2>.log/Try2.G3_2_S3.1.err &


Job <255378> is submitted to default queue <lowpriority>.
Job <255379> is submitted to default queue <lowpriority>.


## 3. generate html report

In [4]:
module load Krona/2.7

In [5]:
# generate krona input files from kraken output .cut file

for i in scaff_links_i1_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs scaff_links_i4_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs
do
   cutFile=$WORKDIR/$i'.kraken2.cut'
   outFile=$WORKDIR/$i'.kraken2.out'
   perl -lane '@a=split /\t/; if ($a[2] =~ /taxid\s+(\d+)/) {print "$a[1]\t$1";}' $cutFile > $outFile
done

In [6]:
python << EOF

import sys, os

filename = '$WORKDIR/ContigFiles.txt'
f = open(filename,'r')

for line in f:
    sampleName = line.split('/')[-1].split('.fa')[0]
    os.system('bsub -J krona \
               -o $WORKDIR/' + sampleName + '.krona.out \
               -e $WORKDIR/' + sampleName + '.krona.err \
               "ktImportTaxonomy \
               $WORKDIR/' + sampleName + '.kraken2.out \
               -i \
               -o $WORKDIR/' + sampleName + '.kraken2.html \
               -tax /software/bioinformatics/Krona-2.7/taxonomy/"')

f.close()

EOF

Job <239083> is submitted to default queue <lowpriority>.
Job <239084> is submitted to default queue <lowpriority>.


## 3. remove contamination seqs

In [8]:
module list

Currently Loaded Modulefiles:
  1) powerPlant/core    4) git/2.21.0         7) asub/2.1
  2) texlive/20151117   5) perlbrew/0.76      8) Krona/2.7
  3) pandoc/1.19.2      6) perl/5.28.0


In [9]:
module load pfr-python3

In [10]:
module list

Currently Loaded Modulefiles:
  1) powerPlant/core     4) git/2.21.0          7) asub/2.1
  2) texlive/20151117    5) perlbrew/0.76       8) Krona/2.7
  3) pandoc/1.19.2       6) perl/5.28.0         9) pfr-python3/3.6.6


In [13]:
python /workspace/hraczw/scriptomics/filter_fasta_by_list_of_headers.py \
/workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaff_links_i6_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs.fa \
$WORKDIR/scaff_links_i6_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs.bacteria.txt > \
/workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaff_links_i6_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs.nobacteria.fasta

In [16]:
grep -c ">" /workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaff_links_i8_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs.fa

4126


In [17]:
grep -c ">" /workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaff_links_i8_gapFilled.noCorrection.tgs-gapcloser.scaff_seqs.nobacteria.fasta

4077


In [13]:
python /workspace/hraczw/scriptomics/filter_fasta_by_list_of_headers.py \
/workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaffolds_slr_gapClosed.fasta \
$WORKDIR/scaffolds_slr_gapClosed.bacteria.txt > \
/workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaffolds_slr_gapClosed.nobacteria.fasta

In [14]:
grep -c ">" /workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaffolds_slr_gapClosed.fasta

5463


In [15]:
grep -c ">" /workspace/hraczw/github/GA/Gillenia_genome/005.GapFilling/scaffolds_slr_gapClosed.nobacteria.fasta

5419
