# contamination removal

* Contamination contigs are suggested to erase from the 10x contig sets before further scaffolding
* Kraken is used to remove contigs belonging to microbiomes
* Workflow adopted from Cecilia

**NOTE** from Cecilia:  the DB I used was /input/powerPlant/appsdata/kraken2db/Refseq91 (Dated Feb 2019). It took ages to build the db hence we don’t update it often. There is a newer version, RefSeq93, built about two weeks ago. I’d recommend to use a server with large RAM to avoid the memory issue when loading DB.

## 1. input contigs

In [1]:
mkdir 0101.Contamination.analysis

In [1]:
WORKDIR=/workspace/hraczw/github/GA/Bilberry_genome/0101.Contamination.analysis

create a file named ContigFiles.txt containing all contig sets required for analysis

In [2]:
cat 0101.Contamination.analysis/ContigFiles.txt

/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i1.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.fasta
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i3.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.fasta
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Flye_all_trimmed_i1.assembly.racon_i3.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.fasta


## 2. Kraken

In [4]:
python << EOF

import sys, os

filename = '$WORKDIR/ContigFiles.txt'
f = open(filename,'r')

for line in f:
    sampleName = line.split('/')[-1].split('_')[0] + '_racon' + line.split('racon')[1].split('.')[0] + '_pilon-2'
    fileName = line.split('\n')[0]

    os.system('bsub -J kraken \
               -m wkoppb50 \
               -n 20 \
               -o $WORKDIR/' + sampleName + '.out \
               -e $WORKDIR/' + sampleName + '.err \
               "/workspace/hraaxt/kraken2/kraken2 \
                --output $WORKDIR/' + sampleName + '.kraken2.cut \
                --unclassified-out $WORKDIR/' + sampleName + '.kraken2.unclassified.out \
                --classified-out $WORKDIR/' + sampleName + '.kraken2.classified.out \
                --report $WORKDIR/' + sampleName + '.kraken2.report.txt \
                --use-names \
                --db /input/powerPlant/appsdata/kraken2db/RefSeq93 \
                --threads 20 \
                --memory-mapping \
                ' + fileName + '"')

f.close()

EOF

# > .log/Try2.G3_2_S3.1.log 2>.log/Try2.G3_2_S3.1.err &


Job <213064> is submitted to default queue <lowpriority>.
Job <213065> is submitted to default queue <lowpriority>.
Job <213066> is submitted to default queue <lowpriority>.


## 3. Generate report

In [5]:
module load Krona/2.7

In [9]:
# generate krona input files from kraken output .cut file

for i in Flye_racon_i3_pilon-2 Shasta_racon_i1_pilon-2 Shasta_racon_i3_pilon-2
do
   cutFile=$WORKDIR/$i'.kraken2.cut'
   outFile=$WORKDIR/$i'.kraken2.out'
   perl -lane '@a=split /\t/; if ($a[2] =~ /taxid\s+(\d+)/) {print "$a[1]\t$1";}' $cutFile > $outFile
done

In [10]:
python << EOF

import sys, os

filename = '$WORKDIR/ContigFiles.txt'
f = open(filename,'r')

for line in f:
    sampleName = line.split('/')[-1].split('_')[0] + '_racon' + line.split('racon')[1].split('.')[0] + '_pilon-2'
    os.system('bsub -J krona \
               -o $WORKDIR/' + sampleName + '.krona.out \
               -e $WORKDIR/' + sampleName + '.krona.err \
               "ktImportTaxonomy \
               $WORKDIR/' + sampleName + '.kraken2.out \
               -i \
               -o $WORKDIR/' + sampleName + '.kraken2.html \
               -tax /software/bioinformatics/Krona-2.7/taxonomy/"')

f.close()

EOF

Job <213639> is submitted to default queue <lowpriority>.
Job <213640> is submitted to default queue <lowpriority>.
Job <213641> is submitted to default queue <lowpriority>.


## 4. remove contamination seqs

In [11]:
module load pfr-python3
module list

Currently Loaded Modulefiles:
  1) powerPlant/core     4) git/2.21.0          7) asub/2.1
  2) texlive/20151117    5) perlbrew/0.76       8) Krona/2.7
  3) pandoc/1.19.2       6) perl/5.28.0         9) pfr-python3/3.6.6


In [12]:
ls $WORKDIR/*.bacteria.txt

/workspace/hraczw/github/GA/Bilberry_genome/0101.Contamination.analysis/Flye_racon_i3_pilon-2.bacteria.txt
/workspace/hraczw/github/GA/Bilberry_genome/0101.Contamination.analysis/Shasta_racon_i1_pilon-2.bacteria.txt
/workspace/hraczw/github/GA/Bilberry_genome/0101.Contamination.analysis/Shasta_racon_i3_pilon-2.bacteria.txt


In [15]:
python /workspace/hraczw/scriptomics/filter_fasta_by_list_of_headers.py \
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i1.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.fasta \
$WORKDIR/Shasta_racon_i1_pilon-2.bacteria.txt > \
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i1.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.noBacteria.fasta

In [14]:
grep -c '>' /workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i1.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.fasta

3003


In [16]:
grep -c '>' /workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i1.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.noBacteria.fasta

2989


In [17]:
python /workspace/hraczw/scriptomics/filter_fasta_by_list_of_headers.py \
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i3.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.fasta \
$WORKDIR/Shasta_racon_i3_pilon-2.bacteria.txt > \
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Shasta_all_v0.2.0.assembly.racon_i3.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.noBacteria.fasta

In [19]:
python /workspace/hraczw/scriptomics/filter_fasta_by_list_of_headers.py \
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Flye_all_trimmed_i1.assembly.racon_i3.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.fasta \
$WORKDIR/Flye_racon_i3_pilon-2.bacteria.txt > \
/workspace/hraczw/github/GA/Bilberry_genome/200.Pilon.correction/Flye_all_trimmed_i1.assembly.racon_i3.includeUnpolished.gpu_pilon-1_corrected_pilon-2_corrected.noBacteria.fasta