This repo contains the code to reproduce the analysis of the unkown trematoda DNA, for the following study:
Krupenko D.*, Miroliubov A.*, Kryukov E.*, Faure L.*, Minemizu R., Haag L., Lundgren M., Kameneva P., Kastriti M. E., Adameyko I.
Polymorphic parasitic larvae cooperate to build swimming colonies luring hosts
Current Biology, (2023) https://doi.org/10.1016/j.cub.2023.08.090
It is recommended to set the environment using conda
or mamba
.
mamba create -n trematoda -c bioconda -c conda-forge seqkit=2.3.1 \
trimmomatic=0.36 spades=3.11.1 bwa=0.7.17 blast=2.12.0 \
barrnap=0.9 python=3.9 bamtocov=2.7.0 seqtk=1.4 \
make=4.3 gxx=12.2.0 --yes
mamba activate trematoda
pip install snakemake ffq gget
git clone https://github.com/LouisFaure/pleorchis_project
cd pleorchis_project
It is recommended to download the original fastq files from AWS, the aws
tool must be installed and credentials must be setup.
mkdir -P data && cd data
# Pipe AWS links to aws s3 cp and download
ffq --aws SRR23342063 | grep -Eo '"url": "[^"]*"' | grep -o '"[^"]*"$' | grep -0 "fastq.gz" | xargs -I {} aws s3 cp {} .
ffq --aws SRR23342064 | grep -Eo '"url": "[^"]*"' | grep -o '"[^"]*"$' | grep -0 "fastq.gz" | xargs -I {} aws s3 cp {} .
ffq --aws SRR23342065 | grep -Eo '"url": "[^"]*"' | grep -o '"[^"]*"$' | grep -0 "fastq.gz" | xargs -I {} aws s3 cp {} .
# Remove trailing '.1'
for f in *; do mv "$f" "${f::-2}"; done
cd ..
Data from passengers was of poor quality (not much reads associated to illumina indexes recovered), hinting on a possible issue while prepary libraries. As the 2 libraries were separately loaded into each lane of the Illumina SP flow cell according to Xp workflow, we could recover the data from Undetermined reads from lane 1 as data containing sequences from passengers sample.
cd data
mv Undetermined_S0_L001_R1_001.fastq.gz Passengers_S0_L001_R1_001.fastq.gz
mv Undetermined_S0_L001_R2_001.fastq.gz Passengers_S0_L001_R2_001.fastq.gz
cd ..
This is the main part of the analysis, and also the most time consuming and compute intensive, the following steps are applied:
- Trim raw reads with
trimmomatic
. - Do a first pass of
spades
, for error correction of the trimmed reads.
- For Undetermined, add one step of
trimmomatic
, trimming reads using illumina.fa provided.
- Do a second pass of
spades
, this time for assembly of contigs. - Filter contigs by removing short ones.
snakemake -j 80 contigs_500-Sailors_S2_L002-seq.fasta \
contigs_500-Passengers_S0_L001-seq.fasta
The outputs are fasta files containing all contigs that are at least 500bp long.
All the contigs of the high quality "Sailors" sample are BLASTed against SSU and LSU sequences of SILVA database, restricted to Holozoa clade, only the ones with hit are kept.
snakemake -j 40 main_contig-Sailors_S2_L002-seq.fasta \
main_contig-Passengers_S0_L001-seq.fasta
The output is a single contig in fasta format, as we have found out that only one (the longest) was having multiple hits with holozoan LSU/SSU.
From that main contig, we predict ribosomal DNA seqs using HMMER 3.1 (using barrnap wrapper).
snakemake -j 1 main_contig-Sailors_S2_L002-rRNA.fasta \
main_contig-Passengers_S0_L001-rRNA.fasta
The output sequences are the ones that have been submitted to GenBank.
Here, we re-map raw corrected reads against newly found main contig.
snakemake -j 40 main_contig-Sailors_S2_L002-mapping.bam
The output bam and bai files can be visualised in IGV browser.
snakemake -j 1 main_contig-Sailors_S2_L002-blast_nt.csv
Here we use sequences included in fasta files pleorchis_rDNA.fasta
and goliath_rDNA.fasta
snakemake -j 1 main_contig-Sailors_S2_L002-blast-goliath.txt main_contig-Sailors_S2_L002-blast-pleorchis.txt
Warning This part will take a very long time to run, as it will download the whole nucleotide database from ncbi!
snakemake -j 40 contigs_500-Sailors_S2_L002-fcounts_classified.csv \
contigs_500-Passengers_S0_L001-fcounts_classified.csv
snakemake -j 1 common_hits.csv