Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensuring to find the correct file after Demultiplexing my 16S amplicon raw dataset with combinatorial dual indexes cutadapt command #776

Open
JayalalKJ opened this issue Apr 16, 2024 · 5 comments

Comments

@JayalalKJ
Copy link

JayalalKJ commented Apr 16, 2024

  • Cutadapt and Python version used: 4.8
  • The method you used to install the tool (conda or pip, for example): Used mamba
  • The command-line parameters you used for demultiplexing paired-end reads with combinatorial dual indexes cutadapt command:
cutadapt \
    -e 0.15 --no-indels \
    -g ^file:barcodes_fwd.fasta \
    -G ^file:barcodes_rev.fasta \
    -o {name1}-{name2}.1.fastq.gz -p {name1}-{name2}.2.fastq.gz \
    input.1.fastq.gz input.2.fastq.gz

Here are some of the output generated from the code above:

LIB1_sample_189-LIB1_sample_183.1.fastq.gz
LIB1_sample_189-LIB1_sample_183.2.fastq.gz
LIB1_sample_189-LIB1_sample_184.1.fastq.gz
LIB1_sample_189-LIB1_sample_184.2.fastq.gz
LIB1_sample_189-LIB1_sample_185.1.fastq.gz
LIB1_sample_189-LIB1_sample_185.2.fastq.gz
LIB1_sample_189-LIB1_sample_186.1.fastq.gz
LIB1_sample_189-LIB1_sample_186.2.fastq.gz
LIB1_sample_189-LIB1_sample_187.1.fastq.gz
LIB1_sample_189-LIB1_sample_187.2.fastq.gz
LIB1_sample_189-LIB1_sample_188.1.fastq.gz
LIB1_sample_189-LIB1_sample_188.2.fastq.gz
LIB1_sample_189-LIB1_sample_189.1.fastq.gz
LIB1_sample_189-LIB1_sample_189.2.fastq.gz
LIB1_sample_189-LIB1_sample_190.1.fastq.gz
LIB1_sample_189-LIB1_sample_190.2.fastq.gz
  1. What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples.
  2. How can I pick those right files from the combination pool folder?
@JayalalKJ JayalalKJ changed the title Ensuring the correct path or file after Demultiplexing my 16S amplicon raw dataset Ensuring to find the correct file after Demultiplexing my 16S amplicon raw dataset with combinatorial dual indexes cutadapt command Apr 16, 2024
@marcelm
Copy link
Owner

marcelm commented Apr 16, 2024

How many barcodes are in barcodes_fwd.fasta and barcodes_rev.fasta, respectively? Are you sure you have combinatorial dual indices? According to Illumina’s documentation, you can have up to 96 samples, so I wonder how the 230 samples fit in there.

@JayalalKJ
Copy link
Author

JayalalKJ commented Apr 17, 2024

Thanks for your question. It was 160 (set B- 96 and the rest in Set C)

kit used: NextFlex Rapid XP kit

Forward primer: GTGCCAGCMGCCGCGGTAA
Reverse primer: GGACTACHVGGGTWTCTAAT

Two files have been received.
sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq
sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq

head -n 4 sample2_1_L001_R1_001.fastq
@A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC
GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT
+A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC
,,:FFF,F,:FF:F:F:FFFF:F,F,FF:,FFF:F,,,,:FFF,F,,:FF::F:,:FFFF:F:F:F,FF::F:FF:,FFFFF,FF,,F,F,,FFFFFFFFF::FF:FF:FFF:FF:FF,,:,:FF:FF:::FF:FF,F:FFF,:F,F:FFFFF:F,:F,FFFFFFF:FFFFF:FFFFFFFF:F:FFFFFF:FFFFFFFFFFFFFFFFF::FFFFF:FFFFFFF:FFF,FFFFFFFFFFFFFF

How many barcodes are in barcodes_fwd.fasta and barcodes_rev.fasta, respectively?

Two 96 plates: B, and C barcode , example see below:

In Plate B, the barcode combination is as follows:
"LIB1" "LIB1_sample_001" "AACAAGCC:GGAATGAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A"
"LIB1" "LIB1_sample_002" "TTACCGCT:CCAGTATG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B"
"LIB1" "LIB1_sample_003" "TTACGCCA:TCATAGCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C"
"LIB1" "LIB1_sample_004" "GTGATCTC:ACTTGGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D"
"LIB1" "LIB1_sample_005" "GGATAGCA:CAGTAGAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E"
"LIB1" "LIB1_sample_006" "AACCTCAG:GGATGATC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F"
"LIB1" "LIB1_sample_007" "CAACCTCA:GGATTCGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G"
"LIB1" "LIB1_sample_008" "CGCCAATT:TAGCAAGG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H"
"LIB1" "LIB1_sample_009" "GGAATGAG:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A"
"LIB1" "LIB1_sample_010" "CCAGTATG:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B"
"LIB1" "LIB1_sample_011" "TCATAGCG:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C"

In Plate C, the barcode combination is as follows:
"LIB1" "LIB1_sample_001" "AACAAGCC:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A"
"LIB1" "LIB1_sample_002" "TTACCGCT:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B"
"LIB1" "LIB1_sample_003" "TTACGCCA:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C"
"LIB1" "LIB1_sample_004" "GTGATCTC:CGATACAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D"
"LIB1" "LIB1_sample_005" "GGATAGCA:TAGCCACT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E"
"LIB1" "LIB1_sample_006" "AACCTCAG:TATCTGGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F"
"LIB1" "LIB1_sample_007" "CAACCTCA:TTAGGCAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G"
"LIB1" "LIB1_sample_008" "CGCCAATT:GTCACAGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H"
"LIB1" "LIB1_sample_009" "GGAATGAG:AACAAGCC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A"
"LIB1" "LIB1_sample_010" "CCAGTATG:TTACCGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B"
"LIB1" "LIB1_sample_011" "TCATAGCG:TTACGCCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C"
"LIB1" "LIB1_sample_012" "ACTTGGCT:GTGATCTC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02D"
"LIB1" "LIB1_sample_013" "CAGTAGAC:GGATAGCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02E"
"LIB1" "LIB1_sample_014" "GGATGATC:AACCTCAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02F"
"LIB1" "LIB1_sample_015" "GGATTCGA:CAACCTCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02G"

I combined plates B and C and created barcodes_fwd.fasta and barcodes_rev.fasta files for each sample number. This means that plate B starts with LIB1_sample_001 and ends with LIB1_sample_096, while plate C starts with LIB1_sample_097 and ends with LIB1_sample_160.

barcodes_fwd.fasta

LIB1_sample_001
AACAAGCC
LIB1_sample_002
TTACCGCT
LIB1_sample_003
TTACGCCA
LIB1_sample_004
GTGATCTC
LIB1_sample_005
GGATAGCA
LIB1_sample_006
AACCTCAG
LIB1_sample_007
CAACCTCA
...................................
LIB1_sample_160

barcodes_rev.fasta

LIB1_sample_001
GGAATGAG
LIB1_sample_002
CCAGTATG
LIB1_sample_003
TCATAGCG
LIB1_sample_004
ACTTGGCT
LIB1_sample_005
CAGTAGAC
LIB1_sample_006
GGATGATC
LIB1_sample_007
GGATTCGA
...............................
LIB1_sample_160

@marcelm
Copy link
Owner

marcelm commented Apr 18, 2024

I still don’t have a clear picture of how your data is structured. I believe you need to understand this yourself before you can proceed. In particular, you need to figure out where the index sequences are.

First, to be explicit: There is a difference between unique dual indexing and combinatorial indexing. Unless you have reliable information that combinatorial indexing was used, it is more likely that unique dual indexing was done.

kit used: NextFlex Rapid XP kit

I am not familiar with it, but the manual for version 2 of that kit talks about Unique Dual Indices:

In addition, the availability of up to 1,536 different
Unique Dual Index adapter barcodes facilitates high-throughput applications.

According to the same manual, the UDIs need to be bought separately, so it would still possible to use combinatorial indexing, but that would be against Illumina’s own advice:

Illumina recommends using unique dual indexing (UDI) whenever possible for the most accurate demultiplexing.

Your first read looks like this:

head -n 4 sample2_1_L001_R1_001.fastq

@A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC
GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT
...
  • The first bases are GTGCCAGCCGCCGCGGTAA, which matches your forward primer. That means the indices are not in the beginning of R1 and you therefore cannot use -g ^file:... to demultiplex because that option is for providing "anchored 5' adapters", which must be at the 5' end of the read.
  • The CGCTGCTC+GATCTGCC in the FASTQ header are index sequences. This information is added by the Illumina software that produced the FASTQ file. This means that the software has already done demultiplexing. (It is not impossible that two levels of multiplexing were done, though.)
  • When I BLAST the read sequence above, it yields a 100% match against 16S RNA, that is, there is no index sequence at all in the read.

Two files have been received. sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq

Do all reads in the first file contain CGCTGCTC+GATCTGCC? What about the second file?

@JayalalKJ
Copy link
Author

JayalalKJ commented Apr 19, 2024

Hello there

Do all reads in the first file contain CGCTGCTC+GATCTGCC? What about the second file?
Yes.

Checking LIB1_L2_1.fq for sequence CGCTGCTC+GATCTGCC...
Number of occurrences in LIB1_1_L2_1.fq: 23013460
Total number of reads in LIB1_L2_1.fq: 23013460
Checking LIB1_L2_2.fq for sequence CGCTGCTC+GATCTGCC...
Number of occurrences in LIB1_L2_2.fq: 23013460
Total number of reads in LIB1_L2_2.fq: 23013460
Checking LIB1_L1_1.fq for sequence CGCTGCTC+GATCTGCC...
Number of occurrences in LIB1_L1_1.fq: 24176149
Total number of reads in LIB1_L1_1.fq: 24176149
Checking LIB1_L1_2.fq for sequence CGCTGCTC+GATCTGCC...
Number of occurrences in LIB1_L1_2.fq: 24176149
Total number of reads in LIB1_L1_2.fq: 24176149

I had given a bad example,: here is right one from L001_1.fastq

In this string 'TTACCGCTGTGCCAGCAGCCGCGGTAA' of of L001_1, the first eight nucleotides are barcodesTTACCGCT, and the rest is forward primer.

@A00783:1516:HWGV2DRX3:1:2145:3992:12289 1:N:0:CGCTGCTC+GATCTGCC
TTACCGCTGTGCCAGCAGCCGCGGTAACACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT
+
FFFFFF:FFFF:FFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFF:FFFF,FFFFFFFFFF:FFFFF::FFFFF:FFFF:FFFFFF,,FFF,FFFFFF:FF:FFF:FFF:FFFFFFFFFFF:FF:FFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2146:4408:26099 1:N:0:CGCTGCTC+GATCTGCC
GCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCCCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTGTTACCGCTGTGCCAGCAGCCGCGGTAACAGATCGGAAGAGCACACGTCTGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FF:FFFFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2147:20889:26741 1:N:0:CGCTGCTC+GATCTGCC
ATTACCGCTGTGCCAGCAGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGAGATATGTAAGTCCCCCGGAGGCGAACGAGACACCTGAGGCGATCTCTGACGC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFF,FFFFFFFFFFFF,FF::F,F,:,,:,,F,:,,F,FF,,,,FF,F,F,,:F:,,F,,,F,,F,FF

@A00783:1516:HWGV2DRX3:1:2151:12608:26271 1:N:0:CGCTGCTC+GATCTGCC
AGCCTAAGTTTTCGCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCTCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTATTACCGCTGTGCCAGCAGCCGCGGTAAAGATCGGAAGAG

L001_1.fastq

In this string 'TTACCGCTGGACTACAGGGGTATCTAAT' of L001_1 , the first eight nucleotides are barcodesTTACCGCT, and the rest is reverse primer.

@A00783:1516:HWGV2DRX3:1:2116:7536:34084 1:N:0:CGCTGCTC+GATCTGCC
TTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTTCGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACGAGACTCAAGCTTGCCAGTATCAGATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCTGACTTAACAAACCGCCTGCGTGCGCTTTACGCCCA
+
,FF:FFF:FFFFFFFF:F::FFFFFFF::F:F:FFFFFFFF:FFFF:FFFF:FFFFFF:FFFFF:FFF:FFF:FFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFF:,FFFFFFF:FFFFFFFFFFF::F,,F:FF,FFFF:FFF:FF::FF:F,FF::F::FF:,FFFF:::FFF:FF,FFFF:FFF:FF,F:F:,,FFFFF:FFFF,,FFFF

@A00783:1516:HWGV2DRX3:1:2117:22815:14325 1:N:0:CGCTGCTC+GATCTGCC
CTTGTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGAGCCTCAGTGTCAGTATGATGCCAGGAAGCTGCCTTCGCCATCGGTATTCCTTCAGATCTCTACGCATTTCACCGCTACACCTGAAATTCTACTTCCCTCTCACCTACTCTAGCCTAACAGTTTCAGATGCAGTTCCCAGGTTAAGCCCGGGGATTTCACATCTGACTTATCAAGCCACCTACGCTCGCTTTACG
+
FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2121:20509:32659 1:N:0:CGCTGCTC+GATCTGCC
CACCTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACACTTTCGCACCTCAGCGTCAGTATCGAGCCAGTGAGCCGCCTTCGCCACTGGTGTTCCTCCGAATATCTACGAATTTCACCTCTACACTCGGAATTCCACTCACCTCTCTCGAACTCAAGACCAGGAGTTTACAAGGCAGTTCCAGGGTTGAGCCCTGGGATTTCACCCTATACTTTCTGATCCGCCTACGTGCGCTTTACG

@marcelm
Copy link
Owner

marcelm commented Apr 22, 2024

What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples.
How can I pick those right files from the combination pool folder?

Do you know which 230 barcode combinations are possible? Then you look for only those in the following way:

  • Create a file forward.fasta and one reverse.fasta. The first lists all 230 forward barcodes and the other the 230 corresponding reverse barcodes. That is, the first barcode in forward.fasta and the first in reverse.fasta are the barcodes that are used for the first sample, the second from both are the barcodes for the second sample etc.
  • Use the --pair-adapters option (see the documentation) and use {name} in the file name template.
  • Cutadapt will produce 230 files, one for each sample.

If you do not know which combinations are possible, just use the 230 biggest files. That may not be entirely correct, but it may be good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants