Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sc/snATAC-seq barcode correction #754

Open
sknaack opened this issue Jan 10, 2024 · 5 comments
Open

sc/snATAC-seq barcode correction #754

sknaack opened this issue Jan 10, 2024 · 5 comments

Comments

@sknaack
Copy link

sknaack commented Jan 10, 2024

Hi Cutadapt team,

This is not a standard "issue" but an inquiry regarding sn/scATAC-seq data preprocessing (10x Genomics multiome). I'm using Cutadapt to trim snATAC-seq data reads before alignment with bwa-mem2, and it certainly works admirably, and the parallelization is nicely implemented. My concern relates to barcode assessment and correction.

In the workflow I've been testing barcode correction is done in a separate preprocessing step. The three .fastq files (R1 R2 and I1) are read into an executable and 1) the sequenced barcode is extracted from the correct position in the data for each read, 2) compared to a whitelist of anticipated barcodes (from 10x), and 3) the data are written out with the corrected/mapped barcodes as tags in the read names. Those output fastq files are passed to Cutadapt and ultimately the corrected barcode tags on the read names are used in the downstream analysis after alignment.

Such a barcode correction step appears distinct from the demultiplexing options described in the online documentation, and I am inferring this is not currently an anticipated usage for Cutadapt. Is that accurate? Nevertheless, it may be surprisingly easy to implement and a nice extension to be able to fully preprocess sn/scATAC-seq data in one step. Is this a functionality you are already implementing, or open to including? On what time scale? Thanks in advance!

Best regards,

Sara Knaack

@marcelm
Copy link
Owner

marcelm commented Jan 10, 2024

I’m not familiar with how sn/scATAC-seq data looks and admit I don’t have the time to find this out myself right now. I’ll comment a bit based on what I learned recently in a project in which I had to trim some scRNAseq data and if this doesn’t apply here, please get back to me with an explanation of how the sn/scATAC-seq reads are structured.

For my project, R1 contained multiple index sequences (or barcodes) and a UMI. Let’s assume the first index is right at the 5' end. Given a list of index sequences in indices.fasta, I used a command like this to find the index sequence and to add its name to the read header:

cutadapt -g ^file:indices.fasta --rename '{header} ligation_index={r1.adapter_name}' ...

The code is public.

The --rename option is meant for this type of usage.

If you don’t want the name of the adapter/index/barcode in the header, you can use {r1.match_sequence} instead of {r1.adapter_name}. match_sequence is the sequence from the read, so it includes errors. If you want to do "error correction" of the barcode sequence, you would currently have to use r1.adapter_name and provide a FASTA file where the sequence names are the same as the sequences, like this:

>AAAA
AAAA
>CCCC
CCCC

I would be open to extending the --rename option to make this easier (adding something like a {adapter_sequence} placeholder which would get replaced with the original sequence).

Cutadapt cannot read I1/I2 files at the moment, see issue #491. In my case, the I1 sequence was already in the read header (of R1 and R2), so I didn’t need to read the I1 file at all.

@sknaack
Copy link
Author

sknaack commented Jan 10, 2024

Hi Marcel,

Thanks much for the quick reply. Indeed what you describe makes perfect sense for snRNA-seq. For sc/snATAC-seq data the sequenced barcode is exclusively in the I1 data file as far as I can tell. If accessing that isn't readily supported it would understandably take some finagling. On some time scale it could be worth checking into, it'd be a nice enhancement. Thanks!

Sara

@marcelm
Copy link
Owner

marcelm commented Jan 11, 2024

I would like to understand what exactly you would need Cutadapt to do. Could you provide a (toy) example of an input read triple (R1, R2, I1) plus the expected output?

@sknaack
Copy link
Author

sknaack commented Jan 11, 2024

Hi Marcel,

That's very kind. Indeed I can do so readily to illustrate. I should first say I slightly misspoke/mistyped. Technically the three files are labeled R1, R2 and R3, but the R1 and R3 become the paired input/output to Cutadapt. The middle file (I1 as I was thinking of it, technically R2 in the naming convention) contains the 16 bp barcode string that represents the cell ID. What is called for is to take the reverse complement of the last 16 nt from the R2 string, and compare that to the whitelisted set of barcodes, and apply a barcode correction to a Hamming distance of 1. This form of correction is standard of course, and would putatively be implemented amidst the demultiplexing code.

Here are two example raw triplicate reads from the raw data:

File R1:
@lh00134:29:2235HVLT3:2:1101:2396:1048 1:N:0:ACCGAACA
CNCATGGAGAACCTGTACTAGGACAATATGAAGGGGAAATGTGGGATTGGAGCCCCCACACAAGTCCCCACTGGGGCACTGTCTAGTGGAGCTGGGAGAAGAGGGCCATCATCCTCCAGACCCTGGAATGGTAGAGCCACTGGCAGTTTG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFF5-F--FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFF
@lh00134:29:2235HVLT3:2:1101:3117:1048 1:N:0:ACCGAACA
ANCCTGGGCAACACAGTGAAACCTTGTCTCTACAAAAAATGCAAAACTTAGTGGGGCATGGTAGCATGAGCCTATGGTCCCAGCTACTCGAGAGGCTGAGGTGGGAGGATGGCTTAAGCCCAGGAGGCGGAGGCTGCAGTGAGCCGAGAT
+
F#F5FFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFF--FFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFF

File R2:
@lh00134:29:2235HVLT3:2:1101:2396:1048 2:N:0:ACCGAACA
CAGACGCGCTAATGCTGACCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFF
@lh00134:29:2235HVLT3:2:1101:3117:1048 2:N:0:ACCGAACA
CAGACGCGGCAGGCATGAGCCAAT
+
-FFFFFFFFFF--FFFFFF-FF-F

File R3:
@lh00134:29:2235HVLT3:2:1101:2396:1048 3:N:0:ACCGAACA
GNCTTTTCCAACTACAGGGTGCAAACTGCCAGTGGCTCTACCATTCCAGGGTCTGGAGGATGATGGCCCTCTTCTCCCAGCTCCACTAGACAGTGCCCCAGTGGGGACTTGTGTGGGGGCTCCAATCTCACATTTCCCCTTCATATTGTC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFF----FFFFFFFFFFFFFF-FF-F
@lh00134:29:2235HVLT3:2:1101:3117:1048 3:N:0:ACCGAACA
ANATAGGGTCTTGCTCTGTCGCCCAAGATGGAGTGCAGTGGCATGATCTCGGCTCACTGCAGCCTCCGCCTCCTGGGCTTAAGCCATCCTCCCACCTCAGCCTCTCGAGTAGCTGGGACCATAGGCTCATGCTACCATGCCCCACTAAGT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFF-FFFF-FFFFFFFF--FFFFFFFFFFFFFFFFFF-FFF-FFF-F

The anticipated output format is two fastq files with the uncorrected and corrected barcode as read name tags, with appropriate sequence trimming (these examples weren't actually changed):

Corrected file R1:
@lh00134:29:2235HVLT3:2:1101:2396:1048 CR:Z:TAGCGGTCAGCATTAG	CB:Z:TAGCGGTCAGCATTAG
CNCATGGAGAACCTGTACTAGGACAATATGAAGGGGAAATGTGGGATTGGAGCCCCCACACAAGTCCCCACTGGGGCACTGTCTAGTGGAGCTGGGAGAAGAGGGCCATCATCCTCCAGACCCTGGAATGGTAGAGCCACTGGCAGTTTG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFF5-F--FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFF
@lh00134:29:2235HVLT3:2:1101:3117:1048 CR:Z:ATTGGCTCATGCCTGC	CB:Z:ATTCGCTCATGCCTGC
ANCCTGGGCAACACAGTGAAACCTTGTCTCTACAAAAAATGCAAAACTTAGTGGGGCATGGTAGCATGAGCCTATGGTCCCAGCTACTCGAGAGGCTGAGGTGGGAGGATGGCTTAAGCCCAGGAGGCGGAGGCTGCAGTGAGCCGAGAT
+
F#F5FFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFF--FFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFF

Corrected file R3:
@lh00134:29:2235HVLT3:2:1101:2396:1048 CR:Z:TAGCGGTCAGCATTAG	CB:Z:TAGCGGTCAGCATTAG
GNCTTTTCCAACTACAGGGTGCAAACTGCCAGTGGCTCTACCATTCCAGGGTCTGGAGGATGATGGCCCTCTTCTCCCAGCTCCACTAGACAGTGCCCCAGTGGGGACTTGTGTGGGGGCTCCAATCTCACATTTCCCCTTCATATTGTC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFF----FFFFFFFFFFFFFF-FF-F
@lh00134:29:2235HVLT3:2:1101:3117:1048 CR:Z:ATTGGCTCATGCCTGC	CB:Z:ATTCGCTCATGCCTGC
ANATAGGGTCTTGCTCTGTCGCCCAAGATGGAGTGCAGTGGCATGATCTCGGCTCACTGCAGCCTCCGCCTCCTGGGCTTAAGCCATCCTCCCACCTCAGCCTCTCGAGTAGCTGGGACCATAGGCTCATGCTACCATGCCCCACTAAGT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFF

Hopefully this clarifies what I mean. Here the CR tag is the extracted reverse-complement of the last 16 nt from the R2 file, and the CB tag is the whitelisted barcode sequence it is mapped to. Naturally some nominal statistics about the correction and how many reads are lost and how many barcodes are used or not would also be helpful for QC and documentation. This is the jist of what I understand sc/snATAC-seq would require, at least for our 10x multiome data. The precise length, location and orientation of the barcode is what would vary in differing platforms/chemistries.

Best,

Sara

@sknaack
Copy link
Author

sknaack commented Jan 16, 2024

Hi Marcel, A quick follow up, I hope the example .fastq lines above were helpful. I understand if it's not 100% doable off hand, but I'll nevertheless be intrigued if you happen to think it's possible at some point. Thank you again for obliging my inquiry! Best, Sara

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants