group method: not all UMI-containing reads assigned UG tag #635

epiliper · 2024-03-20T23:37:39Z

Hi! After running group on a sorted bamfile of roughly 3.4 million reads, I noticed that only half (~1.6 million reads) were actually given UG tags.

I checked this by running samtools view -d UG on the grouped file and counting the number of reads meeting these filter conditions.

All reads in the bamfile I used should contain UMIs. Is this expected behavior, and if so, what might cause some reads to not get assigned to a read group?

Thanks in advance for your patience; I'd really appreciate any help/explanation.

I'm a new user so apologies if I'm missing the obvious.

The text was updated successfully, but these errors were encountered:

TomSmithCGAT · 2024-03-21T10:49:29Z

Hi @epiliper - Could you please post the umi_tools group command used. Cheers

epiliper · 2024-03-21T11:07:18Z

Command used:

umi_tools group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=":" --paired -S UMI3-RSV90-10_S2_umi.bam

Here is the QNAME of a typical read in my input bamfile:

M04202:286:000000000-L5TBR:1:1102:16626:7061:TGTCAGGCTAT

Where TGTCAGGCTAT is the UMI.

When I run the above command, it runs with these options:

UMI-tools version: 1.1.4
 output generated by group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=: --paired -S UMI3-RSV90-10_S2_umi.bam
job started at Thu Mar 21 04:04:31 2024 on MacBeth.users.gowaveg.com -- fb9c6e30-e5bd-4b2f-9b5f-62ccf0d1abd9
pid: 40348, system: Darwin 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:43 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6000 arm64
assigned_tag                            : None
cell_tag                                : None
cell_tag_delim                          : None
cell_tag_split                          : -
chimeric_pairs                          : use
chrom                                   : None
compresslevel                           : 6
detection_method                        : None
filter_umi                              : None
gene_tag                                : None
 gene_transcript_map                     : None
get_umi_method                          : read_id
ignore_tlen                             : False
ignore_umi                              : False
in_sam                                  : False
 log2stderr                              : False
loglevel                                : 1
 mapping_quality                         : 0
method                                  : directional
no_sort_output                          : False
out_sam                                 : False
output_bam                              : True
output_unmapped                         : False
paired                                  : True
per_cell                                : False
per_contig                              : False
per_gene                                : False
random_seed                             : None
read_length                             : False
short_help                              : None
skip_regex                              : ^(__|Unassigned)
soft_clip_threshold                     : 4
spliced                                 : False
stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdin                                   : <_io.TextIOWrapper name='UMI3-RSV90-10_S2.aligned.sorted.bam' mode='r' encoding='UTF-8'>
stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
stdout                                  : <_io.TextIOWrapper name='UMI3-RSV90-10_S2_umi.bam' mode='w' encoding='UTF-8'>
subset                                  : None
threshold                               : 1
timeit_file                             : None
timeit_header                           : None
timeit_name                             : all
tmpdir                                  : None
tsv                                     : None
umi_group_tag                           : BX
umi_sep                                 : :
umi_tag                                 : RX
umi_tag_delim                           : None
umi_tag_split                           : None
umi_whitelist                           : None
umi_whitelist_paired                    : None
unmapped_reads                          : discard
unpaired_reads                          : use
whole_contig                            : False

Thanks again.

epiliper · 2024-03-21T21:55:55Z

Update: I think I know what might be happening...

Looking at part of the definition of get_bundles() in lines 320-327 of sam-methods:

if read.is_read2:
                if self.return_read2:
                    if not read.is_unmapped or (
                            read.is_unmapped and self.return_unmapped):
                        yield read, None, "single_read"
                continue
            else:
                self.read_events['Input Reads'] += 1

If I'm understanding this right, umi-tools doesn't consider read2 in grouping and downstream analysis. That would explain why roughly half of my reads are getting tagged.

Does this sound like a reasonable cause?

I apologize if I missed this in the docs; I guess checking the percentage of reads tagged with "UG" after running group isn't commonly done.

TomSmithCGAT · 2024-03-26T12:51:54Z

Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉

Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options

UMI-tools/umi_tools/group.py

Lines 242 to 253 in 7e799bc

    
           for bundle, key, status in bundle_iterator(inreads): 
        
               # write out read2s and unmapped/chimeric (if these options are set) 
        
               if status == 'single_read': 
        
                   # bundle is just a single read here 
        
                   nInput += 1 
        
                   if outfile: 
        
                       outfile.write(bundle) 
        
                   nOutput += 1 
        
                   continue

Looking over the documentation, I don't think this is stated anywhere. @IanSudbery, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify.

IanSudbery · 2024-03-26T19:01:53Z

I don't see it in the documentation either. I think it should definitely be documented. I also wonder if a tool similar to prepare-for-rsem could add grouping info to the read2s. Ian Sudbery (He/Him) Senior Lecturer in Bioinformatics, Sheffield Institute for Nucleic Acids, School of Biosciences, The University of Sheffield. web: www.sudlab.co.uk Tel: 0114 222 2738 Twitter: IanSudbery Show Calendar Availability <https://calendar.google.com/calendar/u/0?cid=aS5zdWRiZXJ5QHNoZWZmaWVsZC5hYy51aw>

…

On Tue, 26 Mar 2024 at 12:52, Tom Smith ***@***.***> wrote: Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉 Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options https://github.com/CGATOxford/UMI-tools/blob/7e799bc120f185128e3983cbc328e180a0b6b263/umi_tools/group.py#L242-L253 Looking over the documentation, I don't think this is stated anywhere. @IanSudbery <https://github.com/IanSudbery>, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify. — Reply to this email directly, view it on GitHub <#635 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJELDXZMV55M36RQUT3GNDY2FOPDAVCNFSM6AAAAABFANQ64KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGM2DMNZYGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

TomSmithCGAT · 2024-03-26T21:21:07Z

OK, I'll put updating the docs regarding this on the to-do.

If we do want to include read tags on read2, I'm torn between allowing this in group and a separate tool to add these afterwards. The former seem cleaner and shouldn't be too hard, but the later will probably be quicker to actually get done! Thoughts?..

epiliper · 2024-03-26T23:56:22Z

Thank you so much for confirming the cause of half-tagged reads! Peace of mind has been restored.

If you're asking me for my thoughts:

I asked some labmates also involved in UMI NGS, and we think a solution built within UMI-tools group for R2 processing would be great.
A coworker proposed separating, grouping and deduplicating R1 and R2 in parallel, then aligning them both to the reference in a final bam file.

All of this is of course just proposals; Either way, we'd be interested to hear more about this considering it would have a massive impact on how we set up our NGS runs.

Thanks again!

TomSmithCGAT self-assigned this Mar 26, 2024

TomSmithCGAT added the todo label Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group method: not all UMI-containing reads assigned UG tag #635

group method: not all UMI-containing reads assigned UG tag #635

epiliper commented Mar 20, 2024 •

edited

Loading

TomSmithCGAT commented Mar 21, 2024

epiliper commented Mar 21, 2024

epiliper commented Mar 21, 2024 •

edited

Loading

TomSmithCGAT commented Mar 26, 2024

IanSudbery commented Mar 26, 2024 via email

TomSmithCGAT commented Mar 26, 2024

epiliper commented Mar 26, 2024

group method: not all UMI-containing reads assigned UG tag #635

group method: not all UMI-containing reads assigned UG tag #635

Comments

epiliper commented Mar 20, 2024 • edited Loading

TomSmithCGAT commented Mar 21, 2024

epiliper commented Mar 21, 2024

epiliper commented Mar 21, 2024 • edited Loading

TomSmithCGAT commented Mar 26, 2024

IanSudbery commented Mar 26, 2024 via email

TomSmithCGAT commented Mar 26, 2024

epiliper commented Mar 26, 2024

epiliper commented Mar 20, 2024 •

edited

Loading

epiliper commented Mar 21, 2024 •

edited

Loading