UMI removal #29

casimonet · 2016-05-23T15:45:36Z

Hi,
Sorry to open another issue. The fact that the UMI is removed from the sequence is a problem for downstream analyses. For example for building a bam file where the reads are filtered for UMI in which bases have a low quality. Or simply to keep a column with the UMI in the bam file, which can be more practical than having the UMI in the reads name for certain downstream analyses.

Is there no way to avoid the read to be truncated?

Thank you,
Camille

TomSmithCGAT · 2016-05-23T16:07:51Z

Hi Camile,

No need to apologise. Comments are always very welcome.

You raise two issues:

Filtering on quality
Retaining the UMI elsewhere in the BAM file
Filtering on quality:
It's not realistically possible to cover all possible filtering one might wish to apply to the reads. Previously, where I've needed to do this, I've written bespoke code to perform the filtering at the UMI extraction stage as this step is normally relatively simple. I can see that filtering on quality would be a useful feature in general though and easy to implement so I'm happy to add it to the extract tool.
For the alignment step it's necessary to retain the UMI in the read ID as this ensures all aligners will retain the UMI sequence in the resultant BAM. I guess it would be possible to have an option within dedup to move the UMI to a BAM field during the de-deduplication. Can you give me a motivating example as to why this would be useful?

Regards,

Tom

Tom Smith, PhD
CGAT Training Fellow | http://www.cgat.orghttps://owa.nexus.ox.ac.uk/owa/redir.aspx?C=tg_bDY9IpU-WsZT9DRtecLrzBQBo5dAIR2VtNMFUhhmiSbB621cWM1yWzPrEKkt2W6Kx2kZl0Lg.&URL=http%3a%2f%2fwww.cgat.org
MRC Functional Genomics Unit | University of Oxford
+44 1865 285854 | Thomas.Smith2@dpag.ox.ac.uk
https://cgatoxford.wordpress.com/author/tss505/

From: Camille SIMONET [notifications@github.com]
Sent: 23 May 2016 16:45
To: CGATOxford/UMI-tools
Subject: [CGATOxford/UMI-tools] UMI removal (#29)

Hi,
Sorry to open another issue. The fact that the UMI is removed from the sequence is a problem for downstream analyses. For example for building a bam file where the reads are filtered for UMI in which bases have a low quality. Or simply to keep a column with the UMI in the bam file, which can be more practical than having the UMI in the reads name for certain downstream analyses.

Is there no way to avoid the read to be truncated?

Thank you,
Camille

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/29

IanSudbery · 2016-05-23T16:23:08Z

Retaining the UMI on the read is not possible: if you were to leave the UMI on the read, your reads would not map.

I think adding quality filtering to extract is probably a good idea in the medium to long term. However, until that point i'd recommend filtering on the UMI quality before trimming the UMI.

I have previously started thinking about a tool for moving UMIs around a BAM record - certain software expects the UMI in certain places - usually either a particular place in the read name or sometimes in a particular tag. For example iCLIPro requires the UMI to be encoded as :rbc[ATCGN]+: at the end of the read name. I got bogged down trying to specify a format for describing the different possible locations.

I don't think a specific column in the BAM file is a good idea as it would break compatibility with the BAM format standard.

Both these are improvements that could be made to the software in future versions, but I don't think leaving the UMI on the read will ever be possible.

casimonet · 2016-05-24T06:59:21Z

Hi

If a filtering option could be implemented it would be very useful.

The idea of moving the UMI to a BAM field is to make it compatible with a previously designed pipeline. In particular the quality filtering actually. As suggested by @IanSudbery the best way is to do this before UMI extraction for now.
I'm not sure to what extent the pipeline I have is a commonly used strategy but being able to move the UMI back to a BAM field would allow more flexibility and people to test and compare the results obtained with UMI tools with results obtained with their usual pipeline to re-test previous datasets for example.
I'm also very new at bioinformatics though, there is probably ways to go around.

Thank you both for your answer and advice.

TomSmithCGAT · 2016-05-31T12:01:28Z

@casimonet The latest version of UMI-Tools (v.0.2.0) now allows reads to be filtered out during the extraction stage using the quality scores (see options --quality-threshold and --quality-encoding)

IanSudbery added the enhancement label May 24, 2016

TomSmithCGAT mentioned this issue May 25, 2016

Ts add quality filtering #33

Merged

TomSmithCGAT closed this as completed Jan 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMI removal #29

UMI removal #29

casimonet commented May 23, 2016

TomSmithCGAT commented May 23, 2016

IanSudbery commented May 23, 2016

casimonet commented May 24, 2016

TomSmithCGAT commented May 31, 2016