Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMI removal #29

Closed
casimonet opened this issue May 23, 2016 · 4 comments
Closed

UMI removal #29

casimonet opened this issue May 23, 2016 · 4 comments

Comments

@casimonet
Copy link

Hi,
Sorry to open another issue. The fact that the UMI is removed from the sequence is a problem for downstream analyses. For example for building a bam file where the reads are filtered for UMI in which bases have a low quality. Or simply to keep a column with the UMI in the bam file, which can be more practical than having the UMI in the reads name for certain downstream analyses.

Is there no way to avoid the read to be truncated?

Thank you,
Camille

@TomSmithCGAT
Copy link
Member

Hi Camile,

No need to apologise. Comments are always very welcome.

You raise two issues:

  1. Filtering on quality
  2. Retaining the UMI elsewhere in the BAM file
  3. Filtering on quality:
    It's not realistically possible to cover all possible filtering one might wish to apply to the reads. Previously, where I've needed to do this, I've written bespoke code to perform the filtering at the UMI extraction stage as this step is normally relatively simple. I can see that filtering on quality would be a useful feature in general though and easy to implement so I'm happy to add it to the extract tool.
  4. For the alignment step it's necessary to retain the UMI in the read ID as this ensures all aligners will retain the UMI sequence in the resultant BAM. I guess it would be possible to have an option within dedup to move the UMI to a BAM field during the de-deduplication. Can you give me a motivating example as to why this would be useful?

Regards,

Tom

Tom Smith, PhD
CGAT Training Fellow | http://www.cgat.orghttps://owa.nexus.ox.ac.uk/owa/redir.aspx?C=tg_bDY9IpU-WsZT9DRtecLrzBQBo5dAIR2VtNMFUhhmiSbB621cWM1yWzPrEKkt2W6Kx2kZl0Lg.&URL=http%3a%2f%2fwww.cgat.org
MRC Functional Genomics Unit | University of Oxford
+44 1865 285854 | Thomas.Smith2@dpag.ox.ac.uk
https://cgatoxford.wordpress.com/author/tss505/


From: Camille SIMONET [notifications@github.com]
Sent: 23 May 2016 16:45
To: CGATOxford/UMI-tools
Subject: [CGATOxford/UMI-tools] UMI removal (#29)

Hi,
Sorry to open another issue. The fact that the UMI is removed from the sequence is a problem for downstream analyses. For example for building a bam file where the reads are filtered for UMI in which bases have a low quality. Or simply to keep a column with the UMI in the bam file, which can be more practical than having the UMI in the reads name for certain downstream analyses.

Is there no way to avoid the read to be truncated?

Thank you,
Camille


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/29

@IanSudbery
Copy link
Member

Retaining the UMI on the read is not possible: if you were to leave the UMI on the read, your reads would not map.

I think adding quality filtering to extract is probably a good idea in the medium to long term. However, until that point i'd recommend filtering on the UMI quality before trimming the UMI.

I have previously started thinking about a tool for moving UMIs around a BAM record - certain software expects the UMI in certain places - usually either a particular place in the read name or sometimes in a particular tag. For example iCLIPro requires the UMI to be encoded as :rbc[ATCGN]+: at the end of the read name. I got bogged down trying to specify a format for describing the different possible locations.

I don't think a specific column in the BAM file is a good idea as it would break compatibility with the BAM format standard.

Both these are improvements that could be made to the software in future versions, but I don't think leaving the UMI on the read will ever be possible.

@casimonet
Copy link
Author

Hi

If a filtering option could be implemented it would be very useful.

The idea of moving the UMI to a BAM field is to make it compatible with a previously designed pipeline. In particular the quality filtering actually. As suggested by @IanSudbery the best way is to do this before UMI extraction for now.
I'm not sure to what extent the pipeline I have is a commonly used strategy but being able to move the UMI back to a BAM field would allow more flexibility and people to test and compare the results obtained with UMI tools with results obtained with their usual pipeline to re-test previous datasets for example.
I'm also very new at bioinformatics though, there is probably ways to go around.

Thank you both for your answer and advice.

@TomSmithCGAT
Copy link
Member

@casimonet The latest version of UMI-Tools (v.0.2.0) now allows reads to be filtered out during the extraction stage using the quality scores (see options --quality-threshold and --quality-encoding)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants