Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification for GTF file #30

Open
dylkot opened this issue Apr 23, 2018 · 11 comments
Open

Specification for GTF file #30

dylkot opened this issue Apr 23, 2018 · 11 comments

Comments

@dylkot
Copy link

dylkot commented Apr 23, 2018

Hi There, thanks for putting all of this together!

I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL:
ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta

Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF

I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name"

Problems:
Missing transcript_name
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71)
at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166)
at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download.

@Hoohm
Copy link
Owner

Hoohm commented Apr 27, 2018 via email

@dylkot
Copy link
Author

dylkot commented May 2, 2018

Hi. Yep, I fixed it by manually appending gene_name and transcript_name entries to the rows that were missing them in the GTF. I simply used the gene_id and transcript_id for those that were lacking them. I then found that gene_name (which is used by the pipeline as the gene identifier) was not unique in the Macaque annotation. So I overwrote the gene_name entry in the attribute column in the format geneid__genename so it would propagate information for both features through the pipeline. I'm not sure which of these features you would want to add into the package or if you would want to just indicate that gene_name and transcript_name are required attributes. In any case, I'm copying some relevant code here below:

`
import pandas as pd
import numpy as np
import re

def append_gtf_attr(ingtf, outgtf):
    '''Adds gene_name and transcript_name fields to rows that are missing them. Saves gene_name 
       in the format geneid__genename'''
    gtf = pd.read_csv(ingtf, sep='\t', header=None, comment='#')

    ## Extract gene_id and transcript_id
    gtf['gid'] = gtf[8].apply(lambda x: re.findall('gene_id "(.+?)"', x)[0])
    gtf['tid'] = np.nan
    tidind = gtf[2] != 'gene'
    gtf.loc[tidind, 'tid'] = gtf.loc[tidind,8].apply(lambda x: re.findall('transcript_id "(.+?)"', x)[0])

    ## Add attr column to hold our updated attribute info
    gtf['attr'] = gtf[8].copy()

    ## Use gene_id as gene_name for rows that don't have gene_name in attr
    hasgn = gtf['attr'].apply(lambda x: 'gene_name' in x)
    gtf.loc[~hasgn, 'attr'] += ' gene_name "' + gtf.loc[~hasgn,'gid'] + '";'

    ## Use transcript_id as transcript_name for rows missing transcript_name
    hastn = gtf['attr'].apply(lambda x: 'transcript_name' in x)
    notnidx = ~hastn & (gtf[2]!='gene')
    gtf.loc[notnidx, 'attr'] += ' transcript_name "' + gtf.loc[notnidx,'tid'] + '";'

    ## Update gene_name to be geneid__genename format
    gtf['gene_name'] = gtf['attr'].apply(lambda x: re.findall('gene_name "(.+?)"', x)[0])
    gtf['newname'] = gtf['gid'] + '__' + gtf['gene_name']
    strbase = 'gene_name "%s"'
    gtf['attr'] = gtf.apply(lambda x: x['attr'].replace(strbase % x['gene_name'], strbase % x['newname']), axis=1)

    ## Drop extraneous columns and save resulting file
    gtf = gtf.drop([8, 'gid', 'tid', 'gene_name', 'newname'], axis=1)
    gtf.to_csv(outgtf, sep='\t', index=False, header=False, quoting=3)

`

@Hoohm
Copy link
Owner

Hoohm commented Nov 28, 2018

Hey @dylkot !
I'm working on automating the reference and genome download generation. I'm not familiar with all possible species that people use for single cell and I would greatly appreciate help for specific ones such as Rhesus Macaque.

Would you be able to help integrate this once the new version is out?

@dylkot
Copy link
Author

dylkot commented Nov 28, 2018

Sure! I'd be happy to help with integration for Rhesus Macaque. By 'new version', are you referring to incorporating Drop-seq tools 2.0? I was just starting to wonder about how I will use that going forward because I would especially like to have quantitation of the intronic reads in my data

@Hoohm
Copy link
Owner

Hoohm commented Nov 28, 2018

Version 0.4 and many more things.

Which functionnality would you specifically need?

@dylkot
Copy link
Author

dylkot commented Nov 28, 2018

Exciting! Really just the ability to generate a count matrices for reads falling in introns in addition to reads falling in the coding sequence.

@Hoohm
Copy link
Owner

Hoohm commented Dec 19, 2018

Hey! If you want to try out the new feature for automatic mixed species download, merging and generation, you can test out the new 0.4 version!

@dylkot
Copy link
Author

dylkot commented Dec 20, 2018

Awesome, I just got a new load of data and so I'll run it with the new version! I'll let you know how it goes.

@dylkot
Copy link
Author

dylkot commented Dec 22, 2018

Is there documentation anywhere on how to setup the binaries to run the new version? When I try to run it with the environment that was working for the previous version of dropSeqPipe, I get the error:

/bin/bash: cutadapt: command not found

I can install cutadapt but I imagine there might be several changes to the environment including downloading the next version of Drop-seq_tools and I just want to make sure I do all of that before I try to run things.

@Hoohm
Copy link
Owner

Hoohm commented Dec 22, 2018

are you adding --use-conda to the command? Because that would download and activate the envs for each rule and you should not get this kind of errors.

@dylkot
Copy link
Author

dylkot commented Dec 22, 2018

Nope, I wasn't. Sorry I missed that! Will give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants