# First loook at dodgy transcripts

Some of the snpEff results are very strange with a high number of SNPs flagging errors with overlapping transcipts.


Here is a first look at waht is causing this.


## Input data

I am using snpEff v 4.3.1t-2, installed via conda, along with the most recent genome and annotation from the NCBI. Md5sums for genome and gtf are:

In [58]:
! md5sum /home/david/data/SNPeff_db/bStrHab1.2.pri/genes.gtf.gz

5a7f1364a01ad6aa081c5399f5a19d76  /home/david/data/SNPeff_db/bStrHab1.2.pri/genes.gtf.gz


In [37]:
! md5fa /home/david/data/SNPeff_db/bStrHab1.2.pri/sequences.fa | tail -n2

6221971aabdc307a89e66f816ef6241f  /home/david/data/SNPeff_db/bStrHab1.2.pri/sequences.fa  >ordered
2a1eb856ec7ecb9155a6e823aa88c38a  /home/david/data/SNPeff_db/bStrHab1.2.pri/sequences.fa  >unordered


I then ran snpEff, using bcftools to update chromosome names (the cell not run, as it takes qute a while to compelte a run):

```
bcftools annotate  --rename-chrs remap_chroms.tsv ~/analysis/kakapo_birds/vars/Trained.bcf  | \
   java -jar /home/david/miniconda/envs/aspergil/share/snpeff-4.3.1t-2/snpEff.jar \
   eff -v  bStrHab1.2.pri > annotated.vcf
```


## Parsing the annotated VCF

SNPeff writes information for each sequence using an "ANN" field in the INFO column of a vcf, here is some old code to parse those out 

In [1]:
from collections import namedtuple
import vcf

#Use a named tuple to represent site annotation info                                                                                                                   
Annotation = namedtuple("Annotation",                                                                                                                                  
                  ["allele", "annotation", "impact", "gene_name", "gene_id",                                                                                           
                    "feature_type", "feature_id", "transcript_biotype", "rank",                                                                                        
                    "HGVS_c", "HGVS_p", "cDNA_pos","CDS_pos", "AA_pos",                                                                                                 
                    "distance", "messages"]                                                                                                                            
)                                                                                                                                                                      
                                                                                                                                                                       
                                                                                                                                                                       
                                                                                                                                                                       
def _parse_annot(ANN_string):                                                                                                                                           
    """ Represent the ANN information form an VCF INFO field """                                                                                                       
    return(Annotation(*ANN_string.split("|"))                                                                                                                                  )

def get_annotations(site):
    """ Get all annotations from a site"""
    return([_parse_annot(a) for a in site.INFO["ANN"]])

# Find the genes causing warnings

In [117]:
sites = vcf.Reader(open("annotated.vcf"))

In [114]:
#parse through the sites,  site can have multiple annotations if it affects different transcritps of the same 
# gene or is upstream/downstream of multiple genes.So, for each site we call teh parse_annot function for 
# all of the annotations given by the "ANN" field.

dodgy_genes = []
for s in sites:
    for annotation in get_annotations(s):
        if annotation.messages:
            dodgy_genes.append( annotation ) 

        

In [131]:
len(dodgy_genes)

131661

## Work out which genes are associated with which warnings

There are a tonne of warnings, mostly from a small number of genes. 

A couple of ways to colelct them up. First making a dictionary to look up the genes associated with each warnign message...

In [133]:
from collections import defaultdict
warning_dict = defaultdict(set)

In [134]:
for anno in dodgy_genes:
    warning_dict[anno.messages].add(anno.gene_id)

In [135]:
for warning,gene_list in warning_dict.items():
    print(warning, len(gene_list))

INFO_REALIGN_3_PRIME 4113


In [136]:
warning_dict["WARNING_TRANSCRIPT_INCOMPLETE"]

{'FIBP',
 'KHSRP',
 'LOC115602847',
 'LOC115603008',
 'LOC115603047',
 'LOC115603536',
 'MAG',
 'OTUB1',
 'PLP2',
 'STK19',
 'TAF1C',
 'TARS2',
 'VARS1'}

So, most of the WARNINGS have only a realtively small number of genes. To work on thse closer, write out a list of all unique warning-gene pair...

In [145]:
n = 0
all_warnings = set()
with open("warning_summary.tsv", "w") as out:    
    for warning,gene_list in warning_dict.items():
        #ignore INFO for now
        if warning.startswith("WARNING"):
            for gene in gene_list:
                all_warnings.add(gene)
                out.write("{}\t{}\n".format(warning, gene))
                n += 1
n
            

174

.... and finally write out each unique geene. You can use this to show there which genes are causing the errors, most of which appear to be flagged as manual translation exceptions

In [160]:
with open("unique_warning_genes.list", "w") as out:
    for g in all_warnings:
        out.write(g +"\n")
len(all_warnings)

140

In [168]:
! zcat /home/david/data/SNPeff_db/genes.gtf.gz | grep -f unique_warning_genes.list  | head -n5

NC_044277.2	Gnomon	gene	44847719	44870969	.	+	.	gene_id "LOC115609468"; db_xref "GeneID:115609468"; gbkey "Gene"; gene "LOC115609468"; gene_biotype "protein_coding"; 
NC_044277.2	Gnomon	exon	44847719	44847757	.	+	.	gene_id "LOC115609468"; transcript_id "XM_030489764.1"; db_xref "GeneID:115609468"; exception "unclassified transcription discrepancy"; gbkey "mRNA"; gene "LOC115609468"; model_evidence "Supporting evidence includes similarity to: 17 Proteins, 1 long SRA read, and 63% coverage of the annotated genomic feature by RNAseq alignments"; note "The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 4 bases in 3 codons; deleted 3 bases in 3 codons"; product "desmocollin-2-like"; exon_number "1"; 
NC_044277.2	Gnomon	exon	44850321	44850405	.	+	.	gene_id "LOC115609468"; transcript_id "XM_030489764.1"; db_xref "GeneID:115609468"; exception "unclassified transcription discrepancy"; gbkey "mRNA"; gene "LOC115609

In [197]:
! zcat /home/david/data/SNPeff_db/genes.gtf.gz | grep -f unique_warning_genes.list | grep exon | grep -v "modified" | grep -v "LOW Q"

NC_044277.2	Gnomon	exon	99846217	99846272	.	+	.	gene_id "OBSCN"; transcript_id "XM_030482854.1"; db_xref "GeneID:115606602"; gbkey "mRNA"; gene "OBSCN"; model_evidence "Supporting evidence includes similarity to: 2 mRNAs, 5 ESTs, 2 Proteins, and 87% coverage of the annotated genomic feature by RNAseq alignments"; product "obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF"; exon_number "1"; 
NC_044277.2	Gnomon	exon	99855236	99856380	.	+	.	gene_id "OBSCN"; transcript_id "XM_030482854.1"; db_xref "GeneID:115606602"; gbkey "mRNA"; gene "OBSCN"; model_evidence "Supporting evidence includes similarity to: 2 mRNAs, 5 ESTs, 2 Proteins, and 87% coverage of the annotated genomic feature by RNAseq alignments"; product "obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF"; exon_number "2"; 
NC_044277.2	Gnomon	exon	99862886	99863155	.	+	.	gene_id "OBSCN"; transcript_id "XM_030482854.1"; db_xref "GeneID:115606602"; gbkey "mRNA"; gene "OBSCN"; model_evidence "Supporting evid

NC_044277.2	Gnomon	CDS	113950520	113950690	.	+	2	gene_id "MPP7"; transcript_id "XM_030504945.1"; db_xref "GeneID:115616096"; gbkey "CDS"; gene "MPP7"; product "MAGUK p55 subfamily member 7 isoform X1"; protein_id "XP_030360805.1"; exon_number "16"; 
NC_044277.2	Gnomon	CDS	113957514	113957594	.	+	2	gene_id "MPP7"; transcript_id "XM_030504945.1"; db_xref "GeneID:115616096"; gbkey "CDS"; gene "MPP7"; product "MAGUK p55 subfamily member 7 isoform X1"; protein_id "XP_030360805.1"; exon_number "17"; 
NC_044277.2	Gnomon	CDS	113959418	113959511	.	+	2	gene_id "MPP7"; transcript_id "XM_030504945.1"; db_xref "GeneID:115616096"; gbkey "CDS"; gene "MPP7"; product "MAGUK p55 subfamily member 7 isoform X1"; protein_id "XP_030360805.1"; exon_number "18"; 
NC_044277.2	Gnomon	CDS	113960605	113960713	.	+	1	gene_id "MPP7"; transcript_id "XM_030504945.1"; db_xref "GeneID:115616096"; gbkey "CDS"; gene "MPP7"; product "MAGUK p55 subfamily member 7 isoform X1"; protein_id "XP_030360805.1"; exon_number "19"

NC_044277.2	Gnomon	CDS	131295371	131295573	.	+	2	gene_id "MPP6"; transcript_id "XM_030489726.1"; db_xref "GeneID:115609452"; gbkey "CDS"; gene "MPP6"; product "MAGUK p55 subfamily member 6"; protein_id "XP_030345586.1"; exon_number "10"; 
NC_044277.2	Gnomon	CDS	131295977	131296105	.	+	0	gene_id "MPP6"; transcript_id "XM_030489726.1"; db_xref "GeneID:115609452"; gbkey "CDS"; gene "MPP6"; product "MAGUK p55 subfamily member 6"; protein_id "XP_030345586.1"; exon_number "11"; 
NC_044277.2	Gnomon	CDS	131298930	131299103	.	+	0	gene_id "MPP6"; transcript_id "XM_030489726.1"; db_xref "GeneID:115609452"; gbkey "CDS"; gene "MPP6"; product "MAGUK p55 subfamily member 6"; protein_id "XP_030345586.1"; exon_number "12"; 
NC_044277.2	Gnomon	start_codon	131273824	131273826	.	+	0	gene_id "MPP6"; transcript_id "XM_030489726.1"; db_xref "GeneID:115609452"; gbkey "CDS"; gene "MPP6"; product "MAGUK p55 subfamily member 6"; protein_id "XP_030345586.1"; exon_number "2"; 
NC_044277.2	Gnomon	stop_codon	131

NC_044278.2	Gnomon	exon	116066194	116066363	.	-	.	gene_id "DLG2"; transcript_id "XM_030474393.1"; db_xref "GeneID:115602909"; gbkey "mRNA"; gene "DLG2"; model_evidence "Supporting evidence includes similarity to: 3 ESTs, 6 Proteins, 352 long SRA reads, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "discs large MAGUK scaffold protein 2, transcript variant X6"; exon_number "10"; 
NC_044278.2	Gnomon	exon	116042432	116042568	.	-	.	gene_id "DLG2"; transcript_id "XM_030474393.1"; db_xref "GeneID:115602909"; gbkey "mRNA"; gene "DLG2"; model_evidence "Supporting evidence includes similarity to: 3 ESTs, 6 Proteins, 352 long SRA reads, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "discs large MAGUK scaffold protein 2, transcript variant X6"; exon_number "11"; 
NC_044278.2	Gnomon	exon	116030431	116030575	.	

NC_044278.2	Gnomon	exon	116066194	116066363	.	-	.	gene_id "DLG2"; transcript_id "XM_030474516.1"; db_xref "GeneID:115602909"; gbkey "mRNA"; gene "DLG2"; model_evidence "Supporting evidence includes similarity to: 2 ESTs, 5 Proteins, 350 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "discs large MAGUK scaffold protein 2, transcript variant X23"; exon_number "5"; 
NC_044278.2	Gnomon	exon	116042432	116042568	.	-	.	gene_id "DLG2"; transcript_id "XM_030474516.1"; db_xref "GeneID:115602909"; gbkey "mRNA"; gene "DLG2"; model_evidence "Supporting evidence includes similarity to: 2 ESTs, 5 Proteins, 350 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "discs large MAGUK scaffold protein 2, transcript variant X23"; exon_number "6"; 
NC_044278.2	Gnomon	exon	116030431	116030575	.	-	

NC_044279.2	Gnomon	CDS	58095517	58095578	.	-	2	gene_id "PPFIBP1"; transcript_id "XM_030478645.1"; db_xref "GeneID:115604963"; gbkey "CDS"; gene "PPFIBP1"; product "liprin-beta-1 isoform X1"; protein_id "XP_030334505.1"; exon_number "11"; 
NC_044279.2	Gnomon	CDS	58094715	58094799	.	-	0	gene_id "PPFIBP1"; transcript_id "XM_030478645.1"; db_xref "GeneID:115604963"; gbkey "CDS"; gene "PPFIBP1"; product "liprin-beta-1 isoform X1"; protein_id "XP_030334505.1"; exon_number "12"; 
NC_044279.2	Gnomon	CDS	58092896	58093053	.	-	2	gene_id "PPFIBP1"; transcript_id "XM_030478645.1"; db_xref "GeneID:115604963"; gbkey "CDS"; gene "PPFIBP1"; product "liprin-beta-1 isoform X1"; protein_id "XP_030334505.1"; exon_number "13"; 
NC_044279.2	Gnomon	CDS	58091839	58091969	.	-	0	gene_id "PPFIBP1"; transcript_id "XM_030478645.1"; db_xref "GeneID:115604963"; gbkey "CDS"; gene "PPFIBP1"; product "liprin-beta-1 isoform X1"; protein_id "XP_030334505.1"; exon_number "14"; 
NC_044279.2	Gnomon	CDS	58090252	58090305

NC_046358.1	Gnomon	exon	9923532	9923759	.	-	.	gene_id "AP5M1"; transcript_id "XM_030483156.1"; db_xref "GeneID:115606733"; gbkey "mRNA"; gene "AP5M1"; model_evidence "Supporting evidence includes similarity to: 2 ESTs, 12 Proteins, and 91% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; partial "true"; product "adaptor related protein complex 5 subunit mu 1"; exon_number "2"; 
NC_046358.1	Gnomon	exon	9922937	9923076	.	-	.	gene_id "AP5M1"; transcript_id "XM_030483156.1"; db_xref "GeneID:115606733"; gbkey "mRNA"; gene "AP5M1"; model_evidence "Supporting evidence includes similarity to: 2 ESTs, 12 Proteins, and 91% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; partial "true"; product "adaptor related protein complex 5 subunit mu 1"; exon_number "3"; 
NC_046358.1	Gnomon	exon	9922100	9922185	.	-	.	gene_id "AP5M1"; transcript_id "XM_030

NC_046358.1	Gnomon	exon	58363302	58363362	.	-	.	gene_id "PPFIBP2"; transcript_id "XR_003990916.1"; db_xref "GeneID:115606507"; gbkey "misc_RNA"; gene "PPFIBP2"; model_evidence "Supporting evidence includes similarity to: 3 ESTs, 1 Protein, 252 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "PPFIA binding protein 2, transcript variant X1"; exon_number "16"; 
NC_046358.1	Gnomon	exon	58361372	58361413	.	-	.	gene_id "PPFIBP2"; transcript_id "XR_003990916.1"; db_xref "GeneID:115606507"; gbkey "misc_RNA"; gene "PPFIBP2"; model_evidence "Supporting evidence includes similarity to: 3 ESTs, 1 Protein, 252 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "PPFIA binding protein 2, transcript variant X1"; exon_number "17"; 
NC_046358.1	Gnomon	exon	58359370	58359514	.	-	.	gene_id "P

NC_046358.1	Gnomon	exon	80281943	80281982	.	-	.	gene_id "LOC115606427"; transcript_id "XM_030482319.1"; db_xref "GeneID:115606427"; gbkey "mRNA"; gene "LOC115606427"; model_evidence "Supporting evidence includes similarity to: 12 Proteins, and 45% coverage of the annotated genomic feature by RNAseq alignments"; product "cytochrome P450 2W1-like"; exon_number "3"; 
NC_046358.1	Gnomon	exon	80281788	80281794	.	-	.	gene_id "LOC115606427"; transcript_id "XM_030482319.1"; db_xref "GeneID:115606427"; gbkey "mRNA"; gene "LOC115606427"; model_evidence "Supporting evidence includes similarity to: 12 Proteins, and 45% coverage of the annotated genomic feature by RNAseq alignments"; product "cytochrome P450 2W1-like"; exon_number "4"; 
NC_046358.1	Gnomon	exon	80281127	80281295	.	-	.	gene_id "LOC115606427"; transcript_id "XM_030482319.1"; db_xref "GeneID:115606427"; gbkey "mRNA"; gene "LOC115606427"; model_evidence "Supporting evidence includes similarity to: 12 Proteins, and 45% coverage of the 

NC_044281.2	Gnomon	exon	27198801	27198949	.	-	.	gene_id "DLG5"; transcript_id "XM_030485715.1"; db_xref "GeneID:115607897"; gbkey "mRNA"; gene "DLG5"; model_evidence "Supporting evidence includes similarity to: 2 mRNAs, 7 ESTs, 70 long SRA reads, and 98% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "discs large MAGUK scaffold protein 5, transcript variant X4"; exon_number "26"; 
NC_044281.2	Gnomon	exon	27198320	27198490	.	-	.	gene_id "DLG5"; transcript_id "XM_030485715.1"; db_xref "GeneID:115607897"; gbkey "mRNA"; gene "DLG5"; model_evidence "Supporting evidence includes similarity to: 2 mRNAs, 7 ESTs, 70 long SRA reads, and 98% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "discs large MAGUK scaffold protein 5, transcript variant X4"; exon_number "27"; 
NC_044281.2	Gnomon	exon	27196995	27197191	.	-	.	gene_id "D

NC_044281.2	Gnomon	exon	49510641	49517668	.	-	.	gene_id "MGAT5"; transcript_id "XM_030487180.1"; db_xref "GeneID:115608408"; gbkey "mRNA"; gene "MGAT5"; model_evidence "Supporting evidence includes similarity to: 2 ESTs, 21 Proteins, 247 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments"; product "alpha-1,6-mannosylglycoprotein 6-beta-N-acetylglucosaminyltransferase"; exon_number "17"; 
NC_044281.2	Gnomon	exon	52106897	52106962	.	+	.	gene_id "LOC115608763"; transcript_id "XM_030488084.1"; db_xref "GeneID:115608763"; gbkey "mRNA"; gene "LOC115608763"; model_evidence "Supporting evidence includes similarity to: 1 Protein"; partial "true"; product "protein PXR1-like"; exon_number "1"; 
NC_044281.2	Gnomon	exon	52107321	52107374	.	+	.	gene_id "LOC115608763"; transcript_id "XM_030488084.1"; db_xref "GeneID:115608763"; gbkey "mRNA"; gene "LOC115608763"; model_evidence "Supporting evidence includes similarity to: 1 Protein"; partial "true"; product "prote

NC_044282.2	Gnomon	exon	43592097	43592267	.	+	.	gene_id "TPO"; transcript_id "XM_030489711.1"; db_xref "GeneID:115609450"; gbkey "mRNA"; gene "TPO"; model_evidence "Supporting evidence includes similarity to: 21 Proteins, and 81% coverage of the annotated genomic feature by RNAseq alignments"; product "thyroid peroxidase"; exon_number "13"; 
NC_044282.2	Gnomon	exon	43595250	43595410	.	+	.	gene_id "TPO"; transcript_id "XM_030489711.1"; db_xref "GeneID:115609450"; gbkey "mRNA"; gene "TPO"; model_evidence "Supporting evidence includes similarity to: 21 Proteins, and 81% coverage of the annotated genomic feature by RNAseq alignments"; product "thyroid peroxidase"; exon_number "14"; 
NC_044282.2	Gnomon	exon	49571471	49571745	.	-	.	gene_id "LOC115610106"; transcript_id "XM_030491171.1"; db_xref "GeneID:115610106"; gbkey "mRNA"; gene "LOC115610106"; model_evidence "Supporting evidence includes similarity to: 1 Protein, 3 long SRA reads, and 99% coverage of the annotated genomic feature by R

NC_044284.2	Gnomon	exon	23162295	23164031	.	-	.	gene_id "DLG1"; transcript_id "XM_030494780.1"; db_xref "GeneID:115611598"; gbkey "mRNA"; gene "DLG1"; model_evidence "Supporting evidence includes similarity to: 1 mRNA, 4 ESTs, 17 Proteins, 130 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments"; product "discs large MAGUK scaffold protein 1, transcript variant X21"; exon_number "23"; 
NC_044284.2	Gnomon	exon	23253348	23254190	.	-	.	gene_id "DLG1"; transcript_id "XM_030494782.1"; db_xref "GeneID:115611598"; gbkey "mRNA"; gene "DLG1"; model_evidence "Supporting evidence includes similarity to: 1 mRNA, 3 ESTs, 17 Proteins, 129 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments"; product "discs large MAGUK scaffold protein 1, transcript variant X23"; exon_number "1"; 
NC_044284.2	Gnomon	exon	23227873	23227971	.	-	.	gene_id "DLG1"; transcript_id "XM_030494782.1"; db_xref "GeneID:115611598"; gbkey "mRNA"; gene "DLG1"; 

NC_044285.2	Gnomon	exon	13514122	13514306	.	-	.	gene_id "WDR72"; transcript_id "XM_030498252.1"; db_xref "GeneID:115613137"; gbkey "mRNA"; gene "WDR72"; model_evidence "Supporting evidence includes similarity to: 9 Proteins, 29 long SRA reads, and 88% coverage of the annotated genomic feature by RNAseq alignments"; product "WD repeat domain 72"; exon_number "22"; 
NC_044285.2	Gnomon	exon	13506964	13506969	.	-	.	gene_id "WDR72"; transcript_id "XM_030498252.1"; db_xref "GeneID:115613137"; gbkey "mRNA"; gene "WDR72"; model_evidence "Supporting evidence includes similarity to: 9 Proteins, 29 long SRA reads, and 88% coverage of the annotated genomic feature by RNAseq alignments"; product "WD repeat domain 72"; exon_number "23"; 
NC_044285.2	Gnomon	exon	13482695	13482717	.	-	.	gene_id "WDR72"; transcript_id "XM_030498252.1"; db_xref "GeneID:115613137"; gbkey "mRNA"; gene "WDR72"; model_evidence "Supporting evidence includes similarity to: 9 Proteins, 29 long SRA reads, and 88% coverage of 

NC_044285.2	Gnomon	exon	43984981	43985031	.	+	.	gene_id "DLG3"; transcript_id "XM_030474690.1"; db_xref "GeneID:115603089"; gbkey "mRNA"; gene "DLG3"; model_evidence "Supporting evidence includes similarity to: 3 ESTs, 13 Proteins, 181 long SRA reads, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "discs large MAGUK scaffold protein 3, transcript variant X6"; exon_number "15"; 
NC_044285.2	Gnomon	exon	43985649	43985750	.	+	.	gene_id "DLG3"; transcript_id "XM_030474690.1"; db_xref "GeneID:115603089"; gbkey "mRNA"; gene "DLG3"; model_evidence "Supporting evidence includes similarity to: 3 ESTs, 13 Proteins, 181 long SRA reads, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "discs large MAGUK scaffold protein 3, transcript variant X6"; exon_number "16"; 
NC_044285.2	Gnomon	exon	43985869	43986041	.	+	.	

NC_044285.2	Gnomon	exon	43985869	43986041	.	+	.	gene_id "DLG3"; transcript_id "XM_030474691.1"; db_xref "GeneID:115603089"; gbkey "mRNA"; gene "DLG3"; model_evidence "Supporting evidence includes similarity to: 2 ESTs, 2 Proteins, 180 long SRA reads, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "discs large MAGUK scaffold protein 3, transcript variant X7"; exon_number "9"; 
NC_044285.2	Gnomon	exon	43987101	43987210	.	+	.	gene_id "DLG3"; transcript_id "XM_030474691.1"; db_xref "GeneID:115603089"; gbkey "mRNA"; gene "DLG3"; model_evidence "Supporting evidence includes similarity to: 2 ESTs, 2 Proteins, 180 long SRA reads, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "discs large MAGUK scaffold protein 3, transcript variant X7"; exon_number "10"; 
NC_044285.2	Gnomon	exon	43987865	43987956	.	+	.	gen

NC_046360.1	Gnomon	exon	33019589	33019624	.	+	.	gene_id "MAGI1"; transcript_id "XM_030501211.1"; db_xref "GeneID:115614374"; gbkey "mRNA"; gene "MAGI1"; model_evidence "Supporting evidence includes similarity to: 3 mRNAs, 5 ESTs, 9 Proteins, 41 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "membrane associated guanylate kinase, WW and PDZ domain containing 1, transcript variant X26"; exon_number "7"; 
NC_046360.1	Gnomon	exon	33023852	33023909	.	+	.	gene_id "MAGI1"; transcript_id "XM_030501211.1"; db_xref "GeneID:115614374"; gbkey "mRNA"; gene "MAGI1"; model_evidence "Supporting evidence includes similarity to: 3 mRNAs, 5 ESTs, 9 Proteins, 41 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "membrane associated guanylate kinase, WW and PDZ domain containing 1, transcript v

NC_046360.1	Gnomon	CDS	33083168	33083865	.	+	2	gene_id "MAGI1"; transcript_id "XM_030501203.1"; db_xref "GeneID:115614374"; gbkey "CDS"; gene "MAGI1"; product "membrane-associated guanylate kinase, WW and PDZ domain-containing protein 1 isoform X18"; protein_id "XP_030357063.1"; exon_number "23"; 
NC_046360.1	Gnomon	start_codon	33009639	33009641	.	+	0	gene_id "MAGI1"; transcript_id "XM_030501203.1"; db_xref "GeneID:115614374"; gbkey "CDS"; gene "MAGI1"; product "membrane-associated guanylate kinase, WW and PDZ domain-containing protein 1 isoform X18"; protein_id "XP_030357063.1"; exon_number "3"; 
NC_046360.1	Gnomon	stop_codon	33083866	33083868	.	+	0	gene_id "MAGI1"; transcript_id "XM_030501203.1"; db_xref "GeneID:115614374"; gbkey "CDS"; gene "MAGI1"; product "membrane-associated guanylate kinase, WW and PDZ domain-containing protein 1 isoform X18"; protein_id "XP_030357063.1"; exon_number "23"; 
NC_046360.1	Gnomon	exon	32959112	32959449	.	+	.	gene_id "MAGI1"; transcript_id "XM_030

NC_046360.1	Gnomon	CDS	33047657	33047688	.	+	2	gene_id "MAGI1"; transcript_id "XM_030501205.1"; db_xref "GeneID:115614374"; gbkey "CDS"; gene "MAGI1"; product "membrane-associated guanylate kinase, WW and PDZ domain-containing protein 1 isoform X20"; protein_id "XP_030357065.1"; exon_number "11"; 
NC_046360.1	Gnomon	CDS	33055137	33055353	.	+	0	gene_id "MAGI1"; transcript_id "XM_030501205.1"; db_xref "GeneID:115614374"; gbkey "CDS"; gene "MAGI1"; product "membrane-associated guanylate kinase, WW and PDZ domain-containing protein 1 isoform X20"; protein_id "XP_030357065.1"; exon_number "12"; 
NC_046360.1	Gnomon	CDS	33061171	33061248	.	+	2	gene_id "MAGI1"; transcript_id "XM_030501205.1"; db_xref "GeneID:115614374"; gbkey "CDS"; gene "MAGI1"; product "membrane-associated guanylate kinase, WW and PDZ domain-containing protein 1 isoform X20"; protein_id "XP_030357065.1"; exon_number "13"; 
NC_046360.1	Gnomon	CDS	33063920	33064011	.	+	2	gene_id "MAGI1"; transcript_id "XM_030501205.1"; db_x

NC_044290.2	Gnomon	start_codon	3218377	3218379	.	+	0	gene_id "MGAT5B"; transcript_id "XM_030506503.2"; db_xref "GeneID:115616797"; gbkey "CDS"; gene "MGAT5B"; product "alpha-1,6-mannosylglycoprotein 6-beta-N-acetylglucosaminyltransferase B isoform X2"; protein_id "XP_030362363.1"; exon_number "1"; 
NC_044290.2	Gnomon	stop_codon	3283975	3283977	.	+	0	gene_id "MGAT5B"; transcript_id "XM_030506503.2"; db_xref "GeneID:115616797"; gbkey "CDS"; gene "MGAT5B"; product "alpha-1,6-mannosylglycoprotein 6-beta-N-acetylglucosaminyltransferase B isoform X2"; protein_id "XP_030362363.1"; exon_number "17"; 
NC_044293.2	Gnomon	exon	5306473	5306628	.	-	.	gene_id "APLP2"; transcript_id "XM_030505543.1"; db_xref "GeneID:115616386"; gbkey "mRNA"; gene "APLP2"; model_evidence "Supporting evidence includes similarity to: 22 ESTs, 12 Proteins, 2354 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 8 samples with support for all annotated introns"; product "amyl

NC_044295.2	Gnomon	CDS	5651940	5652058	.	+	2	gene_id "MPP2"; transcript_id "XM_030508739.1"; db_xref "GeneID:115617798"; gbkey "CDS"; gene "MPP2"; product "MAGUK p55 subfamily member 2 isoform X4"; protein_id "XP_030364599.1"; exon_number "2"; 
NC_044295.2	Gnomon	CDS	5652872	5653024	.	+	0	gene_id "MPP2"; transcript_id "XM_030508739.1"; db_xref "GeneID:115617798"; gbkey "CDS"; gene "MPP2"; product "MAGUK p55 subfamily member 2 isoform X4"; protein_id "XP_030364599.1"; exon_number "3"; 
NC_044295.2	Gnomon	CDS	5653114	5653260	.	+	0	gene_id "MPP2"; transcript_id "XM_030508739.1"; db_xref "GeneID:115617798"; gbkey "CDS"; gene "MPP2"; product "MAGUK p55 subfamily member 2 isoform X4"; protein_id "XP_030364599.1"; exon_number "4"; 
NC_044295.2	Gnomon	CDS	5653780	5654007	.	+	0	gene_id "MPP2"; transcript_id "XM_030508739.1"; db_xref "GeneID:115617798"; gbkey "CDS"; gene "MPP2"; product "MAGUK p55 subfamily member 2 isoform X4"; protein_id "XP_030364599.1"; exon_number "5"; 
NC_044295.2	Gnom

NC_044301.2	Gnomon	exon	21270664	21270758	.	+	.	gene_id "NIP7"; transcript_id "XM_030511327.1"; db_xref "GeneID:115619283"; gbkey "mRNA"; gene "NIP7"; model_evidence "Supporting evidence includes similarity to: 1 Protein, and 77% coverage of the annotated genomic feature by RNAseq alignments"; partial "true"; product "nucleolar pre-rRNA processing protein NIP7"; exon_number "1"; 
NC_044301.2	Gnomon	exon	21270831	21270969	.	+	.	gene_id "NIP7"; transcript_id "XM_030511327.1"; db_xref "GeneID:115619283"; gbkey "mRNA"; gene "NIP7"; model_evidence "Supporting evidence includes similarity to: 1 Protein, and 77% coverage of the annotated genomic feature by RNAseq alignments"; partial "true"; product "nucleolar pre-rRNA processing protein NIP7"; exon_number "2"; 
NC_044301.2	Gnomon	exon	21271046	21271186	.	+	.	gene_id "NIP7"; transcript_id "XM_030511327.1"; db_xref "GeneID:115619283"; gbkey "mRNA"; gene "NIP7"; model_evidence "Supporting evidence includes similarity to: 1 Protein, and 77% co

NC_044302.2	Gnomon	exon	2953621	2953733	.	-	.	gene_id "ZNF469"; transcript_id "XM_030512476.1"; db_xref "GeneID:115619726"; gbkey "mRNA"; gene "ZNF469"; model_evidence "Supporting evidence includes similarity to: 2 Proteins, 4 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments"; partial "true"; product "zinc finger protein 469"; exon_number "2"; 
NC_044302.2	Gnomon	exon	2924784	2924853	.	-	.	gene_id "ZNF469"; transcript_id "XM_030512476.1"; db_xref "GeneID:115619726"; gbkey "mRNA"; gene "ZNF469"; model_evidence "Supporting evidence includes similarity to: 2 Proteins, 4 long SRA reads, and 99% coverage of the annotated genomic feature by RNAseq alignments"; partial "true"; product "zinc finger protein 469"; exon_number "3"; 
NC_044302.2	Gnomon	exon	2910067	2920705	.	-	.	gene_id "ZNF469"; transcript_id "XM_030512476.1"; db_xref "GeneID:115619726"; gbkey "mRNA"; gene "ZNF469"; model_evidence "Supporting evidence includes similarity to: 2 Proteins, 4 l

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [170]:
! zcat /home/david/data/SNPeff_db/genes.gtf.gz| wc -l

1386827


In [179]:
[w for w in dodgy_genes if w.feature_id == "XM_030489726"]

[]

In [198]:
! grep "OBSCN" warning_summary.tsv



In [184]:
! grep MPP6 unique_warning_genes.list

In [200]:
! zcat /home/david/data/SNPeff_db/genes.gtf.gz |  grep "OBSCN"

NC_044277.2	Gnomon	gene	99846217	100029216	.	+	.	gene_id "OBSCN"; db_xref "GeneID:115606602"; gbkey "Gene"; gene "OBSCN"; gene_biotype "protein_coding"; 
NC_044277.2	Gnomon	exon	99846217	99846272	.	+	.	gene_id "OBSCN"; transcript_id "XM_030482854.1"; db_xref "GeneID:115606602"; gbkey "mRNA"; gene "OBSCN"; model_evidence "Supporting evidence includes similarity to: 2 mRNAs, 5 ESTs, 2 Proteins, and 87% coverage of the annotated genomic feature by RNAseq alignments"; product "obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF"; exon_number "1"; 
NC_044277.2	Gnomon	exon	99855236	99856380	.	+	.	gene_id "OBSCN"; transcript_id "XM_030482854.1"; db_xref "GeneID:115606602"; gbkey "mRNA"; gene "OBSCN"; model_evidence "Supporting evidence includes similarity to: 2 mRNAs, 5 ESTs, 2 Proteins, and 87% coverage of the annotated genomic feature by RNAseq alignments"; product "obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF"; exon_number "2"; 
NC_044277.2	Gnomon	exon	99862

In [201]:
transcript_dict = defaultdict(set)
for anno in dodgy_genes:
    transcript_dict[anno.messages].add(anno.feature_id)

In [205]:
with open("unique_warning_transcripts.list", "w") as out:
    for msg, T in transcript_dict.items():
        if msg.startswith("WARNING"):
            for transcript in T:
                out.write(transcript +"\n")


In [211]:
! wc unique_warning_transcripts.list



In [219]:
! zcat /home/david/data/SNPeff_db/genes.gtf.gz | grep -f unique_warning_transcripts.list | grep CDS | grep -v "LOW Q" | grep -v "except" | grep -v "partial"

In [208]:
! head unique_warning_transcripts.list

XM_030489541.1
XM_030494723.1
XM_030509761.1
unknown_transcript_1
XM_030501634.1
XM_030470721.2
XM_030470565.1
XM_030489711.1
XM_030492783.1
XM_030511325.1


In [None]:
! zcat /home/david/data/SNPeff_db/genes.gtf.gz | grep -f unique_warning_genes.list | bedtools intersect -b annotated.vcf -a - | wc -l