# AnnotQC

Execute the cells in this notebook to ascertain the quality of genome annotation. As an example, the latest sheep genome annotation is used here but the URLs below can be changed to other annotation report and GFF3 files. 

In [9]:
from annotqc_functions import *

## Specify and download annotation files

Enter the URLs to a matching set of annotation report XML file and the GFF3 annotation file. 

In [10]:
# gff3 file 
gff3_url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9940/104/GCF_016772045.1_ARS-UI_Ramb_v2.0/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.gff.gz"
annot_gff = "annotation.gff3"
!curl -Ss {gff3_url} -o {annot_gff}

# annotation report xml 
xml_url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9940/104/GCF_016772045.1_ARS-UI_Ramb_v2.0/Ovis_aries_AR104_annotation_report.xml"
annot_xml = "annot_report.xml"
!curl -Ss {xml_url} -o {annot_xml}

## Gene features

In [11]:
create_genes_plot(annot_xml)

Unnamed: 0,protein coding,non coding,non-transcribed pseudo,Ig TCR segment,transcribed pseudo,other
genes,21257.0,8720.0,3615.0,200.0,9.0,0.0


In [16]:
df = tabulate_genes_attributes(annot_xml)
widgets.HTML(get_html(df, width=950))

Unnamed: 0,has variants,partial,major correction,minor correction,premature stop,has frameshifts
genes,11749,140,794,414,719,657


HTML(value='<table width=950px style="background-color:#FFFFFF" border="1" class="center"><tr><th>Rows</th><th…

## Transcript features

In [13]:
create_transcripts_plots(annot_xml)

Unnamed: 0,model RefSeq,known RefSeq
mRNAs,59726.0,889.0
non-coding RNAs,9731.0,106.0
pseudo transcripts,8.0,1.0
CDSs,59726.0,902.0


In [14]:
df = tabulate_tx_attributes(annot_xml)
widgets.HTML(get_html(df, width=950))

Unnamed: 0,exon <= 3nt,partial,correction,known RefSeq with correction,fully supported,ab initio > 5%,has gaps,model RefSeq with correction,total,major correction,minor correction,premature stop,has frameshifts
mRNAs,127,137,1335,729,58423,1478,0.0,606.0,60615,,,,
non-coding RNAs,6,0,0,2,8107,0,0.0,,9837,,,,
pseudo transcripts,0,0,0,0,8,0,0.0,,9,,,,
CDSs,1058,107,1281,475,58423,1640,,606.0,60628,796.0,415.0,720.0,660.0


HTML(value='<table width=950px style="background-color:#FFFFFF" border="1" class="center"><tr><th>Rows</th><th…

## Long-read RNA-Seq alignments 

In [15]:
df = tabulate_longread_aligns(annot_xml)
widgets.HTML(get_html(df))

Unnamed: 0,sample,num_reads,avg_read_length,AlignedReads,AlignedReadsPct,AlignmentCount,PctCoverage,PctIdentity
SRR11036012,SAMN14053364,674893,2303,611413,90.59,615418,94.68,98.97
SRR11036013,SAMN14053363,683822,1741,624019,91.25,625854,93.11,99.17
SRR11036014,SAMN14053362,242908,2061,190247,78.32,190286,83.63,98.96
SRR11036015,SAMN14053361,191254,2077,159044,83.15,159069,88.64,99.08
SRR11036016,SAMN14053360,233611,2445,168716,72.22,168759,82.99,98.86
SRR8173311,SAMEA104495026,180833,734,110699,61.21,111447,98.01,97.38
SRR8173312,SAMEA104495022,483508,1515,370455,76.61,370801,98.75,97.97
SRR8173313,SAMEA104495023,391485,1502,298878,76.34,299994,98.44,97.35


HTML(value='<table width=750px style="background-color:#FFFFFF" border="1" class="center"><tr><th>Rows</th><th…

## Short-read RNA-Seq alignments