# Orthogroups QA
This notebooks is dedicated to QAing orthogroups clustering results. It is based on a small set of data with genes from 6 samples. I tried Get_Homologoues_Est (GHE) and OrthoFinder2 (OF2).
I start with unfiltered annotation results and the main goal is to use clustering results to find the best filtration criteria.
Out of the 6 samples involved, 2 - <i>ITAG3.2</i> and <i>heinz1706__ver100</i> - are actually the same genome with the official annotation and my annotation, respectively.

In [2]:
import pandas as pd
import numpy as np
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

In [2]:
### GHE resul files
cluster_stats_f = "/groups/itay_mayrose/nosnap/liorglic/Projects/tomato_pan_genome/output/manual_GHE/data_est_homologues/cluster_stats.tsv"
gene_stats_f = "/groups/itay_mayrose/nosnap/liorglic/Projects/tomato_pan_genome/output/manual_GHE/data_est_homologues/gene_stats.tsv"
dix_annotation_qa_f = "/groups/itay_mayrose/nosnap/liorglic/Projects/tomato_pan_genome/output/genome_annotation/annotation_results/per_sample/dixie_golden_giant_tr00020__ver100/Annotation_QA/dixie_golden_giant_tr00020__ver100.QA_report.tsv"

### General exploration of clusters stats

In [3]:
dtype_dict = {'ITAG3.2':str,
       'dixie_golden_giant_tr00020__ver100':str,
              'heinz1706_3_00__ver100':str,
              'la4133_tr00026__ver100':str,
              'moneymaker_la2706_pi262996__ver100':str,
              'pi311117_ea05701__ver100':str}
cluster_stats_df = pd.read_csv(cluster_stats_f, sep='\t', dtype=dtype_dict)
cluster_stats_df.head()

Unnamed: 0,Cluster name,Occupancy,Size,ITAG3.2,dixie_golden_giant_tr00020__ver100,heinz1706_3_00__ver100,la4133_tr00026__ver100,moneymaker_la2706_pi262996__ver100,pi311117_ea05701__ver100
0,1_Solyc01g111080.3.1,6,35,"6;Solyc01g111080.3.1,Solyc01g111070.3.1,Solyc0...",6;Solyc01g111040.3.1__maker-1_91126460_9284156...,5;maker-1_96608960_98455869-augustus-gene-1.9-...,6;augustus-1_92287730_93987019-processed-gene-...,7;augustus-1_94349354_96655447-processed-gene-...,5;augustus-1_87883719_90546862-processed-gene-...
1,2_Solyc02g089263.1.1,6,12,"2;Solyc02g089263.1.1,Solyc02g089260.3.1",2;augustus-2_45563230_48243420-processed-gene-...,2;maker-2_49684608_52444864-augustus-gene-4.31...,2;augustus-2_46143865_48858210-processed-gene-...,2;Solyc02g089260.3.1__maker-2_47174677_4994965...,2;augustus-2_45273431_47936574-processed-gene-...
2,3_Solyc01g106230.3.1,6,15,"4;Solyc01g106230.3.1,Solyc01g106253.1.1,Solyc0...",2;Solyc01g106220.3.1__maker-1_85766080_8844627...,3;augustus-1_93848704_96608960-processed-gene-...,2;Solyc01g106220.3.1__maker-1_86859040_8957338...,2;augustus-1_91574373_94349354-processed-gene-...,2;Solyc01g106220.3.1__maker-1_85220576_8788371...
3,4_Solyc08g078530.3.1,6,12,"2;Solyc08g078530.3.1,Solyc08g078540.3.1",2;maker-8_56283990_58964180-augustus-gene-3.11...,2;maker-8_60725632_63485888-augustus-gene-3.12...,2;maker-8_57001245_59715590-augustus-gene-3.11...,2;maker-8_58274601_61049582-augustus-gene-2.4-...,2;maker-8_55926003_58589146-augustus-gene-4.17...
4,5_Solyc02g083900.3.1,6,6,1;Solyc02g083900.3.1,1;Solyc02g083775.1.1__maker-2_40202850_4288304...,1;Solyc02g083775.1.1__maker-2_46924352_4968460...,1;Solyc02g083775.1.1__maker-2_40715175_4342952...,1;Solyc02g083775.1.1__maker-2_41624715_4439969...,1;Solyc02g083775.1.1__maker-2_39947145_4261028...


In [4]:
trace = go.Histogram(x=cluster_stats_df['Occupancy'])
layout = go.Layout(
    title='Histogram of cluster occupancy',
    xaxis=dict(
        title='Occupancy'
    ),
    yaxis=dict(
        title='# of clusters'
    )
)
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [5]:
trace = go.Histogram(x=cluster_stats_df.loc[cluster_stats_df['Size'] < 100]['Size'])
layout = go.Layout(
    title='Histogram of cluster size',
    xaxis=dict(
        title='Cluster size'
    ),
    yaxis=dict(
        title='# of clusters'
    )
)
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

### Explore singletons properties in dixie
Join the annotation QA table of dixie with the clustering info for genes from this sample to check if singletons have characteristic annotation stats.

In [6]:
gene_stats_df = pd.read_csv(gene_stats_f, sep='\t')
dix_annotation_qa_df = pd.read_csv(dix_annotation_qa_f, sep='\t')

In [7]:
dix_gene_stats_df = gene_stats_df.loc[gene_stats_df['Sample'] == "dixie_golden_giant_tr00020__ver100"]
dix_gene_stats_df.rename(columns={"Gene":"gene"}, inplace=True)
dix_gene_stats_df.set_index(dix_gene_stats_df['gene'], inplace=True)
dix_gene_stats_df.head()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0_level_0,gene,Sample,Cluster name,Cluster Occupancy,Cluster size,Copies,Cluster samples
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Solyc01g111040.3.1__maker-1_91126460_92841563-pred_gff_maker-gene-1.33-mRNA-1,Solyc01g111040.3.1__maker-1_91126460_92841563-...,dixie_golden_giant_tr00020__ver100,1_Solyc01g111080.3.1,6,35,6,ITAG3.2:6;dixie_golden_giant_tr00020__ver100:6...
maker-1_91126460_92841563-augustus-gene-1.171-mRNA-1__maker-1_91126460_92841563-augustus-gene-1.171-mRNA-1,maker-1_91126460_92841563-augustus-gene-1.171-...,dixie_golden_giant_tr00020__ver100,1_Solyc01g111080.3.1,6,35,6,ITAG3.2:6;dixie_golden_giant_tr00020__ver100:6...
augustus-1_91126460_92841563-processed-gene-1.291-mRNA-1__augustus-1_91126460_92841563-processed-gene-1.291-mRNA-1,augustus-1_91126460_92841563-processed-gene-1....,dixie_golden_giant_tr00020__ver100,1_Solyc01g111080.3.1,6,35,6,ITAG3.2:6;dixie_golden_giant_tr00020__ver100:6...
maker-1_91126460_92841563-augustus-gene-1.174-mRNA-1__maker-1_91126460_92841563-augustus-gene-1.174-mRNA-1,maker-1_91126460_92841563-augustus-gene-1.174-...,dixie_golden_giant_tr00020__ver100,1_Solyc01g111080.3.1,6,35,6,ITAG3.2:6;dixie_golden_giant_tr00020__ver100:6...
Solyc01g111075.1.1__maker-1_91126460_92841563-pred_gff_maker-gene-1.119-mRNA-1,Solyc01g111075.1.1__maker-1_91126460_92841563-...,dixie_golden_giant_tr00020__ver100,1_Solyc01g111080.3.1,6,35,6,ITAG3.2:6;dixie_golden_giant_tr00020__ver100:6...


In [8]:
dix_annotation_qa_df.set_index(dix_annotation_qa_df['gene'], inplace=True)
dix_annotation_qa_df.head()

Unnamed: 0_level_0,gene,Chromosome,AED,Exons,UTR,Repeats,BUSCO,BLAST,IPS
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Solyc00g005050.3.1__maker-10000001_2680190_5360380-pred_gff_maker-gene-4.625-mRNA-1,Solyc00g005050.3.1__maker-10000001_2680190_536...,10000001,0.11,5,2,4.419116,,,4.24e-18
Solyc00g005060.1.1__maker-10000001_5360380_8040570-pred_gff_maker-gene-0.110-mRNA-1,Solyc00g005060.1.1__maker-10000001_5360380_804...,10000001,0.01,2,2,0.0,,,
Solyc00g005080.2.1__maker-10000001_0_2680190-pred_gff_maker-gene-1.0-mRNA-1,Solyc00g005080.2.1__maker-10000001_0_2680190-p...,10000001,0.12,1,0,2.439024,,85.0,
Solyc00g005080.2.1__maker-6_24121710_26801900-pred_gff_maker-gene-3.17-mRNA-1,Solyc00g005080.2.1__maker-6_24121710_26801900-...,6,0.11,1,2,0.0,,,
Solyc00g005090.1.1__maker-10000001_0_2680190-pred_gff_maker-gene-4.197-mRNA-1,Solyc00g005090.1.1__maker-10000001_0_2680190-p...,10000001,0.01,1,0,0.0,,,


In [9]:
dix_joint_stats_df = dix_annotation_qa_df.join(dix_gene_stats_df, lsuffix="_ann", rsuffix='_GHE')

In [10]:
dix_joint_stats_df.shape

(48334, 16)

In [11]:
dix_joint_stats_df.head()

Unnamed: 0_level_0,gene_ann,Chromosome,AED,Exons,UTR,Repeats,BUSCO,BLAST,IPS,gene_GHE,Sample,Cluster name,Cluster Occupancy,Cluster size,Copies,Cluster samples
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Solyc00g005050.3.1__maker-10000001_2680190_5360380-pred_gff_maker-gene-4.625-mRNA-1,Solyc00g005050.3.1__maker-10000001_2680190_536...,10000001,0.11,5,2,4.419116,,,4.24e-18,Solyc00g005050.3.1__maker-10000001_2680190_536...,dixie_golden_giant_tr00020__ver100,21567_Solyc00g005050.3.1,6.0,6.0,1.0,ITAG3.2:1;dixie_golden_giant_tr00020__ver100:1...
Solyc00g005060.1.1__maker-10000001_5360380_8040570-pred_gff_maker-gene-0.110-mRNA-1,Solyc00g005060.1.1__maker-10000001_5360380_804...,10000001,0.01,2,2,0.0,,,,Solyc00g005060.1.1__maker-10000001_5360380_804...,dixie_golden_giant_tr00020__ver100,32406_Solyc00g005060.1.1,4.0,4.0,1.0,ITAG3.2:1;dixie_golden_giant_tr00020__ver100:1...
Solyc00g005080.2.1__maker-10000001_0_2680190-pred_gff_maker-gene-1.0-mRNA-1,Solyc00g005080.2.1__maker-10000001_0_2680190-p...,10000001,0.12,1,0,2.439024,,85.0,,Solyc00g005080.2.1__maker-10000001_0_2680190-p...,dixie_golden_giant_tr00020__ver100,32032_Solyc00g005080.2.1,5.0,5.0,1.0,ITAG3.2:1;dixie_golden_giant_tr00020__ver100:1...
Solyc00g005080.2.1__maker-6_24121710_26801900-pred_gff_maker-gene-3.17-mRNA-1,Solyc00g005080.2.1__maker-6_24121710_26801900-...,6,0.11,1,2,0.0,,,,Solyc00g005080.2.1__maker-6_24121710_26801900-...,dixie_golden_giant_tr00020__ver100,80415_Solyc00g005080.2.1__maker-6_24121710_268...,5.0,5.0,1.0,dixie_golden_giant_tr00020__ver100:1;heinz1706...
Solyc00g005090.1.1__maker-10000001_0_2680190-pred_gff_maker-gene-4.197-mRNA-1,Solyc00g005090.1.1__maker-10000001_0_2680190-p...,10000001,0.01,1,0,0.0,,,,Solyc00g005090.1.1__maker-10000001_0_2680190-p...,dixie_golden_giant_tr00020__ver100,32094_Solyc00g005090.1.1,6.0,6.0,1.0,ITAG3.2:1;dixie_golden_giant_tr00020__ver100:1...


In [12]:
singletons = go.Histogram(x=dix_joint_stats_df.loc[dix_joint_stats_df['Cluster Occupancy'] == 1]['AED'],
                          name="Singletons",
                         opacity=0.5)
others = go.Histogram(x=dix_joint_stats_df.loc[dix_joint_stats_df['Cluster Occupancy'] > 1]['AED'],
                      name="Other",
                     opacity=0.5)
data=[singletons,others]
layout = go.Layout(
    title='Histogram of AED - Singletons vs. other',
    xaxis=dict(
        title='AED'
    ),
    yaxis=dict(
        title='# of genes'
    ),
    barmode='overlay'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [13]:
singletons = go.Histogram(x=dix_joint_stats_df.loc[dix_joint_stats_df['Chromosome'] == 10000001]['AED'],
                          name="Unmapped",
                         opacity=0.5)
others = go.Histogram(x=dix_joint_stats_df.loc[dix_joint_stats_df['Chromosome'] != 10000001]['AED'],
                      name="Other",
                     opacity=0.5)
data=[singletons,others]
layout = go.Layout(
    title='Histogram of AED - Unmapped vs. other',
    xaxis=dict(
        title='AED'
    ),
    yaxis=dict(
        title='# of genes'
    ),
    barmode='overlay'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

### Check agreement between ITAG annotation and my annotation of Heinz1706

In [14]:
# find difference in genes number per cluster between ITAG and heinz (and between ITAG and dixie, for comparison)
itag_gene_cn = cluster_stats_df["ITAG3.2"].apply(lambda x: int(x.split(';')[0]))
heinz_gene_cn = cluster_stats_df["heinz1706_3_00__ver100"].apply(lambda x: int(x.split(';')[0]))
dixie_gene_cn = cluster_stats_df["dixie_golden_giant_tr00020__ver100"].apply(lambda x: int(x.split(';')[0]))
itag_delta_heinz = itag_gene_cn - heinz_gene_cn
itag_delta_dixie = itag_gene_cn - dixie_gene_cn

In [15]:
singletons = go.Histogram(x=itag_delta_heinz,
                          name="ITAG - heinz",
                         opacity=0.5)
others = go.Histogram(x=itag_delta_dixie,
                      name="ITAG - dixie",
                     opacity=0.5)
data=[singletons,others]
layout = go.Layout(
    title='Histogram of difference in gene number per cluster',
    xaxis=dict(
        title='diff'
    ),
    yaxis=dict(
        title='# of clusters'
    ),
    barmode='overlay'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [16]:
# estimation of false PAV rate:
cn_df = pd.DataFrame({"ITAG": itag_gene_cn, "heinz": heinz_gene_cn, "dixie": dixie_gene_cn})
len(cn_df.loc[((cn_df["ITAG"] == 0) & (cn_df["heinz"] != 0)) | ((cn_df["ITAG"] != 0) & (cn_df["heinz"] == 0))])

14537

In [17]:
# estimation of false CNV rate:
len(cn_df.loc[cn_df["ITAG"] != cn_df["heinz"]])

16471

Looks like most of the disagreement is PAV, but need to check if this is due to junk singletons.

## OrthoFinder2 results
Parse OF2 output, then analyze.

In [3]:
# orthofinder orthogroups tsv contains only non-singletons
in_of2_orthogroups_f = "/groups/itay_mayrose/nosnap/liorglic/Projects/tomato_pan_genome/output/manual_orthoFinder/data/Results_May03/Orthogroups.csv"
# orthofinder unassigned genes are actually singleton clusters
in_of2_singletons_f = "/groups/itay_mayrose/nosnap/liorglic/Projects/tomato_pan_genome/output/manual_orthoFinder/data/Results_May03/Orthogroups_UnassignedGenes.csv"

# read tables
in_of2_orthogroups_df = pd.read_csv(in_of2_orthogroups_f, sep='\t')
in_of2_singletons_df = pd.read_csv(in_of2_singletons_f, sep='\t')

In [4]:
# add tables together - they have the same columns
all_orthogroups_df = pd.concat([in_of2_orthogroups_df, in_of2_singletons_df], axis=0, ignore_index=True)
all_orthogroups_df.shape
all_orthogroups_df.rename(columns={"Unnamed: 0": "Orthogroup"}, inplace=True)

In [5]:
# calculate occupancy per orthogroup
all_orthogroups_df["Occupancy"] = all_orthogroups_df.count(axis=1) - 1

In [6]:
# calculate genes per sample
gene_counts_df = all_orthogroups_df.iloc[:,1:-1].apply(lambda x: x.fillna('').apply(lambda y: 0 if not y else len(y.split(', '))))
new_col_names = [c + " gene_count" for c in gene_counts_df.columns]
gene_counts_df.columns = new_col_names
# calculate cluster size (sum of genes per sample)
gene_counts_df['Cluster size'] = gene_counts_df.sum(axis=1)

# join to original table
all_orthogroups_df = all_orthogroups_df.join(gene_counts_df)

In [7]:
trace = go.Histogram(x=all_orthogroups_df['Occupancy'])
layout = go.Layout(
    title='Histogram of cluster occupancy',
    xaxis=dict(
        title='Occupancy'
    ),
    yaxis=dict(
        title='# of clusters'
    )
)
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [9]:
# create genes df
genes = pd.DataFrame(columns=["Gene","Sample","Orthogroup"])
tmp_sub = all_orthogroups_df.iloc[:,0:7]

In [10]:
for col in tmp_sub.columns[1:]:
    tmp = tmp_sub[col].str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
    tmp_df = pd.DataFrame(tmp)
    tmp2 = tmp_sub.join(tmp_df)
    tmp3 = tmp2[['Orthogroup',0]]
    tmp3.rename(columns={0:"Gene"}, inplace=True)
    tmp3["Sample"] = col
    genes = pd.concat([genes,tmp3], axis=0, ignore_index=True)
genes = genes.loc[genes['Gene'].notnull()]



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.





In [13]:
genes.head()

Unnamed: 0,Gene,Orthogroup,Sample
0,Solyc08g006254.1.1,OG0000000,ITAG3.2
1,Solyc08g006256.1.1,OG0000000,ITAG3.2
2,Solyc08g006258.1.1,OG0000000,ITAG3.2
4,Solyc03g097882.1.1,OG0000002,ITAG3.2
5,Solyc05g006607.1.1,OG0000002,ITAG3.2


In [16]:
genes_full = genes.merge(all_orthogroups_df, on="Orthogroup")
genes_full.head()

Unnamed: 0,Gene,Orthogroup,Sample,ITAG3.2,dixie_golden_giant_tr00020__ver100,heinz1706_3_00__ver100,la4133_tr00026__ver100,moneymaker_la2706_pi262996__ver100,pi311117_ea05701__ver100,Occupancy,ITAG3.2 gene_count,dixie_golden_giant_tr00020__ver100 gene_count,heinz1706_3_00__ver100 gene_count,la4133_tr00026__ver100 gene_count,moneymaker_la2706_pi262996__ver100 gene_count,pi311117_ea05701__ver100 gene_count,Cluster size
0,Solyc08g006254.1.1,OG0000000,ITAG3.2,"Solyc08g006254.1.1, Solyc08g006256.1.1, Solyc0...",Solyc08g006254.1.1__maker-10000001_10720760_13...,Solyc08g006254.1.1__maker-10000001_11041024_13...,Solyc03g026417.1.1__maker-10000001_5428690_814...,Solyc03g026417.1.1__maker-10000001_30524791_33...,Solyc03g026417.1.1__maker-10000001_21305144_23...,6,3,148,40,185,175,239,790
1,Solyc08g006256.1.1,OG0000000,ITAG3.2,"Solyc08g006254.1.1, Solyc08g006256.1.1, Solyc0...",Solyc08g006254.1.1__maker-10000001_10720760_13...,Solyc08g006254.1.1__maker-10000001_11041024_13...,Solyc03g026417.1.1__maker-10000001_5428690_814...,Solyc03g026417.1.1__maker-10000001_30524791_33...,Solyc03g026417.1.1__maker-10000001_21305144_23...,6,3,148,40,185,175,239,790
2,Solyc08g006258.1.1,OG0000000,ITAG3.2,"Solyc08g006254.1.1, Solyc08g006256.1.1, Solyc0...",Solyc08g006254.1.1__maker-10000001_10720760_13...,Solyc08g006254.1.1__maker-10000001_11041024_13...,Solyc03g026417.1.1__maker-10000001_5428690_814...,Solyc03g026417.1.1__maker-10000001_30524791_33...,Solyc03g026417.1.1__maker-10000001_21305144_23...,6,3,148,40,185,175,239,790
3,Solyc08g006254.1.1__maker-10000001_10720760_13...,OG0000000,dixie_golden_giant_tr00020__ver100,"Solyc08g006254.1.1, Solyc08g006256.1.1, Solyc0...",Solyc08g006254.1.1__maker-10000001_10720760_13...,Solyc08g006254.1.1__maker-10000001_11041024_13...,Solyc03g026417.1.1__maker-10000001_5428690_814...,Solyc03g026417.1.1__maker-10000001_30524791_33...,Solyc03g026417.1.1__maker-10000001_21305144_23...,6,3,148,40,185,175,239,790
4,Solyc08g006254.1.1__maker-10000001_10720760_13...,OG0000000,dixie_golden_giant_tr00020__ver100,"Solyc08g006254.1.1, Solyc08g006256.1.1, Solyc0...",Solyc08g006254.1.1__maker-10000001_10720760_13...,Solyc08g006254.1.1__maker-10000001_11041024_13...,Solyc03g026417.1.1__maker-10000001_5428690_814...,Solyc03g026417.1.1__maker-10000001_30524791_33...,Solyc03g026417.1.1__maker-10000001_21305144_23...,6,3,148,40,185,175,239,790


I'd like to explore genes that are in my heinz annotation, but not in official  ITAG annotation. I'd like to see how many seem like real genes. I expect very few such genes, which are supposedly predictions missed by ITAG. Most are likely to be MAKER artifacts.

In [25]:
heinz_not_in_itag = genes_full.loc[(genes_full["ITAG3.2 gene_count"] == 0) & (genes_full["heinz1706_3_00__ver100 gene_count"] == 1)]["Gene"]

In [26]:
heinz_annotation_qa_f = "/groups/itay_mayrose/nosnap/liorglic/Projects/tomato_pan_genome/output/genome_annotation/annotation_results/per_sample/heinz1706_3_00__ver100/Annotation_QA/heinz1706_3_00__ver100.QA_report.tsv"
heinz_annotation_qa_df = pd.read_csv(heinz_annotation_qa_f, sep='\t')
heinz_annotation_qa_df.columns

Index(['gene', 'Chromosome', 'AED', 'Exons', 'UTR', 'Repeats', 'BUSCO',
       'BLAST', 'IPS'],
      dtype='object')

In [27]:
heinz_not_in_itag_annotation_qa_df = heinz_annotation_qa_df.loc[heinz_annotation_qa_df['gene'].isin(heinz_not_in_itag)]

In [28]:
heinz_not_in_itag_annotation_qa_df.shape

(7421, 9)

In [29]:
trace = go.Histogram(x=heinz_not_in_itag_annotation_qa_df['AED'])
layout = go.Layout(
    title='Histogram of AED',
    xaxis=dict(
        title='AED'
    ),
    yaxis=dict(
        title='# of genes'
    )
)
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

Let's look at the genes with very good AED.

In [30]:
heinz_not_in_itag_annotation_qa_df.loc[heinz_not_in_itag_annotation_qa_df['AED'] < 0.01].head()

Unnamed: 0,gene,Chromosome,AED,Exons,UTR,Repeats,BUSCO,BLAST,IPS
877,Solyc01g008590.1.1__maker-1_0_2760256-pred_gff...,1,0.0,1,0,0.0,,,
1274,Solyc01g017040.1.1__maker-1_22082048_24842304-...,1,0.0,1,0,0.0,,,
1308,Solyc01g017630.1.1__maker-1_22082048_24842304-...,1,0.0,1,0,0.0,,,3.4e-06
1586,Solyc01g060350.1.1__maker-7_5520512_8280768-pr...,7,0.0,1,1,2.193784,,,
1701,Solyc01g067417.1.1__maker-1_74526912_77287168-...,1,0.0,1,1,0.0,,77.0,1.57e-07
