In [None]:
import pandas as pd

In [None]:
%%html
<!-- "Force the table in the next cell to left justify" -->
<style>
table {float:left}
</style>


# Review the Housekeeping Genes from the O'Rourke paper

Whole-body gene expression atlas of an adult metazoan

https://www.biorxiv.org/content/10.1101/2022.11.06.515345v2



 Sheet | Description
------ | ---------------
2 | List of genes with negative skew
3 | List of genes with Gini < 0.3
4 | List of genes with Gini < 0.3 found in L2 as housekeeping
5 | List of genes with Gini < 0.3 found in L2 as housekeeping and essential 
6 | List of genes with Gini < 0.3 found in L2 as housekeeping, essential and conserved across species

## How well do the O'Rourke Housekeeping genes Aligh with the Wormcat Cetorgories?

* Data indicates an expected results for most genes.
* However 15 Housekeeping Genes were identified as unassigned
* Additionally some unassigned genes had limited expression yet still showed up as housekeeping

wormbase_id	|Sequence ID	| Observed in Cell Count 	| Cumulative Expression 	| Max Expression Observed 
------------|--------------|--------------------------|--------------------------|------------------------
WBGene00019537	|K08D12.3	| 13,309 	| 138,109.968 	| 135.986 
WBGene00019466	|K07B1.6	| 10,812 	| 94,266.221 	| 1,074.407 
WBGene00007630	|C16C10.11	| 4,949 	| 35,813.754 	| 72.790 
WBGene00022053	|Y67D2.3	| 2,104 	| 14,001.399 	| 59.574 
WBGene00022114	|Y71F9AL.9	| 1,672 	| 13,220.190 	| 61.342 
WBGene00007192	|B0491.5	| 1,628 	| 10,969.477 	| 72.247 
WBGene00011735	|T12D8.8	| 1,560 	| 12,546.396 	| 51.983 
WBGene00014016	|ZK632.9	| 1,280 	| 10,731.927 	| 103.150 
WBGene00004167	|Y57A10A.18	| 1,269 	| 10,381.947 	| 159.451 
WBGene00009688	|F44E5.1	| 1,059 	| 6,648.277 	| 51.690 
WBGene00019893	|R05F9.10	| 1,032 	| 7,475.850 	| 37.841 
WBGene00010639	|K07F5.15	| 1,021 	| 6,281.891 	| 64.413 
WBGene00017088	|E01A2.6	| 667 	| 5,077.094 	| 35.484 
WBGene00008530	|F02E9.5	| 438 	| 3,231.856 	| 41.776 
WBGene00004138	|R07B7.3	| 434 	| 3,631.449 	| 185.442 

In [None]:
# supplementary-material with Housekeeping genes
# https://www.biorxiv.org/content/10.1101/2022.11.06.515345v2.supplementary-material

xlsx_file_nm = './input_data/media-2.xlsx'
fpkm_adult_xlsx = pd.ExcelFile(xlsx_file_nm)

In [None]:
sheet_names = fpkm_adult_xlsx.sheet_names
sheet_names

In [None]:
# Sheet4 has the "List of genes with Gini < 0.3 found in L2 as housekeeping"

housekeeping_df = pd.read_excel(xlsx_file_nm, sheet_name='Sheet4')

In [None]:
housekeeping_df

In [None]:
# Read in the Wormcat Catalog
wormcat_df = pd.read_csv('./input_data/whole_genome_v2_nov-11-2021.csv')

In [None]:
wormcat_df

In [None]:
# Merge the data joining on the Wormbase ID use a Left join as to not drop any Wormcat rows rows
wormcat_w_housekeeping_df = pd.merge(wormcat_df, housekeeping_df, left_on='Wormbase ID', right_on='gene_ID', how='left')

In [None]:
wormcat_w_housekeeping_df

In [None]:
# Check how many Wormbase IDs do not have house keeping genes
missing = wormcat_w_housekeeping_df['gene_ID'].isna()
missing.value_counts()

In [None]:
# Yes, just joining with out a Left join would return the same results
# The extra step is just a sanity check
wormcat_w_housekeeping_df = wormcat_w_housekeeping_df[~wormcat_w_housekeeping_df['gene_ID'].isna()]
wormcat_w_housekeeping_df = wormcat_w_housekeeping_df.drop(['Unnamed: 0','gene_ID'], axis=1)
wormcat_w_housekeeping_df

In [None]:
wormcat_w_housekeeping_df.to_csv('./output_data/wormcat_w_housekeeping.csv', index=False)