In this notebook, we will go through the steps of downloading the data from [Repogle et al. (2022)](https://www.sciencedirect.com/science/article/pii/S0092867422005979), and doing some basic filtering for reproducing the results of the sVAE+ manuscript.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import scanpy as sc
import scvi
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Global seed set to 0


# 1. Download the data

Information about the data set is available [here](https://gwps.wi.mit.edu/). We are not interested in the raw data (sequences), but instead in the gene expression matrix. Similarly, the full data set contains gene expression from several perturbation experiments. Here, we are interested in the largest scale experiment, conducted on K562 cells. This data is available in [Figshare](https://plus.figshare.com/articles/dataset/_Mapping_information-rich_genotype-phenotype_landscapes_with_genome-scale_Perturb-seq_Replogle_et_al_2022_processed_Perturb-seq_datasets/20029387). Download the file **K562_gwps_raw_singlecell_01.h5ad** (60Gb), and put it in this directory. 

# 2. Load processed dataset

In [32]:
adata = sc.read_h5ad("K562_gwps_raw_singlecell_01.h5ad")

Let's checkout the information available for every cell

In [33]:
adata.obs

Unnamed: 0_level_0,gem_group,gene,gene_id,transcript,gene_transcript,sgID_AB,mitopercent,UMI_count,z_gemgroup_UMI,core_scale_factor,core_adjusted_UMI_count
cell_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AAACCCAAGAAACCAT-157,157,CTSC,ENSG00000109861,P1P2,1946_CTSC_P1P2_ENSG00000109861,CTSC_-_88070848.23-P1P2|CTSC_-_88070918.23-P1P2,0.088177,14709.0,0.470687,1.051176,13992.900391
AAACCCAAGAAACCAT-207,207,CWC25,ENSG00000273559,P1P2,1973_CWC25_P1P2_ENSG00000273559,CWC25_+_36981555.23-P1P2|CWC25_+_36981567.23-P1P2,0.114342,16162.0,0.824790,1.074744,15038.004883
AAACCCAAGAAACCAT-29,29,PDE4DIP,ENSG00000178104,ENST00000313431.9,6168_PDE4DIP_ENST00000313431.9_ENSG00000178104,PDE4DIP_+_144932474.23-ENST00000313431.9|PDE4D...,0.107157,33297.0,2.627126,1.472444,22613.423828
AAACCCAAGAAAGCGA-149,149,ZZEF1,ENSG00000074755,P1P2,10745_ZZEF1_P1P2_ENSG00000074755,ZZEF1_+_4046247.23-P1P2|ZZEF1_+_4046255.23-P1P2,0.143107,7435.0,0.918149,0.480401,15476.669922
AAACCCAAGAAATCCA-172,172,SNAPIN,ENSG00000143553,P1P2,8210_SNAPIN_P1P2_ENSG00000143553,SNAPIN_+_153631238.23-P1P2|SNAPIN_-_153631252....,0.130754,7755.0,-0.230920,0.695649,11147.856445
...,...,...,...,...,...,...,...,...,...,...,...
TTTGTTGTCTTTCCAA-201,201,SCRIB,ENSG00000180900,P1P2,7706_SCRIB_P1P2_ENSG00000180900,SCRIB_-_144897465.23-P1P2|SCRIB_-_144897339.23...,0.115868,11867.0,-0.441665,1.151834,10302.698242
TTTGTTGTCTTTCGAT-79,79,CADM4,ENSG00000105767,P1P2,1151_CADM4_P1P2_ENSG00000105767,CADM4_+_44143915.23-P1P2|CADM4_+_44143932.23-P1P2,0.094454,6437.0,-1.208969,0.935249,6882.660156
TTTGTTGTCTTTCTAG-218,218,SLC4A7,ENSG00000033867,P1P2,8098_SLC4A7_P1P2_ENSG00000033867,SLC4A7_+_27525847.23-P1P2|SLC4A7_+_27525631.23...,0.170080,11377.0,-0.109167,1.000028,11376.682617
TTTGTTGTCTTTGCGC-169,169,TMEM131,ENSG00000075568,P1P2,8997_TMEM131_P1P2_ENSG00000075568,TMEM131_-_98612356.23-P1P2|TMEM131_-_98612239....,0.103772,10658.0,-0.358464,1.023092,10417.442383


"gene" refers to the gene targeted by the genetic perturbation. A distinct target corresponds therefore to a unique "regime" for our latent model. 

In [34]:
adata.obs["gene"].unique()

['CTSC', 'CWC25', 'PDE4DIP', 'ZZEF1', 'SNAPIN', ..., 'LRP8', 'FECH', 'OGDH', 'MCOLN1', 'ZNF774']
Length: 9867
Categories (9867, object): ['A1BG', 'AAAS', 'AACS', 'AAGAB', ..., 'ZYG11B', 'ZYX', 'ZZEF1', 'non-targeting']

We are two potential issues with this dataset. First, there are too many perturbations and we wish to focus on a subset of them that are having a strong transcriptional effect for the purpose of validating our algorithms. Second, we are measuring many genes (20,000 or so), and we would like to focus on a subset of them to minimize noise. 

# 3. Filter the data

To do so, we exploit the supplementary information from the paper. Such a filtering was performed in the Supplements (supplementary table 3), and we simply pull this information here to filter (i) targeted genes and (ii) measured genes. 

In [35]:
# these are copied from the excel files
string = """DCAF11,HIPK3,NPC1,OXCT1,SYNJ2
ZC3H13,CBLL1,METTL14,METTL3,PSMG1,RBM15
CASC3,CLOCK,IPO13,MAGOH,SMG5,SMG7,SMG8,SMG9,UPF1,UPF2
ZNHIT1,ACTR6,CFDP1,CHCHD3,H2AFZ,HOXD3,HSPH1,KANSL2,KANSL3,KAT8,MCRS1,PCM1,PPP1CA,VPS72,YEATS4
ACTR8,GBA2,INO80B,INO80C,INO80,NFRKB,UCHL5,ZBTB17
ZC3H3,ZFC3H1,CAMTA2,DHX29,DIS3,EXOSC1,EXOSC2,EXOSC3,EXOSC4,EXOSC5,EXOSC6,EXOSC7,EXOSC8,EXOSC9,MBNL1,PABPN1,PIBF1,MTREX,ST20-MTHFS,THAP2
PCNX3,PIAS1,SAE1,UBA2,UBE2I
GLRX3,KEAP1,NAA30,NAA35
DPY19L4,SUMO2,UBE3C,UGCG
CMTR2,RBM14-RBM4,RBM4,UNCX,WDFY3
CCT2,CCT3,CCT4,CCT5,CCT7,CCT8,TBCB,TCP1
CDC73,CTR9,LEO1,PAF1,TUBB2A,WDR61
ELOF1,IWS1,SH3KBP1,SSRP1,SUPT16H,SUPT4H1
LAMTOR1,LAMTOR3,LAMTOR4,TMEM214
DAXX,METTL21A,NAE1,NEDD8-MDP1,NEDD8,BMP2K,UBA3
COPS2,COPS3,COPS4,COPS5,COPS6,COPS8,GPS1
ZBTB4,MRPS34,SRRT,TAF7
CCNK,CCNK,CDK12,GRB2
MEF2A,PRPF19,SFSWAP,BCAS2
INTS1,INTS2,INTS5,INTS7,INTS8
ACSL3,CPSF6,ANKRD39,NUDT21,OGFOD1
CAD,DHODH,SDHA,SDHC,UMPS
NUP54,NUP62,ATF5,XPO1
CPSF2,CPSF3,ITGB1BP1,SYMPK
DERL2,MIS12,PMF1,BGLAP,BUB1B,TTK
NELFA,NELFB,NELFCD,NELFE
CYREN,ARPC4,SUPT20H,TADA1,TADA2B,TADA3,TAF6L
CTU1,CTU2,ELP3,ELP4,ELP5
ZMAT2,CLNS1A,DDX20,DDX41,DDX46,ECD,GEMIN4,GEMIN5,GEMIN6,GEMIN8,INTS3,INTS4,INTS9,ICE1,LSM2,LSM3,LSM5,LSM5,LSM6,LSM7,MMP17,PHAX,PRPF4,PRPF6,SART3,SF3A2,SMN2,SNAPC1,SNAPC3,SNRPD3,SNRPG,TIPARP,TTC27,TXNL4A,USPL1
DGCR8,DICER1,DROSHA,XPO5
CLTC,EXOC1,EXOC4,AP2S1,ATP6AP1,ATP6V1A,ATP6V1G1,ATP6V1H,SACM1L,WDR7
ZNF560,CNDP2,MYB,PTGR2,TBC1D4
AC118549.1,GYG1,MBIP,MYSM1,PDHA1,PET117,YEATS2
C7orf26,FASN,INTS10,ARID5B,INTS13,INTS14
FUNDC2,MEIS3,NKX6-1,SUGP1,TAF10,TAF11,TAF12,TAF13,TAF1,TAF2,TAF3,TAF5,TAF6,TAF8,TYK2
CRLF3,FASTK,GSK3B,MON1A
EIF3A,EIF3G,XRCC5,XRCC6
CHMP3,CHMP6,GLUL,ATP6V1B2,TSG101,VPS28
EIF1AX,EIF3F,EIF3H,EIF3M,EIF4G2
GATA1,HMBOX1,LDB1,LMO2,NCKAP1,PRDM8
PSMA4,PSMB7,PSMC1,PSMC2,PSMC4,PSMD12,PSMD2,PSMD6,PSMD7,PSMG3
CASP8AP2,CHAF1A,CHAF1B,FOLR3,HINFP,LSM10,LSM11,ARHGAP6,ARPC1B,SLBP
FKBP9,MIOS,MOB4,MTOR,PPP1R37,PPP2CB,RPTOR,SEH1L,SIRT7
ALDOA,ENO1,PGAM1,PGK1
ETV4,RAD21,SMC1A,SMC3
DMAP1,EP400,ESYT2,KIF20A,MAX,TELO2,BRD8,TRRAP,TTI1
GAB2,INPPL1,PTPN11,SHC1,BCR
DONSON,GINS1,GINS2,GINS4,MCM10,MCM2,MCM3,MCM4,MCM5,MCM6,MMS22L,ORC5,POLD1,POLD2,POLD3,RAD51,RFC2,RFC4,RFC5
RAMAC,NCBP1,NCBP2,RNMT
ZDHHC7,ADAM10,EPS8L1,FAM136A,POGLUT3,MED10,MED11,MED12,MED14,MED17,MED18,MED19,MED1,MED20,MED21,MED22,MED28,MED29,MED30,MED6,MED7,MED8,MED9,SUPT6H,BRIX1,TMX2
C1QBP,CCNH,ERCC2,ERCC3,GPN1,GPN3,GTF2E1,GTF2E2,GTF2H1,GTF2H4,MNAT1,NUMA1,PDRG1,PFDN2,POLR2B,POLR2F,POLR2G,RPAP1,RPAP2,RPAP3,TANGO6,TMEM161B,UXT
NBAS,RINT1,STX18,BNIP1,USE1
DMRTC2,HSD17B12,MAD2L1,STT3B
DAD1,DDOST,DHDDS,ALG2,SEC61A1,SEC61G,SRP19,SRP68,SRP72,SRP9,SRPRB
ZCCHC9,ZNF236,C1orf131,ZNF84,ZNHIT6,CCDC59,AATF,CPEB1,DDX10,DDX18,DDX21,DDX47,DDX52,DHX33,DHX37,DIMT1,DKC1,DNTTIP2,ESF1,FBL,FBXL14,FCF1,GLB1,HOXA3,IMP4,IMPA2,KRI1,KRR1,LTV1,MPHOSPH10,MRM1,NAF1,NOB1,NOC4L,NOL6,NOP10,PDCD11,ABT1,PNO1,POP1,POP4,POP5,PSMG4,PWP2,RCL1,RIOK1,RIOK2,RNF31,RPP14,RPP30,RPP40,RPS10-NUDT3,RPS10,RPS11,RPS12,RPS13,RPS15A,RPS18,RPS19BP1,RPS19,RPS21,RPS23,RPS24,RPS27A,RPS27,RPS28,RPS29,RPS2,RPS3A,RPS3,RPS4X,RPS5,RPS6,RPS7,RPS9,RPSA,RRP12,RRP7A,RRP9,SDR39U1,SRFBP1,TBL3,TRMT112,TSR1,TSR2,BYSL,C12orf45,USP36,UTP11,UTP20,UTP23,UTP6,BUD23,WDR36,WDR3,WDR46,AAR2
AARS2,DHX30,GFM1,HMGB3,MALSU1,MRPL10,MRPL11,MRPL13,MRPL14,MRPL16,MRPL17,MRPL18,MRPL19,MRPL22,MRPL23,MRPL24,MRPL27,MRPL2,MRPL33,MRPL35,MRPL36,MRPL37,MRPL38,MRPL39,MRPL3,MRPL41,MRPL42,MRPL43,MRPL44,MRPL4,MRPL50,MRPL51,MRPL53,MRPL55,MRPL9,MRPS18A,MRPS30,NARS2,PTCD1,RPUSD4,TARS2,VARS2,YARS2
ACSS2,CEBPG,LRPPRC,MTPAP,PNPT1,POLRMT,SSBP1,TEFM,TFAM,TOMM20
CD3EAP,HEATR1,NOL11,POLR1A,POLR1B,POLR1C,POLR1D,POLR1E,TAF1C,TAF1D,TWISTNB,UTP15,WDR75
CARF,CCDC86,DDX24,DDX51,DDX56,EIF6,ABCF1,GNL2,LSG1,MAK16,MDN1,MYBBP1A,NIP7,NLE1,NOL8,NOP16,NVL,PES1,PPAN,RBM28,RPL10A,RPL10,RPL11,RPL13,RPL14,RPL17,RPL19,RPL21,RPL23A,RPL23,RPL24,RPL26,RPL27A,RPL30,RPL31,RPL32,RPL34,RPL36,RPL37A,RPL37,RPL38,RPL4,RPL5,RPL6,RPL7,RPL8,RPL9,RRS1,RSL1D1,SDAD1,BOP1,TEX10,WDR12
AARS,CHCHD4,DNAJA3,DNAJC19,EIF2B1,EIF2B2,EIF2B3,EIF2B4,EIF2B5,FARSA,FARSB,GFER,GRPEL1,HARS,HSPA9,HSPD1,HSPE1,IARS2,LARS,LETM1,NARS,OXA1L,PGS1,PHB2,PHB,PMPCA,PMPCB,ATP5F1A,ATP5F1B,ATP5PD,QARS,RARS,SAMM50,PRELID3B,TARS,TIMM23B,TIMM44,TOMM22,TTC1,VARS"""


In [37]:
valid_guides = string.replace("\n", ",").split(",")

In [38]:
len(valid_guides)

683

In [None]:
to_keep = [x in valid_guides for x in adata.obs["gene"]]

In [19]:
subset = adata[to_keep].copy()

In [20]:
subset

AnnData object with n_obs × n_vars = 116641 × 8248
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano'

In [11]:
string_genes = """MT-ND6,MT-CO2,MT-CYB,MT-ND2,MT-ND5,MT-CO1,MT-ND3,MT-ND4,MT-ND1,MT-ATP6,MT-CO3,MT-ND4L,MT-ATP8,MTRNR2L8,MTRNR2L12
LETMD1,CIRBP,CCNB1IP1,TAF1D,ZFAS1,SELENOW,SNHG12,SNHG32,EPB41L4A-AS1,GAS5,SNHG6,SNHG8
HSPA5,HERPUD1,TRAM1,NUCB2,MYDGF,DNAJB11,DNAJC3,KDELR1,SEC61B,SSR3,RPN2,SERP1,SSR1,CANX,SDF2L1,TMED7,KDELR2,PDIA6,MANF,HYOU1,PDIA4,SSR2,HSP90B1,PPIB,PDIA3,CALR,P4HB,SND1,OSTC,DDOST
HIST1H2BJ,HIST1H4H,HIST1H2AC,HIST1H2BC,HIST3H2A,HIST1H1C,HIST1H2AG,HIST1H2BN,LINC00958,HIST1H2BH,HELLPAR
CYP51A1,MSMO1,IDI1,TFRC,FDFT1,SCD,SLC25A1,SQLE,SLC9A3R1,TMEM97,HMGCS1,HMGCR,DHCR24,ACAT2,ACLY,SLC40A1,MMAB,ABHD2,FADS1,TM7SF2,FDPS,CITED2,MVD,FASN,ELOVL6,DHCR7,INSIG1
ACSM3,PNN,CLN5,NPRL3,YWHAH,CYB561A3,MTHFD2L,FMC1,PCF11,ATF7IP2,SMIM4,MAT2A,SCAND1,CSTF3,EPM2AIP1,DPM3,ZNF708,H1FX,SLC39A10,PBX2,PPP1R10,THUMPD3-AS1,ERV3-1,MIR17HG,DANCR,HCG18,DLEU2,SNHG3,GABPB1-AS1,RAB30-DT,TMEM161B-AS1,MINCR,AP002387.2,EID1,SNHG19,LINC01003,AC004687.1,ILF3-DT,SNHG30,ZSCAN16-AS1,ZNF595,PAXIP1-AS1,SNHG4,AC016074.2
RPL18,RPL31,RPL34,RPL22,RPL5,RPS15A,RPS24,RPL32,RPS3A,RPL37,RPL7,RPL26,RPS14,RPL13,RPL15,RPL37A
SLC25A5,EIF4B,SLC25A3,RPL3,CCNG1,RPL23,EIF2S3,HNRNPA1,RPS2,RPL7A,RPS3,SNHG16,EEF2,RPSA,RPS9,TOMM20,EIF3F,RPL14,RPL12,RPL10A,SNHG5,RPL41
RPS20,RPS5,RPS16,RPS19,RPL18A,RPL28,RPL19,COX7C,SEC11A,RPS11,RPL13A,HIGD2A,RPL10,RPL27A,RPS27,ACTG1,POLR1D,RPS23,RPL23A
LUC7L,DIS3,ACIN1,SF3B1,CEBPZ,SNRPA1,DEF8,HDGF,DDX46,PRPF39,MAGEA6,MAGEA12,MAGEA3
AP2S1,THRAP3,SNRNP70,OAZ1,DDX5,CD81,SKP1,NUP214,H3F3B,EIF4E2,DEGS1,PTMS,H3F3A,CFL1,POLR2A,BCAP31,UBL5,BAG6
RALA,STARD3NL,AKAP8L,ATP2B1,MED15,MTPN,SELENOK,DGUOK,SRSF11,ATP6V0B,RBM39,SRP14,BRD4,TMCO1,ARF1,LIMS1,GATA2,UBALD2,ATP6V0C,PTPN1,MAFG,PLCG2,YTHDF2
SARS,ZCCHC8,GCAT,XBP1,TRIB3,MDK,RTN4,IGFBP2,CCND2,ATF4,SESN2,MAP1B,SLC31A1,IDH1,WARS,HAX1,DDIT4,RNF41,PRSS57,COLGALT2,CCPG1,LINC00662
BCAT1,MTHFD2,RAB27A,ERP29,AARS,PHGDH,PCK2,SLC7A5,SLC1A5,GARS,ALDH2,MTHFD1L,YARS,PSAT1,EPRS,PSPH,SLC7A11,SHMT2,EIF4EBP1,IARS,LONP1,CRNDE
ATP1B3,HADHA,MGST2,SEC22C,HCFC1R1,ELOB,FAM32A,HSPB1,ZNHIT1,COX6A1,CUTA,COX7A2,HMGN3,NDUFB3,ATP5F1E,ROMO1,COX6B1,UQCR11,PNKD,NEDD8,DAD1,COX7B,NDUFA2,SEC61G,DCTN3,COX17,ANAPC11,NDUFB11,COX6C,CIAO2B,ATP5MG,ATP5ME,COPS9,MICOS10,ATP5MD,GNG5,CLTB,COX8A,POLR2L,NDUFB1,UQCR10,UBE2L3,NDUFA13,NDUFA4,TMA7,ATP5MF,C4orf48,NDUFA7
CHI3L2,OAT,BLVRB,KLF1,SLC2A1,NENF,NFE2,UROD,ASH2L,APOC1,HBZ,TMOD1,HEMGN,SLC39A8,SORD,MFGE8,SLC25A37,FEZ1,ERMAP,GFI1B,STAT3,FAM178B,TANGO2,C2orf88,HBG2,ANXA6,MYL4,MPIG6B,HBA1,HBG1,UCA1,HBD,PVT1,HMBS,AC079466.1,AC079801.1,AP001531.1
LTBP1,TSC22D1,TSPAN13,BLVRA,CCND3,FAM210B,NARF,MGST3,FAM83A,CARHSP1,ALAS2,GYPA,GYPE,GYPB
ERP44,DERA,FNIP2,FSCN1,KIFAP3,PCM1,STX7,NCOA1,GNAS,AAMDC,OSBPL8,MYL6,SNAP29,QPRT,FCGRT,ETFB,CPED1,VAT1,BIRC2,CORO1C,CERT1,TFDP2,ATP6V1A,HSDL2,CYSTM1,TAF12,POLR3GL,TRIM24,ATP6V1F,REEP5,APOE,ZNF428,VPS25,BTG1,CMPK2,USP15,LCP1,ZDHHC4,NDUFB5,TPMT,CTDSPL2,CLTC,GUK1,COMMD3,HSD17B12,MBIP,APOOL,APPL1,EPB41,CSTB,PRMT2,KIAA1522,AGGF1,ZNF83,ZNF507,STXBP6,JUP,SNAPC5,GRINA,EXOC3,TRAPPC6B,TTC3,RPS27L,PURA,MYO6,DCAF12,CSAG1,SNX2,UBE2V1,TXNIP,IGFL2-AS1,EBLN3P
PSMB1,PSMC4,RPL26L1,NDUFB4,UFD1,HSPB11,BZW1,PSMC5,CDC5L,ERH,PSMC1,PSMA6,PSMA7,IDH3B,PSMD7,RPA3,PSMA2,COA1,PSMD3,PSMD11,GNPDA1,PSMD14,NDUFA8,SNRPC,GLO1,PSMB2,PSMA1,ADRM1,POMP,YWHAQ,PSMB7,GTF2A2,ETFA,PSMB6,NDUFS6,NDUFC2,VBP1,SEC13,PSMD4,PSMD6,KIAA1143,PSMC3,YWHAB,LSM3,BANF1,MRPS23,PSMD13,HSPA14,PSMD12,UQCC3,HSBP1,PSMB3
CFLAR,VIM,CAPG,FTL,NFKBIA,SOD2,AHNAK,SAT1,JUND,TUBB2B,IER3,UBC,RAB11FIP1,IFNGR2,IER2,SQSTM1,TUBA1A,JUNB,NABP1,ANXA2,ZFP36L1,ISG15,S100A10,TAPBP
HIPK2,NCKAP1L,TPST2,ITM2B,TLN1,ACTR2,BCL2L11,FCER1G,LMNA,LRP10
WAS,PIK3CB,CYBA,SPI1,BAX,LGALS1,PYCARD,TNNT1,PLIN3,CD33,AIP,ARPC3,GNAI2,VAMP8,ELF1,VASP,RAC2,ARHGEF6,ARPC1B,GMFG,CAP1,DOCK2,PAPSS1,SLC27A2,EMP3,SH3BGRL3,EVA1B,CD53,PTPN7,PPP1R18,ABRACL,MSN,TAGLN2,LAPTM5,ARPC5,S100A11,TPM4,GPRC5C,CAVIN1,FMNL1,IFITM2,SOCS1,SMYD3,STAC3,SPN,CFD,TAFA2,AIF1,CLIC1
MRPS24,IARS2,MRPS34,VDAC3,SLC27A5,TIMM13,POLD2,MRPS15,CHCHD5,NDUFA5,MRPL34,TXNL4A,ATP5MC1,BOLA3,CENPX,MRPL11,AURKAIP1
TPR,CLSPN,CDC45,MSH2,MYBL2,PIH1D1,MCM3,GMNN,DROSHA,TTF2,HELLS,ZWINT,DGCR8,DUT,DNMT1,GINS2,FIGNL1,NASP,RFC3,ODF2,BRCA2,FANCI,PAXBP1,RBBP4,SLBP,MCM7,CDCA4,EXO1,TYMS,HIST2H2AC,HIST1H4C,PRIM1,DHFR,PRKDC,TAF15
NDUFAB1,NDUFB2,PSMB5,PRPF31,CHCHD2,SUB1,POLE4,PARK7,ENY2,ECSIT,SNRPG,ATP5MC3,ATP5PF,PHB,NDUFS5,HINT1,TOMM5,SNRPE,XRCC6,TOMM7,RPS26,TOMM6,GTF2H5
ATP6AP1,ITM2A,NPC2,CREG1,UQCRB,B2M,GLB1,BSG,PSAP,HEXA
KMT2E,JARID2,BOD1L1,KMT2C,IP6K2,AFF4,TNRC6B,ARHGAP5,MIB1,XIAP,ARMCX3,NFAT5,TUSC3,AK1,CLCN3,CPEB4,SPTBN1,TMEM59,ASH1L,PFDN5,NCOA3,PNISR,WBP2,CD63,CEP350,GOLGA4,RNF145,ADD3,ATM,ZNF117,SPPL3,WDR26,NIPBL,MIDN,GABARAP,KIAA0232,BPTF,ZMAT3,GOLGB1,CHD2,C18orf32,DAZAP2,TMEM50A,UBE2H,NF1
MPHOSPH9,ASPM,TPX2,UBE2S,CENPA,CDC20,STMN1,CENPF,TUBA1B,CKS2,TOP2A,NUSAP1,MKI67,GSTO1,CCNB2,DBF4B,HMGB2,PTTG1,PCLAF,TUBA1C,NETO2,RRM2,TUBB4B,H2AFX,TUBB
KDM1A,POLR2J,UQCRC1,PHPT1,PSMD8,NDUFB7,RBX1,FKBP3,PSMC6,CCDC34,NDUFS8,CDK2AP1,SRSF9,SPCS1,NDUFS7,TIMM17B,PIN1,EMC6,ATP5IF1,MRPS36,MRPL47,UQCC2,PNMT,ENSA,SRP9,ATG3,NDUFB9,PDXK,NDUFB8,COX11,PPP1R14A,VEGFB,RFESD,RPS6KB2,SSNA1,DLEU1,PXMP2,RPP25,RAD23A,HIGD1A,ARL6IP4,SUMO3,NDUFA6,MRPL40,MYL6B,ADH5,MRPL42,FIS1,AL445524.1,ATP5PO
MAD1L1,PNPLA4,ERCC1,MRPS10,TNPO3,FAM50A,LYRM2,DYNLL1,PPP2R3C,SCFD1,EZR,MRPS18A,SNW1,DHRS7,COG4,C11orf58,CHKA,COQ5,VPS29,COPZ1,MRPS30,SNX4,AUP1,TSNAX,SDHB,NSL1,WASHC3,COPS5,RNF114,MEA1,NGDN,SNX6,TMEM106C,ISCU,XPA,HINT2,DYNC2LI1,COPS4,NOSIP,MPC2,LYPLAL1,C1orf131,CNIH4,TPRKB,PLIN2,SAP18,HAUS1,CWC27,DPH3,ZCCHC10,MED8,CCDC28B,MED27,MRPS18C,KRTCAP2,MRPL58,TMEM99,RNF181,MFF,NFU1,TMEM126A,EXOSC1,MRPL52,FAM192A,MRPS22,TMEM70,RBIS,ZNRF1,ZNF273,DDRGK1,MRPS18B,MCTS1,MRPS6,MRPL46
NUB1,ALAS1,SEC63,WNK1,VPS35,WAC,RAB18,MAPK1,STAG2,TJP1,TLE5,OCIAD1,UBE2D3,SNX3,CLINT1,MSH3,RNF7,TFG,BBX,ACTR3,SRSF4,RO60,PAIP2,PDLIM2,DESI2,LRIF1,RPAP2,SEPTIN7,KTN1,RAP1B,ORMDL1,ZRANB2,DCTN4,KRAS,EMC7,TAOK3,RAC1,ATP6V1G1,MTCH1,SEC31A,PIP5K1A,CALM2,RAB5A,C7orf50,GOLGA7,THYN1,SRP19,MCM3AP,IKZF3,NEMF,TRIM44,HDGFL3,TERF2IP,ARF4,TMEM208,C15orf40,KIF5B,CXXC5,HOXB2,AKIRIN1,PAK2,NRIP1,MORF4L1,KPNA4,ZNF280B,GGNBP2
CLNS1A,SNX5,PABPC4,HSP90AB1,RBM3,HSPA8,MS4A4A,ANP32B,MYC,EIF4A1,MRPL16,SNUPN,NPM1,PTMA,PHB2
THUMPD1,RIF1,SEPHS1,HNRNPC,HNRNPM,HNRNPA2B1,SYNCRIP,HNRNPD,HNRNPU,EIF5B,NOLC1,HNRNPA3
YBX3,RANBP1,POP1,C1QBP,NCL,HSPE1,RAN,EIF5A,BZW2,GCSH,HSPD1,HNRNPDL,SRSF2,SNRPD1,PPIH,EIF1AX,ALYREF,HNRNPAB,JPT2,MIF
AK2,SS18L2,NOP16,HEBP2,MRTO4,YBX1,EIF2B3,TRIB2,DAZAP1,NUDC,FH,CDV3,NOP56,NAA10,GSPT1,NAMPT,EIF3B,RBM28,BLMH,CCDC47,DRG2,GPN3,NUP107,TPD52L1,HSPA9,MRPL3,RRP9,SRM,EBNA1BP2,RCN2,SET,GTF3A,CLPP,PRMT1,CTAG2,AUNIP,TOMM40,METTL26,DKC1,PSME3,PDHA1,SCO1,CCT7,IMP4,RABEPK,SSB,WDR12,ZNF593,ILF2,LYAR,NHP2,BOD1,CCT6A,MTDH,SSRP1,CCT5,QDPR,IGF2BP1,RRP1B,FAM207A,COA7,RPL22L1,WDR43,GNL3,BMS1,PA2G4,NOC3L,NAA20,YIF1A,RUVBL1,DCTPP1,PSMG4,TBX1,CMSS1,NOC2L,ATAD3A,OPA1,SLC35B4,RTL10,BOP1,MRPL12
IFRD1,PPIE,EIF5,LTBR,MAP7D1,VAPB,DDX27,LRRFIP1,UPF3B,NSRP1,DGCR6L,DCTD,MED10,CAB39,RAP1GDS1,C1orf35,ZC3H8,RP9,NME6,KCMF1,POLR3C,DBNDD2,CNPY2,CEP95
MCUR1,DIMT1,THOC5,NSMCE4A,BAG2,FARSB,UCHL5,EXOSC8,EEF1E1,EMG1,ESD,SPATC1L,ARL5A,FXN,NUDT5,DNAJC21,SMARCC1,PSMG1
HCCS,AGPS,FAM136A,MTREX,PITHD1,KIF2A,REXO2,EXOSC5,PCNP,MTIF2,MRPL28,METTL2A,CDC34,ARMC1,UBXN8,PLEKHJ1,PDCD5,POLR2I,TMEM147,ZPR1,MAGOHB,CHPT1,UNC50,C1orf109,CENPK,UBA2,EIF3G,VPS4A,DAP3,PWP1,PNPT1,PREB,METTL5,PPIP5K2,SURF2,REXO4,MIS18A,CCDC58,GPATCH4,THOC7,NDUFAF2,STOML2,ZNF22,ATP5F1C,IDH3A,UBE2V2,CNBP,AVEN,NUDCD2,RNASEH1,PAIP1,GLMN,NDUFA11,FARSA,MRPS16,C8orf33,PGP,IRAK1,THAP7,LYRM7,RPS19BP1,SDHAF3,MRPL21,RPF2,CNOT7,MRPL38,AKT1S1,TMX2,SMIM30,NOL7,EIF6"""

In [22]:
valid_genes = string_genes.replace("\n", ",").split(",")

In [26]:
to_keep_genes = [x in valid_genes for x in adata.var["gene_name"]]

In [28]:
adata_ = subset[:, to_keep_genes].copy()

In [29]:
adata_

AnnData object with n_obs × n_vars = 116641 × 1187
    obs: 'gem_group', 'gene', 'gene_id', 'transcript', 'gene_transcript', 'sgID_AB', 'mitopercent', 'UMI_count', 'z_gemgroup_UMI', 'core_scale_factor', 'core_adjusted_UMI_count'
    var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano'

# 4. Put together a list of pathways based on the supplements

Finally, in order to produce meaningful splits of the data for transfer learning, we pull more information from Replogle et al., using their perturbation clustering, curated in meaningful pathways.

In [18]:
EXOSOME = ['ZC3H3',
 'ZFC3H1',
 'CAMTA2',
 'DHX29',
 'DIS3',
 'EXOSC1',
 'EXOSC2',
 'EXOSC3',
 'EXOSC4',
 'EXOSC5',
 'EXOSC6',
 'EXOSC7',
 'EXOSC8',
 'EXOSC9',
 'MBNL1',
 'PABPN1',
 'PIBF1',
 'MTREX',
 'ST20-MTHFS',
 'THAP2']

SPLICEOSOME = ['ZMAT2',
 'CLNS1A',
 'DDX20',
 'DDX41',
 'DDX46',
 'ECD',
 'GEMIN4',
 'GEMIN5',
 'GEMIN6',
 'GEMIN8',
 'INTS3',
 'INTS4',
 'INTS9',
 'ICE1',
 'LSM2',
 'LSM3',
 'LSM5',
 'LSM5',
 'LSM6',
 'LSM7',
 'MMP17',
 'PHAX',
 'PRPF4',
 'PRPF6',
 'SART3',
 'SF3A2',
 'SMN2',
 'SNAPC1',
 'SNAPC3',
 'SNRPD3',
 'SNRPG',
 'TIPARP',
 'TTC27',
 'TXNL4A',
 'USPL1']

MEDIATOR_COMPLEX = ['ZDHHC7',
 'ADAM10',
 'EPS8L1',
 'FAM136A',
 'POGLUT3',
 'MED10',
 'MED11',
 'MED12',
 'MED14',
 'MED17',
 'MED18',
 'MED19',
 'MED1',
 'MED20',
 'MED21',
 'MED22',
 'MED28',
 'MED29',
 'MED30',
 'MED6',
 'MED7',
 'MED8',
 'MED9',
 'SUPT6H',
 'BRIX1',
 'TMX2']

NUCLEOTIDE_EXCISION_REPAIR = ['C1QBP',
 'CCNH',
 'ERCC2',
 'ERCC3',
 'GPN1',
 'GPN3',
 'GTF2E1',
 'GTF2E2',
 'GTF2H1',
 'GTF2H4',
 'MNAT1',
 'NUMA1',
 'PDRG1',
 'PFDN2',
 'POLR2B',
 'POLR2F',
 'POLR2G',
 'RPAP1',
 'RPAP2',
 'RPAP3',
 'TANGO6',
 'TMEM161B',
 'UXT']

S40_RIBOSOMAL_UNIT = ['ZCCHC9',
 'ZNF236',
 'C1orf131',
 'ZNF84',
 'ZNHIT6',
 'CCDC59',
 'AATF',
 'CPEB1',
 'DDX10',
 'DDX18',
 'DDX21',
 'DDX47',
 'DDX52',
 'DHX33',
 'DHX37',
 'DIMT1',
 'DKC1',
 'DNTTIP2',
 'ESF1',
 'FBL',
 'FBXL14',
 'FCF1',
 'GLB1',
 'HOXA3',
 'IMP4',
 'IMPA2',
 'KRI1',
 'KRR1',
 'LTV1',
 'MPHOSPH10',
 'MRM1',
 'NAF1',
 'NOB1',
 'NOC4L',
 'NOL6',
 'NOP10',
 'PDCD11',
 'ABT1',
 'PNO1',
 'POP1',
 'POP4',
 'POP5',
 'PSMG4',
 'PWP2',
 'RCL1',
 'RIOK1',
 'RIOK2',
 'RNF31',
 'RPP14',
 'RPP30',
 'RPP40',
 'RPS10-NUDT3',
 'RPS10',
 'RPS11',
 'RPS12',
 'RPS13',
 'RPS15A',
 'RPS18',
 'RPS19BP1',
 'RPS19',
 'RPS21',
 'RPS23',
 'RPS24',
 'RPS27A',
 'RPS27',
 'RPS28',
 'RPS29',
 'RPS2',
 'RPS3A',
 'RPS3',
 'RPS4X',
 'RPS5',
 'RPS6',
 'RPS7',
 'RPS9',
 'RPSA',
 'RRP12',
 'RRP7A',
 'RRP9',
 'SDR39U1',
 'SRFBP1',
 'TBL3',
 'TRMT112',
 'TSR1',
 'TSR2',
 'BYSL',
 'C12orf45',
 'USP36',
 'UTP11',
 'UTP20',
 'UTP23',
 'UTP6',
 'BUD23',
 'WDR36',
 'WDR3',
 'WDR46',
 'AAR2']

S39_RIBOSOMAL_UNIT = ['AARS2',
 'DHX30',
 'GFM1',
 'HMGB3',
 'MALSU1',
 'MRPL10',
 'MRPL11',
 'MRPL13',
 'MRPL14',
 'MRPL16',
 'MRPL17',
 'MRPL18',
 'MRPL19',
 'MRPL22',
 'MRPL23',
 'MRPL24',
 'MRPL27',
 'MRPL2',
 'MRPL33',
 'MRPL35',
 'MRPL36',
 'MRPL37',
 'MRPL38',
 'MRPL39',
 'MRPL3',
 'MRPL41',
 'MRPL42',
 'MRPL43',
 'MRPL44',
 'MRPL4',
 'MRPL50',
 'MRPL51',
 'MRPL53',
 'MRPL55',
 'MRPL9',
 'MRPS18A',
 'MRPS30',
 'NARS2',
 'PTCD1',
 'RPUSD4',
 'TARS2',
 'VARS2',
 'YARS2']

S60_RIBOSOMAL_UNIT = ['CARF',
 'CCDC86',
 'DDX24',
 'DDX51',
 'DDX56',
 'EIF6',
 'ABCF1',
 'GNL2',
 'LSG1',
 'MAK16',
 'MDN1',
 'MYBBP1A',
 'NIP7',
 'NLE1',
 'NOL8',
 'NOP16',
 'NVL',
 'PES1',
 'PPAN',
 'RBM28',
 'RPL10A',
 'RPL10',
 'RPL11',
 'RPL13',
 'RPL14',
 'RPL17',
 'RPL19',
 'RPL21',
 'RPL23A',
 'RPL23',
 'RPL24',
 'RPL26',
 'RPL27A',
 'RPL30',
 'RPL31',
 'RPL32',
 'RPL34',
 'RPL36',
 'RPL37A',
 'RPL37',
 'RPL38',
 'RPL4',
 'RPL5',
 'RPL6',
 'RPL7',
 'RPL8',
 'RPL9',
 'RRS1',
 'RSL1D1',
 'SDAD1',
 'BOP1',
 'TEX10',
 'WDR12']


MT_PROTEIN_TRANSLOCATION = ['AARS',
 'CHCHD4',
 'DNAJA3',
 'DNAJC19',
 'EIF2B1',
 'EIF2B2',
 'EIF2B3',
 'EIF2B4',
 'EIF2B5',
 'FARSA',
 'FARSB',
 'GFER',
 'GRPEL1',
 'HARS',
 'HSPA9',
 'HSPD1',
 'HSPE1',
 'IARS2',
 'LARS',
 'LETM1',
 'NARS',
 'OXA1L',
 'PGS1',
 'PHB2',
 'PHB',
 'PMPCA',
 'PMPCB',
 'ATP5F1A',
 'ATP5F1B',
 'ATP5PD',
 'QARS',
 'RARS',
 'SAMM50',
 'PRELID3B',
 'TARS',
 'TIMM23B',
 'TIMM44',
 'TOMM22',
 'TTC1',
 'VARS']

In [19]:
pathway_list = [EXOSOME, SPLICEOSOME, MEDIATOR_COMPLEX, NUCLEOTIDE_EXCISION_REPAIR, S40_RIBOSOMAL_UNIT, 
                    S39_RIBOSOMAL_UNIT, S60_RIBOSOMAL_UNIT, MT_PROTEIN_TRANSLOCATION]

In [20]:
[len(x) for x in pathway_list]

[20, 35, 26, 23, 97, 43, 53, 40]

Finally, we write all the info to disk

In [32]:
adata_.write_h5ad("replogle.h5ad")