Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep popular fails with duplicate gene names #92

Closed
rbutleriii opened this issue Apr 3, 2024 · 7 comments
Closed

Keep popular fails with duplicate gene names #92

rbutleriii opened this issue Apr 3, 2024 · 7 comments
Labels

Comments

@rbutleriii
Copy link

Trying to use the keep_popular setting in drop_uninformative_genes, and I get an error:

> ctd_drop <- drop_uninformative_genes(
+   exp = expr,
+   level2annot = ctype,
+   convert_orths = TRUE,
+   non121_strategy = "keep_popular",
+   input_species = "mouse",
+   output_species = "human",
+   no_cores = 4
+ )
Check 12936Check 1122
4 core(s) assigned as workers (60 reserved).
one2one_strategy='keep_popular' selected.
Setting method='gprofiler' and 'mthreshold=1.
Preparing gene_df.
data.frame format detected.
Extracting genes from rownames.
12,936 genes extracted.
Converting mouse ==> human orthologs using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: mmusculus
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Checking for genes without orthologs in human.
Extracting genes from input_gene.
13,075 genes extracted.
Extracting genes from ortholog_gene.
13,075 genes extracted.
Dropping 1,829 NAs of all kinds from ortholog_gene.
Checking for genes without 1:1 orthologs.
Filtering gene_df with gene_map
Setting ortholog_gene to rownames.
Error in `.rowNamesDF<-`(x, value = value) :
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘ACAA1’, ‘ACAD10’, ‘ACOT2’, ‘ACOT9’, ‘AGTPBP1’, ‘ALDH1A1’, ‘ALDOA’, ‘ALYREF’, ‘APH1B’, ‘BEX2’, ‘CALM1’, ‘CD59’, ‘CS’, ‘CYP4F8’, ‘DYNLT1’, ‘ECI2’, ‘ERMARD’, ‘GBP4’, ‘GBP6’, ‘GSTM1’, ‘H2AC4’, ‘HJURP’, ‘HMSD’, ‘HSPA1B’, ‘IFITM3’, ‘IFT70B’, ‘IRGM’, ‘ISOC2’, ‘LCORL’, ‘LY6S’, ‘MAPK1IP1L’, ‘MFAP1’, ‘MORC2’, ‘MYC’, ‘NUDT11’, ‘NUTF2’, ‘PCDHB4’, ‘PCDHB5’, ‘PLSCR1’, ‘PNP’, ‘POT1’, ‘PRPS1’, ‘PSME2’, ‘RBM12B’, ‘RBMX’, ‘RNF113A’, ‘RPL34’, ‘RPL36A’, ‘RPLP2’, ‘RSPH3’, ‘SCD’, ‘SERPINB1’, ‘SIAH1’, ‘SLC22A5’, ‘SLCO1A2’, ‘SPCS3’, ‘SRP54’, ‘TCEAL3’, ‘TMT1A’, ‘UBA52’, ‘UCHL3’, ‘UTP14A’, ‘VWA5A’, ‘ZCCHC18’, ‘ZFP91’, ‘ZKSCAN4’, ‘ZNF195’, ‘ZNF233’, ‘ZNF274’, ‘ZNF34’, ‘ZNF420’, ‘ZNF443 [... truncated]

Looks like it returns duplicates. here are the first 500 of my gene names, looks like Hjurp is one of them that causes the error.

> dput(rownames(expr)[1:500])
c("Xkr4", "Sox17", "Mrpl15", "Lypla1", "Tcea1", "Rgs20", "Atp6v1h",
"Oprk1", "Npbwr1", "Rb1cc1", "St18", "Pcmtd1", "Sntg1", "Rrs1",
"3110035E14Rik", "Vcpip1", "Sgk3", "Snhg6", "Snord87", "Cops5",
"Cspp1", "Arfgef1", "Prex2", "A830018L16Rik", "Slco5a1", "Ncoa2",
"Tram1", "Lactb2", "Eya1", "Kcnb2", "Terf1", "Rpl7", "Rdh10",
"Stau2", "Ube2w", "Tceb1", "Tmem70", "Ly96", "Jph1", "Gdap1",
"Crispld1", "Mcm3", "Paqr8", "Tmem14a", "Kcnq5", "Rims1_loc1",
"Rims1_loc2", "Ogfrl1", "B3gat2", "Smap1", "1110058L19Rik", "Fam135a",
"Col19a1", "Lmbrd1", "Bai3", "Gm20172", "Phf3", "Ptp4a1", "Gm13363",
"Khdrbs2", "Prim2", "Rab23", "Bag2", "Zfp451", "Gm15455", "Bend6",
"Dst", "Ccdc115", "Imp4", "Amer3", "Arhgef4", "Fam168b", "Plekhb2",
"Hs6st1", "Uggt1", "Arid5a", "Kansl3", "Lman2l", "Cnnm4", "Cnnm3",
"Ankrd23", "Ankrd39", "Sema4c", "Fam178b_loc2", "Cox5b", "Actr1b",
"Tmem131", "Inpp4a", "Coa5", "Unc50", "Mgat4a", "4930594C11Rik",
"2010300C02Rik", "Tsga10", "Lipt1", "Mitd1", "Mrpl30", "Txndc9",
"Eif5b", "Rev1", "Aff3", "Lonrf2", "Chst10", "Pdcl3", "Npas2",
"Rpl31", "Tbc1d8", "Cnot11", "Rnf149", "Creg2", "Map4k4", "Gm16894",
"Il1r1", "Slc9a2", "Mfsd9", "2610017I09Rik", "Pou3f3", "2900092D14Rik",
"Mrps9", "Gpr45", "Tgfbrap1", "AI597479", "Fhl2", "Nck2", "Uxs1",
"Tpp2", "Tex30", "Kdelc1", "Bivm", "Ercc5", "Gulp1", "Col3a1",
"Col5a2", "Wdr75", "Slc40a1", "Slc39a10", "Tmeff2", "Sdpr", "Nabp1",
"Myo1b", "Stat1", "Gls", "Nab1", "Tmem194b", "Mfsd6", "Inpp1",
"Hibch", "1700019D03Rik", "Mstn", "Pms1", "Ormdl1", "Osgepl1",
"Asnsd1", "Stk17b", "Hecw2", "Gtf3c3", "Pgap1", "Ankrd44", "Sf3b1",
"Coq10b", "Hspd1", "Hspe1", "Mob4", "Rftn2", "Mars2", "Plcl1",
"1700066M21Rik", "Tyw5", "9430016H08Rik", "Spats2l", "Kctd18",
"Aox1", "Aox3", "Bzw1", "Clk1", "Ppil3", "Nif3l1", "Orc2", "Fam126b",
"Ndufb3", "Cflar", "Casp8", "Trak2", "Stradb", "Tmem237", "Mpp4",
"Als2", "Sumo1", "Nop58", "Bmpr2", "Fam117b", "Ica1l", "Wdr12",
"Carf", "Nbeal1", "Cyp20a1", "Abi2", "Raph1", "Nrp2", "Ino80d",
"Gm11602", "Ndufs1", "Eef1b2", "Zdbf2", "Adam23", "Fastkd2",
"Klf7", "Creb1", "Mettl21a", "2810408I11Rik", "Ccnyl1", "Fzd5",
"Plekhm3", "D630023F18Rik", "Idh1", "Pikfyve", "Map2", "Unc80",
"Rpe", "Kansl1l", "Acadl", "Lancl1", "Erbb4", "Ikzf2", "Vwc2l",
"Atic", "Fn1", "Mreg", "Pecr", "Tmem169", "Xrcc5", "Marchf4",
"Smarcal1", "Rpl37a", "Igfbp2", "Igfbp5", "Tns1", "Arpc2", "Aamp",
"Pnkd_loc2", "Pnkd_loc1", "Tmbim1", "Ctdsp1", "Usp37", "Rqcd1",
"Plcd4", "Zfp142", "Bcs1l", "Rnf25", "Cyp27a1", "Wnt10a", "Cdk5r2",
"Nhej1", "Cnppd1", "Fam134a", "Zfand2b", "Abcb6", "Atg9a", "Ankzf1",
"Glb1l", "Stk16", "Tuba4a", "Dnajb2", "Ptprn", "Resp18", "Dnpep",
"Des", "Speg", "Gmppa", "Asic4", "Chpf", "Tmem198", "Obsl1",
"Inha", "Stk11ip", "Slc4a3", "Epha4", "Sgpp2", "Farsb", "Acsl3",
"Utp14b", "Scg2", "Wdfy1", "Mrpl44", "Serpine2", "Cul3", "Dock10",
"Nyap2", "Irs1", "Rhbdd1", "Mff", "Agfg1", "Slc19a3", "Sphkap",
"Pid1", "Dner", "Trip12", "Fbxo36", "Slc16a14", "Sp110", "Sp140",
"Sp100", "Cab39", "Itm2c", "4933407L21Rik", "Psmd1", "Armc9",
"Ncl", "Snora75", "C130036L24Rik", "Ptma", "Pde6d", "Cops7b",
"Nppc", "Dis3l2", "Ecel1", "Prss56", "Eif4e2", "Efhd1", "Gigyf2",
"Ngef", "Neu2", "Inpp5d", "Atg16l1", "Dgkd", "Usp40", "Dnajb3",
"Hjurp", "6430706D22Rik", "A730008H23Rik", "Arl4c", "Sh3bp4",
"Agap1", "Ackr3", "Cops8", "Lrrfip1", "Ramp1", "Ube2f", "Scly",
"Ilkap", "Hes6", "Per2", "Traf3ip1", "Asb1", "Hdac4", "Ndufa10",
"Myeov2", "Gpc1", "Dusp28", "Rnpepl1", "9430060I03Rik", "Capn10",
"Kif1a", "Mterfd2", "Pask", "Ppp1r7", "Hdlbp", "Septin2", "Farp2",
"Stk25", "Bok", "Thap4", "Atg4b", "Dtymk", "Ing5", "D2hgdh",
"Fam174a", "St8sia4", "Slco4c1", "D1Ertd622e", "Ppip5k2", "Gin1",
"Pam", "B230216N24Rik", "Cntnap5b", "Cdh20", "Rnf152", "Pign",
"2310035C23Rik", "Zcchc2", "Phlpp1", "Bcl2", "Kdsr", "Vps4b",
"Cdh7", "Cdh19", "Dsel", "Tsn", "Mki67ip", "Clasp1", "2900060B14Rik",
"Ralb", "Tmem185b", "Epb4.1l5", "Ptpn4", "Tmem177", "Tmem37",
"Dbi", "3110009E18Rik", "Steap3", "Insig2", "Ccdc93", "Ddx18",
"Dpp10", "Actr3", "Slc35f5", "Gpr39", "Lypd1", "Nckap5", "Mgat5",
"Tmem163", "2900009J06Rik", "Ccnt2", "Rab3gap1", "Zranb3", "R3hdm1",
"Ubxn4", "Mcm6", "Dars", "Cxcr4", "Thsd7b", "Cd55", "Pfkfb2",
"Yod1", "AA986860", "Mapkapk2", "Eif2d", "Rassf5", "Srgap2",
"5430435G22Rik", "Slc41a1", "Rab7l1", "Nucks1", "Slc45a3", "Mfsd4",
"Cdk18", "Klhdc8a", "Tmcc2", "Dstyk", "Rbbp5", "Tmem81", "Cntn2",
"Nfasc", "Lrrn2", "Mdm4", "Pik3c2b", "Ppp1r15b", "Plekha6", "Etnk2",
"Sox13", "Snrpe", "Zc3h11a", "Zbed6", "Atp2b4", "Prelp", "Btg2",
"Adora1", "Ppfia4", "Tmem183a", "Cyb5r1", "Adipor1", "Klhl12",
"Rabif", "Kdm5b", "Syt2", "Ppp1r12b", "Ptprv", "Ptpn7", "Arl8a",
"Gpr37l1", "Rnpep", "Timm17a", "Lmod1", "Shisa4", "Ipo9", "Nav1",
"Csrp1", "Phlda3", "Tnni1", "Tmem9", "Kif21b", "Gpr25", "Camsap2",
"Ddx59", "Zfp281", "Nek7", "2310009B15Rik", "Dennd1b", "Zbtb41",
"Gm4788", "Cfhr2", "Cfh", "Kcnt2", "Cdc73", "B3galt2", "Glrx2",
"Trove2", "Uchl5", "Rgs2", "Rgs1")
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] stringr_1.5.1       biomaRt_2.54.1      ggrepel_0.9.5
 [4] ggridges_0.5.6      reshape2_1.4.4      ggplot2_3.4.4
 [7] plyr_1.8.9          EWCE_1.6.0          RNOmni_1.0.1.2
[10] data.table_1.15.0   GEOquery_2.66.0     Biobase_2.58.0
[13] BiocGenerics_0.44.0

loaded via a namespace (and not attached):
  [1] backports_1.4.1               AnnotationHub_3.6.0
  [3] BiocFileCache_2.11.1          lazyeval_0.2.2
  [5] orthogene_1.4.2               ewceData_1.6.0
  [7] BiocParallel_1.37.1           GenomeInfoDb_1.34.9
  [9] digest_0.6.34                 yulab.utils_0.1.4
 [11] htmltools_0.5.7               fansi_1.0.6
 [13] magrittr_2.0.3                memoise_2.0.1
 [15] tzdb_0.4.0                    limma_3.58.1
 [17] Biostrings_2.66.0             readr_2.1.5
 [19] matrixStats_1.2.0             R.utils_2.12.3
 [21] prettyunits_1.2.0             colorspace_2.1-0
 [23] blob_1.2.4                    rappdirs_0.3.3
 [25] dplyr_1.1.4                   tcltk_4.2.2
 [27] crayon_1.5.2                  RCurl_1.98-1.14
 [29] jsonlite_1.8.8                SparseArray_1.2.3
 [31] ape_5.7-1                     glue_1.7.0
 [33] gtable_0.3.4                  zlibbioc_1.48.0
 [35] XVector_0.42.0                HGNChelper_0.8.1
 [37] DelayedArray_0.28.0           car_3.1-2
 [39] SingleCellExperiment_1.20.1   abind_1.4-7
 [41] scales_1.3.0                  DBI_1.2.2
 [43] rstatix_0.7.2                 Rcpp_1.0.12
 [45] progress_1.2.3                viridisLite_0.4.2
 [47] xtable_1.8-4                  gridGraphics_0.5-1
 [49] tidytree_0.4.6                bit_4.0.5
 [51] stats4_4.2.2                  S4Arrays_1.2.0
 [53] htmlwidgets_1.6.4             httr_1.4.7
 [55] ellipsis_0.3.2                XML_3.99-0.16.1
 [57] pkgconfig_2.0.3               R.methodsS3_1.8.2
 [59] farver_2.1.1                  dbplyr_2.4.0
 [61] utf8_1.2.4                    ggplotify_0.1.2
 [63] tidyselect_1.2.1              labeling_0.4.3
 [65] rlang_1.1.3                   later_1.3.2
 [67] AnnotationDbi_1.60.2          munsell_0.5.0
 [69] BiocVersion_3.16.0            tools_4.2.2
 [71] cachem_1.0.8                  cli_3.6.2
 [73] generics_0.1.3                RSQLite_2.3.5
 [75] ExperimentHub_2.6.0           broom_1.0.5
 [77] fastmap_1.1.1                 ggdendro_0.2.0
 [79] yaml_2.3.8                    ggtree_3.6.2
 [81] babelgene_22.9                bit64_4.0.5
 [83] fs_1.6.3                      purrr_1.0.2
 [85] KEGGREST_1.38.0               gprofiler2_0.2.3
 [87] nlme_3.1-164                  mime_0.12
 [89] R.oo_1.26.0                   grr_0.9.5
 [91] aplot_0.2.2                   xml2_1.3.6
 [93] compiler_4.2.2                plotly_4.10.4
 [95] filelock_1.0.3                curl_5.2.0
 [97] png_0.1-8                     interactiveDisplayBase_1.36.0
 [99] ggsignif_0.6.4                treeio_1.22.0
[101] tibble_3.2.1                  statmod_1.5.0
[103] homologene_1.4.68.19.3.27     stringi_1.8.3
[105] lattice_0.22-5                Matrix_1.6-5
[107] vctrs_0.6.5                   pillar_1.9.0
[109] lifecycle_1.0.4               BiocManager_1.30.22
[111] bitops_1.0-7                  httpuv_1.6.14
[113] patchwork_1.2.0               GenomicRanges_1.50.2
[115] R6_2.5.1                      promises_1.2.1
[117] IRanges_2.32.0                codetools_0.2-19
[119] MASS_7.3-60.0.1               SummarizedExperiment_1.28.0
[121] withr_3.0.0                   S4Vectors_0.36.2
[123] GenomeInfoDbData_1.2.9        parallel_4.2.2
[125] hms_1.1.3                     grid_4.2.2
[127] ggfun_0.1.4                   tidyr_1.3.1
[129] MatrixGenerics_1.10.0         carData_3.0-5
[131] ggpubr_0.6.0                  shiny_1.8.0
@rbutleriii rbutleriii added the bug label Apr 3, 2024
@Al-Murphy
Copy link
Collaborator

@bschilder any idea what this keep popular from orthogene::convert_orthologs is doing and the expected behaviour for EWCE?

@rbutleriii
Copy link
Author

rbutleriii commented Apr 3, 2024

looks like it calls gprofiler for that and I assume does something like this:

y <- data.table(
  gorth(
    query = specs$GENE, 
    source_organism = "mmusculus", 
    target_organism = "hsapiens", 
    mthreshold = 1, 
    filter_na = TRUE
  )
)

I have looked at their web documentation, and I don't know how they pick favorites, but in my case it would be very helpful to have an ortholog resolution that keeps one of the 1:many , as using only 1:1 orths drops plenty of key genes. Notably in my case Mt1 is pretty biologically relevant to oligodendrocytes, but results in 10 duplicate human orthologs.

Currently I tend to leave the duplicates, knowing that this does produce duplicates, but i am not sure biomaRt has anything specifying a "best" either.

#' For a given column of mouse ensIDs, replace them with human ensembl gene ids 
#' includes a filter for only no pseudo or TEC. Currently still locked at v105
#' of biomaRt until fix for https://github.com/grimbough/biomaRt/issues/61
#'
#' @param dt data table to replace names
#' @param colname name of the column containing mouse ensgs
#' @param keep.mouse boolean, retain the mouse ENSG column?
#'
#' @returns dt with a GENE column with or without the mGENE column
convertMouseHuman = function(dt, colname, keep.mouse=F){
  require("biomaRt")
  print(paste(length(unique(dt[[colname]])), "genes to look up"))
  mouse = useEnsembl("ensembl", dataset="mmusculus_gene_ensembl", version="105")
  human = useEnsembl("ensembl", dataset="hsapiens_gene_ensembl", version="105")
  # match with ensembl_gene_id across both
  x = data.table(getLDS(mart=mouse, attributes=c('ensembl_gene_id'), martL=human,
                        attributesL=c('ensembl_gene_id', 'chromosome_name', 
                                      'gene_biotype'), 
                        filters='ensembl_gene_id', values=unique(dt[[colname]]),
                        bmHeader=F))
  names(x) = make.unique(names(x))
  print(paste(nrow(x), "total genes returned"))
  # ensembl transcript types to keep, excluding pseudogenes and TECs
  keep_categ = grep("pseudo", unique(x$gene_biotype), value=T, invert=T)
  keep_categ = grep("TEC", keep_categ, value=T, invert=T)
  x = x[gene_biotype %in% keep_categ]
  print(paste(nrow(x), "genes correct biotype"))
  # use only canonical chromosomes
  use_chrs = c(1:22, "X", "Y", "MT")
  x = x[chromosome_name %in% use_chrs]
  print(paste(nrow(x), "genes in chr 1:22 X Y MT"))
  setnames(x, c("ensembl_gene_id", "ensembl_gene_id.1"), c("mGENE", "GENE"))
  # replace gene names and write to file
  dt = merge(x[, .(GENE, mGENE)], dt, all.y=T, by.x="mGENE", by.y=colname)
  dt = dt[!is.na(GENE)]
  print(paste(length(unique(dt[["GENE"]])), "unique genes with ensembl ids"))
  if (keep.mouse == FALSE) dt[, mGENE := NULL]
  return(unique(dt))
}

#' For a given column of Gene Symbols, replace them with Ensembl IDs
#' includes a filter for only protein coding genes
#'
#' @param dt data table to replace names
#' @param colname name of the column containing ensgs
#' @param keep.ensgs boolean, retain the ENSG column?
#'
#' @returns dt with a GENE column with or without the Symbol column
convertGeneSymbol = function(dt, colname, keep.symbols=F){
  require("biomaRt")
  print(paste(length(unique(dt[[colname]])), "genes to look up"))
  ensembl = useEnsembl("ensembl", dataset="mmusculus_gene_ensembl", version=ensver)
  # match with ensembl_gene_id across both
  x = data.table(getBM(attributes=c("ensembl_gene_id", "external_gene_name",
                                    "gene_biotype", "chromosome_name"),
                       filters="external_gene_name", 
                       values=unique(dt[[colname]]), 
                       mart=ensembl))
  print(paste(nrow(x), "total genes returned"))
  # ensembl transcript types to keep, excluding pseudogenes and TECs
  keep_categ = grep("pseudo", unique(x$gene_biotype), value=T, invert=T)
  keep_categ = grep("TEC", keep_categ, value=T, invert=T)
  x = x[gene_biotype %in% keep_categ]
  print(paste(nrow(x), "genes correct biotype"))
  # use only canonical chromosomes
  use_chrs = c(1:19, "X", "Y", "MT")
  x = x[chromosome_name %in% use_chrs]
  print(paste(nrow(x), "genes in chr 1:19, X, Y, MT"))
  setnames(x, c("ensembl_gene_id", "external_gene_name"), c("GENE", "Symbol"))
  # replace gene names and write to file
  dt = merge(x[, .(GENE, Symbol)], dt, all.y=T, by.x="Symbol", by.y=colname)
  dt = dt[!is.na(GENE)]
  print(paste(length(unique(dt[["GENE"]])), "unique genes with ensembl ids"))
  if (keep.symbols == FALSE) dt[, Symbol := NULL]
  return(unique(dt))
}

@Al-Murphy
Copy link
Collaborator

Looking into this more now, this seems like more of a bug for orthogene. Yes, you would think the 'keep popular' option would return just one gene in the mapping but it doesn't. EWCE uses orthogene for this mapping and you can see that drop_non121() doesn't have a method to handle keep_popular (kp) so it is assumed int he code that there will be a 1:1 return for this which there isn't. @rbutleriii could you create a bug issue on orthogene and reference this instead?

@bschilder
Copy link
Collaborator

@rbutleriii I would indeed expect keep_popular to return only 1:1 genes, which should be compatible with EWCE.
https://github.com/neurogenomics/orthogene/blob/bc242c50396018d55fd12e653c0c069bc34dca67/R/check_keep_popular.R#L1

I think logging this as an issue (with example data) would be the best thing to do. I can then look into it there.

@rbutleriii
Copy link
Author

Ok, so that took a bit, but eventually I can now reproduce it without the dataset. Turns out the error is not specific to orthogene. It appears to just be that the expression matrix was a data.frame, not a matrix:

# library(data.table)
library(EWCE)
# library(orthogene)
library(gprofiler2)

mGenes = c(
  "Hjurp", "6430706D22Rik", "A730008H23Rik", "Arl4c", "Sh3bp4"
)
set.seed(1)
expr = data.frame(matrix(
  sample(100, length(mGenes) * 9, replace = TRUE), 
  nrow = length(mGenes)
))
rownames(expr) = mGenes
ctype = unlist(lapply(c("A", "B", "C"), rep, 3))

x <- drop_uninformative_genes(
  exp = expr,
  level2annot = ctype,
  convert_orths = TRUE,
  non121_strategy = "keep_popular",
  input_species = "mouse",
  output_species = "human"
)

If instead you load it in and keep it a matrix, it runs fine.

expr = matrix(
  sample(100, length(mGenes) * 9, replace = TRUE), 
  nrow = length(mGenes)
)

@bschilder
Copy link
Collaborator

Thanks for making the reprex, @rbutleriii . Will close this unless there's anything else we can help with.

@Al-Murphy
Copy link
Collaborator

Thanks @rbutleriii for figuring this out! I actually think this is something EWCE should handle too though so I have pushed a fix sodrop_uninformative_genes() will now catch cases where expression matrix is a dataframe and convert to a matrix. See version >=1.11.4 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants