Skip to content

Commit

Permalink
add new file
Browse files Browse the repository at this point in the history
  • Loading branch information
reedliu committed Sep 2, 2021
1 parent bd28492 commit a42f9e3
Show file tree
Hide file tree
Showing 5 changed files with 63 additions and 8 deletions.
6 changes: 2 additions & 4 deletions R/transID.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,11 @@
#' @examples
#' transId(
#' id = c("Cyp2c23", "Fhit", "Gal3st2b", "Trp53", "Tp53"),
#' trans_to = "ensembl", org = "mouse", simple = TRUE
#' )
#' trans_to = "ensembl", org = "mouse", simple = TRUE)
#' # input id contains duplicates,fake id and one-to-many match id
#' transId(
#' id = c("MMD2", "HBD", "TP53", "RNR1", "TEC", "BCC7", "FAKEID", "TP53"),
#' trans_to = "entrez", org = "hg", simple = FALSE
#' )
#' trans_to = "entrez", org = "hg", simple = FALSE)
transId <- function(id, trans_to, org, simple = TRUE) {

#--- args ---#
Expand Down
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ remotes::install_github("GangLiLab/genekitr", build_vignettes = TRUE, dependenci

- [x] ID转换`transId` 允许错误的id匹配,结果为NA,并且提交的顺序和结果的顺序一致
- [x]`genInfo`的结果中提取转换后的id,更快更准确,并且可以保证output和input顺序一致
- [ ] 加快大型数据的ID转换速度,需要改进`genInfo`

##### 数据分析(Analyse)

Expand All @@ -122,11 +123,24 @@ remotes::install_github("GangLiLab/genekitr", build_vignettes = TRUE, dependenci
## DEBUG

- [x] `genGO`的use_symbol参数不管用 (原因:如果提供的已经是symbol,那么就忽略了这个参数)

- [x] 函数正常使用,但是帮助文档出不来(原因:写完函数忘记`devtools::document()` ,跳过这一步直接刷新包就会导致文档没更新)

- [x] 一个symbol对应多个entrez时,会默认按照数值从小到大排序,然后再进行合并。因为同一个symbol name,数值比较小的entrez更常用

- [x] 更新了`genInfo``transID` ,增加参数`simple = TRUE` ,方便应对一个id同时存在多个match结果的情况(如果`simple = TRUE` 就返回和input id 同样顺序的结果;如果`simple = F` ,就返回所有结果)

- [ ] `biomart`的结果也不全?发现`ENSG00000002079`[这个基因](http://asia.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000002079;r=7:99238829-99311130)在ensembl中对应基因`MYH16` ,但是biomart的结果中`ENSG00000002079 ` 没有对应,而且`MYH16` 对应的ensemble id也是NA

- 解决方法1:尝试下载ensemble物种所有的mapping数据:

```xml
# 以human为例,下载ensembl、symbol、uniprot的对应
wget -O result.txt 'http://www.ensembl.org/biomart/martservice?query=<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE Query><Query virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" ><Dataset name = "hsapiens_gene_ensembl" interface = "default" ><Attribute name = "ensembl_gene_id" /><Attribute name = "hgnc_symbol" /><Attribute name = "uniprotswissprot" /></Dataset></Query>'
```





## Let's mining data!
Expand Down
6 changes: 2 additions & 4 deletions man/transId.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

36 changes: 36 additions & 0 deletions test-zone/generate-expdat.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,39 @@ airway
as.data.frame(colData(airway))
summary(colSums(assay(airway))/1e6)
metadata(rowRanges(airway))

exprSet=assay(airway)
group_list=colData(airway)[,3]
colnames(exprSet)
pheatmap::pheatmap(cor(exprSet))
group_list
tmp=data.frame(g=group_list)
rownames(tmp)=colnames(exprSet)
# 组内的样本的相似性应该是要高于组间的!
pheatmap::pheatmap(cor(exprSet),annotation_col = tmp)
dim(exprSet)
exprSet=exprSet[apply(exprSet,1, function(x) sum(x>1) > 5),]
dim(exprSet)

exprSet=log(edgeR::cpm(exprSet)+1)
dim(exprSet)
exprSet=exprSet[names(sort(apply(exprSet, 1,mad),decreasing = T)[1:500]),]
dim(exprSet)
M=cor(log2(exprSet+1))
tmp=data.frame(g=group_list)
rownames(tmp)=colnames(M)
pheatmap::pheatmap(M,annotation_col = tmp)

pheatmap::pheatmap(scale(cor(log2(exprSet+1))))

length(unique(rownames(exprSet)))
id = transId(rownames(exprSet),'symbol','hs',simple = F)

id2 = clusterProfiler::bitr(rownames(exprSet),'ENSEMBL','SYMBOL','org.Hs.eg.db')

table(is.na(id))


x = genInfo(rownames(exprSet),'hs')


9 changes: 9 additions & 0 deletions test-zone/test_biomart.R
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,15 @@ bmt2 <- getBM( values = mm_id,
dataset = paste0(organism,"_gene_ensembl"),
host = "asia.ensembl.org"))


getBM( values = 'ENSG00000002079',
attributes = c("uniprot_gn_symbol",'uniprotswissprot','ensembl_gene_id',
'uniprot_gn_id','uniprotsptrembl'),
filters = "ensembl_gene_id",
mart = useMart("ensembl",
dataset = paste0(organism,"_gene_ensembl"),
host = "asia.ensembl.org"))

# %>%
# data.table::setnames(., old =colnames(.),
# new = c('ensembl','chr','start','end','strand','gc_content','gene_biotype','transcript_count')) %>%
Expand Down

0 comments on commit a42f9e3

Please sign in to comment.