Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export from tome to any other format? #29

Open
maximilianh opened this issue Dec 4, 2019 · 7 comments
Open

export from tome to any other format? #29

maximilianh opened this issue Dec 4, 2019 · 7 comments

Comments

@maximilianh
Copy link

Hi, we have a tome file that we need to process. Is there any function or way to get the data out of the .tome file in a standard format? Like .mtx, .h5, a .csv or .tsv file with the genes on the lines and the first column being the geneId (possibly the symbol separated by | or similar) ?

I can see that tome is very good at importing files, but I cannot see an export function...

thanks!
Max

@hypercompetent
Copy link
Member

Hi Max,

.tome files are an HDF5-formatted sparse matrix format, so you should be able to extract the data back out to .h5 (following 10x conventions) or .mtx. Here are some examples in R (below).

If your target is Python, let me know. I think I have some code that may be able to go straight from .tome into a scipy sparse csc matrix.

For .mtx

library(rhdf5)
library(scrattch.io)

tome_file <- "//allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/tomes/facs/mouse_V1_ALM_20170913/faster_transcrip.tome"

# Read the sparse matrix for exon counts
tome_matrix <- read_tome_dgCMatrix(tome_file,
                                   "/data/exon")

# Write to .mtx using the Matrix package
Matrix::writeMM(tome_matrix,
                "tome.mtx")

# Read the sample and gene names (row and column names, respectively)
sample_names <- h5read(tome_file, "/sample_names")
gene_names <- h5read(tome_file, "/gene_names")

# Write row and column names to .csv
write.csv(sample_names, "row_sample_names.csv")
write.csv(gene_names, "col_gene_names.csv")

There is a 10x .h5 output function (write_dgCMatrix_h5()), but it looks like it may be out of date with the current structure used by 10x. Here's a way to output to the current structure:

library(rhdf5)
library(scrattch.io)

tome_file <- "//allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/tomes/facs/mouse_V1_ALM_20170913/faster_transcrip.tome"

# Read the sparse matrix for exon counts
tome_matrix <- read_tome_dgCMatrix(tome_file,
                                   "/data/exon")

# Transpose to match the orientation expected by 10x
tome_matrix <- Matrix::t(tome_matrix)

# Now sample_names correspond to columns, gene_names to rows
sample_names <- h5read(tome_file, "/sample_names")
gene_names <- h5read(tome_file, "/gene_names")

# Output data in .h5 locations
h5_file <- "path_to_your.h5"

# Build groups
h5createFile(h5_file)
h5createGroup(h5_file, "/matrix")
h5createGroup(h5_file, "/matrix/features")

# Create Datasets and write their sparse matrix components
h5createDataset(h5_file, dataset = "/matrix/data", dims = length(tome_matrix@x), chunk = 1000)
h5write(tome_matrix@x, h5_file, "/matrix/data")

h5createDataset(h5_file, dataset = "/matrix/indices", dims = length(tome_matrix@i), chunk = 1000)
h5write(tome_matrix@i, h5_file, "/matrix/indices")

h5createDataset(h5_file, dataset = "/matrix/indptr", dims = length(tome_matrix@p), chunk = 1000)
h5write(tome_matrix@p, h5_file, "/matrix/indptr")

# Add shape/dims and row and column names
h5write(dim(tome_matrix), h5_file, "/matrix/shape")

h5write(sample_names, h5_file, "/matrix/barcodes")
h5write(gene_names, h5_file,  "/matrix/features/id")

I wouldn't recommend a .csv or .tsv, as expanding these files out to full matrices instead of sparse formats can make the resulting files very large. However, there is a .csv export function, write_dgCMatrix_csv():

library(rhdf5)
library(scrattch.io)

tome_file <- "//allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/tomes/facs/mouse_V1_ALM_20170913/faster_transcrip.tome"

# Read the sparse matrix for exon counts
tome_matrix <- read_tome_dgCMatrix(tome_file,
                                   "/data/exon")

# Transpose to genes as rows if that's what you'd like to use
tome_matrix <- Matrix::t(tome_matrix)

# Write to .csv
write_dgCMatrix_csv(tome_matrix,
                    "tome.csv",
                    col1_name = "geneId",
                    chunk_size = 1000)

@maximilianh
Copy link
Author

Hi, would it be possible to provide these files in some other format, like .mtx? Getting all these packages to work on our server is painful, it requires a certain version of R and just for reading a few files, this seems a lot of work. Would you mind providing these files in some other format, besides your own file format? It would be very much appreciated and may help increase community uptake of your results...
many thanks!
Max

@wuzhaoqi1015
Copy link

wuzhaoqi1015 commented Apr 2, 2020

Hello, I got the following error when I used it. Could you tell me the reason?


sample_name <- read_tome_sample_names(tome)
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.
gene_name <- read_tome_gene_names(tome)
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.


@adrisede
Copy link

adrisede commented Apr 2, 2020

Hello wuzhaoqi1015,

From both of your inquiries, it seems like you might be missing to load the test dataset properly.

Try this:

library(scrattch.io)
library("rhdf5")

tome <- system.file("testdata/tome",
"transcrip.tome",
package = "scrattch.io")

@wuzhaoqi1015
Copy link

wuzhaoqi1015 commented Apr 3, 2020

Thank you for your reply to the previous message. I have some questions to ask.
1.I wanna get a matrix with row and column names. After extracting the sparse matrix, can I add column and row names as follows?
a<-read_tome_dgCMatrix(tome,"data/t_exon") # read exon b<-read_tome_dgCMatrix(tome,"data/t_intron") #read intron sample_name <- read_tome_sample_names(tome) gene_name <- read_tome_gene_names(tome) rownames(a)<-gene_name colnames(a)<-sample_name
2.I want to export this matrix to a file. But when I run “write_dgCMatrix_csv”, I get an error. What is the reason for it? Is it possible to use "as.matrix" and then "write.csv".
`write_dgCMatrix_csv(a, "filename", col1_name ="gene_names",chunk_size = 2000)

[1] "Writing rows 1 to 2000"

Error in data.frame(..., check.names = FALSE) :

arguments imply differing number of rows: 0, 2000`

@maximilianh
Copy link
Author

maximilianh commented Apr 3, 2020 via email

@KaitlynPrice
Copy link

I am also getting this error:

Thank you for your reply to the previous message. I have some questions to ask.
1.I wanna get a matrix with row and column names. After extracting the sparse matrix, can I add column and row names as follows?
a<-read_tome_dgCMatrix(tome,"data/t_exon") # read exon b<-read_tome_dgCMatrix(tome,"data/t_intron") #read intron sample_name <- read_tome_sample_names(tome) gene_name <- read_tome_gene_names(tome) rownames(a)<-gene_name colnames(a)<-sample_name
2.I want to export this matrix to a file. But when I run “write_dgCMatrix_csv”, I get an error. What is the reason for it? Is it possible to use "as.matrix" and then "write.csv".
`write_dgCMatrix_csv(a, "filename", col1_name ="gene_names",chunk_size = 2000)

[1] "Writing rows 1 to 2000"

Error in data.frame(..., check.names = FALSE) :

arguments imply differing number of rows: 0, 2000`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants