Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is matrixValidity printing row and column wrong? #52

Open
jamespeapen opened this issue Dec 4, 2023 · 17 comments
Open

Is matrixValidity printing row and column wrong? #52

jamespeapen opened this issue Dec 4, 2023 · 17 comments

Comments

@jamespeapen
Copy link
Contributor

jamespeapen commented Dec 4, 2023

I'm trying out a dataset with 333 cells and 60329 genes.

> head input/matrix.mtx

%%MatrixMarket matrix coordinate real general
333 60329 2995863
40 1 1.064
72 1 1.036
152 1 15
1 2 2
128 2 16
258 2 1
40 9 116.936
72 9 12.964

When running:

too-many-cells make-tree \
    --matrix-path input \
    --output out
"Warning: mismatch in number of (features, cells) (60329,333) with matrix (rows, columns) (333,60329), will probably result in error."..................................]   0%
too-many-cells: matMat : incompatible matrix sizes((333,60329),(88,1))
CallStack (from HasCallStack):
  error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-8Sr5Y9guRLx7MwdljauHcO:Data.Sparse.SpMatrix

I wasn't sure what the error meant and checked the code. It looks like its printing the matrix's cols,rows instead of rows,cols as the message says. However, The matrix does have 333 rows and 60329 columns so I'm not sure how to interpret it or know if my matrix is set up wrong. I'm also not sure where the (88, 1) means.

-- | Check validity of matrix.
matrixValidity :: (MatrixLike a) => a -> Maybe String
matrixValidity mat
  | rows /= numCells || cols /= numFeatures =
      Just $ "Warning: mismatch in number of (features, cells) ("
                           <> show numFeatures
                           <> ","
                           <> show numCells
                           <> ") with matrix (rows, columns) ("
                           <> show cols
                           <> ","
                           <> show rows
                           <> "), will probably result in error."
  | otherwise = Nothing
  where
    (rows, cols) = S.dimSM . getMatrix $ mat
    numCells = V.length . getRowNames $ mat
    numFeatures = V.length . getColNames $ mat

I'm having issues where not all 333 cells are in the final clusters.csv output - fewer than 100 make it depending on the filter-thresholds used. Since the documentation said this was optional, I took it out, but found this error.

I've made a draft PR swapping the two prints if it is actually wrong.

@GregorySchwartz
Copy link
Owner

Thanks for catching the error in the error!

You are right, the actual error is in the (88, 1) portion. Usually this is when filtering or input is unexpected in some way. For instance, you say that you have 333 cells but they are rows in the input matrix. Cell ranger's matrix market output has cells as columns, so that is what we used for input, could you check that your barcode and features file are matching that? If not, you can also transpose the matrix with the --matrix-transpose argument, but you need to make sure those other files match.

@jamespeapen
Copy link
Contributor Author

jamespeapen commented Dec 5, 2023

I had initially tried cells as columns, but only got one cell in the output with the default filter thresholds:

clusters.csv:

cell,cluster,path
SE6052_SA56912_S1_L001_R1_001.merged_quant,0,0

I tried different threshold combinations, but still only had this one cell in the output.

Not including filter thresholds gives:

too-many-cells: matMat : incompatible matrix sizes((60329,333),(1,1))...........................] 0%│
CallStack (from HasCallStack):                                                                                                                                                │
  error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-8Sr5Y9guRLx7MwdljauHcO:Data.Sparse.SpMatrix       

However, using the previous cells as rows and genes as columns mtx format, I ran

too-many-cells make-tree \
    --matrix-path input \
    --matrix-transpose \
    --output out

This listed the genes instead of the barcodes as cells in its output so I swapped the genes and barcodes filenames so that genes.tsv contains the cell ids and barcodes.tsv contains the genes. Then when running the previous command, I got all 333 cells in the output and a dendrogram.

clusters.csv:

cell,cluster,path
SE6052_SA57031_S120_L001_R1_001.merged_quant,3,3/2/1/0
SE6054_SA56854_S135_L002_R1_001_quant,3,3/2/1/0
SE6054_SA56903_S184_L002_R1_001_quant,3,3/2/1/0
SE6052_SA56932_S21_L001_R1_001.merged_quant,5,5/4/2/1/0
SE6052_SA56952_S41_L001_R1_001.merged_quant,5,5/4/2/1/0
...

dendrogram

I'm not sure where I've gone wrong, but doing it the opposite, but correct way doesn't work: matrix with cells as columns and genes as rows, genes.tsv with gene ids, and barcodes.tsv with cell ids - incompatible matrix size error.

@GregorySchwartz
Copy link
Owner

GregorySchwartz commented Dec 5, 2023

Can I see a head of each file (matrix, features, barcodes) with each file name?

For the actual case (non-transposed).

@jamespeapen
Copy link
Contributor Author

jamespeapen commented Dec 5, 2023

==> barcodes.tsv <==
SE6052_SA56912_S1_L001_R1_001.merged_quant
SE6052_SA56914_S3_L001_R1_001.merged_quant
SE6052_SA56915_S4_L001_R1_001.merged_quant
SE6052_SA56916_S5_L001_R1_001.merged_quant
SE6052_SA56917_S6_L001_R1_001.merged_quant
SE6052_SA56918_S7_L001_R1_001.merged_quant
SE6052_SA56919_S8_L001_R1_001.merged_quant
SE6052_SA56920_S9_L001_R1_001.merged_quant
SE6052_SA56921_S10_L001_R1_001.merged_quant
SE6052_SA56922_S11_L001_R1_001.merged_quant

==> genes.tsv <==
ENSG00000223972
ENSG00000243485
ENSG00000284332
ENSG00000268020
ENSG00000240361
ENSG00000186092
ENSG00000233750
ENSG00000241599
ENSG00000279928
ENSG00000286448

==> matrix.mtx <==
%%MatrixMarket matrix coordinate real general
60329 333 2995863
2 1 2
15 1 138.588
16 1 155.831
17 1 61.541
18 1 1
20 1 854.483
21 1 2.003
25 1 35.331
> wc -l barcodes.tsv             
333 barcodes.tsv
> wc -l genes.tsv   
60329 genes.tsv

@GregorySchwartz
Copy link
Owner

GregorySchwartz commented Dec 5, 2023

And can I see the command you ran with those files along with the error? Is it the first comment of this thread?

@jamespeapen
Copy link
Contributor Author

Yes, with this data it was

too-many-cells make-tree \
    --matrix-path input \
    --output out

Adding this

too-many-cells make-tree \
    --filter-thresholds "(250, 1)" \
    --matrix-path input \
    --output out

removes the error giving one cell in the output

@GregorySchwartz
Copy link
Owner

I'm starting to confuse myself. I've reverted the pull request as I realized I was referring to the rows as features and cells as columns like with Cell Ranger (even though for the program our convention is cells as rows). Hence the swap.

For your problem, this means that the original error is that the feature file had 60329 rows but the matrix had 333 rows (and vice versa for columns). Based on what you sent me (the wc -l for each), I don't know why this would happen. As you can see in

it is getting the cell file as rows and the features as columns.

Silly question, what is in the input folder?

What happens if you do not change your features and barcode files but use -T?

@GregorySchwartz
Copy link
Owner

GregorySchwartz commented Dec 5, 2023

Also, you should not use that "default" filtering threshold as your values are definitely not 10x scRNA-seq, just leave them at (0,0) for now.

Could you also send the tail of each file?

@jamespeapen
Copy link
Contributor Author

Silly question, what is in the input folder?

> ls input
barcodes.tsv
genes.tsv
matrix.mtx

Could you also send the tail of each file?

==> barcodes.tsv <==
SE6054_SA56902_S183_L002_R1_001_quant
SE6054_SA56903_S184_L002_R1_001_quant
SE6054_SA56904_S185_L002_R1_001_quant
SE6054_SA56905_S186_L002_R1_001_quant
SE6054_SA56906_S187_L002_R1_001_quant
SE6054_SA56907_S188_L002_R1_001_quant
SE6054_SA56908_S189_L002_R1_001_quant
SE6054_SA56909_S190_L002_R1_001_quant
SE6054_SA56910_S191_L002_R1_001_quant
SE6054_SA56911_S192_L002_R1_001_quant

==> genes.tsv <==
ENSG00000277761
ENSG00000277836
ENSG00000275869
ENSG00000273554
ENSG00000278633
ENSG00000278066
ENSG00000276017
ENSG00000278817
ENSG00000277196
ENSG00000278625

==> matrix.mtx <==
60270 333 2
60271 333 26
60273 333 68
60274 333 436
60275 333 3
60278 333 393
60279 333 258
60287 333 11
60288 333 1.999
60291 333 2

What happens if you do not change your features and barcode files but use -T?

too-many-cells make-tree \
    --matrix-path input \
    -T \
    --output input
too-many-cells: matMat : incompatible matrix sizes((333,60329),(155,1)).......................] 0%
CallStack (from HasCallStack):
  error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-8Sr5Y9guRLx7MwdljauHcO:Data.Sparse.SpMatrix

@jamespeapen
Copy link
Contributor Author

If its helpful, the data are from a new scRNA-seq protocol and I'm using the data from that paper. I got a SingleCellExperiment object and pulled the count matrix from it and wrote it to a mtx file.

@GregorySchwartz
Copy link
Owner

I'm a little confused, in the first comment in this thread you had cells as rows, but in the latest one you have them as columns. Which is the original and which error goes with which matrix?

@GregorySchwartz
Copy link
Owner

Just to be clear, in a perfect world where it works, the matrix should have 333 columns and the barcode file with 333 lines, no transposing, and filters being 0.

@GregorySchwartz
Copy link
Owner

Try testing on the example from the TooManyCells workshop to see if it has the appropriate output, and see if the inputs match yours.

@jamespeapen
Copy link
Contributor Author

jamespeapen commented Dec 5, 2023

The first comment had cells as rows and genes as columns and produced that error about mismatched dimensions (incompatible matrix sizes((333,60329),(88,1))). I did not create the matrix correctly as too-many-cells expected.

The last comment from me is using the same data, but, as expected, with cells as columns and genes as rows. The perfect world scenario also produced an error with different mismatched dimensions incompatible matrix sizes((333,60329),(155,1)).

In both cases, using the filters-thresholds flag removes the error. When the matrix is set up as too-many-cells expects, only the first cell is present in the output regardless of the filter thresholds values.

Try testing on the example from the TooManyCells workshop to see if it has the appropriate output, and see if the inputs match yours.

I ran this and it worked perfectly. There is probably something up with my matrix - I'm having a hard time tracking down the source of the error. As far as I can tell the inputs match in mtx header and the counts of features and barcodes.

My comment with the dendrogram picture was the only time I did a transpose of the 'wrong' matrix format (like my first comment) and swapped genes/barcodes filenames. Its also the only one that produces results without filters.

Sorry for the confusion and thanks for taking the time to go through this!

@jamespeapen
Copy link
Contributor Author

jamespeapen commented Dec 5, 2023

The only difference I found between the workshop data and mine is the numeric type - real vs integer:

mine:

%%MatrixMarket matrix coordinate real general
60329 333 2995863
2 1 2
15 1 138.588
16 1 155.831
17 1 61.541
18 1 1
20 1 854.483
21 1 2.003
25 1 35.331

wc -l genes.tsv
60329

wc -l barcodes.tsv
333

workshop brain:

%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "3.0.0"}
31053 1301 4220492
30976 1 73
30974 1 1
30973 1 56
30972 1 2
30971 1 11
30970 1 116
30969 1 123

zcat too-many-cells/workshop/data/brain/filtered_feature_bc_matrix/barcodes.tsv.gz | wc -l
1301

zcat too-many-cells/workshop/data/brain/filtered_feature_bc_matrix/features.tsv.gz | wc -l
31053

@GregorySchwartz
Copy link
Owner

Very weird. What if you use a csv? It will be slower but we can see if it has something to do with the mtx file which I think is the culprit.

@jamespeapen
Copy link
Contributor Author

That worked! The tree also makes more biological sense than the transposed approach. I'll try converting between csv/mtx and figure out where my mtx file is broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants