ENH: add embedding, fixative and staining to sm_index #30

fedorov · 2024-06-26T15:46:11Z

Since there are multiple items within a series, the use of arrays becomes unavoidable - unless we switch to instance-level index for SM

Related to https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1796.

@DanielaSchacherer please take a look!

Since there are multiple items within a series, the use of arrays becomes unavoidable - unless we switch to instance-level index for SM

DanielaSchacherer · 2024-06-27T09:15:46Z

Copy from the message that I sent you in Slack regarding the query in https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1796:

I select the base level to make sure that I am consistently taking the staining & embedding information from a defined instance. These attributes are supposed to be the same across all instances (at least VOLUME instances, I am not 100% sure about the LABEL instances) in a Series. If they are not, then it's a mistake. And to somehow be deterministic André and I decided to always use the base level.

fedorov · 2024-06-28T13:34:01Z

If we consider only the unique codes across all levels, I don't think this would be problematic. Also, this would allow us to handle those cases where we have different stains for different channels in the base layer.

fedorov · 2024-06-28T14:08:36Z

I tested it with HTAN, and it does not give me the results I expected to see. Need to spend more time on this!

Surprisingly or not, but not all slides have embedding and fixative information

fedorov · 2024-07-01T13:31:16Z

The above was due to the use of inner join and invalid assumption that all processing codes we are using are available for all slides! This is fixed now.

fedorov · 2024-07-01T14:06:08Z

I select the base level to make sure that I am consistently taking the staining & embedding information from a defined instance. These attributes are supposed to be the same across all instances (at least VOLUME instances, I am not 100% sure about the LABEL instances) in a Series. If they are not, then it's a mistake. And to somehow be deterministic André and I decided to always use the base level.

@DanielaSchacherer can you please review the query and the results it is producing?

I think aggregating distinct values over all instances in a series is the right way to do it.

fedorov · 2024-07-02T14:11:17Z

I will revisit the names for the newly added columns, and add separate columns for designator:code and codeMeaning, to be consistent with what we did in other similar columns already included.

fedorov · 2024-07-03T19:04:26Z

@DanielaSchacherer how about we call the new columns as follows:

embeddingMedium_code_designator_value_str
embeddingMedium_CodeMeaning
tissueFixative_code_designator_value_str
tissueFixative_CodeMeaning
usingSubstance_code_designator_value_str
usingSubstance_CodeMeaning

the *_code_designator_value_str columns would have values of form DCM:111744, and *_CodeMeaning is self-explanatory.

This way we would be consistent with the conventions used for illuminationType, primaryAnatomicStructure etc, which I believe we all discussed and agreed upon earlier when we added the table to IDC BQ dataset.

@dclunie do you have any comments?

DanielaSchacherer · 2024-07-04T11:47:17Z

I know, it's a deviation from the convention, but what about
Staining_usingSubstance_code_designator_value_str
Staining_usingSubstance_CodeMeaning
instead of
usingSubstance_code_designator_value_str
usingSubstance_CodeMeaning
?

dclunie · 2024-07-04T17:39:04Z

Coupling the SpecimenPreparationProcedure (of SCT:127790008 "staining") with the "Using substance" value makes sense to me, because in future there could be other substances used for other purposes.

Also, note that there may be multiple values (and this is common, e.g., H&E uses two "Using substance" entries, one for the H and one for the E.

DanielaSchacherer · 2024-07-05T10:16:24Z

yes, true. We should make sure that this is distinguishable from the cases where there is also multiple values per series, but because there are different values for the different instances instead of multiple (but the same) values for all instances. This is not currently not apparent from the query result, do you have an idea how we could achieve it?

@fedorov If we are decided about the naming, we could maybe close this pull request and make another one for the question above.

fedorov · 2024-07-05T19:29:16Z

I will update the current PR to account for this feedback.

fedorov · 2024-07-05T19:52:12Z

Also, note that there may be multiple values (and this is common, e.g., H&E uses two "Using substance" entries, one for the H and one for the E.

Yes, and this is already covered by the query, since it will aggregate all distinct values for each of the concepts selected. In this specific case, the result will look like the following (see the last 2 columns).

Similarly, for the HTAN multiplexed images, all of the stains encountered across the series instances will be captured.

Given this organization, we can select all series that were processed using any given stain or embedding.

fedorov · 2024-07-05T20:16:23Z

@DanielaSchacherer please take a look at the query, run it in BQ studio to look at the results, and let me know if you have any concerns.

DanielaSchacherer

I ran the query and checked the results. It looked good for me so far.

One concern I have is that you can not deduce from the query's result, if you have multiple values for e.g. staining if they originate from different instances or from the same. Of course, in case of H&E stain people would know that they belong together, but that might not always be true. Right now, I am not sure though about a good solution.
Another thing which confuses me a little is that for most Series we don't have any information about embedding/tissue fixative. Is that expected?

fedorov · 2024-07-09T13:32:52Z

One concern I have is that you can not deduce from the query's result, if you have multiple values for e.g. staining if they originate from different instances or from the same. Of course, in case of H&E stain people would know that they belong together, but that might not always be true. Right now, I am not sure though about a good solution.

This is expected, as we discussed today. Series-level index has the union of all stains across all resolution levels and instances within the series. Instance-level index in #31 keeps track of this at the instance level.

Another thing which confuses me a little is that for most Series we don't have any information about embedding/tissue fixative. Is that expected?

I was somewhat surprised by this as well. @dclunie is this expected?

fedorov · 2024-07-09T16:54:47Z

Per @dclunie, fixation and embedding should be non-empty except for CPTAC and HTAN. I will look and report.

fedorov · 2024-07-09T17:54:47Z

Only the following CPTAC and HTAN collections do not have embedding codes defined.

cptac_aml
cptac_brca
cptac_ccrcc
cptac_cm
cptac_coad
cptac_gbm
cptac_hnscc
cptac_lscc
cptac_luad
cptac_ov
cptac_pda
cptac_sar
cptac_ucec
htan_hms
htan_ohsu
htan_wustl

Many TCGA collections do not have fixative code defined, but I guess this is also expected, since some samples were frozen and I assume no fixative was applied.

cptac_aml
cptac_brca
cptac_ccrcc
cptac_cm
cptac_coad
cptac_gbm
cptac_hnscc
cptac_lscc
cptac_luad
cptac_ov
cptac_pda
cptac_sar
cptac_ucec
htan_hms
htan_ohsu
htan_wustl
tcga_acc
tcga_blca
tcga_brca
tcga_cesc
tcga_chol
tcga_coad
tcga_dlbc
tcga_esca
tcga_gbm
tcga_hnsc
tcga_kich
tcga_kirc
tcga_kirp
tcga_lgg
tcga_lihc
tcga_luad
tcga_lusc
tcga_meso
tcga_ov
tcga_paad
tcga_pcpg
tcga_prad
tcga_read
tcga_sarc
tcga_skcm
tcga_stad
tcga_tgct
tcga_thca
tcga_thym
tcga_ucec
tcga_ucs
tcga_uvm

DanielaSchacherer · 2024-07-10T14:01:01Z

While checking for myself, I saw that temp_table extracts collection_id, but it is not used further. Do we want to output it as well as part of the final result?

fedorov · 2024-07-10T17:28:18Z

I don't know ... it's a tradeoff between redundancy and duplication of columns available in the main index and convenience of not having to do joins. I am not sure there is a perfect solution to this.

DanielaSchacherer · 2024-07-11T09:32:47Z

collection_id helps you to avoid joins?

ENH: add embedding, fixative and staining to sm_index

0b25272

Since there are multiple items within a series, the use of arrays becomes unavoidable - unless we switch to instance-level index for SM

fedorov force-pushed the sm_index-expand branch from ccef90f to 0b25272 Compare June 26, 2024 15:59

fedorov marked this pull request as draft June 28, 2024 14:08

use left in place of inner join

66a89a0

Surprisingly or not, but not all slides have embedding and fixative information

fedorov marked this pull request as ready for review July 1, 2024 14:12

fedorov mentioned this pull request Jul 3, 2024

Add slide microscopy (SM) index #15

Closed

fedorov force-pushed the sm_index-expand branch from f528399 to a5b413b Compare July 5, 2024 19:58

update columns to follow the earlier conventions for coded items

82c9f1d

fedorov force-pushed the sm_index-expand branch from a5b413b to 82c9f1d Compare July 5, 2024 19:59

fedorov marked this pull request as draft July 5, 2024 20:01

fedorov marked this pull request as ready for review July 5, 2024 20:14

fedorov assigned DanielaSchacherer and unassigned DanielaSchacherer Jul 5, 2024

fedorov requested a review from DanielaSchacherer July 5, 2024 20:15

DanielaSchacherer reviewed Jul 9, 2024

View reviewed changes

DanielaSchacherer approved these changes Jul 9, 2024

View reviewed changes

fedorov merged commit 743ab20 into main Jul 9, 2024
10 checks passed

fedorov deleted the sm_index-expand branch July 9, 2024 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add embedding, fixative and staining to sm_index #30

ENH: add embedding, fixative and staining to sm_index #30

fedorov commented Jun 26, 2024

DanielaSchacherer commented Jun 27, 2024

fedorov commented Jun 28, 2024

fedorov commented Jun 28, 2024

fedorov commented Jul 1, 2024 •

edited

Loading

fedorov commented Jul 1, 2024

fedorov commented Jul 2, 2024

fedorov commented Jul 3, 2024

DanielaSchacherer commented Jul 4, 2024

dclunie commented Jul 4, 2024

DanielaSchacherer commented Jul 5, 2024 •

edited

Loading

fedorov commented Jul 5, 2024

fedorov commented Jul 5, 2024 •

edited

Loading

fedorov commented Jul 5, 2024

DanielaSchacherer left a comment

fedorov commented Jul 9, 2024

fedorov commented Jul 9, 2024

fedorov commented Jul 9, 2024

DanielaSchacherer commented Jul 10, 2024

fedorov commented Jul 10, 2024

DanielaSchacherer commented Jul 11, 2024

ENH: add embedding, fixative and staining to sm_index #30

ENH: add embedding, fixative and staining to sm_index #30

Conversation

fedorov commented Jun 26, 2024

DanielaSchacherer commented Jun 27, 2024

fedorov commented Jun 28, 2024

fedorov commented Jun 28, 2024

fedorov commented Jul 1, 2024 • edited Loading

fedorov commented Jul 1, 2024

fedorov commented Jul 2, 2024

fedorov commented Jul 3, 2024

DanielaSchacherer commented Jul 4, 2024

dclunie commented Jul 4, 2024

DanielaSchacherer commented Jul 5, 2024 • edited Loading

fedorov commented Jul 5, 2024

fedorov commented Jul 5, 2024 • edited Loading

fedorov commented Jul 5, 2024

DanielaSchacherer left a comment

Choose a reason for hiding this comment

fedorov commented Jul 9, 2024

fedorov commented Jul 9, 2024

fedorov commented Jul 9, 2024

DanielaSchacherer commented Jul 10, 2024

fedorov commented Jul 10, 2024

DanielaSchacherer commented Jul 11, 2024

fedorov commented Jul 1, 2024 •

edited

Loading

DanielaSchacherer commented Jul 5, 2024 •

edited

Loading

fedorov commented Jul 5, 2024 •

edited

Loading