Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add embedding, fixative and staining to sm_index #30

Merged
merged 3 commits into from
Jul 9, 2024

Conversation

fedorov
Copy link
Member

@fedorov fedorov commented Jun 26, 2024

Since there are multiple items within a series, the use of arrays becomes unavoidable - unless we switch to instance-level index for SM

Related to https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1796.

@DanielaSchacherer please take a look!

Since there are multiple items within a series, the use of arrays becomes
unavoidable - unless we switch to instance-level index for SM
@DanielaSchacherer
Copy link

Copy from the message that I sent you in Slack regarding the query in https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1796:

I select the base level to make sure that I am consistently taking the staining & embedding information from a defined instance. These attributes are supposed to be the same across all instances (at least VOLUME instances, I am not 100% sure about the LABEL instances) in a Series. If they are not, then it's a mistake. And to somehow be deterministic André and I decided to always use the base level.

@fedorov
Copy link
Member Author

fedorov commented Jun 28, 2024

If we consider only the unique codes across all levels, I don't think this would be problematic. Also, this would allow us to handle those cases where we have different stains for different channels in the base layer.

@fedorov
Copy link
Member Author

fedorov commented Jun 28, 2024

I tested it with HTAN, and it does not give me the results I expected to see. Need to spend more time on this!

@fedorov fedorov marked this pull request as draft June 28, 2024 14:08
Surprisingly or not, but not all slides have embedding and fixative information
@fedorov
Copy link
Member Author

fedorov commented Jul 1, 2024

The above was due to the use of inner join and invalid assumption that all processing codes we are using are available for all slides! This is fixed now.

@fedorov
Copy link
Member Author

fedorov commented Jul 1, 2024

I select the base level to make sure that I am consistently taking the staining & embedding information from a defined instance. These attributes are supposed to be the same across all instances (at least VOLUME instances, I am not 100% sure about the LABEL instances) in a Series. If they are not, then it's a mistake. And to somehow be deterministic André and I decided to always use the base level.

@DanielaSchacherer can you please review the query and the results it is producing?

I think aggregating distinct values over all instances in a series is the right way to do it.

@fedorov fedorov marked this pull request as ready for review July 1, 2024 14:12
@fedorov
Copy link
Member Author

fedorov commented Jul 2, 2024

I will revisit the names for the newly added columns, and add separate columns for designator:code and codeMeaning, to be consistent with what we did in other similar columns already included.

@fedorov
Copy link
Member Author

fedorov commented Jul 3, 2024

@DanielaSchacherer how about we call the new columns as follows:

  • embeddingMedium_code_designator_value_str

  • embeddingMedium_CodeMeaning

  • tissueFixative_code_designator_value_str

  • tissueFixative_CodeMeaning

  • usingSubstance_code_designator_value_str

  • usingSubstance_CodeMeaning

the *_code_designator_value_str columns would have values of form DCM:111744, and *_CodeMeaning is self-explanatory.

This way we would be consistent with the conventions used for illuminationType, primaryAnatomicStructure etc, which I believe we all discussed and agreed upon earlier when we added the table to IDC BQ dataset.

@dclunie do you have any comments?

@DanielaSchacherer
Copy link

I know, it's a deviation from the convention, but what about
Staining_usingSubstance_code_designator_value_str
Staining_usingSubstance_CodeMeaning
instead of
usingSubstance_code_designator_value_str
usingSubstance_CodeMeaning
?

@dclunie
Copy link

dclunie commented Jul 4, 2024

Coupling the SpecimenPreparationProcedure (of SCT:127790008 "staining") with the "Using substance" value makes sense to me, because in future there could be other substances used for other purposes.

Also, note that there may be multiple values (and this is common, e.g., H&E uses two "Using substance" entries, one for the H and one for the E.

@DanielaSchacherer
Copy link

DanielaSchacherer commented Jul 5, 2024

yes, true. We should make sure that this is distinguishable from the cases where there is also multiple values per series, but because there are different values for the different instances instead of multiple (but the same) values for all instances. This is not currently not apparent from the query result, do you have an idea how we could achieve it?

@fedorov If we are decided about the naming, we could maybe close this pull request and make another one for the question above.

@fedorov
Copy link
Member Author

fedorov commented Jul 5, 2024

I will update the current PR to account for this feedback.

@fedorov
Copy link
Member Author

fedorov commented Jul 5, 2024

Also, note that there may be multiple values (and this is common, e.g., H&E uses two "Using substance" entries, one for the H and one for the E.

Yes, and this is already covered by the query, since it will aggregate all distinct values for each of the concepts selected. In this specific case, the result will look like the following (see the last 2 columns).

image

Similarly, for the HTAN multiplexed images, all of the stains encountered across the series instances will be captured.

image

Given this organization, we can select all series that were processed using any given stain or embedding.

@fedorov fedorov marked this pull request as draft July 5, 2024 20:01
@fedorov fedorov marked this pull request as ready for review July 5, 2024 20:14
@fedorov
Copy link
Member Author

fedorov commented Jul 5, 2024

@DanielaSchacherer please take a look at the query, run it in BQ studio to look at the results, and let me know if you have any concerns.

Copy link

@DanielaSchacherer DanielaSchacherer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the query and checked the results. It looked good for me so far.

  • One concern I have is that you can not deduce from the query's result, if you have multiple values for e.g. staining if they originate from different instances or from the same. Of course, in case of H&E stain people would know that they belong together, but that might not always be true. Right now, I am not sure though about a good solution.
  • Another thing which confuses me a little is that for most Series we don't have any information about embedding/tissue fixative. Is that expected?

@fedorov
Copy link
Member Author

fedorov commented Jul 9, 2024

One concern I have is that you can not deduce from the query's result, if you have multiple values for e.g. staining if they originate from different instances or from the same. Of course, in case of H&E stain people would know that they belong together, but that might not always be true. Right now, I am not sure though about a good solution.

This is expected, as we discussed today. Series-level index has the union of all stains across all resolution levels and instances within the series. Instance-level index in #31 keeps track of this at the instance level.

Another thing which confuses me a little is that for most Series we don't have any information about embedding/tissue fixative. Is that expected?

I was somewhat surprised by this as well. @dclunie is this expected?

@fedorov fedorov merged commit 743ab20 into main Jul 9, 2024
10 checks passed
@fedorov fedorov deleted the sm_index-expand branch July 9, 2024 13:33
@fedorov
Copy link
Member Author

fedorov commented Jul 9, 2024

Per @dclunie, fixation and embedding should be non-empty except for CPTAC and HTAN. I will look and report.

@fedorov
Copy link
Member Author

fedorov commented Jul 9, 2024

Only the following CPTAC and HTAN collections do not have embedding codes defined.

cptac_aml
cptac_brca
cptac_ccrcc
cptac_cm
cptac_coad
cptac_gbm
cptac_hnscc
cptac_lscc
cptac_luad
cptac_ov
cptac_pda
cptac_sar
cptac_ucec
htan_hms
htan_ohsu
htan_wustl

Many TCGA collections do not have fixative code defined, but I guess this is also expected, since some samples were frozen and I assume no fixative was applied.

cptac_aml
cptac_brca
cptac_ccrcc
cptac_cm
cptac_coad
cptac_gbm
cptac_hnscc
cptac_lscc
cptac_luad
cptac_ov
cptac_pda
cptac_sar
cptac_ucec
htan_hms
htan_ohsu
htan_wustl
tcga_acc
tcga_blca
tcga_brca
tcga_cesc
tcga_chol
tcga_coad
tcga_dlbc
tcga_esca
tcga_gbm
tcga_hnsc
tcga_kich
tcga_kirc
tcga_kirp
tcga_lgg
tcga_lihc
tcga_luad
tcga_lusc
tcga_meso
tcga_ov
tcga_paad
tcga_pcpg
tcga_prad
tcga_read
tcga_sarc
tcga_skcm
tcga_stad
tcga_tgct
tcga_thca
tcga_thym
tcga_ucec
tcga_ucs
tcga_uvm

@DanielaSchacherer
Copy link

While checking for myself, I saw that temp_table extracts collection_id, but it is not used further. Do we want to output it as well as part of the final result?

@fedorov
Copy link
Member Author

fedorov commented Jul 10, 2024

I don't know ... it's a tradeoff between redundancy and duplication of columns available in the main index and convenience of not having to do joins. I am not sure there is a perfect solution to this.

@DanielaSchacherer
Copy link

collection_id helps you to avoid joins?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants