Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BioTEA prepare cannot process (some) Agilent arrays #8

Open
MrHedmad opened this issue Jan 27, 2023 · 1 comment
Open

[BUG] BioTEA prepare cannot process (some) Agilent arrays #8

MrHedmad opened this issue Jan 27, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@MrHedmad
Copy link
Member

Describe the bug
Many Agilent arrays fail to be processed by BioTEA prepare. Some examples include:

  • GSE102238
  • GSE91035
  • GSE71729
  • GSE40098

To Reproduce
Steps to reproduce the behavior:

  1. Download the data of the above GEO datasets;
  2. Try and run BioTEA prepare against the data;
  3. BioTEA fails just after "reading input files..."

Desktop:

  • OS: Arch Linux
  • BioTEA Version (Run biotea info biotea): 1.1.0
  • Docker engine version (Run docker --version): N/A
  • BioTEA container version (if applicable): 1.0.4
@MrHedmad MrHedmad added the bug Something isn't working label Jan 27, 2023
@MrHedmad MrHedmad self-assigned this Jan 27, 2023
@MrHedmad MrHedmad added the critical This needs to be addressed ASAP label Jan 27, 2023
@MrHedmad
Copy link
Member Author

MrHedmad commented Feb 3, 2023

The bug was triaged by @Feat-FeAR. In short, the read.maimages function needs to know what scanner produced the input files, to set the colnames to read appropriately. This is the source of the "colnames not found" error that is generated by the mentioned GSEs.

This is not a trivial bug to solve, as the user cannot say what scanner they used, often GEO does not hold this information (unless you open the input files and read the names manually), and some columns (like gIsWellOverBG that we use later for filtering) are needed in the later parts of the scripts.

A few band-aid fixes could be useful:

  • A better error message (but the function can crash with the same error due to other causes, such as reading a completely different file, even a completely invalid file -- notably, for Agilent arrays, GPL files are typically bundled together with GSM files in the GSExxxx_RAW.tar archive available from GEO! Should we make some regex to detect them and remove from the file list to feed to read.maimages()?)
  • A brute-force approach, testing all scanners that read.maimages can support, and choosing the first one that does not crash. In this case, we have to either add the columns that we need later on manually, or change the downstream code to not run if the columns are missing, or something else entirely.
  • Search the valid colnames for every supported chip, and give specific error messages if they do not match. Partial matching could also give an even more specific error message (e.g. "This looks like scanner A, but the cols a, b, and c are missing.")

@MrHedmad MrHedmad changed the title [BUG] BioTEA prepare cannot process Agilent arrays [BUG] BioTEA prepare cannot process (some) Agilent arrays Feb 3, 2023
@MrHedmad MrHedmad removed the critical This needs to be addressed ASAP label Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants