Skip to content

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts#237

Merged
sgosline merged 33 commits intomainfrom
split_broad_sanger
Nov 12, 2024
Merged

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts#237
sgosline merged 33 commits intomainfrom
split_broad_sanger

Conversation

@jjacobson95
Copy link
Copy Markdown
Collaborator

@jjacobson95 jjacobson95 commented Oct 21, 2024

Ready to merge.

Multiple Changes will be coming with this PR.

Summary:

@sgosline
Copy link
Copy Markdown
Member

Is there a way to generalize the validation script so we only need one for all datasets? It feels like this is redundant and increases the burden on anyone wanting to add a dataset. Since all the files have the same file name structure ([datasetname]_[datatype]), can we just iterate through a directory and validate each one according to the file name?

@jjacobson95
Copy link
Copy Markdown
Collaborator Author

Yes, that is a more maintainable approach, I'll make that change!

Likely will use a decent amount of hard coded file names to ensure that all expected files are present (as is currently done) rather than simple pattern matching.

@sgosline
Copy link
Copy Markdown
Member

I'd avoid hard-coded files. Perhaps use a config file that can be updated so the validation code itself remains the same.

@jjacobson95
Copy link
Copy Markdown
Collaborator Author

Good call, that sounds more elegant

@sgosline
Copy link
Copy Markdown
Member

sgosline commented Nov 2, 2024

Please merge #242 before this.

@jjacobson95
Copy link
Copy Markdown
Collaborator Author

Merged. I've got one bug that I'm working on (The broad_sanger data split is working in the individual dataset script, but not in the build_all script) and then I'll rebuild all of the data.

@jjacobson95
Copy link
Copy Markdown
Collaborator Author

Just a note, I'm running into consistent memory issues when splitting the broad_sanger data in docker on AWS. I'm reworking this script to reduce the memory usage.

@jjacobson95
Copy link
Copy Markdown
Collaborator Author

Resolves #227. This is the most notable update.
Resolves #233, #234, #236 (with a schema fix to allow "Any"), #238, and #240. These are all bug and schema updates.
-We never made this an issue, but this moves all of the schema checking scripts into a single unified script that runs schema checking in parallel. Old schema code has been removed.

Closes #239. Not because this was resolved, but because it is a non-issue.

@jjacobson95
Copy link
Copy Markdown
Collaborator Author

Ready to merge!

@sgosline
Copy link
Copy Markdown
Member

I reviewed this, and it looks solid!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants