-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 338 - Add init-job functionality #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: The first step in an AB comparison is initializing a 'Job' that will be used for subsequent 'Runs' of the job. This requires setting up a working directory, start a 'job.json' file, and some supporting utilities. How this addresses that need: * Adds first function 'init_job' in abdiff.core * Adds first utility functions for job names, slug, and working directory Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-338
| __all__ = [ | ||
| "init_job", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had figured we could just add functions here as they get created?
| return job_slug, Path(CONFIG.data_directory) / job_slug | ||
|
|
||
|
|
||
| def update_or_create_job_json(job_name: str, new_job_data: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To anyone who works' on functionality that will update the job's job.json, my hope is that this utility function would be reusable for that purpose.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but a few comments about directory naming
abdiff/config.py
Outdated
| def data_directory(self) -> str: | ||
| return self.DATA_DIRECTORY or "output" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be output_directory for simplicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 - I'm not 100% opposed, but but to pushback a bit I think parts ot the pipeline will also read from this directory as well. As in, it's not entirely just a destination for writing out files, but also some reading and interaction.
Full transparency, I used output/ as the default value for this configuration as it's already included in our .gitignore and to me is kind of a scratch / working directory for our CLIs.
But open to renaming either the property or the default root working directory if you or @jonavellecuerdo feel it's warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe pulling in your comment below @ehanson8: what if we named the property root_working_directory? and updated the env var to ROOT_WORKING_DIRECTORY?
Then we'd have the following throughout:
root_working_directory: the base folder for Jobs to get created injob_working_directory: a Job specific folder, under the root working directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me, just looking for symmetry and alignment between env var, args, and app variables whenever possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw your comment after I wrote my input, but I like what you've proposed above -- root_working_directory and job_working_directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ehanson8 and @jonavellecuerdo for the feedback. I agree it was a little awkward before. Working on some changes now and will re-request a review.
| 1. create a working directory for job | ||
| 2. initialize a job.json file | ||
| """ | ||
| job_slug, job_working_directory = get_job_slug_and_working_directory(job_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building on the output_directory comment, to further minimize potential confusion, could this just be job_directory or working_directory consistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ehanson8 on choosing one naming convention and keeping it consistent across functions.
I am leaning towards job_directory. I like that it is explicitly linking the directory to a job. Related to @ehanson8 's comment on output_directory: data_directory makes me think of a folder containing file extracts and transformed record JSON files, but this directory will contain more than that -- the structure of the directory is very specific to a job, containing job and run JSON files and dated 'run' folders in addition to the transformed records. 🤔
Pull Request Test Coverage Report for Build 10565589134Details
💛 - Coveralls |
|
@ehanson8, @jonavellecuerdo - I've renamed While we're at it: I wonder if the default directory should be At least in my IDE, I have to be careful that it doesn't attempt to scan and index that directory because it's part of the project. Pushing another commit now that will change the default to Lastly, while I imagine we may need to tune some things as it comes together more, we really haven't discussed the potentially fairly large data this app will be saving locally. Probably only hundreds of megabytes... unless someone gets ambitious and tries to run on full Alma (20x files @ 1gb each). Counter-point, this is naturally pretty hard to do by accident given how slow the transform is for those files; it would be a potentilaly multi-day running job locally. Somewhat unrelated to the default root working directory, but felt worth mentioning as we're thinking about files on local disk. |
Scratch that! Going to create a ticket to think about file management over time, job footprints, etc. Seems conceivable we might even want to support the job working directory being in S3. For now, feels okay to have the root working directory be |
Ticket created: https://mitlibraries.atlassian.net/browse/TIMX-342 |
@ghukill Curious what you mean by your IDE "attempt[ing] to scan and index" a directory? 🤔 |
My IDE (PyCharm) indexes all the files in the project. This is great for searching around code as it makes it quicker... but is less great when it finds 20gb of JSON files and starts to index them all for searching 😅. It's really not designed for "data" your app may use, just the code. I can exclude a directory if I remember, and short-circuit this problem. But I think it's consistent with a more general concept that storing ephemeral, working data inside a project's sort of "runtime" environment is not a great idea. For this reason, and others. Hoping this ticket, https://mitlibraries.atlassian.net/browse/TIMX-342, can become the locus of these discussions. |
Purpose and background context
The first step in starting a Transmogrifier A/B test is creating a "Job" which is a wrapper around subsequent "Runs" of the job. The job provides a working directory, later the Transmogrifier versions used, etc. This information is captured in a
job.jsonfile that lives in the Job's working directory.The first function importable from
abdiff.corehas been created to support this:init_job(). It takes only a single argumentjob_name, and the rest is opinionated behavior fromabdiff.core.utils, where it is intended that other functions that require knowing a job name, slug, or working directory can use those as well.How can a reviewer manually see the effects of these changes?
Start an ipython shell
Create a new job:
If no environemnt variables are set, you should see a new directory in the local
outputfolder:This is basically the totality of this function's requirements. Any downstream jobs can use
utils.get_job_slug_and_working_directory()to get a job's slug and working directory from the job name alone.Delete this section if it isn't applicable to the PR.
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)