TIMX 338 - Add init-job functionality #2

ghukill · 2024-08-26T14:36:43Z

Purpose and background context

The first step in starting a Transmogrifier A/B test is creating a "Job" which is a wrapper around subsequent "Runs" of the job. The job provides a working directory, later the Transmogrifier versions used, etc. This information is captured in a job.json file that lives in the Job's working directory.

The first function importable from abdiff.core has been created to support this: init_job(). It takes only a single argument job_name, and the rest is opinionated behavior from abdiff.core.utils, where it is intended that other functions that require knowing a job name, slug, or working directory can use those as well.

How can a reviewer manually see the effects of these changes?

Start an ipython shell

pipenv run ipython

Create a new job:

from abdiff.core import init_job

job_data = init_job("Fruits and Vegetables")

display(job_data)
#Out[3]: 
#{'job_name': 'Fruits and Vegetables',
# 'job_slug': 'fruits-and-vegetables',
# 'working_directory': 'output/fruits-and-vegetables'}

If no environemnt variables are set, you should see a new directory in the local output folder:

/ output
├── fruits-and-vegetables
│   └── job.json

This is basically the totality of this function's requirements. Any downstream jobs can use utils.get_job_slug_and_working_directory() to get a job's slug and working directory from the job name alone.

Delete this section if it isn't applicable to the PR.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-338

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: The first step in an AB comparison is initializing a 'Job' that will be used for subsequent 'Runs' of the job. This requires setting up a working directory, start a 'job.json' file, and some supporting utilities. How this addresses that need: * Adds first function 'init_job' in abdiff.core * Adds first utility functions for job names, slug, and working directory Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-338

ghukill · 2024-08-26T14:51:32Z

abdiff/core/__init__.py

+__all__ = [
+    "init_job",
+]


I had figured we could just add functions here as they get created?

ghukill · 2024-08-26T14:52:41Z

abdiff/core/utils.py

+    return job_slug, Path(CONFIG.data_directory) / job_slug
+
+
+def update_or_create_job_json(job_name: str, new_job_data: dict) -> dict:


To anyone who works' on functionality that will update the job's job.json, my hope is that this utility function would be reusable for that purpose.

ehanson8

Looks good but a few comments about directory naming

ehanson8 · 2024-08-26T17:42:26Z

abdiff/config.py

+    def data_directory(self) -> str:
+        return self.DATA_DIRECTORY or "output"


Could this just be output_directory for simplicity?

@ehanson8 - I'm not 100% opposed, but but to pushback a bit I think parts ot the pipeline will also read from this directory as well. As in, it's not entirely just a destination for writing out files, but also some reading and interaction.

Full transparency, I used output/ as the default value for this configuration as it's already included in our .gitignore and to me is kind of a scratch / working directory for our CLIs.

But open to renaming either the property or the default root working directory if you or @jonavellecuerdo feel it's warranted.

Maybe pulling in your comment below @ehanson8: what if we named the property root_working_directory? and updated the env var to ROOT_WORKING_DIRECTORY?

Then we'd have the following throughout:

root_working_directory: the base folder for Jobs to get created in

job_working_directory: a Job specific folder, under the root working directory

That works for me, just looking for symmetry and alignment between env var, args, and app variables whenever possible!

Saw your comment after I wrote my input, but I like what you've proposed above -- root_working_directory and job_working_directory.

Thanks @ehanson8 and @jonavellecuerdo for the feedback. I agree it was a little awkward before. Working on some changes now and will re-request a review.

ehanson8 · 2024-08-26T17:43:21Z

abdiff/core/init_job.py

+    1. create a working directory for job
+    2. initialize a job.json file
+    """
+    job_slug, job_working_directory = get_job_slug_and_working_directory(job_name)


Building on the output_directory comment, to further minimize potential confusion, could this just be job_directory or working_directory consistently?

I agree with @ehanson8 on choosing one naming convention and keeping it consistent across functions.

I am leaning towards job_directory. I like that it is explicitly linking the directory to a job. Related to @ehanson8 's comment on output_directory: data_directory makes me think of a folder containing file extracts and transformed record JSON files, but this directory will contain more than that -- the structure of the directory is very specific to a job, containing job and run JSON files and dated 'run' folders in addition to the transformed records. 🤔

abdiff/core/init_job.py

coveralls · 2024-08-26T19:00:46Z

Pull Request Test Coverage Report for Build 10565589134

Details

49 of 49 (100.0%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 10560877145:	0.0%
Covered Lines:	77
Relevant Lines:	77

💛 - Coveralls

ghukill · 2024-08-26T19:09:19Z

@ehanson8, @jonavellecuerdo - I've renamed data_directory to root_working_directory throughout in this commit: b51e8fd.

While we're at it: I wonder if the default directory should be /tmp instead of ./output relative to the project directory?

At least in my IDE, I have to be careful that it doesn't attempt to scan and index that directory because it's part of the project.

Pushing another commit now that will change the default to /tmp. Will re-request review shortly.

Lastly, while I imagine we may need to tune some things as it comes together more, we really haven't discussed the potentially fairly large data this app will be saving locally. Probably only hundreds of megabytes... unless someone gets ambitious and tries to run on full Alma (20x files @ 1gb each). Counter-point, this is naturally pretty hard to do by accident given how slow the transform is for those files; it would be a potentilaly multi-day running job locally.

Somewhat unrelated to the default root working directory, but felt worth mentioning as we're thinking about files on local disk.

ghukill · 2024-08-26T19:16:34Z

While we're at it: I wonder if the default directory should be /tmp instead of ./output relative to the project directory?

At least in my IDE, I have to be careful that it doesn't attempt to scan and index that directory because it's part of the project.

Pushing another commit now that will change the default to /tmp. Will re-request review shortly.

Scratch that!

Going to create a ticket to think about file management over time, job footprints, etc. Seems conceivable we might even want to support the job working directory being in S3. For now, feels okay to have the root working directory be <repository/output.

ghukill · 2024-08-26T19:23:04Z

While we're at it: I wonder if the default directory should be /tmp instead of ./output relative to the project directory?
At least in my IDE, I have to be careful that it doesn't attempt to scan and index that directory because it's part of the project.
Pushing another commit now that will change the default to /tmp. Will re-request review shortly.

Scratch that!

Going to create a ticket to think about file management over time, job footprints, etc. Seems conceivable we might even want to support the job working directory being in S3. For now, feels okay to have the root working directory be <repository/output.

Ticket created: https://mitlibraries.atlassian.net/browse/TIMX-342

jonavellecuerdo · 2024-08-26T19:33:04Z

Rename data_directory to root_working_directory

@ghukill Curious what you mean by your IDE "attempt[ing] to scan and index" a directory? 🤔

ghukill · 2024-08-26T19:48:55Z

Rename data_directory to root_working_directory

@ghukill Curious what you mean by your IDE "attempt[ing] to scan and index" a directory? 🤔

My IDE (PyCharm) indexes all the files in the project. This is great for searching around code as it makes it quicker... but is less great when it finds 20gb of JSON files and starts to index them all for searching 😅. It's really not designed for "data" your app may use, just the code.

I can exclude a directory if I remember, and short-circuit this problem. But I think it's consistent with a more general concept that storing ephemeral, working data inside a project's sort of "runtime" environment is not a great idea. For this reason, and others.

Hoping this ticket, https://mitlibraries.atlassian.net/browse/TIMX-342, can become the locus of these discussions.

ghukill requested review from ehanson8 and jonavellecuerdo August 26, 2024 14:36

ghukill commented Aug 26, 2024

View reviewed changes

ehanson8 reviewed Aug 26, 2024

View reviewed changes

ghukill requested a review from ehanson8 August 26, 2024 18:35

Rename data_directory to root_working_directory

b51e8fd

jonavellecuerdo approved these changes Aug 26, 2024

View reviewed changes

ehanson8 approved these changes Aug 27, 2024

View reviewed changes

ghukill merged commit 3c92ab1 into main Aug 27, 2024

ghukill deleted the TIMX-338-init-job branch September 25, 2024 18:25

		return job_slug, Path(CONFIG.data_directory) / job_slug


		def update_or_create_job_json(job_name: str, new_job_data: dict) -> dict:

		def data_directory(self) -> str:
		return self.DATA_DIRECTORY or "output"

TIMX 338 - Add init-job functionality #2

TIMX 338 - Add init-job functionality #2

Uh oh!

Conversation

ghukill commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coveralls commented Aug 26, 2024

Pull Request Test Coverage Report for Build 10565589134

Details

💛 - Coveralls

Uh oh!

ghukill commented Aug 26, 2024

Uh oh!

ghukill commented Aug 26, 2024

Uh oh!

ghukill commented Aug 26, 2024

Uh oh!

jonavellecuerdo commented Aug 26, 2024

Uh oh!

ghukill commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ghukill commented Aug 26, 2024 •

edited

Loading

ghukill Aug 26, 2024 •

edited

Loading