Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Aug 26, 2024

Purpose and background context

The first step in starting a Transmogrifier A/B test is creating a "Job" which is a wrapper around subsequent "Runs" of the job. The job provides a working directory, later the Transmogrifier versions used, etc. This information is captured in a job.json file that lives in the Job's working directory.

The first function importable from abdiff.core has been created to support this: init_job(). It takes only a single argument job_name, and the rest is opinionated behavior from abdiff.core.utils, where it is intended that other functions that require knowing a job name, slug, or working directory can use those as well.

How can a reviewer manually see the effects of these changes?

Start an ipython shell

pipenv run ipython

Create a new job:

from abdiff.core import init_job

job_data = init_job("Fruits and Vegetables")

display(job_data)
#Out[3]: 
#{'job_name': 'Fruits and Vegetables',
# 'job_slug': 'fruits-and-vegetables',
# 'working_directory': 'output/fruits-and-vegetables'}

If no environemnt variables are set, you should see a new directory in the local output folder:

/ output
├── fruits-and-vegetables
│   └── job.json

This is basically the totality of this function's requirements. Any downstream jobs can use utils.get_job_slug_and_working_directory() to get a job's slug and working directory from the job name alone.

Delete this section if it isn't applicable to the PR.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

The first step in an AB comparison is initializing a 'Job' that
will be used for subsequent 'Runs' of the job.  This requires
setting up a working directory, start a 'job.json' file, and
some supporting utilities.

How this addresses that need:
* Adds first function 'init_job' in abdiff.core
* Adds first utility functions for job names, slug, and
working directory

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-338
Comment on lines +8 to +10
__all__ = [
"init_job",
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had figured we could just add functions here as they get created?

return job_slug, Path(CONFIG.data_directory) / job_slug


def update_or_create_job_json(job_name: str, new_job_data: dict) -> dict:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To anyone who works' on functionality that will update the job's job.json, my hope is that this utility function would be reusable for that purpose.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but a few comments about directory naming

abdiff/config.py Outdated
Comment on lines 18 to 19
def data_directory(self) -> str:
return self.DATA_DIRECTORY or "output"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be output_directory for simplicity?

Copy link
Contributor Author

@ghukill ghukill Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 - I'm not 100% opposed, but but to pushback a bit I think parts ot the pipeline will also read from this directory as well. As in, it's not entirely just a destination for writing out files, but also some reading and interaction.

Full transparency, I used output/ as the default value for this configuration as it's already included in our .gitignore and to me is kind of a scratch / working directory for our CLIs.

But open to renaming either the property or the default root working directory if you or @jonavellecuerdo feel it's warranted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pulling in your comment below @ehanson8: what if we named the property root_working_directory? and updated the env var to ROOT_WORKING_DIRECTORY?

Then we'd have the following throughout:

  • root_working_directory: the base folder for Jobs to get created in
  • job_working_directory: a Job specific folder, under the root working directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me, just looking for symmetry and alignment between env var, args, and app variables whenever possible!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw your comment after I wrote my input, but I like what you've proposed above -- root_working_directory and job_working_directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ehanson8 and @jonavellecuerdo for the feedback. I agree it was a little awkward before. Working on some changes now and will re-request a review.

1. create a working directory for job
2. initialize a job.json file
"""
job_slug, job_working_directory = get_job_slug_and_working_directory(job_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on the output_directory comment, to further minimize potential confusion, could this just be job_directory or working_directory consistently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ehanson8 on choosing one naming convention and keeping it consistent across functions.

I am leaning towards job_directory. I like that it is explicitly linking the directory to a job. Related to @ehanson8 's comment on output_directory: data_directory makes me think of a folder containing file extracts and transformed record JSON files, but this directory will contain more than that -- the structure of the directory is very specific to a job, containing job and run JSON files and dated 'run' folders in addition to the transformed records. 🤔

@ghukill ghukill requested a review from ehanson8 August 26, 2024 18:35
@coveralls
Copy link

Pull Request Test Coverage Report for Build 10565589134

Details

  • 49 of 49 (100.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 10560877145: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls

@ghukill
Copy link
Contributor Author

ghukill commented Aug 26, 2024

@ehanson8, @jonavellecuerdo - I've renamed data_directory to root_working_directory throughout in this commit: b51e8fd.

While we're at it: I wonder if the default directory should be /tmp instead of ./output relative to the project directory?

At least in my IDE, I have to be careful that it doesn't attempt to scan and index that directory because it's part of the project.

Pushing another commit now that will change the default to /tmp. Will re-request review shortly.

Lastly, while I imagine we may need to tune some things as it comes together more, we really haven't discussed the potentially fairly large data this app will be saving locally. Probably only hundreds of megabytes... unless someone gets ambitious and tries to run on full Alma (20x files @ 1gb each). Counter-point, this is naturally pretty hard to do by accident given how slow the transform is for those files; it would be a potentilaly multi-day running job locally.

Somewhat unrelated to the default root working directory, but felt worth mentioning as we're thinking about files on local disk.

@ghukill
Copy link
Contributor Author

ghukill commented Aug 26, 2024

While we're at it: I wonder if the default directory should be /tmp instead of ./output relative to the project directory?

At least in my IDE, I have to be careful that it doesn't attempt to scan and index that directory because it's part of the project.

Pushing another commit now that will change the default to /tmp. Will re-request review shortly.

Scratch that!

Going to create a ticket to think about file management over time, job footprints, etc. Seems conceivable we might even want to support the job working directory being in S3. For now, feels okay to have the root working directory be <repository/output.

@ghukill
Copy link
Contributor Author

ghukill commented Aug 26, 2024

While we're at it: I wonder if the default directory should be /tmp instead of ./output relative to the project directory?
At least in my IDE, I have to be careful that it doesn't attempt to scan and index that directory because it's part of the project.
Pushing another commit now that will change the default to /tmp. Will re-request review shortly.

Scratch that!

Going to create a ticket to think about file management over time, job footprints, etc. Seems conceivable we might even want to support the job working directory being in S3. For now, feels okay to have the root working directory be <repository/output.

Ticket created: https://mitlibraries.atlassian.net/browse/TIMX-342

@jonavellecuerdo
Copy link
Contributor

Rename data_directory to root_working_directory

@ghukill Curious what you mean by your IDE "attempt[ing] to scan and index" a directory? 🤔

@ghukill
Copy link
Contributor Author

ghukill commented Aug 26, 2024

Rename data_directory to root_working_directory

@ghukill Curious what you mean by your IDE "attempt[ing] to scan and index" a directory? 🤔

My IDE (PyCharm) indexes all the files in the project. This is great for searching around code as it makes it quicker... but is less great when it finds 20gb of JSON files and starts to index them all for searching 😅. It's really not designed for "data" your app may use, just the code.

I can exclude a directory if I remember, and short-circuit this problem. But I think it's consistent with a more general concept that storing ephemeral, working data inside a project's sort of "runtime" environment is not a great idea. For this reason, and others.

Hoping this ticket, https://mitlibraries.atlassian.net/browse/TIMX-342, can become the locus of these discussions.

@ghukill ghukill merged commit 3c92ab1 into main Aug 27, 2024
@ghukill ghukill deleted the TIMX-338-init-job branch September 25, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants