-
Notifications
You must be signed in to change notification settings - Fork 0
USE 137 - support JSONLines input #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USE 137 - support JSONLines input #19
Conversation
embeddings/cli.py
Outdated
| # read input records from TIMDEX dataset (default) | ||
| # or a JSONLines file | ||
| if input_jsonl: | ||
| with ( | ||
| smart_open.open(input_jsonl, "r") as file_obj, # type: ignore[no-untyped-call] | ||
| jsonlines.Reader(file_obj) as reader, | ||
| ): | ||
| timdex_records = iter(list(reader)) | ||
|
|
||
| else: | ||
| if not dataset_location or not run_id: | ||
| raise click.UsageError( | ||
| "Both '--dataset-location' and '--run-id' are required arguments " | ||
| "when reading input records from the TIMDEX dataset." | ||
| ) | ||
|
|
||
| # init TIMDEXDataset | ||
| timdex_dataset = TIMDEXDataset(dataset_location) | ||
|
|
||
| # query TIMDEX dataset for an iterator of records | ||
| timdex_records = timdex_dataset.read_dicts_iter( | ||
| columns=[ | ||
| "timdex_record_id", | ||
| "run_id", | ||
| "run_record_offset", | ||
| "transformed_record", | ||
| ], | ||
| run_id=run_id, | ||
| where=f"""run_record_offset >= {run_record_offset}""", | ||
| limit=record_limit, | ||
| action="index", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point, I think @ehanson8 floated the idea of refactoring this out to a utility function. I considered that again here now that we handle two different possible inputs.
For the time being, I'd propose to keep here as-is, because there are no other known use cases for these inputs getting handled or used. Until then, it feels reasonable that this CLI handles the input + output handling.
Why these changes are being introduced: Initially, the CLI command create-embeddings only supported reading input records from the TIMDEX dataset via TDA. While this is likely the way we'll get input records, supporting a JSONLines file as input is helpful for testing. How this addresses that need: * Adds a new --input-jsonl argument that reads a JSONLines file and uses those rows as input for creating embeddings. * Args --dataset-location and --run-id are required when --input-jsonl is not set. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-137
7216c9e to
b898882
Compare
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected! ✨
NOTE: this PR builds on #18. Once the other PR is complete, this PR will get pointed to
mainand merged.Purpose and background context
Small PR to support a JSONLines files as input for the CLI command
create-embeddings.While normally the CLI will read directly from the TIMDEX dataset via TDA, it can be optionally invoked to read records from as JSONLines file that simulates the columns/fields that would come from TDA. This is helpful for testing and will be helpful for ironing out the parallel processing work.
How can a reviewer manually see the effects of these changes?
Set env vars:
If not already done, ensure model is downloaded locally:
Create embeddings with entirely local input + output files:
Note: S3 uris are supported for both
--input-jsonland--ouput-jsonlviasmart_open.Includes new or updated dependencies?
YES:
smart_open[s3]Changes expectations for external applications?
NO
What are the relevant tickets?
Code review