Config class and S3 clients #6

ehanson8 · 2024-12-05T20:16:16Z

Purpose and background context

This code is largely copied over from wiley-deposits with some minor changes to align it with our current coding practices.

Note: Most of the line changes are from Pipfile.lock

How can a reviewer manually see the effects of these changes?

Not possible yet

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * The application needs a class to manage config values and a stream variable to store the application logs for an email sent at the end of each run. How this addresses that need: * Add Config class with corresponding unit tests and fixtures for the methods * Update cli.py with calls to Config methods and stream variable * Update README.md * Update pyproject.toml Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1108

Why these changes are being introduced: * An S3 client is needed for submission metadata and content files How this addresses that need: * Add S3Client class with corresponding unit tests and fixtures for the methods * Update dependencies Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1099

ehanson8 · 2024-12-05T20:21:39Z

dsc/config.py

+            message = f"Missing required environment variables: {', '.join(missing_vars)}"
+            raise OSError(message)
+
+    def configure_logger(


This is a combination of configure_logger from wiley-deposits and our template config code, there's some repetition and I have a gut feeling there's a better way to do this but didn't land on anything specific, curious about your thoughts

Here's a possible refactor, removing some redundancies:

def configure_logger( self, logger: logging.Logger, stream: StringIO, *, verbose: bool ) -> str: logging_format_base = "%(asctime)s %(levelname)s %(name)s.%(funcName)s()" logger.addHandler(logging.StreamHandler(stream)) if verbose: log_method, log_level = logger.debug, logging.DEBUG template = logging_format_base + " line %(lineno)d: %(message)s" for handler in logging.root.handlers: handler.addFilter(logging.Filter("dsc")) else: log_method, log_level = logger.info, logging.INFO template = logging_format_base + ": %(message)s" logger.setLevel(log_level) logging.basicConfig(format=template) logger.addHandler(logging.StreamHandler(stream)) log_method(f"{logging.getLevelName(logger.getEffectiveLevel())}") return ( f"Logger '{logger.name}' configured with level=" f"{logging.getLevelName(logger.getEffectiveLevel())}" )

At a glance, what makes this hard to refactor are the tests:

test_configure_logger_not_verbose

test_configure_logger_verbose

which attempt to read from the stream handler that is setup. That's what the log_method, log_level = do above, is get the appropriate logging method to use for log_method(f"{logging.getLevelName(logger.getEffectiveLevel())}") which satisfies these tests.

My feeling: good enough for now, but globally I think we could improve our logging setups. The presence of logging.basicConfig() is kind of a red flag unto itself. But, I think this is mostly consistent with other projects, so value in that.

For example: I'm not sure that returning a string is ideal. Perhaps this method could just log what it has done, and return None. But this would require updating the CLI, tests, etc. that are expecting a string response. And as mentioned, this is consistent with other apps.

Still need review this at a deeper level but figured I'd pass along a discussion @ghukill and I had on a previous PR I worked on re: excluding configure_sentry and configure_logger from the Config class:

From @ghukill :

I like having a Config class that I can put complex behavior on for static-esque values I want to access through the application

I'm indifferent to configure_logger and configure_sentry on there / they feel somewhat disconnected from the essence of the application, which I attribute to the Config class. Especially sentry. Logging is a little more difficult for me... but then again logging is always kind of difficult. \ Of course, if those methods (configure_sentry and configure_logger) are on it, they don't need to be called.... but if I'm not supposed to call configure_sentry from anywhere in the app, then maybe it doesn't belong on that global Config object?
This is 5:00pm talking, no much reasoning, but just my knee-jerk thinking.

Great refactor, thanks! I was hoping this would prompt a larger conversation about the template/config class because I think it needs work as well, maybe a future DataEng meeting topic?

I think a future DataEng topic could be good, for sure. But I think it could/should extend longer than 30 minute meeting w/ CB.

I have two somewhat conflicting feelings at the moment re: config and logging:

it's okay for projects to tweak our established approach as needed

it's okay if slightly awkward or sub-optimal insofar as it's consistent with other repos

I think it'd be easy for any one project to grind to a halt trying to get it perfect, when we'll always have tension with, "but others don't do it that way..."

Until we do take a long and thoughtful look at our config + logging approach for python projects, I'm pretty okay with allowing projects to try and strike a balance between feelings #1 and #2 above.

ghukill

Couple of comments and a suggestion for a docstring update, but none blocking! Looking like good foundation until the more interesting bits come.

ghukill · 2024-12-06T14:38:30Z

dsc/config.py

+            message = f"Missing required environment variables: {', '.join(missing_vars)}"
+            raise OSError(message)
+
+    def configure_logger(


Here's a possible refactor, removing some redundancies:

def configure_logger( self, logger: logging.Logger, stream: StringIO, *, verbose: bool ) -> str: logging_format_base = "%(asctime)s %(levelname)s %(name)s.%(funcName)s()" logger.addHandler(logging.StreamHandler(stream)) if verbose: log_method, log_level = logger.debug, logging.DEBUG template = logging_format_base + " line %(lineno)d: %(message)s" for handler in logging.root.handlers: handler.addFilter(logging.Filter("dsc")) else: log_method, log_level = logger.info, logging.INFO template = logging_format_base + ": %(message)s" logger.setLevel(log_level) logging.basicConfig(format=template) logger.addHandler(logging.StreamHandler(stream)) log_method(f"{logging.getLevelName(logger.getEffectiveLevel())}") return ( f"Logger '{logger.name}' configured with level=" f"{logging.getLevelName(logger.getEffectiveLevel())}" )

At a glance, what makes this hard to refactor are the tests:

test_configure_logger_not_verbose

test_configure_logger_verbose

which attempt to read from the stream handler that is setup. That's what the log_method, log_level = do above, is get the appropriate logging method to use for log_method(f"{logging.getLevelName(logger.getEffectiveLevel())}") which satisfies these tests.

My feeling: good enough for now, but globally I think we could improve our logging setups. The presence of logging.basicConfig() is kind of a red flag unto itself. But, I think this is mostly consistent with other projects, so value in that.

For example: I'm not sure that returning a string is ideal. Perhaps this method could just log what it has done, and return None. But this would require updating the CLI, tests, etc. that are expecting a string response. And as mentioned, this is consistent with other apps.

dsc/s3.py

jonavellecuerdo

A couple small change requests, but it's looking good! Will inspect at a deeper level via second pass.

dsc/s3.py

jonavellecuerdo · 2024-12-06T14:40:31Z

dsc/config.py

+            message = f"Missing required environment variables: {', '.join(missing_vars)}"
+            raise OSError(message)
+
+    def configure_logger(


Still need review this at a deeper level but figured I'd pass along a discussion @ghukill and I had on a previous PR I worked on re: excluding configure_sentry and configure_logger from the Config class:

From @ghukill :

I like having a Config class that I can put complex behavior on for static-esque values I want to access through the application

I'm indifferent to configure_logger and configure_sentry on there / they feel somewhat disconnected from the essence of the application, which I attribute to the Config class. Especially sentry. Logging is a little more difficult for me... but then again logging is always kind of difficult. \ Of course, if those methods (configure_sentry and configure_logger) are on it, they don't need to be called.... but if I'm not supposed to call configure_sentry from anywhere in the app, then maybe it doesn't belong on that global Config object?
This is 5:00pm talking, no much reasoning, but just my knee-jerk thinking.

dsc/s3.py

jonavellecuerdo · 2024-12-06T14:44:45Z

dsc/s3.py

+    def __init__(self) -> None:
+        self.client = client("s3")
+
+    def archive_file_with_new_key(


Curious whether archiving the metadata file will be required for workflows other than Wiley deposits? 🤔 If an archive/ folder will exist for every workflow bucket, I think the function name is fine as-is, otherwise, it might be good to rename the function so it is more universal?

I believe so, we probably want an archived/processed path in all of the workflows to prevent re-processing of files

dsc/s3.py

* Refactor Config.configure_logger method * Rename retrieve_file_type_from_bucket > get_files_iter and refactor method

coveralls · 2024-12-06T16:54:30Z

Pull Request Test Coverage Report for Build 12202770589

Details

64 of 68 (94.12%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-4.8%) to 95.238%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
dsc/config.py	35	37	94.59%
dsc/s3.py	22	24	91.67%

Totals
Change from base Build 12183247713:	-4.8%
Covered Lines:	80
Relevant Lines:	84

💛 - Coveralls

* Add Config class Why these changes are being introduced: * The application needs a class to manage config values and a stream variable to store the application logs for an email sent at the end of each run. How this addresses that need: * Add Config class with corresponding unit tests and fixtures for the methods * Update cli.py with calls to Config methods and stream variable * Update README.md * Update pyproject.toml Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1108 * Add S3 client Why these changes are being introduced: * An S3 client is needed for submission metadata and content files How this addresses that need: * Add S3Client class with corresponding unit tests and fixtures for the methods * Update dependencies Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1099 * Updates based on discussion in PR #6 * Refactor Config.configure_logger method * Rename retrieve_file_type_from_bucket > get_files_iter and refactor method

ehanson8 added 2 commits December 5, 2024 11:27

ehanson8 commented Dec 5, 2024

View reviewed changes

ehanson8 marked this pull request as ready for review December 5, 2024 20:43

ehanson8 requested review from ghukill and jonavellecuerdo December 5, 2024 20:43

ghukill approved these changes Dec 6, 2024

View reviewed changes

jonavellecuerdo requested changes Dec 6, 2024

View reviewed changes

Updates based on discussion in PR #6

9ac46a5

* Refactor Config.configure_logger method * Rename retrieve_file_type_from_bucket > get_files_iter and refactor method

ehanson8 requested a review from jonavellecuerdo December 6, 2024 16:55

jonavellecuerdo approved these changes Dec 6, 2024

View reviewed changes

ehanson8 merged commit 8eb29e2 into main Dec 6, 2024
2 checks passed

ehanson8 deleted the IN-1099-aws-clients branch December 6, 2024 17:55

Config class and S3 clients #6

Config class and S3 clients #6

Uh oh!

Conversation

ehanson8 commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

ehanson8 Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

ehanson8 Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

ghukill Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonavellecuerdo Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonavellecuerdo Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

ehanson8 Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coveralls commented Dec 6, 2024

Pull Request Test Coverage Report for Build 12202770589

Details

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ehanson8 commented Dec 5, 2024 •

edited

Loading

ehanson8 Dec 5, 2024 •

edited

Loading

ghukill Dec 6, 2024 •

edited

Loading

ghukill Dec 6, 2024 •

edited

Loading