-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HRQB 11 - Data Warehouse client and SQLQueryExtractTask base task #16
Conversation
Why these changes are being introduced: A primary source of data for this application to work with is data coming from the MIT Oracle Data Warehouse. A normalized way of connecting and performing queries is needed. How this addresses that need: * Creates new DWClient to connect to, and query from, a remote database. * DWClient includes a method execute_query that uses pandas to return a DataFrame of the results, leaning into the DataFrame-first nature of the PandasPickleTasks. Side effects of this change: * Connection possible to external Data Warehouse Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11
Why these changes are being introduced: For testing purposes it can be difficult to test when an attrs field gets a default from an env var at class definition, as this will not allow monkeypatching. How this addresses that need: * Use factory pattern where the default value is set each instantiation of the class via a lambda Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-13
Why these changes are being introduced: Similar to base task QuickbaseUpsertTask, a very common extract task will be querying the Oracle Data Warehouse for data. To reduce boilerplate code in each of these defined tasks, a base class will be helpful for setting up some structure to follow. How this addresses that need: * New base task SQLQueryExtractTask * requires only a sql_query property, where the rest of querying and writing pandas dataframe as task target is handled * optional SQL query parameters may be defined * optional DWClient instance may be defined (e.g. testing to SQLite database) Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11
"""Client to provide Oracle Data Warehouse connection and querying.""" | ||
|
||
connection_string: str = field( | ||
factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if I commented on this in tests, but using a factory=lambda: ...
pattern here is helpful for testing. This ensures that only when the DWClient
is instantiated, does it pull the env var DATA_WAREHOUSE_CONNECTION_STRING
if an explicit connection string is not passed.
If this had been default=Config().DATA_WAREHOUSE_CONNECTION_STRING
, then monkey patching the env var at pytest time would not be sufficient, as this class is already defined with the default value for any future instances being the env var before that monkey patching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-blocking] Maybe pull some of this information into the doc string or as a comment above the declaration of connection_string
?
Why these changes are being introduced: Some extract tasks that query the data warehouse will have complex SQL queries. Storing them in a dedicated file will support syntax highlighting and testing of those files directly, while keeping the task definitions sipmler. How this addresses that need: * Add new property SQLQueryExtractTask.sql_file * Required either sql_query OR sql_file defined Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems solid, a few suggestions and questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but still recommend updating that docstring. Edit: don't know why this didn't link properly 16ad2a7#r141964334
16ad2a7
to
82fe684
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just have a few small suggestions / questions. 🤓
hrqb/utils/data_warehouse.py
Outdated
def init_engine(self) -> None: | ||
"""Instantiate a SQLAlchemy engine if not already configured and set. | ||
|
||
User provided engine parameters will override self.default_engine_parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be updated: self.default_engine_parameters
-> self.engine_parameters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Updated, and even better, can remove that altogether; I think setting engine_parameters
explicitly is pretty transparent in how it works now.
factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING, | ||
repr=False, | ||
) | ||
engine_parameters: dict = field(factory=lambda: {"thick_mode": True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would field(default={"thick_mode": True})
be sufficient in this case as it doesn't call any methods to set the value? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Given the DATA_WAREHOUSE_CONNECTIONS_STRING
env var, these are the only SQLAlchemy engine parameters required (at least so far in testing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to this comment and another, updated the docstring for this class to explain the fields:
"""Client to provide Oracle Data Warehouse connection and querying.
Fields:
connection_string: str
- full SQLAlchemy connection string, e.g.
- oracle: oracle+oracledb://user1:pass1@example.org:1521/ABCDE
- sqlite: sqlite:///:memory:
- defaults to env var DATA_WAREHOUSE_CONNECTION_STRING, loaded from env vars
at time of DWClient initialization
engine_parameters: dict
- optional dictionary of SQLAlchemy engine parameters
engine: Engine
- set via self.init_engine()
"""
"""Client to provide Oracle Data Warehouse connection and querying.""" | ||
|
||
connection_string: str = field( | ||
factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-blocking] Maybe pull some of this information into the doc string or as a comment above the declaration of connection_string
?
hrqb/base/task.py
Outdated
if self.sql_query: | ||
query = self.sql_query | ||
elif self.sql_file: | ||
with open(self.sql_file) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would this SQL files live and in what cases is a SQL file preferred over a string? 🤔
Hmm, what do you think of an update like:
class SQLQueryExtractTask(PandasPickleTask):
"""Base class for Tasks that make SQL queries for data."""
...
@property
def sql_query(self) -> str | None:
"""SQL query from string or loaded from file"""
if self.sql_query:
query = self.sql_query
elif self.sql_file:
with open(self.sql_file) as f:
query = f.read()
else:
message = "Property sql_query or sql_file must be set."
raise AttributeError(message)
return query
@property
def sql_query_string(self) -> str | None:
"""SQL query from string to execute."""
return None
@property
def sql_file(self) -> str | None:
"""SQL query loaded from file to execute."""
return None
@property
def sql_query_parameters(self) -> dict:
"""Optional parameters to include with SQL query."""
return {}
def get_dataframe(self) -> pd.DataFrame:
return self.dwclient.execute_query(
self.sql_query,
params=self.sql_query_parameters,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question and refactor suggestion are well received. In fact, given some work since this PR on actually scoping out some extract tasks, it seems pretty clear that most (if not all) SQL extract tasks will use a file.
However, being able to easily override that with an explicit string is handy for testing. I think there is a middleground here:
self.sql_query
returns a string, defaulting to readingself.sql_file
- if neither are overridden, throws an exception that
sql_file
must be defined ORsql_query
overridden to provide a string vs reading a file
Going to make that change and push a commit.
Why these changes are being introduced: During code review, and some work on sketching actual extract tasks, it is clear that SQL extract tasks will most likely use SQL files. However, it remains helpful for testing and other dev work to define the query directly in the class. How this addresses that need: * Default SQLQueryExtractTask.sql_query to return string, and get this string from SQLQueryExtractTask.sql_file * Allows for overridding SQLQueryExtractTask.sql_query with an explicit string Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11
@jonavellecuerdo - added a couple of commits addressing your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Thank you for taking the time to make this update. Just one follow-up question!
Purpose and background context
This PR introduces class
DWClient
to assist with making queries to the Oracle Data Warehouse. The goal was to keep this as simple as possible, assisting with setting up a connection, supporting testing to a local SQLite database, and returning the results of a query as a pandas DataFrame.Additionally, a new base task class was created
SQLQueryExtractTask
. This base class is designed to assist with Extract tasks that query for data from the warehouse, using theDWClient
, and return a DataFrame for downstream tasks to work with. This leans into the requirement ofPandasPickleTask
to define aget_dataframe()
method, which in turn this defines as an opinionated method to perform a SQL query.The end result is a task that can look as simple as this:
When this task runs, it will use the
sql_query
property defined, perform a query, and return the result as a DataFrame. The goal is to focus on writing the SQL queries themselves, and not the orchestration of executing queries or preparing the results for downstream tasks.How can a reviewer manually see the effects of these changes?
An ipython shell is likely the easiest way to get a feel for the new base task type
SQLQueryExtractTask
andDWClient
working together.Set env vars:
Ipython shell:
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES; hypothetical connection to data warehouse, though credentials are not deployed yet
What are the relevant tickets?
Developer
Code Reviewer(s)