HRQB 11 - Data Warehouse client and SQLQueryExtractTask base task #16

ghukill · 2024-05-10T20:06:42Z

Purpose and background context

This PR introduces class DWClient to assist with making queries to the Oracle Data Warehouse. The goal was to keep this as simple as possible, assisting with setting up a connection, supporting testing to a local SQLite database, and returning the results of a query as a pandas DataFrame.

Additionally, a new base task class was created SQLQueryExtractTask. This base class is designed to assist with Extract tasks that query for data from the warehouse, using the DWClient, and return a DataFrame for downstream tasks to work with. This leans into the requirement of PandasPickleTask to define a get_dataframe() method, which in turn this defines as an opinionated method to perform a SQL query.

The end result is a task that can look as simple as this:

class ExtractTable1AndTable2Data(SQLQueryExtractTask):
    pipeline = luigi.Parameter()
    stage = luigi.Parameter("Extract")

    @property
    def sql_query(self) -> str:
        return """
        select
           t1.column_a,
           t1.column_b,
           t2.column_c
        from my_table t1
        inner join my_other_table t2 on t1.column_a = t2.column_a
        where t2.column_d = 'foo'
        """

When this task runs, it will use the sql_query property defined, perform a query, and return the result as a DataFrame. The goal is to focus on writing the SQL queries themselves, and not the orchestration of executing queries or preparing the results for downstream tasks.

How can a reviewer manually see the effects of these changes?

An ipython shell is likely the easiest way to get a feel for the new base task type SQLQueryExtractTask and DWClient working together.

Set env vars:

WORKSPACE=dev
SENTRY_DSN=None
DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH
LUIGI_CONFIG_PATH=hrqb/luigi.cfg
QUICKBASE_API_TOKEN=abc123
QUICKBASE_APP_ID=def456

Ipython shell:

# import a testing Task class from fixtures
from tests.fixtures.tasks.extract import SQLExtractAnimalColors

# instantiate the Task with pipeline name and stage
task = SQLExtractAnimalColors(pipeline="Testing", stage="Extract")

# run get_dataframe() method
# this relies on the base class SQLQueryExtractTask to perform a SQL query
# (configured here to query a local SQLite database that has data)
# and returns a dataframe only, no files are written
task.get_dataframe()
# Out[3]: 
#   animal_id  color
# 0         42  green
# 1        101    red

# run actual run() method, and note the file created:
# output/Testing__Extract__SQLExtractAnimalColors.pickle
task.run()

# other confirmations that data was written identical
# to the PandasPickleTask this task class extends

task.path
# Out[4]: 'output/Testing__Extract__SQLExtractAnimalColors.pickle'

task.target
# Out[5]: <hrqb.base.target.PandasPickleTarget at 0x12de27e10>

task.target.exists()
# Out[6]: True

task.complete()
# Out[7]: True

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES; hypothetical connection to data warehouse, though credentials are not deployed yet

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/HRQB-11

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: A primary source of data for this application to work with is data coming from the MIT Oracle Data Warehouse. A normalized way of connecting and performing queries is needed. How this addresses that need: * Creates new DWClient to connect to, and query from, a remote database. * DWClient includes a method execute_query that uses pandas to return a DataFrame of the results, leaning into the DataFrame-first nature of the PandasPickleTasks. Side effects of this change: * Connection possible to external Data Warehouse Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11

Why these changes are being introduced: For testing purposes it can be difficult to test when an attrs field gets a default from an env var at class definition, as this will not allow monkeypatching. How this addresses that need: * Use factory pattern where the default value is set each instantiation of the class via a lambda Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-13

Why these changes are being introduced: Similar to base task QuickbaseUpsertTask, a very common extract task will be querying the Oracle Data Warehouse for data. To reduce boilerplate code in each of these defined tasks, a base class will be helpful for setting up some structure to follow. How this addresses that need: * New base task SQLQueryExtractTask * requires only a sql_query property, where the rest of querying and writing pandas dataframe as task target is handled * optional SQL query parameters may be defined * optional DWClient instance may be defined (e.g. testing to SQLite database) Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11

ghukill · 2024-05-10T20:19:30Z

hrqb/utils/data_warehouse.py

+    """Client to provide Oracle Data Warehouse connection and querying."""
+
+    connection_string: str = field(
+        factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING,


Unsure if I commented on this in tests, but using a factory=lambda: ... pattern here is helpful for testing. This ensures that only when the DWClient is instantiated, does it pull the env var DATA_WAREHOUSE_CONNECTION_STRING if an explicit connection string is not passed.

If this had been default=Config().DATA_WAREHOUSE_CONNECTION_STRING, then monkey patching the env var at pytest time would not be sufficient, as this class is already defined with the default value for any future instances being the env var before that monkey patching.

[Non-blocking] Maybe pull some of this information into the doc string or as a comment above the declaration of connection_string?

Why these changes are being introduced: Some extract tasks that query the data warehouse will have complex SQL queries. Storing them in a dedicated file will support syntax highlighting and testing of those files directly, while keeping the task definitions sipmler. How this addresses that need: * Add new property SQLQueryExtractTask.sql_file * Required either sql_query OR sql_file defined Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11

ehanson8

Seems solid, a few suggestions and questions

hrqb/base/task.py

hrqb/utils/data_warehouse.py

tests/test_base_task.py

ehanson8

Looks good but still recommend updating that docstring. Edit: don't know why this didn't link properly 16ad2a7#r141964334

jonavellecuerdo

Looking good! Just have a few small suggestions / questions. 🤓

jonavellecuerdo · 2024-05-15T17:53:42Z

hrqb/utils/data_warehouse.py

+    def init_engine(self) -> None:
+        """Instantiate a SQLAlchemy engine if not already configured and set.
+
+        User provided engine parameters will override self.default_engine_parameters.


Should this be updated: self.default_engine_parameters -> self.engine_parameters?

Good catch! Updated, and even better, can remove that altogether; I think setting engine_parameters explicitly is pretty transparent in how it works now.

jonavellecuerdo · 2024-05-15T17:56:57Z

hrqb/utils/data_warehouse.py

+        factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING,
+        repr=False,
+    )
+    engine_parameters: dict = field(factory=lambda: {"thick_mode": True})


Would field(default={"thick_mode": True}) be sufficient in this case as it doesn't call any methods to set the value? 🤔

Yep! Given the DATA_WAREHOUSE_CONNECTIONS_STRING env var, these are the only SQLAlchemy engine parameters required (at least so far in testing).

Related to this comment and another, updated the docstring for this class to explain the fields:

"""Client to provide Oracle Data Warehouse connection and querying. Fields: connection_string: str - full SQLAlchemy connection string, e.g. - oracle: oracle+oracledb://user1:pass1@example.org:1521/ABCDE - sqlite: sqlite:///:memory: - defaults to env var DATA_WAREHOUSE_CONNECTION_STRING, loaded from env vars at time of DWClient initialization engine_parameters: dict - optional dictionary of SQLAlchemy engine parameters engine: Engine - set via self.init_engine() """

jonavellecuerdo · 2024-05-15T18:24:20Z

hrqb/utils/data_warehouse.py

+    """Client to provide Oracle Data Warehouse connection and querying."""
+
+    connection_string: str = field(
+        factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING,


[Non-blocking] Maybe pull some of this information into the doc string or as a comment above the declaration of connection_string?

jonavellecuerdo · 2024-05-15T18:31:41Z

hrqb/base/task.py

+        if self.sql_query:
+            query = self.sql_query
+        elif self.sql_file:
+            with open(self.sql_file) as f:


Where would this SQL files live and in what cases is a SQL file preferred over a string? 🤔

Hmm, what do you think of an update like:

class SQLQueryExtractTask(PandasPickleTask): """Base class for Tasks that make SQL queries for data.""" ... @property def sql_query(self) -> str | None: """SQL query from string or loaded from file""" if self.sql_query: query = self.sql_query elif self.sql_file: with open(self.sql_file) as f: query = f.read() else: message = "Property sql_query or sql_file must be set." raise AttributeError(message) return query @property def sql_query_string(self) -> str | None: """SQL query from string to execute.""" return None @property def sql_file(self) -> str | None: """SQL query loaded from file to execute.""" return None @property def sql_query_parameters(self) -> dict: """Optional parameters to include with SQL query.""" return {} def get_dataframe(self) -> pd.DataFrame: return self.dwclient.execute_query( self.sql_query, params=self.sql_query_parameters, )

The question and refactor suggestion are well received. In fact, given some work since this PR on actually scoping out some extract tasks, it seems pretty clear that most (if not all) SQL extract tasks will use a file.

However, being able to easily override that with an explicit string is handy for testing. I think there is a middleground here:

self.sql_query returns a string, defaulting to reading self.sql_file

if neither are overridden, throws an exception that sql_file must be defined OR sql_query overridden to provide a string vs reading a file

Going to make that change and push a commit.

Why these changes are being introduced: During code review, and some work on sketching actual extract tasks, it is clear that SQL extract tasks will most likely use SQL files. However, it remains helpful for testing and other dev work to define the query directly in the class. How this addresses that need: * Default SQLQueryExtractTask.sql_query to return string, and get this string from SQLQueryExtractTask.sql_file * Allows for overridding SQLQueryExtractTask.sql_query with an explicit string Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-11

ghukill · 2024-05-15T20:01:50Z

@jonavellecuerdo - added a couple of commits addressing your feedback.

jonavellecuerdo

@ghukill Thank you for taking the time to make this update. Just one follow-up question!

hrqb/base/task.py

ghukill added 6 commits May 10, 2024 15:20

Test utils functions

4558d32

sqlalchemy dependency and linting updates

ee18bd9

update README with new env var

23f3e0a

ghukill requested review from ehanson8 and jonavellecuerdo May 10, 2024 20:09

ghukill commented May 10, 2024

View reviewed changes

ehanson8 reviewed May 13, 2024

View reviewed changes

ghukill requested a review from ehanson8 May 13, 2024 17:28

ehanson8 approved these changes May 13, 2024

View reviewed changes

Code review updates

82fe684

ghukill force-pushed the HRQB-11-dw-connection-and-helpers branch from 16ad2a7 to 82fe684 Compare May 13, 2024 18:22

jonavellecuerdo reviewed May 15, 2024

View reviewed changes

ghukill added 2 commits May 15, 2024 15:57

Additional DWClient docstrings

b32ef9f

ghukill requested a review from jonavellecuerdo May 15, 2024 20:01

jonavellecuerdo reviewed May 16, 2024

View reviewed changes

hrqb/base/task.py Show resolved Hide resolved

ghukill requested a review from jonavellecuerdo May 16, 2024 13:23

jonavellecuerdo approved these changes May 16, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

ghukill merged commit 86810bb into main May 16, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HRQB 11 - Data Warehouse client and SQLQueryExtractTask base task #16

HRQB 11 - Data Warehouse client and SQLQueryExtractTask base task #16

ghukill commented May 10, 2024 •

edited

Loading

ghukill May 10, 2024 •

edited

Loading

jonavellecuerdo May 15, 2024

ehanson8 left a comment

ehanson8 left a comment •

edited

Loading

jonavellecuerdo left a comment

jonavellecuerdo May 15, 2024

ghukill May 15, 2024

jonavellecuerdo May 15, 2024

ghukill May 15, 2024

ghukill May 15, 2024

jonavellecuerdo May 15, 2024

jonavellecuerdo May 15, 2024

ghukill May 15, 2024

ghukill commented May 15, 2024

jonavellecuerdo left a comment

This comment was marked as outdated.

HRQB 11 - Data Warehouse client and SQLQueryExtractTask base task #16

HRQB 11 - Data Warehouse client and SQLQueryExtractTask base task #16

Conversation

ghukill commented May 10, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

ehanson8 left a comment • edited Loading

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill commented May 15, 2024

jonavellecuerdo left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

ghukill commented May 10, 2024 •

edited

Loading

ghukill May 10, 2024 •

edited

Loading

ehanson8 left a comment •

edited

Loading