Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRQB 11 - Data Warehouse client and SQLQueryExtractTask base task #16

Merged
merged 10 commits into from
May 16, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented May 10, 2024

Purpose and background context

This PR introduces class DWClient to assist with making queries to the Oracle Data Warehouse. The goal was to keep this as simple as possible, assisting with setting up a connection, supporting testing to a local SQLite database, and returning the results of a query as a pandas DataFrame.

Additionally, a new base task class was created SQLQueryExtractTask. This base class is designed to assist with Extract tasks that query for data from the warehouse, using the DWClient, and return a DataFrame for downstream tasks to work with. This leans into the requirement of PandasPickleTask to define a get_dataframe() method, which in turn this defines as an opinionated method to perform a SQL query.

The end result is a task that can look as simple as this:

class ExtractTable1AndTable2Data(SQLQueryExtractTask):
    pipeline = luigi.Parameter()
    stage = luigi.Parameter("Extract")

    @property
    def sql_query(self) -> str:
        return """
        select
           t1.column_a,
           t1.column_b,
           t2.column_c
        from my_table t1
        inner join my_other_table t2 on t1.column_a = t2.column_a
        where t2.column_d = 'foo'
        """

When this task runs, it will use the sql_query property defined, perform a query, and return the result as a DataFrame. The goal is to focus on writing the SQL queries themselves, and not the orchestration of executing queries or preparing the results for downstream tasks.

How can a reviewer manually see the effects of these changes?

An ipython shell is likely the easiest way to get a feel for the new base task type SQLQueryExtractTask and DWClient working together.

Set env vars:

WORKSPACE=dev
SENTRY_DSN=None
DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH
LUIGI_CONFIG_PATH=hrqb/luigi.cfg
QUICKBASE_API_TOKEN=abc123
QUICKBASE_APP_ID=def456

Ipython shell:

# import a testing Task class from fixtures
from tests.fixtures.tasks.extract import SQLExtractAnimalColors

# instantiate the Task with pipeline name and stage
task = SQLExtractAnimalColors(pipeline="Testing", stage="Extract")

# run get_dataframe() method
# this relies on the base class SQLQueryExtractTask to perform a SQL query
# (configured here to query a local SQLite database that has data)
# and returns a dataframe only, no files are written
task.get_dataframe()
# Out[3]: 
#   animal_id  color
# 0         42  green
# 1        101    red

# run actual run() method, and note the file created:
# output/Testing__Extract__SQLExtractAnimalColors.pickle
task.run()

# other confirmations that data was written identical
# to the PandasPickleTask this task class extends

task.path
# Out[4]: 'output/Testing__Extract__SQLExtractAnimalColors.pickle'

task.target
# Out[5]: <hrqb.base.target.PandasPickleTarget at 0x12de27e10>

task.target.exists()
# Out[6]: True

task.complete()
# Out[7]: True

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES; hypothetical connection to data warehouse, though credentials are not deployed yet

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

A primary source of data for this application to work with
is data coming from the MIT Oracle Data Warehouse.  A normalized
way of connecting and performing queries is needed.

How this addresses that need:
* Creates new DWClient to connect to, and query from, a remote
database.
* DWClient includes a method execute_query that uses pandas
to return a DataFrame of the results, leaning into the
DataFrame-first nature of the PandasPickleTasks.

Side effects of this change:
* Connection possible to external Data Warehouse

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-11
Why these changes are being introduced:
For testing purposes it can be difficult to test when
an attrs field gets a default from an env var at class
definition, as this will not allow monkeypatching.

How this addresses that need:
* Use factory pattern where the default value is set
each instantiation of the class via a lambda

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-13
Why these changes are being introduced:

Similar to base task QuickbaseUpsertTask, a very common extract task
will be querying the Oracle Data Warehouse for data.  To reduce boilerplate
code in each of these defined tasks, a base class will be helpful for
setting up some structure to follow.

How this addresses that need:
* New base task SQLQueryExtractTask
  * requires only a sql_query property, where the rest of querying and
writing pandas dataframe as task target is handled
  * optional SQL query parameters may be defined
  * optional DWClient instance may be defined (e.g. testing to
SQLite database)

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-11
"""Client to provide Oracle Data Warehouse connection and querying."""

connection_string: str = field(
factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING,
Copy link
Collaborator Author

@ghukill ghukill May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if I commented on this in tests, but using a factory=lambda: ... pattern here is helpful for testing. This ensures that only when the DWClient is instantiated, does it pull the env var DATA_WAREHOUSE_CONNECTION_STRING if an explicit connection string is not passed.

If this had been default=Config().DATA_WAREHOUSE_CONNECTION_STRING, then monkey patching the env var at pytest time would not be sufficient, as this class is already defined with the default value for any future instances being the env var before that monkey patching.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking] Maybe pull some of this information into the doc string or as a comment above the declaration of connection_string?

Why these changes are being introduced:

Some extract tasks that query the data warehouse will
have complex SQL queries.  Storing them in a dedicated
file will support syntax highlighting and testing of
those files directly, while keeping the task definitions
sipmler.

How this addresses that need:
* Add new property SQLQueryExtractTask.sql_file
* Required either sql_query OR sql_file defined

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-11
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems solid, a few suggestions and questions

hrqb/base/task.py Show resolved Hide resolved
hrqb/utils/data_warehouse.py Outdated Show resolved Hide resolved
hrqb/utils/data_warehouse.py Outdated Show resolved Hide resolved
hrqb/utils/data_warehouse.py Outdated Show resolved Hide resolved
tests/test_base_task.py Outdated Show resolved Hide resolved
@ghukill ghukill requested a review from ehanson8 May 13, 2024 17:28
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but still recommend updating that docstring. Edit: don't know why this didn't link properly 16ad2a7#r141964334

@ghukill ghukill force-pushed the HRQB-11-dw-connection-and-helpers branch from 16ad2a7 to 82fe684 Compare May 13, 2024 18:22
Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just have a few small suggestions / questions. 🤓

def init_engine(self) -> None:
"""Instantiate a SQLAlchemy engine if not already configured and set.

User provided engine parameters will override self.default_engine_parameters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be updated: self.default_engine_parameters -> self.engine_parameters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Updated, and even better, can remove that altogether; I think setting engine_parameters explicitly is pretty transparent in how it works now.

factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING,
repr=False,
)
engine_parameters: dict = field(factory=lambda: {"thick_mode": True})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would field(default={"thick_mode": True}) be sufficient in this case as it doesn't call any methods to set the value? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Given the DATA_WAREHOUSE_CONNECTIONS_STRING env var, these are the only SQLAlchemy engine parameters required (at least so far in testing).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this comment and another, updated the docstring for this class to explain the fields:

"""Client to provide Oracle Data Warehouse connection and querying.

Fields:
    connection_string: str
        - full SQLAlchemy connection string, e.g.
            - oracle: oracle+oracledb://user1:pass1@example.org:1521/ABCDE
            - sqlite: sqlite:///:memory:
        - defaults to env var DATA_WAREHOUSE_CONNECTION_STRING, loaded from env vars
        at time of DWClient initialization
    engine_parameters: dict
        - optional dictionary of SQLAlchemy engine parameters
    engine: Engine
        - set via self.init_engine()
"""

"""Client to provide Oracle Data Warehouse connection and querying."""

connection_string: str = field(
factory=lambda: Config().DATA_WAREHOUSE_CONNECTION_STRING,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking] Maybe pull some of this information into the doc string or as a comment above the declaration of connection_string?

if self.sql_query:
query = self.sql_query
elif self.sql_file:
with open(self.sql_file) as f:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would this SQL files live and in what cases is a SQL file preferred over a string? 🤔

Hmm, what do you think of an update like:

class SQLQueryExtractTask(PandasPickleTask):
    """Base class for Tasks that make SQL queries for data."""

   ...

    @property
    def sql_query(self) -> str | None:
        """SQL query from string or loaded from file"""
        if self.sql_query:
            query = self.sql_query
        elif self.sql_file:
            with open(self.sql_file) as f:
                query = f.read()
        else:
            message = "Property sql_query or sql_file must be set."
            raise AttributeError(message)
        return query

    @property
    def sql_query_string(self) -> str | None:
        """SQL query from string to execute."""
        return None

    @property
    def sql_file(self) -> str | None:
        """SQL query loaded from file to execute."""
        return None

    @property
    def sql_query_parameters(self) -> dict:
        """Optional parameters to include with SQL query."""
        return {}

    def get_dataframe(self) -> pd.DataFrame:
        return self.dwclient.execute_query(
            self.sql_query,
            params=self.sql_query_parameters,
        )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question and refactor suggestion are well received. In fact, given some work since this PR on actually scoping out some extract tasks, it seems pretty clear that most (if not all) SQL extract tasks will use a file.

However, being able to easily override that with an explicit string is handy for testing. I think there is a middleground here:

  • self.sql_query returns a string, defaulting to reading self.sql_file
  • if neither are overridden, throws an exception that sql_file must be defined OR sql_query overridden to provide a string vs reading a file

Going to make that change and push a commit.

Why these changes are being introduced:

During code review, and some work on sketching actual extract tasks,
it is clear that SQL extract tasks will most likely use SQL files.
However, it remains helpful for testing and other dev work to define
the query directly in the class.

How this addresses that need:
* Default SQLQueryExtractTask.sql_query to return string, and get
this string from SQLQueryExtractTask.sql_file
* Allows for overridding SQLQueryExtractTask.sql_query with an
explicit string

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-11
@ghukill
Copy link
Collaborator Author

ghukill commented May 15, 2024

@jonavellecuerdo - added a couple of commits addressing your feedback.

Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Thank you for taking the time to make this update. Just one follow-up question!

hrqb/base/task.py Show resolved Hide resolved
jonavellecuerdo

This comment was marked as outdated.

@ghukill ghukill merged commit 86810bb into main May 16, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants