This repo is a compilation of Python Classes to help Extract data from a source (a database, a file, a data lake, or other) and make the minimal necessary transformations to load it into your Data Warehouse.
- Install ASDF to keep versions consistent
- Install Python 3.11.4
asdf plugin add python asdf install python 3.11.4
- Install Poetry 1.5.1
asdf plugin-add poetry https://github.com/asdf-community/asdf-poetry.git asdf install poetry 1.5.1
- Install Python 3.11.4
- In the source repository run:
poetry install poetry run pre-commit install
- Copy .env.example as .env into the 'tests' folder and enter the correct value for each variable
Each class by itself does not do anything. To see the code running, try running one of the Pipelines in the folder ./src/pipelines
There are two main folders at the root of the project:
- src
- Here we have all the code
- tests
- Here, we have all the tests, configuration files, fixtures, and anything else necessary to run tests on the code.
Inside the src folder, we have four main folders
- extract
- Here, we have the definition of the extract classes and the declaration of the interface
- load
- transform
- pipeline
To simplify the import statements on your tests, every 'test_*.py' file is placed in the tests folder root. Any other definition necessary to run the tests will be placed into the respective folder.
The main objective when defining each of the base classes is that your pipeline can be defined as:
from ../extract/choose_one_extract_class import ExtractClass
from ../transform/choose_one_transform_class import TransformClass
from ../load/choose_one_load_class import LoadClass
# the **kwargs depend on the class you choose
extract_obj = ExtractClass(**kwargs)
transform_obj = TransformClass(**kwargs)
load_obj = LoadClass(**kwargs)
extracted_data = extract_obj.extract(**kwargs)
transformed_data = transform_obj.transform(extracted_data, **kwargs)
load_obj.load(transformed_data, **kwargs)
Not all pipelines will need a transformation before the load
- When creating a new class, you need to inherit the interface for that step, so if you want to extract data, you need to inherit the extract_interface
- Develop all the methods defined in the interface and any other necessary methods
- Create unit tests for your class
- Open a PR