An implementation of core data engineering concepts using Python and SQL, with GitHub Codespaces integration.
In this challenge, you are working as a Data Engineer for a retail banking company. Your task is to build an ETL (Extract, Transform, Load) pipeline that processes the bank’s transaction data. The company collects various transaction types, customer ages, and balances, and they need a system to extract this data from raw files, clean and transform it, and load it into a database for analysis.
You will be working with a dataset containing 100 rows of customer IDs, ages, transaction types, and balances, which is provided in the repository as bank_transactions_dataset.csv. Your task is to implement the ETL pipeline to process this data.
-
Fork this project to create your own copy of the repository.
-
Use GitHub Codespaces:
- Click on the green Code button in your forked repository.
- Select Codespaces and choose "Create codespace on main" to open your development environment.
-
The repository includes a pre-configured
python.ymlfile, which automatically sets up the Python environment in Codespaces.- Ensure the Python version is correct: Check that the
python.ymlfile has the correct Python version for the project. You can update thepython.ymlfile if necessary:- name: Set up Python uses: actions/setup-python@v2 with: python-version: '3.8' # Adjust this version based on your project's needs
- Ensure the Python version is correct: Check that the
-
Once the Codespace is ready and the environment is set up, review the code in the
data_engineering/folder to understand the structure. -
Implement the missing functions marked with TODO comments.
-
Test your implementation by running the
main.pyfile inside GitHub Codespaces.
data_engineering/extract.py: Implement data extraction from the provided CSV file.data_engineering/transform.py: Implement data transformation logic (data cleaning, feature engineering).data_engineering/load.py: Implement loading the cleaned data into a SQLite database.main.py: Control the ETL pipeline and test your implementation.
-
extract.py:- Implement the
extract_datafunction to read the data frombank_transactions_dataset.csv.
- Implement the
-
transform.py:- Implement the
transform_datafunction to clean and transform the data (handle missing values, format changes, etc.).
- Implement the
-
load.py:- Implement the
load_datafunction to load the transformed data into a SQLite database.
- Implement the
-
main.py:- Implement the pipeline flow (call extract, transform, and load functions).
- Fork the repository and set up your environment using GitHub Codespaces.
- extract.py: Implement the logic to extract data from the
bank_transactions_dataset.csv. - transform.py: Implement the transformation logic (e.g., handle missing values, data normalization).
- load.py: Implement the loading logic to insert the cleaned data into a SQLite database.
- Run the code inside GitHub Codespaces to test your implementation.
- Make sure to handle different data types properly (e.g., strings, integers, floats).
- Use the SQLite library in Python to interact with the database.
- Ensure data quality checks are in place after the transformation step.
After completing the challenge, submit the link to your forked GitHub repository in the LMS submission link text box.
- Correct implementation of the ETL pipeline (extract, transform, load).
- Proper handling of missing or invalid data.
- Accurate data insertion into the SQLite database.
- Clean and readable code with appropriate comments and structure.
- Successful execution of the project within GitHub Codespaces.
Good luck, and happy coding!