Data Engineering standard practices of Extract - Transform - Load (ETL)
A data warehouse is a large and centralized repository of data that is used for storing and managing an organization's data from various sources. The purpose of a data warehouse is to provide a single source of truth for all data in an organization, allowing for easy analysis and reporting.
Terraform is an open-source infrastructure-as-code (IAC) tool developed by HashiCorp. It allows developers to manage and provision infrastructure resources such as virtual machines, networks, and storage using code.
Terraform uses a declarative language to define the desired state of infrastructure resources, allowing developers to easily create, modify, and destroy infrastructure resources using version-controlled configuration files. This enables teams to automate infrastructure provisioning and ensure consistency across environments.
For experimenting several data engineering practices used for preparing data from different sources into formats which are easily and readily available for descriptive and predictive analysis.
- SQL
- Python3
- Pandas
This repo contains 3 folders which seperately analyze different data modeling schemes
The relational DB section entails the management of structured and relational database systems using SQL (Structured Query Language) for the database querying and maintainance. Here, the postgresql engine is used for Data modeling operations which entails: Tables creation, Joins, Normalization, Denormalization, Schema, Warehousing
Requirements: 'python3', 'postgre', 'sql', 'pandas', 'numpy' and 'json' ..
The non-relational database section implements the no-tabular schema that is optimized for the specific requirements of the type of data being stored. Here, the CQL of the Cassandra engine is used for Data modeling operations which entails: Tables creation, Joins, Denormalization, Clauses.
Requirements: 'python3', 'cassandra', 'psycopg2', 'pandas', 'numpy' and 'json' ..
The Data_Warehousing section uses the postgresql and cql to manage schemas on the Pagila dataset including, ETL, Fact and Dimension Tables, OLAP and OLTP Cubes.
Requirements: 'python3', 'postgre', 'sql', 'pandas', 'numpy' and 'json' ..
The Pagila posgre movie rental dataset is used for anaysis in this work. You can find the Licensing for the data and other descriptive information at the link available here. Otherwise, feel free to use the code here as you would like.