Masters in Data Science and Engineering
Faculdade de Engenharia da Universidade do Porto
Cátia Teixeira | Rojan Aslani | Miguel Veloso | Luís Henriques
A data warehouse, an infrastructure to store data, was created with 2 stars - information from several tables combined into two tables with different purposes - and a sales cube - aggregated sales by month or customer segment given frequent queries made -, to store data from an online retail store. This data was retrieved from Kaggle.
This was built using PostgreSQL, pgAdmin 4, Pentaho PDI, and PowerBI, in order to create a data warehouse from scratch, considering the dataset chosen.
-
Setup
- The Data Warehouse was implemented using PostgreSQL;
- The ETL was developed using Pentaho PDI;
- The mondrian schema was done using Penatho Shema Workbench;
- The dashboards required the use of Microsoft Power BI;
-
Create dimensions
- Create schema inside a database in postgres;
- Open transformations (ETL) for dimensions tables (customer, time, location, product) in Pentaho PDI Kettle - 'Spoon' dir;
- Configure source csv to local path in each transformation file;
- Configure your local db connection in 'Table Output' in each transformation file;
- Run to create tables and populate them;
-
Create facts and agg tables
- Open trasformation (ETL) for facts and aggregation tables in Pentaho (order_item, order, agg_sales);
- Configure source csv to local path in each transformation file;
- Configure each 'Table Input' in each file for your local db connection where dimensions were created;
- Configure 'Table Output' in each file for your local db connection;
- Run to create tables and populate them;
-
Create dashboards and cube in PowerBI
- Load the data from PostgreSQL, the data warehouse, to PowerBI;
- Create dashboards;
- Compute sales aggregations by creating a data cube, making it faster for frequent queries made for decision making;
- Use other tools to create the data cube;
- Explore in a production line how efficient and what efficiency improvements could be done to the ETL pipelines;