Skip to content

This project aims to create an ETL pipeline from energy consumption data.

Notifications You must be signed in to change notification settings

JaviSandoval94/ETL-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL-Project

Background

The purpose of our project was to create a database containing information of energy consumption in restaurants across the following four regions in Mexico in the last six months: CDMX, Guadalajara, Monterrey and Cancún. The final database should contain information of each of the most relevant consumption categories in the restaurant industry (HVAC, illumination, ventilation, kitchen, machinery and general contacts) and provide a comprehensible data frame to allow for comparisons using weather variables and commercial activity of each site.

Extract

The original information was found in csv format in three different files and in JSON format from an API:

  • sites.csv: This file contains sales and customers information of ten different sites, classified in eight different fields. The information in this file contains additional information of each site, including their tariff category and division. The columns required for the rest of the project are the site_id, site_name and zone. For the purpose of this project, a sample of only four regions was required and the data had to be filtered to just the first four restaurants in the list.
site_id site_name client_id tarifa_cfe tarifa2 division_cfe tarifa3 zona
15099 Restaurante 01 1 GDMTH HM Jalisco NULL GDL
36983 Restaurante 02 1 GDMTH HM Golfo Norte NULL MTY
38716 Restaurante 03 1 GDMTH HM Peninsular NULL YUC
26804 Restaurante 04 1 GDMTH HM Valle de México Centro NULL CDMX
32703 Restaurante 05 1 GDMTH HM Golfo Norte NULL MTY
  • measurements.csv: This file contains power and energy consumption data of all the devices in each of the original ten sites. The data was collected for the last six months in intervals of 15 minutes. In addition, each of the devices is categorized into four categories: HVAC, Illumination (Ilum), Ventilation (Iny_Ext), Machinery (Mach), Kitchen (Cocina) and electrical contacts in general (Contactos).
device_id power(kW) energy(kWh) site_id category device_name Fecha Hora
44989 3.2694 0.81735 15099 Ilum Iluminación A 01/11/2019 06:00:00 p.m.
44989 3.2839 0.820975 15099 Ilum Iluminación A 01/11/2019 06:15:00 p.m.
44989 3.6099 0.902475 15099 Ilum Iluminación A 01/11/2019 06:30:00 p.m.
44989 3.3515 0.837875 15099 Ilum Iluminación A 01/11/2019 06:45:00 p.m.
44989 3.2893 0.822325 15099 Ilum Iluminación A 01/11/2019 07:00:00 p.m.
  • ventas.csv: This file contains information of the sales and customers of each of the original ten sites. All the sales are given in Mexican pesos (MXN) and provided in daily intervals for the last six months. In addition, the data needs to be filtered to contain only the desired four sites.

  • Weather data: The data was extracted from the Premium API of “World Weather Online”. The requested data was obtained based on the location of each restaurant (4). Once the data was obtained, all the columns were extracted from the JSON and converted into a data frame. Finally, the non-desired columns were dropped.

weather-data

Transform

The operations required to load these files into the data set including various steps of filtering and grouping information from the raw data. In addition, some of the categories in the original tables were made in different tables to allow for maximum data normalization and primary key connections across all the database. The most relevant steps in the process were the following:

Data filtering

A list of the required sites was created as a Python array using the site_id column and all the information in the datasets including this field was filtered using our array.

data-filtering

Grouping

The data contained in the measurements.csv file was grouped by four categories: zone, site_id, date and category.

grouping

Column selection using Pandas

Having all the information loaded into Pandas DataFrames, it was easy to select just the required columns in each of the provided data sets. This allowed for more flexibility in the normalization process. To enable primary keys across all the tables and to avoid using repeated fields, some of the data had to be created in new, separate tables. This was done for the regions and the consumption categories.

column-selection-1

column-selection-2

Additional steps in the transformation process included data type transformation to change string types into comprehensive date-time type.

Load

The final database contains six tables in total, related to each other according to the following diagram. These tables were loaded into PostgreSQL using PGAdmin.

db-diagram

The reason that the information was created in this format was to normalize the data as much as possible. Our final database specifies daily data across all the tables, allowing for daily categorizations of consumption categories (HVAC, illumination, kitchen, ventilation, machinery and general contacts) according to weather data and commercial activity (sales and customers).

Final remarks

This project allowed us to have fun while learning how to transform and load data into comprehensible databases. The most challenging part was defining the database structure and formatting the original data sets to allow integration into the desired structure. The data normalization part was also challenging in that we had to extract separate tables from the original sets to allow flexibility in the handling of primary and foreign keys. Next steps with this database could be performing analysis and generating data visualizations to compare how the consumption has varied in the last six months in the selected restaurants and in each of the selected consumption categories.

About

This project aims to create an ETL pipeline from energy consumption data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Jupyter Notebook 100.0%