The purpose of our project was to create a database containing information of energy consumption in restaurants across the following four regions in Mexico in the last six months: CDMX, Guadalajara, Monterrey and Cancún. The final database should contain information of each of the most relevant consumption categories in the restaurant industry (HVAC, illumination, ventilation, kitchen, machinery and general contacts) and provide a comprehensible data frame to allow for comparisons using weather variables and commercial activity of each site.
The original information was found in csv format in three different files and in JSON format from an API:
sites.csv
: This file contains sales and customers information of ten different sites, classified in eight different fields. The information in this file contains additional information of each site, including their tariff category and division. The columns required for the rest of the project are thesite_id
,site_name
andzone
. For the purpose of this project, a sample of only four regions was required and the data had to be filtered to just the first four restaurants in the list.
site_id | site_name | client_id | tarifa_cfe | tarifa2 | division_cfe | tarifa3 | zona |
---|---|---|---|---|---|---|---|
15099 | Restaurante 01 | 1 | GDMTH | HM | Jalisco | NULL | GDL |
36983 | Restaurante 02 | 1 | GDMTH | HM | Golfo Norte | NULL | MTY |
38716 | Restaurante 03 | 1 | GDMTH | HM | Peninsular | NULL | YUC |
26804 | Restaurante 04 | 1 | GDMTH | HM | Valle de México Centro | NULL | CDMX |
32703 | Restaurante 05 | 1 | GDMTH | HM | Golfo Norte | NULL | MTY |
measurements.csv
: This file contains power and energy consumption data of all the devices in each of the original ten sites. The data was collected for the last six months in intervals of 15 minutes. In addition, each of the devices is categorized into four categories: HVAC, Illumination (Ilum), Ventilation (Iny_Ext), Machinery (Mach), Kitchen (Cocina) and electrical contacts in general (Contactos).
device_id | power(kW) | energy(kWh) | site_id | category | device_name | Fecha | Hora |
---|---|---|---|---|---|---|---|
44989 | 3.2694 | 0.81735 | 15099 | Ilum | Iluminación A | 01/11/2019 | 06:00:00 p.m. |
44989 | 3.2839 | 0.820975 | 15099 | Ilum | Iluminación A | 01/11/2019 | 06:15:00 p.m. |
44989 | 3.6099 | 0.902475 | 15099 | Ilum | Iluminación A | 01/11/2019 | 06:30:00 p.m. |
44989 | 3.3515 | 0.837875 | 15099 | Ilum | Iluminación A | 01/11/2019 | 06:45:00 p.m. |
44989 | 3.2893 | 0.822325 | 15099 | Ilum | Iluminación A | 01/11/2019 | 07:00:00 p.m. |
-
ventas.csv
: This file contains information of the sales and customers of each of the original ten sites. All the sales are given in Mexican pesos (MXN) and provided in daily intervals for the last six months. In addition, the data needs to be filtered to contain only the desired four sites. -
Weather data: The data was extracted from the Premium API of “World Weather Online”. The requested data was obtained based on the location of each restaurant (4). Once the data was obtained, all the columns were extracted from the JSON and converted into a data frame. Finally, the non-desired columns were dropped.
The operations required to load these files into the data set including various steps of filtering and grouping information from the raw data. In addition, some of the categories in the original tables were made in different tables to allow for maximum data normalization and primary key connections across all the database. The most relevant steps in the process were the following:
A list of the required sites was created as a Python array using the site_id
column and all the information in the datasets including this field was filtered using our array.
The data contained in the measurements.csv
file was grouped by four categories: zone
, site_id
, date
and category
.
Having all the information loaded into Pandas DataFrames, it was easy to select just the required columns in each of the provided data sets. This allowed for more flexibility in the normalization process. To enable primary keys across all the tables and to avoid using repeated fields, some of the data had to be created in new, separate tables. This was done for the regions and the consumption categories.
Additional steps in the transformation process included data type transformation to change string types into comprehensive date-time type.
The final database contains six tables in total, related to each other according to the following diagram. These tables were loaded into PostgreSQL using PGAdmin.
The reason that the information was created in this format was to normalize the data as much as possible. Our final database specifies daily data across all the tables, allowing for daily categorizations of consumption categories (HVAC, illumination, kitchen, ventilation, machinery and general contacts) according to weather data and commercial activity (sales and customers).
This project allowed us to have fun while learning how to transform and load data into comprehensible databases. The most challenging part was defining the database structure and formatting the original data sets to allow integration into the desired structure. The data normalization part was also challenging in that we had to extract separate tables from the original sets to allow flexibility in the handling of primary and foreign keys. Next steps with this database could be performing analysis and generating data visualizations to compare how the consumption has varied in the last six months in the selected restaurants and in each of the selected consumption categories.