# Data science for energy engineers

## Copyright
These notebooks, authored by Hussain Kazmi, are licensed under the AGPL License; you may not use this file except in compliance with the License. Notebooks are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. Support from KU Leuven and InnoEnergy is gratefully acknowledged.

## Data is the new electricity

Data and artificial intelligence have been termed the new electricity by prominent [business practitioners](https://www.youtube.com/watch?v=VIAFQ5p2dxU) and [researchers](https://www.youtube.com/watch?v=21EiKfQYZXc). This is because of the ubiquity of data today, and the products that make use of artificial intelligence algorithms to make sense of this data. We use data science to refer to the end-to-end process of acquiring relevant data, analysing and manipulating it to create real-world value.

Data-driven algorithms allow us to build robust, adaptable systems, as opposed to rule-based systems with hard-coded parameters.  However, the latest innovations in data science are driven by three key developments:
1. Computational power has increased manifold over the last few decades.
2. The amount of available data has likewise grown explosively during this time.
3. Algorithmic evolution and availability of open source toolkits (such as this one) have contributed greatly to democratizing data science.


## Data use cases in energy

We start off by highlighting some of the most common use cases of applying data science in energy engineering. The list is by no means exhaustive, but the interested reader is referred to [this](https://www.sciencedirect.com/science/article/pii/S037877961500293X?casa_token=PMDOZOsOu9kAAAAA:NEZUJTiwOL6OqzBXAZ2GKZn9NF5LkHVJd1E2lJ6SK8tS6zS2gg5Hnynd1dGjApYXSHkrpK2xEg) and [this](https://www.mdpi.com/1996-1073/9/5/348) review for more details.

### Smarter buildings
Data can be used to optimize design and operation of building and building components. For instance:
1. Architects and building engineers use data from simulations and previous projects to design buildings that minimize the lifetime costs (and emissions). This idea is embraced by the industry standard cost-optimality criterion to design and/or refurbish buildings in a sustainable manner.
2. Building engineers use energy demand data of a building for space conditioning (e.g. heating or cooling) to determine whether energy efficiency investments make economic sense for an existing building. This includes replacing building facade by better insulating materials or replacing an outdated HVAC system (heating, ventilation, air conditioning) by a modern, more efficient system such as a heat pump.
3. Building energy management systems also make use of real-time data to make buildings more responsive to user needs (for instance through recognizing preferred temperature set-points by detecting occupancy etc.). This has two key advantages: it increases user comfort, and also cuts down on unnecessary energy demand.
4. Historic building energy demand can also help practitioners identify the optimal dimensioning of distributed energy resources, e.g. rooftop solar PV and electrical battery systems. However, not only can data guide the optimal design choices, it also helps with optimizing the operation of these systems in practice. This topic is the focus of much of this course.

### Smarter energy grids
Data can also be used to optimize grid operation on a scale larger than an individual building. For instance:
1. Practitioners use real world data to identify the optimal location for utility scale solar and wind farms. This helps minimize the levelised cost of electricity production. Furthermore, once these renewable energy projects are realized, plant operators use historical production data to detect and predict faults. Data is also used to model degradation of such systems: e.g. to better understand the drivers that affect the longevity of solar PV panels, wind turbines or battery-inverter systems.
2. Grid operators use data from numerous sources to make forecasts for generation and consumption of energy. These forecasts are then used to commission enough flexibility to ensure grid operation can continue in a stable manner. This data-driven analysis can be conducted by the transmission system operators (TSOs) for the national grid, and by the distribution system operators (DSOs) for the local or regional grids. This flexibility is often offered by aggregators through automated demand response.
3. On a related note, data-driven analysis is also used extensively by entities such as energy traders, who make forecasts for future electricity prices, and then use this information to optimize energy generation and/or consumption.

While we focus on electricity in this course, the same concepts apply to other energy sources such as gas.

## Types of practical data science projects

So, how does an end to end workflow in such a data science project in energy look like? There are a few different ways of looking at this, which we explore in this section.

### The [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) view
This view is focused more on forecasting related problems. More specifically, if your project is related to forecasting energy consumption and/or production, then this view is likely the most appropriate. Forecasting energy is mostly posed as a regression problem (i.e. the output of your machine learning models is a continuous value). This project is usually split into two parts:
1. A training phase, where you use collected training examples to train a model 
2. An evaluation (or inference) phase, where you use the trained model to make predictions for cases that you did not see in the training process

To summarize, the question we want to answer in this case is: 'if you have historical demand (or generation) data, how can you forecast it for the future?' The bird's level overview is visualized in the figure below. We will deal with how to make forecasts in lecture 3. 

<img src="slearn.png" style="width: 640px;">

Note that there are other use cases of supervised learning in the energy domain, however we do not deal with them in this course.


### The end to end workflow
While forecasting lets you better understand the future, it does not allow you to 'close the loop' or take any actions to affect the future. End to end data science solutions, on the other hand, create actionable insights and schedules etc. from forecasts. In the case of energy projects, it is straightforward to use demand and production forecasts to optimize schedules (e.g. in trading portfolios). A common example is to charge large-scale battery storage systems (either thermal or electrical) when the prices are low and discharge them when the prices are high. This is known as cost arbitrage, and you will learn how to do it in lecture 4.

The following image shows an overview of the end to end data science project. Note that such a project includes forecasting as a component, but greatly expands the scope. An alternative to this forecast-control staged process is to use reinforcement learning, which often combines the two into a single block (as model-free control). While we do not cover reinforcement learning in this course, we discuss some of the remaining building blocks (data sensing etc.) in the next section.

<img src="Pipeline.png" style="width: 200px;">
(Image courtesy Lex Friedman)


### The [A/B testing](https://en.wikipedia.org/wiki/A/B_testing) view
This view builds further on the other two views, and answerrs questions such as 'given two different algorithms, how do I know which one will lead to the greatest improvement'. As such, it deals with ways to designing experiments and drawing (statistically significant) conclusions from the gathered evidence. One example of such a test is to run two different closed-loop forecasting and optimization algorithms on two identical populations (e.g. sets of buildings). Assuming that the two populations are otherwise identical, the results of such an experiment can help identify the real world usefulness of the two algorithms, and give an indication about which to use in practice.

An example of A/B testing is given below. Note that A/B testing is a special class of randomized experimentation, where we test only two variants in a single experiment. Such tests are used extensively in energy as well as other domains. However, we will not cover them further in this course.

<img src="ab_testing.png" style="width: 640px;">


### Other views

It is also possible to look at data science projects in energy in a number of other ways. These include:
1. Using unsupervised learning to cluster buildings with similar energy demand. Another possibility is to use unsupervised learning algorithms to reduce the dimensionlaity of data, for example using principle component analysis etc.
2. There are genuine concerns surrounding the use of electricity consumption in buildings. This has sparked a number of debates surrounding data privacy and security. Privacy-preserving learning algorithms provide one way to utilize data to still create value, while preserving the user's privacy.
3. Many data-driven algorithms require enormous amounts of training data, before they can be utilized for practical purposes. Learning from simulations and transfer learning present two key approaches to mitigate these issues.

We do not discuss these further in this course, however the interested reader can find more details [here]().

## Practical concerns in data science projects

The 'smart' component of most data science projects is often made possible by an enormous effort to gather, store, curate and transmit the appropriate data. This is also evident in the end to end workflow figure above, where sensors (and sensed data) form an integral part of the overall value chain. In this section, we briefly address data acquisition, storage and processing. However, we deal with loading and working with data in lecture 2.

### Data acquisition

Data sensing and acquisition forms the foundation on which any data science project is built. As such, the quality of the data determines, to a large part, the fate of the project and how much value it can create. A rule of thumb that applies to these projects is therefore, 'Garbage-in-Garbage-out', or GIGO for short. 

Despite the amount of data being generated every second, sensing relevant data can still be expensive. This is specially true for the case of energy systems, where many privacy concerns exist. For optimal operation of the grid, we ideally want perfect visibility of both demand and supply side, in high spatial and temporal resolution. The reality is a bit different:

1. On the demand side, as we move from larger commercial buildings to residential buildings, the state of smart metering usually degrades quickly. While some European countries have made progress in smart metering infrastructure, this is by no means universal. Furthermore, even when this data is being recorded, it will seldom be available for public consumption (for good reasons). However, even in larger installations where electricity demand is being logged, legacy systems might often prevent export of this data. Likewise, even when the overall electricity demand in a building might be recorded, it will not give the full picture which can be obtained by sub-metering individual loads (such as heating, ventilation etc.). This is an expensive proposition, not just because of the investment cost for sensors, but also the huge amounts of data that requires to be transmitted and stored.
2. On the supply side, large electricity generators and centralized, utility scale power plants record operational data. However, on the building level, the output of distributed energy resources such as solar panels is seldom monitored. Thus, even in cases where this data is being generated, very often it is not analysed to see if there are opportunities for further optimization (by dealing with shading or soiling etc.).


### Data storage

As alluded to earlier, once the sensors are in place to gather data, the next step is to decide where to record it and make it available for subsequent analysis etc. This can be done via multiple ways: 

1. The sensors recording the data can write it to a database which can then be accessed by data scientists.
2. Alternatively, the sensors can make the data available on demand. In this case, the data does not get written continuously, thereby reducing its volume.

Which solution works better depends on the problem at hand. In the next lecture, we look into a number of popular file formats to load and analyse energy data.

In addition to storing data for subsequent use, it is also often necessary to store machine learning models trained using this data. In the Python programming language, there are a number of ways to do this. This includes storing the files in the same (binary) format as your data. More recently, formats specialized for machine learning models such as ONYX have also developed.

Finally, it is important to keep in mind that the data, code and models all need to be versioned correctly for replication purposes. Online services such as Github can provide part of this service to open-source and private projects.

### Processing

Once the data can be retrieved for further analysis, we can start off with analyzing it. This can be done either on your own computer or on the cloud (as in this course). Increasingly, edge computation is becoming feasible with the growing computational power on sensing devices. Such processing is the focus of this course.


## Course structure

The remainder of the course is organized as follows:
* We will look at ways to import, clean and visualize data in lecture 2. We will consider energy demand from 200 (simulated) Belgian households.
* In lecture 3, we discuss how to better understand the trends in time series data, and make forecasts using various techniques.
* Lecture 4 will introduce a general purpose, derivative-free framework to solve optimization problems in energy. More specifically, we will look at ways to minimizing energy costs using a battery. 
* Lecture 5 on advanced topics will not contain any coding per se, and will therefore take on the form of a discussion.

<img src="course-content.png">

## Why Python?

The material for this course has been developed in Python, a high level programming language. Python is among the most popular programming languages at the moment. Some of the advantages that it offers include:

1. Extensive support for data science with open source libraries and frameworks
2. Straightforward interfaces with other programming languages
3. nteractive development environments and the IPython framework. 

At the same type, there are also some disadvantages associated with using Python. These include:
1. Relatively slow computational speed (especially when code is not written in Pythonic syntax)
2. Non-typed language
3. Compatibility issues between Python 2 and 3

For more details on where Python ranks at the moment in terms of popularity as estimated by the Tiobe Index, look [here](https://www.tiobe.com/tiobe-index/). For more details on some of the reasons behind Python's popularity with the data science community, check out [this](https://www.kdnuggets.com/2017/07/6-reasons-python-suddenly-super-popular.html) and [this](https://medium.com/@mindfiresolutions.usa/advantages-and-disadvantages-of-python-programming-language-fd0b394f2121) blog.

More specifically, we use the Jupyter notebooks to teach Python and data science skills in this course. However, you can execute these notebooks also on your local machine using IDE's such as [PyCharm](https://www.jetbrains.com/pycharm/).

## Additional resources

In general, for those interested in data science, [Medium](https://medium.com) and [KDNuggets](https://www.kdnuggets.com) have plenty of (introductory) content written by bloggers. In this section, we include a list of some additional resources that might be useful for the interested student.

### Introduction to programming

This course expects a basic familiarity with programming in Python. If this sounds intimidating, make sure to follow the prerequisites lecture in this course before proceeding to the next lecture. If learning from static notebooks is not your thing, there are a number of online resources to bring you up to speed with Python. These include:


1. [Introduction to Python](https://www.datacamp.com/courses/intro-to-python-for-data-science)
This course provides a quick and brief introduction to lists, numpy arrays and in-built functions in Python. After this course, you will know how to create and manipulate data in multiple forms.


2. [Intermediate Python](https://www.datacamp.com/courses/intermediate-python-for-data-science)
This course introduces two key data science libraries: Pandas (for loading and working with your data) and Matplotlib (for visualizing). It also introduces logical operators and loops - necessary tools for any serious project.


3. Numerous courses on Pandas and Matplotlib:
A number of courses on DataCamp cover these in much greater detail. These are not essential for following the course, but will provide you with greater knowledge and skills to tackle real world problems.


4. [Lecture notes from a one day course on Numpy and Linear Algebra](https://github.com/ADGEfficiency/teaching-monolith/tree/master/numpy) that explains vector, matrix and tensor data processing.


5. An [introduction to coding standards in Python](https://www.datacamp.com/community/tutorials/pep8-tutorial-python-code) that goes over the PEP-8 standard.


### Introduction to machine learning

This course will introduce a number of machine learning and optimization concepts for the energy domain. Here is a non-exhaustive list of courses and material that focus more on machine learning and artificial intelligence itself. These can serve either as follow-up to the course or a complement, but they are not required for this course.

1. [Andrew Ng's introduction to machine learning](https://www.youtube.com/playlist?list=PLoR5VjrKytrCv-Vxnhp5UyS1UjZsXP0Kj) is perhaps the most watched lecture series on the internet about machine learning. This course offers an extremely accessible introduction to machine learning (and is also pretty watered down). For the more adventurous, Ng also has a much more comprehensive [machine learning course](https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599) he taught at Stanford online. Finally, he has also made his [deep learning course](https://www.youtube.com/playlist?list=PLoROMvodv4rOABXSygHTsbvUz4G_YQhOb) available online recently - however, unlike the other two, this one focuses exclusively on deep learning (as opposed to the more general machine learning content of earlier courses).


2. Like Prof. Ng, Nando de Freitas also has multiple courses on machine learning online. These start from an [undergrad level course](https://www.youtube.com/playlist?list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf) to a [graduate level course](https://www.youtube.com/playlist?list=PLE6Wd9FR--EdyJ5lbFl8UuGjecvVw66F6) on machine learning. Finally, there is also a course on [deep learning](https://www.youtube.com/playlist?list=PLjK8ddCbDMphIMSXn-w1IjyYpHU3DaUYw). The content covered by these courses is quite different, and can provide complementary information to Prof. Ng's lectures.


3. For an excellent (but comprehensive) introduction to reinforcement learning, check out [David Silver's course](https://www.youtube.com/playlist?list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-). These lectures follow the seminal book by Sutton and Barto, introduction to reinforcment learning, available online and in LIMO. Pieter Abbeel also covers the foundations of RL in his [artificial intelligence course](https://www.youtube.com/watch?v=i0o-ui1N35U) (which is also a good complement to the courses discussed earlier).


4. For more details on deep learning, check out the recent book by Goodfellow et al. For computer vision applications, Stanford has a set of very intuitive [courses](https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv) online as well.


5. Finally, Zico Kolter has a [course](https://www.youtube.com/playlist?list=PLTOBJKrkhpoOjsfYdEeKskarea09we9DJ) online on computational methods for sustainable energy which covers smart grids and machine learning.




<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b21ecfca-1024-46a4-860c-35a51c91b2b7' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>