# Pandas and matplotlib exercise with open data from Helen 
### 12.4.2022

In this exercise we will make a plot similar to the one found on https://www.helen.fi/helen-oy/vastuullisuus/ajankohtaista/avoindata . We will also calculate and visualize some yearly summary data. The aim of this exercise notbook is to practice parsing and manipulating csv:s with pandas and visualize data with matplotlib. Feel free to add cells and playaround with the variables to get a better understanding.

First import necessary packages:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Go to https://www.helen.fi/helen-oy/vastuullisuus/ajankohtaista/avoindata and copy the url for the csv.
The cell below is provided as reference but will not produce the required dataframe. It is provided as reference.

In [None]:
url = ???
storage_options = {'User-Agent': 'Mozilla/5.0'}
df = pd.read_csv(url,storage_options=storage_options)
df.head()

The first attempt at parsing the CSV resulted in an error. So more parameters are required to successfully read the CSV, this is typical. The cell below is provided as help, it makes a request and reads the raw data and prints the first 5 rows in binary format. From this one can deduce the required parameters. Opening the csv in Excel can also help to figure out the correct format.

In [None]:
from urllib.request import Request, urlopen

req = Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')
content = urlopen(req)

data = content.readlines()
for line in data[0:5]:
    print(line)

The '\xef\xbb\xbf is a binary tag (https://www.knownhost.com/kb/htaccess-invalid-command-xefxbbxbf/) and can be ignored. 

Some useful documentation:
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea
- https://pandas.pydata.org/docs/user_guide/timeseries.html

Pandas has many useful methods for working with timedata.

In [None]:
pd.to_datetime('2.1.2015 1:00')

Fill in the right parameters:

In [None]:
df = pd.read_csv(url,
            storage_options = storage_options,
            delimiter=???,
            header=???,
            decimal=???,
            parse_dates=???,
            index_col= ???,
            dayfirst=???
        )

df.head()

Run the tests below to check that your dataframe is (probably) in the correct format.

In [None]:
assert isinstance(df.index, pd.DatetimeIndex), "The dataframe named df should have a DatetimeIndex"
assert df.isna().sum().item() == 0, "There should not have any NA values in the dataframe"
assert df.dtypes.item() == np.float64, "The dataframe should contain float64's"
assert df.size == 52607, "The dataframe is of wrong size"
assert df.iloc[df.index == "2015-2-1 01:00:00",0].item() == 1052.6, "The dataframe was parsed with the datetimes in the wrong convention, American vs European"
print("Passed tests ! :)")

If the dataframe is correct the cell below should produce a graph similar to the one seen on: https://www.helen.fi/helen-oy/vastuullisuus/ajankohtaista/avoindata .

In [None]:
# Set figure size
f, ax = plt.subplots(1, 1, figsize = (15, 5))

# Plot lineplot
plt.plot(df["dh_MWh"])
# Add title and labels
plt.title("Helsinki district heating effect")
plt.xlabel("Date")
plt.ylabel("MWh")
# fill space between x-axis and lineplot
plt.fill_between(df.index, 0, df["dh_MWh"], facecolor='indigo', alpha=0.7)

#

plt.show()

Next we are going to calculate some summary statistics (mean, min, max, std) for each complete year, i.e. 2015-2020. First downsample the timeseries and then aggregate the timeseries (think groupby year) with each respective function.

In [None]:
# Downsample the timeseries and calculate statistics

yearly_means = ???
yearly_mins = ???
yearly_maxes = ???
yearly_stds = ???


Run the cell below to test your calculations

In [None]:
assert yearly_means.sum().item()//1 == 4663.0, "The means are not correct"
assert yearly_mins.sum().item()//1 == 1005.0, "The min values are not correct"
assert yearly_maxes.sum().item()//1 == 13252.0, "The max values are not correct"
assert (yearly_stds.sum().item())//1 == 2565.0, "The standard deviations are not correct"
print("Passed tests ! :)")

We can next visualize the results with a stacked errorbar plot

In [None]:

# Create stacked errorbars:

# Set figure size
f, ax = plt.subplots(1, 1, figsize = (10, 7))


# Mean and standard deviation
plt.errorbar(yearly_means.index,
    yearly_means["dh_MWh"],
    yearly_stds["dh_MWh"],
    fmt='ok',
    lw=3)

# Mean and standard deviation
plt.errorbar(yearly_means.index,
    yearly_means["dh_MWh"],
    [yearly_means["dh_MWh"] - yearly_mins["dh_MWh"], yearly_maxes["dh_MWh"] - yearly_means["dh_MWh"]],
    fmt='.k',
    ecolor='gray',
    lw=1)


# Add title and labels
plt.title("Yearly Helsinki district heating effect")
plt.xlabel("Date")
plt.ylabel("MWh")

# Add grid
ax.grid(axis='y')

plt.show()

Finally plot a 2x3 grid of histograms, each cell containing a histogram of that years heating usage to showcase the variation year on year.
Some usefull resources:
- https://matplotlib.org/3.5.0/gallery/statistics/hist.html
- https://matplotlib.org/stable/gallery/statistics/histogram_multihist.html#sphx-glr-gallery-statistics-histogram-multihist-py
- https://stackoverflow.com/questions/20174468/how-to-create-subplots-of-pictures-made-with-the-hist-function-in-matplotlib-p

In [None]:
fig, axs = plt.subplots(2, 3, sharey=True, figsize = (15, 7))

???

plt.show()
