**Data Visualization course - winter semester 23/24 - FU Berlin**

*Tutorials adapted from the [Information Visualization](https://infovis.fh-potsdam.de/tutorials/) course at the FH Potsdam*

# Tutorial 5: Temporal analysis

We encounter time series data in pretty much every domain, from finance to weather, from public health to renewable energies. Visualizations of temporal data may represent recorded observations from the past and/or predicted developments for the future, which is why visual representations of temporal data are so important and interesting. Especially, in the context of the ongoing climate and corona crises we encounter many time series visualizations. 

## üõí 1. Prepare 

Before we are able to do anything, we need to include the libraries that we are working with (as always):

In [None]:
import pandas as pd
import altair as alt

### Parse dates and times

In its most basic form, time series data contain a quantitative measure that changes over time. To reference a time point we use [Timestamp](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html) of Pandas as the data type for temporal entities.

With **`to_datetime()`** you can create a Timestamp with a string containing a date and/or time. Pandas can infer the date and time from various date/time strings. Let's start with the present:


In [None]:
pd.to_datetime('now')

We can pass a range of date formats and Pandas will guess which numbers refer to years, months, days, hours, etc.:

In [None]:
pd.to_datetime('2024-10-07 3pm')

When expressing dates and times in written language, there is an ambiguity between the order of different entities. The most frequent ambiguity concerns the order of days and months, as they are typyically both expressed in double-digit numbers, unlike years that tend to be expressed with four digits. However, date conventions vary across the world.  For example, the following date might be interpreted differently depending on the country; it may refer to Saint Nicholas Day in 1929 or Anne Frank's birthday:

In [None]:
pd.to_datetime('12.6.1929')

To clarify towards Pandas that the first number refers to the day, you can add the parameter **`dayfirst`**:

In [None]:
pd.to_datetime('12.6.1929', dayfirst=True)

The method `to_datetime()` can also handle an array of date strings; it will return a `DatetimeIndex`, which is crucial for temporal indexing with Pandas.

In [None]:
sessions=["2.4.2020", "9.4.2020", "16.4.2020", "23.4.2020", "7.5.2020", "14.5.2020", "28.5.2020", "4.6.2020", "11.6.2020", "25.6.2020", "2.7.2020", "9.7.2020"]

pd.to_datetime(sessions, dayfirst=True)

If you want to make extra sure that the date/time string is parsed correctly and quickly, you can pass a fixed **`format`** for the date/time strings to be parsed:

In [None]:
pd.to_datetime('2020-05-07', format='%Y-%m-%d')

In [None]:
pd.to_datetime('8.5.1945 23:01', format='%d.%m.%Y %H:%M')

### Load time series data

In this tutorial we will be analyzing our usual data source. `read_csv()` has a convenient feature, which lets you specify the column containing date/time information.

In [None]:
covid_data = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv", parse_dates=['date'])

We filter the data for rows which contain German data (and also skip some columns we do not need ('iso_code', 'continent', 'location') or that provide some problems later in the process because of its object type ('tests_units'):

In [None]:
covid_data_de = covid_data[ covid_data.location == 'Germany' ]
covid_data_de = covid_data_de.drop(['iso_code', 'continent', 'location', 'tests_units'], axis=1, inplace=False)

In [None]:
covid_data_de.info()

In [None]:
covid_data_de = covid_data_de.set_index('date')
covid_data_de

The `DatetimeIndex` provides a few handy methods to extract temporal units such as months, days, week of the year, etc.: 

In [None]:
covid_data_de.index.year.unique()

‚úèÔ∏è *Try to extract any other [temporal attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html):*


## üïí 2. Process


### Query time points and spans

A particular powerful feature of the Pandas DataFrame is its indexing capability that also works using time-based entities, such as dates and times. We have already created the index above, so let's put it to use.

One useful function of a temporal index, is its querying function. We can quickly extract the rows for a given time point or period.

In [None]:
covid_data_de.loc['2020-10-11']

Above query is an example of partial-string indexing: while our `DateTime` column actually contains time information as well, you can query it quickly (!) with just the date, or even a shorter query:

In [None]:
covid_data_de.loc['2020-10']

In [None]:
covid_data_de.loc['2020-09-11': '2020-10-09']

### Aggregate values along time


The DataFrame's `resample()` method provides a concise and quick way of aggregating temporally indexed data along time units. Here we create a DataFrame with summed up values for each year aggregated from the original dataset:

In [None]:
sums = covid_data_de.resample('M').sum()
sums

The resample operations can also be carried out one after another. For example, we might want to know how weekly energy use/production varies between the quarters:

In [None]:
weekly_sums = covid_data_de.resample('W').sum()

quarterly_sums = weekly_sums.resample('Q').mean()

quarterly_sums.astype(int)


## ü•ó 3. Present

Enough data processing. It's time for visualization!

### Overall trends

Next, we are going to return to the energy time series data that we prepared above. Remember that the original dataset has an hourly resolution resulting in too many data points to visualize at once. Altair itself handles at most 5000 data rows.

To reduce the dataset into a manageable size (only one year), we  will monthly daily sums with the `resample()` method of the DateTimeIndex:

In [None]:
covid_data_sa = covid_data[ covid_data.continent == 'South America' ]
covid_data_sa = covid_data_sa.drop(['iso_code', 'continent', 'tests_units'], axis=1, inplace=False)  
covid_data_sa['date'] = pd.to_datetime(covid_data_sa['date'], format="%Y-%m-%d")
# Filter the data for the year 2021
covid_data_sa = covid_data_sa.loc[covid_data_sa['date'].dt.year == 2021]
# Resample the data by month and location and compute the sum
covid_data_sa = covid_data_sa.groupby(['location', pd.Grouper(key='date', freq='M')]).sum()
# Reset the index to convert the grouped columns back to regular columns
covid_data_sa = covid_data_sa.reset_index()
covid_data_sa



Now let's start with a scatterplot visualization of the daily data:

In [None]:
alt.Chart(covid_data_sa).mark_circle().encode(
    x='date:T',
    y='new_cases',
    color='location',
    tooltip='location'
).properties(
    width=800,
    height=400
)

With this our eyes can already see several patterns going on. Some are more dictinct than others.

‚úèÔ∏è *There is a lot of overplotting going on. Reduce the `size` and `opacity` of all dots, by passing these as parameters to `mark_circle`!*

Next, we are going to connect the dots and create a line chart form this data. So basically the same code as above, except we're now using `mark_line()` instead of `mark_circle`:

In [None]:
chart1 = alt.Chart(covid_data_sa).mark_line(opacity=0.5).encode(
    x='date:T',
    y='new_cases',
    color='location',
    tooltip='location'
).properties(
    width=800,
    height=400
)

chart1

This chart already shows a lot: we can see weekly patterns‚Äîthe jittery up and down‚Äîin the new cases. What else do you see?


While above line chart is truthful to the local fluctuations, it makes it hard to actually grasp the up and down over the course of weeks. Let's change the sampling from days to months to examine the overall patterns in the data.

In [None]:
months = covid_data_sa.set_index('date').groupby('location').resample('M').mean().reset_index()

chart2 = alt.Chart(months).mark_line(interpolate='basis').encode(
    x='date:T',
    y='new_cases',
    color='location',
    tooltip='location'
).properties(
    width=800,
    height=400
)

chart2

What do you think? The fine-grained jitter is now gone and we might have lost too much detail. In fact, first downsampling the data and then including an interpolation is maybe giving it too much of a treatment (like overusing Photoshop's blur function).

One way to integrate the local and global patterns is to create a layered graph, as we have already done with the presidents' names above. This time we are combining a line chart of the days with a line chart of monthly averages.


Next we create the line charts and combine the two again with the **+** operator:

In [None]:
chart1 + chart2

With this view we already get a good sense of the overall time patterns, while still seeing some of the particular variations.

‚úèÔ∏è *Add the `.interactive()` directive to one of these charts to make them zoomable!*

### Rolling windows

While the `resample()` method takes a broad brush and results in a reduced dataset and a chart with smooth curves, `rolling()` offers an alternative way of smoothing out local outliers without actually reducing the resolution of the dataset.

The first parameter determines the window size, by positioning the window at the `center` values are considered in both directions of the current date/time, and `win_type` determines how the values across the window are weighted; with the `triang` option the values further away contribute less:

In [None]:
rolling = covid_data_sa.set_index('date').groupby('location').rolling(14, center=True, win_type='triang').mean().reset_index()

chart3 = alt.Chart(rolling).mark_line().encode(
    x='date:T',
    y='new_cases',
    color='location',
    tooltip='location'
).properties(
    width=800,
    height=400
)

chart1 + chart2 + chart3

Above you see the two lines for daily sums and monthly averages from the previous cell (slightly more transparent), on top of which you can see the time curve generated with a rolling window. It is quite apparent that this curve still features more pronounced dips around the end-of-year periods and elsewhere.


‚úèÔ∏è *Play around with different window sizes and other parameters in the first line in above cell!*

## Small Multiples

In [None]:
from vega_datasets import data

In [None]:
code_lookup = pd.read_csv('country_lookup.csv')
countries = alt.topo_feature(data.world_110m.url, 'countries')

In [None]:
def generate_map_plot(date, covid_data):
    date_infections = covid_data.loc[date]
    country_infections = date_infections[['iso_code', 'total_cases_per_million']].groupby('iso_code').mean().reset_index()
    data = country_infections.merge(code_lookup, left_on='iso_code', right_on='Alpha-3 code').rename(columns={'Numeric code': 'id'})

    map = alt.Chart(countries).mark_geoshape(
        stroke='white'
    ).encode(
        color=alt.Color('total_cases_per_million:Q', scale=alt.Scale(type='symlog')),
        tooltip=['Country:N', 'total_cases_per_million:Q']
    ).transform_lookup(
        lookup='id',
        from_=alt.LookupData(data=data, key='id', fields=['total_cases_per_million', 'Country'])
    ).project(
        type='mercator',
        scale=125,
        center=[-30,70],
        clipExtent=[[0,0], [200,150]]
    ).properties(
        width=200,
        height=150
    )
    
    return map

In [None]:
dates = ['2020-02-01', '2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01']

maps = []
covid_data = covid_data.set_index('date')

for i, date in enumerate(dates):
    maps.append(generate_map_plot(date, covid_data))


import numpy as np
plot = None
for row in np.array_split(np.array(maps),3):
    row_plot = row[0]
    for col in row[1:]:
        row_plot |= col
    if plot:
        plot &= row_plot
    else:
        plot = row_plot
        
plot

### Time spans

One of the first time visualizations was [*A Chart of Biography*](https://en.wikipedia.org/wiki/A_Chart_of_Biography) (1765) by Joseph Priestley. Let's create a similar visualization of the US presidencies since World War II. First we load the CSV file with `pd.read_csv()`:

In [None]:
presidents = pd.read_csv("http://infovis.fh-potsdam.de/temp/us_presidents.csv", parse_dates=['start', 'end'])

The following chart consists of two parts: `bars` and `labels`. The former will be the main bar chart representing the time spans of the presidencies, and the latter will add the presidents' names. This way we can position the labels right next to the bars, much nicer!

In [None]:
bars = alt.Chart(presidents).mark_bar(height=5).encode(
    x='start',
    x2='end',
    y=alt.Y('name', sort='x', axis=None),
    color='party'
)

labels = bars.mark_text(align='right', dx=-5).encode(text='name')

bars + labels

‚úèÔ∏è *Customize this chart. For example, you might want to change the colors associated with the parties‚Ä¶ There was a time when orange has not been the color of the Republicans‚Ä¶*

## Sources

Tutorials & Examples
- [‚Äã‚Äã‚Äã‚ÄãTutorial: Time Series Analysis with Pandas by Jennifer Walker](https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/)
- [Altair Interval Selection Example](https://altair-viz.github.io/gallery/interval_selection.html)

Documentation: Pandas
- [Time series / date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
- [Timestamp](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html)
- [DatetimeIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html)
- [Time-aware rolling vs. resampling](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#time-aware-rolling-vs-resampling)

