# Lesson 5: time series pt1, tricks & DIY

In this lesson we are going to focus a bit on longitudinal data. We have already seen line plots in one of the first lessons, and these are often used to visalise longitudinal/time-series data. In this lesson we are going to:

- See why lines are a a good idea for time series data
- look at two types of time series data
- how to efficiently plot multiple lines (& in specified colors)
- find some data and Do It Yourself!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 100

# Lines are a good idea

Let's do an example and tell me what you prefer and why. I'm going to load some data that some of you might be familiar with (optional excercise from lesson 2), the weather data from Ukkel. Specifically:

- the average temperatre since 2000
- the weather for the first semester (14 sep 2020 - 19 feb 2021)

In [None]:
Ukkel_avg = pd.read_csv('Ukkel_average2000.csv')
Ukkel_avg.head()

In [None]:
Ukkel_sem = pd.read_csv('Ukkel_latest.csv')
Ukkel_sem.head()

### A bit of pre-processing

To get the data in a compareable format for plotting we need to fix some things. To make it easy let's not fix this date issue but let's focus on the month of januari

- TAVG clearly not in celsius
- Select january

In [None]:
Ukkel_avg = Ukkel_avg[Ukkel_avg['month'] == 1 ] # Select only month 1 i.e. january
Ukkel_avg.sort_values(by='day', inplace = True) # sort values on day (day 1 first), inplace means the data is changed immediately in the DataFrame 
Ukkel_avg.head()

In [None]:
Ukkel_sem['TAVG_cel'] = (5/9)*(Ukkel_sem['TAVG']-32) # Convert to celsius, assign to other variable to avoid multiple transformations
Ukkel_sem['DATE'] = pd.to_datetime(Ukkel_sem['DATE']) # convert date to a 'datetime' object (to calculate and select with dates)

Ukkel_sem['month'] = Ukkel_sem['DATE'].dt.month # calculate the month, (2020-09-14 -> month 9)
Ukkel_sem['day'] = Ukkel_sem['DATE'].dt.day # calculat the day (2020-09-14 -> day 14)

Ukkel_sem= Ukkel_sem[ Ukkel_sem['month'] == 1 ] # select only the first month
Ukkel_sem.sort_values(by='day', inplace = True) # sort on day 
Ukkel_sem.head()

### making the plot

Which one do you prefer, lines or points? Why?

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, sharey= True, sharex = True)
ax1.scatter(Ukkel_avg['day'], Ukkel_avg['average_temp'], label = "Average" )
ax1.scatter(Ukkel_sem['day'], Ukkel_sem['TAVG_cel'], label = "2021" )
ax2.plot(Ukkel_avg['day'], Ukkel_avg['average_temp'], label = "Average" );
ax2.plot(Ukkel_sem['day'], Ukkel_sem['TAVG_cel'], label = "2021")

# 2 types of (time series) data: wide vs long

In [None]:
polio_vac = pd.read_csv('pol3_vacc.csv')
polio_vac.head()

In [None]:
polio_vac_long = pd.melt(polio_vac, id_vars='country', var_name='year', value_name='percentage')
polio_vac_long

In [None]:
polio_vac_long.dtypes

In [None]:
polio_vac_long['year'] = polio_vac_long['year'].astype(int)

# Plotting multiple lines (data) automatically

## Something about color maps

https://matplotlib.org/stable/gallery/color/colormap_reference.html

# Do It Yourself

Depending on the time (otherwise we continue next time with this) 

1. think of something you want to visualise with the gapminder.org data (dont think of the visualisation but think think of the story you want to tell/the question you want answered)
2. get the data and load it in
3. Make a plot
