# Week 04a Assignment weather data

Welcome to week four of this course programming 1. Analyzing time related data such as estimating seasonal effect, or year effect might be a challenge. How to filter the essential information from the noise? How to apply signal analysis with noisy data. How to make compact useful visualizations? Python has several constructs to handle date time related data. The relevant classes for making plots are Locators and Formatters. Locators determine where the ticks are, and formatters control the formatting of tick labels. The relevant class for date time data is the pandas datetime data type, which has methods like resample and several possibilities to display data (frequencies). As a study case we will work with weather data. If you have data that fits the learning goals, you can bring your own data.

Keywords: signal processing, smoothing, resample, formatters and locators, datetime object

More to read: 

- https://fennaf.gitbook.io/bfvm22prog1/
- https://machinelearningmastery.com/time-series-data-visualization-with-python/
- https://towardsdatascience.com/how-to-plot-time-series-86b5358197d6
- In the https://pandas.pydata.org/docs/reference/offset_frequency.html you can find more about frequencies and in the documentation
- https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html you can read all the methods of this datetime object.
- https://en.wikipedia.org/wiki/Smoothing


Learning objectives

- load, inspect and clean a dataset
- reshape dataframes to group data in a certain frequency
- apply smoothing technologies
- Create useful visualisation with timeseries data
- Maintain development environment 
- Apply coding standards and FAIR principles

Please add topics you want to learn here: https://padlet.com/ffeenstra1/z9duo25d39dcgezz


## Assignment

You will to organise your data into the required format and apply smoothing. In this assignment we will work with weatherdata from the KNMI. A subset of weatherdata is for you available in the file: `KNMI_20181231`. The data consist of several stations with daily weather data of several years. Your task is to make a plot similar to the plot below. 

<img src="../images/weather.png" alt="drawing" width="400"/>


Furthermore the plot needs the following enhancements

1. proper titles and ticks
2. widgets selecting a particular year or all years
3. lines need to be smoothed
3. legends needs to be added

Use your creativity. Consider colors, alpha settings, sizes etc. 

Learning outcomes

- load, inspect and clean a dataset 
- reformat dataframes
- apply smoothing technologies
- visualize timeseries data

The assignment consists of 6 parts:

- [part 1: load the data](#0)
- [part 2: clean the data](#1)
- [part 3: reformat data](#2)
- [part 4: smooth the data](#3)
- [part 5: visualize the data](#4)
- [part 6: Challenge](#5)

Part 1 and 5 are mandatory, part 6 is optional (bonus)
Mind you that you cannot copy code without referencing the code. If you copy code you need to be able to explain your code verbally and you will not get the full score. 


NB if you want to make a plot with more actual data you can download data from https://openweathermap.org/api 


---

<a name='0'></a>
## Part 1: Load the data

Either load the dataset `KNMI_20181231.csv` or `KNMI_20181231.txt.tsv`. 
Preferably we read the data not with a hard coded data path but using a config file. See https://fennaf.gitbook.io/bfvm22prog1/data-processing/configuration-files/yaml. 

- The dataheaders contain spaces and are not very self explainable. Change this into more readable ones. 
- Select data from a station. Station 270 is in the neighborhood of Groningen. 
- For our plot we only need the the mean, minimum and maximum temperature. Of course you are welcome to select other data if you think it might be useful for your visualization. The data should look something like this:


In [1]:
import yaml
import pandas as pd
import numpy as np

In [2]:
# Import data
with open('config.yaml', 'r') as stream:
    config = yaml.safe_load(stream)
    
path = config['filepath_ass4a']

In [3]:
# Retrieve headers
with open(path, 'r') as file:
    comments = file.read()
    comments = comments.split('#')

# Locating row header
headers_raw = comments[-2]
# Splitting headers
headers_raw = headers_raw.split(',')

# Remove tab in the front
headers = []
for i in headers_raw:
    headers.append(i.strip())
    
headers

['STN', 'YYYYMMDD', 'TG', 'TN', 'TX', 'SQ', 'DR', 'RH']

In [4]:
# Read file, skips first line with '#'. Tabs are NaN
df = pd.read_csv(path,
                    sep=',',
                    header= None, 
                    comment= '#',
                    na_values= '     ',
                   )
df.columns = headers

In [5]:
# Rename columns
header_dict = {'YYYYMMDD': 'date',
                'TG': 'Tmean',
                'TN': 'Tmin',
                'TX': 'Tmax',
                }

df = df.rename(columns= header_dict)

df.tail()

Unnamed: 0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH
331311,391,20181227,12.0,-18.0,47.0,28.0,0.0,0.0
331312,391,20181228,7.0,-29.0,30.0,23.0,0.0,0.0
331313,391,20181229,59.0,25.0,92.0,0.0,15.0,5.0
331314,391,20181230,78.0,52.0,87.0,3.0,42.0,17.0
331315,391,20181231,89.0,74.0,97.0,0.0,0.0,-1.0


In [6]:
# Get date in the right format

df['date'] = pd.to_datetime(df['date'], format= "%Y%m%d")
# My df['date'] now also contains hours.. I tried to solve it with:
# df['date'] = pd.to_datetime(df['date'].dt.date)
# But this resulted in all months to be changed to '1'..
# Still don't know how to get rid of the hours..

print(df['date'].head())
df.info()


0   2001-01-30
1   2001-01-31
2   2001-02-01
3   2001-02-02
4   2001-02-03
Name: date, dtype: datetime64[ns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331316 entries, 0 to 331315
Data columns (total 8 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   STN     331316 non-null  int64         
 1   date    331316 non-null  datetime64[ns]
 2   Tmean   237566 non-null  float64       
 3   Tmin    237567 non-null  float64       
 4   Tmax    237567 non-null  float64       
 5   SQ      223728 non-null  float64       
 6   DR      223669 non-null  float64       
 7   RH      224072 non-null  float64       
dtypes: datetime64[ns](1), float64(6), int64(1)
memory usage: 20.2 MB


In [7]:
# Select Grunn
df = df[df['STN'] == 270]

In [8]:
df.head()

Unnamed: 0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH
97641,270,2000-01-01,42.0,-4.0,79.0,49.0,15.0,11.0
97642,270,2000-01-02,55.0,33.0,74.0,12.0,0.0,-1.0
97643,270,2000-01-03,74.0,49.0,89.0,0.0,124.0,172.0
97644,270,2000-01-04,46.0,22.0,75.0,4.0,13.0,11.0
97645,270,2000-01-05,41.0,14.0,56.0,56.0,0.0,0.0


---

<a name='1'></a>
## Part 2: Clean the data

The data ia not clean. There are empty cells in the dataframe which needs to be replaced with NaN's and the temperature is in centidegrees which needs to be transformed into degrees. The date field needs a datetime format. For visualization convience we would like to remove the leap year. Conduct the cleaning.

In [None]:
#replace cells with spaces to NaN
#change data formats
#change temperatures to celcius degrees
#remove leap year

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>pd.to_datetime(df['Date'].astype(str), format='%Y%m%d')</li>
    <li>regex for empty cells = `^\s*$` </li>
    <li>remove month == 2 & day == 29</li> 
</ul>
</details>

In [9]:
#remove leap year
# remove month == 2 & day == 29
print(df.info())
df = df.copy()
df['leapyear'] = df['date'].dt.is_leap_year


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6940 entries, 97641 to 104580
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   STN     6940 non-null   int64         
 1   date    6940 non-null   datetime64[ns]
 2   Tmean   6940 non-null   float64       
 3   Tmin    6940 non-null   float64       
 4   Tmax    6940 non-null   float64       
 5   SQ      6940 non-null   float64       
 6   DR      6940 non-null   float64       
 7   RH      6940 non-null   float64       
dtypes: datetime64[ns](1), float64(6), int64(1)
memory usage: 488.0 KB
None


In [10]:
# Remove leapyear:
df = df[df['leapyear'] == False]
df = df.drop(columns='leapyear')

#Test your outcome
#write code to check if you have done the above
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5110 entries, 98007 to 104580
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   STN     5110 non-null   int64         
 1   date    5110 non-null   datetime64[ns]
 2   Tmean   5110 non-null   float64       
 3   Tmin    5110 non-null   float64       
 4   Tmax    5110 non-null   float64       
 5   SQ      5110 non-null   float64       
 6   DR      5110 non-null   float64       
 7   RH      5110 non-null   float64       
dtypes: datetime64[ns](1), float64(6), int64(1)
memory usage: 359.3 KB


In [11]:
df.head()

Unnamed: 0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH
98007,270,2001-01-01,21.0,4.0,38.0,0.0,54.0,53.0
98008,270,2001-01-02,63.0,37.0,85.0,22.0,29.0,36.0
98009,270,2001-01-03,53.0,26.0,79.0,43.0,31.0,24.0
98010,270,2001-01-04,65.0,46.0,79.0,0.0,55.0,71.0
98011,270,2001-01-05,66.0,56.0,83.0,0.0,80.0,129.0


In [12]:
df.tail()

Unnamed: 0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH
104576,270,2018-12-27,57.0,53.0,62.0,0.0,9.0,2.0
104577,270,2018-12-28,71.0,58.0,81.0,0.0,0.0,0.0
104578,270,2018-12-29,85.0,69.0,102.0,0.0,14.0,18.0
104579,270,2018-12-30,80.0,68.0,90.0,0.0,14.0,5.0
104580,270,2018-12-31,87.0,82.0,97.0,38.0,0.0,-1.0


In [13]:
df['Tmean'] = df['Tmean'] / 10
df['Tmin'] = df['Tmin']   / 10
df['Tmax'] = df['Tmax']   / 10
df.head()

Unnamed: 0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH
98007,270,2001-01-01,2.1,0.4,3.8,0.0,54.0,53.0
98008,270,2001-01-02,6.3,3.7,8.5,22.0,29.0,36.0
98009,270,2001-01-03,5.3,2.6,7.9,43.0,31.0,24.0
98010,270,2001-01-04,6.5,4.6,7.9,0.0,55.0,71.0
98011,270,2001-01-05,6.6,5.6,8.3,0.0,80.0,129.0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5110 entries, 98007 to 104580
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   STN     5110 non-null   int64         
 1   date    5110 non-null   datetime64[ns]
 2   Tmean   5110 non-null   float64       
 3   Tmin    5110 non-null   float64       
 4   Tmax    5110 non-null   float64       
 5   SQ      5110 non-null   float64       
 6   DR      5110 non-null   float64       
 7   RH      5110 non-null   float64       
dtypes: datetime64[ns](1), float64(6), int64(1)
memory usage: 359.3 KB


### Expected outcome

---

<a name='2'></a>
## Part 3: Reform your data

- First we will split the data in data from 2018 and data before 2018. Best is to split this in two dataframes. 
- Next we need for the non 2018 data the minimum values for each day and the maximum values for each day. So we look for the minimum value out of all january-01 minimum values (regardless the year). 
- Create a dataframe with 365 days containing the ultimate minimum and the ultimate maximum per day. 


In [15]:
df_2018 = df[df['date'].dt.year == 2018]
df_before_2018 = df[df['date'].dt.year != 2018]

In [16]:
print(f'Original df has {df.shape[0]} entries. \n\ndf_2018 has {df_2018.shape[0]}, df_before_2018 has {df_before_2018.shape[0]}.\nTogether they have {df_2018.shape[0] + df_before_2018.shape[0]} entries.')

Original df has 5110 entries. 

df_2018 has 365, df_before_2018 has 4745.
Together they have 5110 entries.


In [17]:
df_before_2018.head()

Unnamed: 0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH
98007,270,2001-01-01,2.1,0.4,3.8,0.0,54.0,53.0
98008,270,2001-01-02,6.3,3.7,8.5,22.0,29.0,36.0
98009,270,2001-01-03,5.3,2.6,7.9,43.0,31.0,24.0
98010,270,2001-01-04,6.5,4.6,7.9,0.0,55.0,71.0
98011,270,2001-01-05,6.6,5.6,8.3,0.0,80.0,129.0


In [19]:
df_test.head()

Unnamed: 0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH,month,year,day
98007,270,2001-01-01,2.1,0.4,3.8,0.0,54.0,53.0,1,2001,1
98008,270,2001-01-02,6.3,3.7,8.5,22.0,29.0,36.0,1,2001,2
98009,270,2001-01-03,5.3,2.6,7.9,43.0,31.0,24.0,1,2001,3
98010,270,2001-01-04,6.5,4.6,7.9,0.0,55.0,71.0,1,2001,4
98011,270,2001-01-05,6.6,5.6,8.3,0.0,80.0,129.0,1,2001,5


In [20]:
df_test.groupby(['month', 'day']).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,STN,date,Tmean,Tmin,Tmax,SQ,DR,RH,year
month,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1,270,2001-01-01,-1.9,-5.8,1.0,0.0,0.0,-1.0,2001
1,2,270,2001-01-02,-3.4,-7.5,-1.4,0.0,0.0,-1.0,2001
1,3,270,2001-01-03,-4.1,-12.6,0.1,0.0,0.0,-1.0,2001
1,4,270,2001-01-04,-3.8,-6.7,-0.2,0.0,0.0,-1.0,2001
1,5,270,2001-01-05,-2.3,-6.2,0.8,0.0,0.0,-1.0,2001
...,...,...,...,...,...,...,...,...,...,...
12,27,270,2001-12-27,-1.7,-6.0,0.6,0.0,0.0,0.0,2001
12,28,270,2001-12-28,-2.9,-7.4,0.3,0.0,0.0,-1.0,2001
12,29,270,2001-12-29,-4.0,-7.3,-2.6,0.0,0.0,0.0,2001
12,30,270,2001-12-30,-1.9,-6.7,-0.4,0.0,0.0,0.0,2001


In [21]:
# Groupby on df_before_2018 to get the Tmean

df_test = df_2018.copy()

df_test['month'] = df_test['date'].dt.month
df_test['year'] = df_test['date'].dt.year
df_test['day'] = df_test['date'].dt.day

# df_test = df_test.set_index([#'year',
#                              'month', 
#                              'day',
#                             ]).sort_index()

df_groupedbymonthday_2018 = df_test.groupby(['month','day']
                                      )


In [22]:
ser_Tmean_2018 = df_groupedbymonthday_2018.mean()['Tmean']

In [23]:
ser_Tmean_2018

month  day
1      1      6.0
       2      5.6
       3      7.5
       4      7.3
       5      6.0
             ... 
12     27     5.7
       28     7.1
       29     8.5
       30     8.0
       31     8.7
Name: Tmean, Length: 365, dtype: float64

In [24]:
# Groupby on df_before_2018 to get the Tmin and Tmax

df_test = df_before_2018.copy()

df_test['month'] = df_test['date'].dt.month
df_test['year'] = df_test['date'].dt.year
df_test['day'] = df_test['date'].dt.day

# df_test = df_test.set_index([#'year',
#                              'month', 
#                              'day',
#                             ]).sort_index()


df_groupedbymonthday_before_2018 = df_test.groupby(['month', 'day'])


In [25]:
ser_Tmin = df_groupedbymonthday_before_2018.min()['Tmin']
ser_Tmin

month  day
1      1      -5.8
       2      -7.5
       3     -12.6
       4      -6.7
       5      -6.2
              ... 
12     27     -6.0
       28     -7.4
       29     -7.3
       30     -6.7
       31     -5.1
Name: Tmin, Length: 365, dtype: float64

In [26]:
ser_Tmax = df_groupedbymonthday_before_2018.max()['Tmax']

In [28]:
df_over_years = pd.merge(left=ser_Tmin, right=ser_Tmax, on= ['month', 'day'])

In [None]:
# def month_day(df_multipleyears):
#     #your code to reform data here
    
#     df_test = df_multipleyears.copy()

#     # Select grunn
#     df_test = df_test[df_test['STN'] == 270]



#     df_groupedbymonthday = df_test.groupby(['month', 'day'])
    
#     print(df_groupedbymonthday.min())

In [None]:
# #Test your code
# def test_reformed(df):
#     #
#     df = df[(df.index.year > 2007) & (df.index.year < 2018)]
#     month_day(df)

# test_reformed(df_test)

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use the dt.month and dt.day to groupby</li>
</ul>
</details>

### Expected outcome
Note, the layout or names my differ, but the length should be 365 and the minimum values should be the same

---

<a name='3'></a>
## Part 4: Smooth the data

Make a function that takes an array or a dataframe column and returns an array of smoothed data. Explain in words why you choose a certain smoothing algoritm. Ask the signal analysis teacher if you want some advice.


In [None]:
#your code here
#your motivation here

---

<a name='4'></a>
## Part 5: Visualize the data

Plot the mean temperature of the year 2018. Create a shaded band with the ultimate minimum values and the ultimate maximum values from the multi-year dataset. Add labels, titles and legends. Use proper ranges. Be creative to make the plot attractive. 



<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use from bokeh.models import Band</li>
    <li>use ColumnDataSource to parse data arrays</li>
    <li>look for xaxis tick formatters</li>
</ul>
</details>

---

In [None]:
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot

from bokeh.plotting import ColumnDataSource

from bokeh.models import Band
output_notebook()

In [None]:
### HIER BEN IK ###

p = figure(title= 'Mean temperatures of 2018', 
           x_axis_label= 'Time (unites?)', 
#            x_axis_type= 'datetime',
           y_axis_label= 'Temperature (C)',
          )
p.line(x=, 
       y=, 
      )

show(p)


<a name='5'></a>
## Part 6: Challenge

Make a widget in which you can select the year range for the multiyear set. Or maybe a widget were you choose a different station. Add this to your layout to make the plot interactive. Add another widget to select or deselect the smoother. Inspiration: https://demo.bokeh.org/weather