# Introduction

The exercises here presented try to provide some practical introduction to general and useful data handling steps.

The intention is to practice and become more comfortable with tools commonly used during such processes.

# Table of Contents

1. Generated sample data
2. Glucose data

# Setup

Imported/required packages:

- [numpy](numpy.org)
- [pandas](pandas.pydata.org)
- [matplotlib](matplotlib.org)
- [scipy](scipy.org)

__Note:__ make sure to have a look at the used/required packages website including their documentation and resources. These libraries are heavily used when dealing with data in a python environment.

For any extra -- and more general -- information see: 
- [Book: Python for Everybody](py4e.com/book.php)
- [Book: Python Data Science Handbook](jakevdp.github.io/PythonDataScienceHandbook/)
- [Tutorials, Exercises](pynative.com)

In [None]:
import pkg_resources

# requirements list
dependencies = [
  'pandas',
  'scipy',
  'matplotlib',
]

# requirements check
pkg_resources.require(dependencies)

In [None]:
import numpy as np
import pandas as pd
from scipy.ndimage import gaussian_filter1d
from matplotlib import pyplot as plt

#plt.style.use('seaborn')
plt_fig_size = (6, 4)

# 1. Generated Sample Data

## Data

### Generate

Data can come in different flavors and structures. For now, let us generate our own data. The following exercise makes use of data generated from an arbitrary function.

**EXERCISE** 
<br>
_Use the function $y = x * sin(x^2) + 1$ to generate an output of $100$ data points with $-2 \leq x \leq 3$._

In [None]:
# custom function: (y = x * sin(x^2) + 1 | dy/dx = sin(x^2) + 2x^2 * cos(x^2)) -> en.wikipedia.org/wiki/Derivative
x = 
y = 

# data
data = np.column_stack((x, y))

Data is used as input for functions that expect it in specific formats. Usually array/matrix based format is the way to go, which leads to the extensive usage of **Numpy**'s [n-dimensional arrays](numpy.org/doc/stable/reference/arrays.ndarray.html).

**EXERCISE** 
<br>
_Output the $10$ first entries of `data`._

In [None]:
# numpy n-dimensional array


However, the visualization of array/matrix formatted data is not as good as table-like format. With this, __pandas__'s [_DataFrame_](pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) can be used to store - and also output - data in a more readable structure.

**EXERCISE** 
<br>
Using `data`:
1. _Create a _DataFrame_ and store it in a variable called `df`._
2. _Name its columns as `x` and `y`._
3. _Display the first $10$ entries of the _DataFrame_._

In [None]:
# DataFrame
df = 

# display the first 10 entries of df


**EXERCISE** 
<br>
_Plot the data contained in the _DataFrame_._

__Tip:__ the entire raw data using function outputs as the `y` values and data point indices as the `x`._

In [None]:
# use matplotlib to plot here


## Filtering

Raw data can come with a high degree of variation and/or noise. To make it more smooth, which can facilitate visualization, some filters can be applied.

In this exercise, [_rolling_](pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) median and [_Gaussian_](docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter1d.html) will be used.

### Gaussian and rolling window

**EXERCISE** 
<br>
_Add to the already existing `df` _DataFrame_ two more columns for each filter (_rolling median_, and _Gaussian_)._

In [None]:
# filtering
df['filtered_median'] = 
df['filtered_gaussian'] = 

# display df (e.g., first 10 entries) with the new column


__Note:__ other aggregation such as _mean_ can be used when applying `rolling()`, so make sure to look for different ones and try.

**EXERCISE** 
<br>
_Plot the previous filtered values together with the raw data._

In [None]:
# plot


Both filters can be tweaked by, for instance, varying windows size for rolling and/or using mean instead of median, and using a different $\sigma$ value for _Gaussian_.

## Identifying changes

The tools to be used here can serve us to identify special values and variations within the data. For that purpose let us start with __pandas__ [`diff()`](pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html).
<br>
This method evaluated the distance (differences) from consecutive values in a Series, or in the entire DataFrame.

**EXERCISE** 
<br>
_Add a new column to `df` called `y_diff` that has the output of `diff()` applied to `y`._

In [None]:
df['y_diff'] = 

# display df with the new column


__Note:__ the very first value of the new added column `y_diff` is set as `NaN`. This is because __pandas__ cannot evaluate the difference for the first value as there is no previous.

**EXERCISE**
<br>
_By using the values in the previous added column, answer: which data point is associated to the largest decrease and increase when coming from the previous data point?_
_For that:_
<br>
1. _Get the index of the entry with the lowest value of `y_diff`._
2. _Get the index of the entry with the highest value of `y_diff`._

In [None]:
# lowest value index
idx_min_diff = 

In [None]:
# highest value index
idx_max_diff = 

**EXERCISE**
<br>
_Now, show the values of both lowest and highest differences._

In [None]:
# lowest


In [None]:
# highest


## Using derivatives for trending identification

The previous approach using `diff()` might be enough to get aware of how data varies over time (increasing, decreasing). This can be seen when plotting the data.

**EXERCISE**
<br>
_Plot both `y` and `y_diff` together._

In [None]:
# plot


However, depending on how volatile the data are, such identification becomes more difficult. For that, approximated derivatives come as a tool for such identification.

**EXERCISE**
<br>
_Calculate such approximation:_


1. _Using the already evaluated differences._

In [None]:
# approximated derivative
df['deriv'] = 

2. _Using __Numpy__ [`gradient()`](Numpy.org/doc/stable/reference/generated/numpy.gradient.html)._

In [None]:
# approximated derivative
df['deriv_gradient'] = 

3. _Appying the already mentioned _Gaussian_ filter, but now using `order=1`_

In [None]:
# aproximate derivative
df['deriv_filtered_gaussian'] = 

__NOTE__: the `order` parameter sets the derivative order used for the filter approximation, which means that by using a first order filter (`order=1`) we are getting an approximation to the 1st derivative. The default value is `order=0`, and can be seeing as an approximation to the original function.

**EXERCISE**
<br>
_Plot the approximated derivative values, each one in a subplot._

In [None]:
# plot


# 2. Glucose Time Series

## Data

For the next exercises, real - and already collected - data will be used. The values are related to Blood Glucose, and were collected for one person during some days.

### Load

The data is distributed in a `.xml` file containing specific events as nodes. This is a very declarative/container type structure, and the one to be used here contains specific observations that will compose the data to be handled: events including time and blood glucose values.

**EXERCISE**
<br>
1. _Extract from the `training.xml` the nodes containing glucose events, i.e., all the `<event ts="..." value="..."/>` nodes._
2. _Store it in `glucose_event_nodes`._

In [None]:
import xml.etree.ElementTree as ET

# parse XML data
patient_xml_root = 
glucose_event_nodes = 

### Preprocessing

With `.xml` the nodes in hands, the data must be now handled in a more easy way. For that - as already covered - let us, from this data, create a structure that contains and facilitates handling it.

**EXERCISE**
<br>
_For that:_
1. _From the event nodes, create a _DataFrame_ (assign it to `df`) containing columns for timestamp (name it `ts`) and blood glucose values (name it `value`)._

In [None]:
df = 

2. _Display the type of the columns of `df`._

In [None]:
# column types


In [None]:
# [optional] use adf information/summary display method


__Note:__ When creating a _DataFrame_ coming from a more diverse data source, it is not guaranteed that __pandas__ will set the types of the columns correctly. So it is important to always check and then set the types properly.

3. _Set columns to their proper type: `value` to numeric, and `ts` to datetime. Check types after._

In [None]:
df['value'] = 
df['ts'] = 

In [None]:
# display column types and or df info


As a time series is a sequence of values gathered continuously through time, it makes sense to associate each data point to a moment in time instead of an integer index.
Thus, to get the best out of __pandas__ when dealing with time series, let us set `ts` as `df`'s index.

4. _Set `ts` column as the index of `df`._

In [None]:
# set 'ts' as the index of df


Now that a properly set _DataFrame_ is stored, an extra step can be done to persist this for further usage.

**EXERCISE**
<br>
_Store the data of `df` in a `.csv` file called `glucose.csv`._

In [None]:
# create a .csv file to be used on further steps


__Note:__ Persisting data in a interchangeable format as `.csv` facilitates further usage, thus is a very common part of the preprocessing step.

## Selecting parts of the time series (day)

Time series are commonly long and continuous streams of data. But very often, specific parts of it must be taken for analysis.

**EXERCISE**
<br>
_Use indexing to isolate:_

1. _The data of the 5th day in the blood glucose time series (display and plot)._

In [None]:
# display


In [None]:
# plot


2. _The data of the day with the first occurrence of the highest blood glucose value in the whole time series (display and plot)._

In [None]:
# display


In [None]:
# plot

## Filtering

Here, let us apply to the blood glucose time series the same filters used in the generated data from Section 1.

**EXERCISE**
<br>
1. _Use the stored `glucose.csv` to reload the whole time series data into `df`._

In [None]:
df = 

In [None]:
# [optional] display df information


__TIP:__ Make sure to check and set the types, as well as set `ts` as the index.

2. _Apply the previous covered filters to the whole time series (display and plot)._

In [None]:
# filtering
df['filtered_median'] = 
df['filtered_gaussian'] = 

In [None]:
# display


In [None]:
# plot


3. _Show the 5th day of time series including the filtered values (display and plot)._

In [None]:
# display


In [None]:
# plot


__Note:__ the _Gaussian_ filter used will not get small window variations. This is due to the fact that the $\sigma$ used cannot is not sensitive enough to catch such variations. Try using $\sigma < 5$. 

## Identifying changes

**EXERCISE**
<br>
_Use the same techniques previsously mentioned (´diff()´, approximated derivatives) to identify the variations in the time series._

In [None]:
# use diff()


## Using derivatives for trending identification

**EXERCISE**
<br>
_Use the approximated derivatives to identify different trends._

In [None]:
# use the derivative approximations
