# Exercise 3.2 - Daily distance


## Introduction 
In this exercise you are going to analyze the daily distance, i.e., distance covered by a vehicle in one day, of our vehicle. The daily distance of a vehicle is an established characteristic value due to its simplicity and availability. It gathers various information such as the level and regularity of utilization and therefore allows first conclusions with regard to the vehicle’s movement patterns.

## Preparation
First of all we need to import all necessary packages and modules:
* pandas (pandas dataframes)
* numpy (numpy arrays as well as various mathematical methods)
* matplotlib.pyplot (plotting)
* register_matplotlib_converters (datetime converter for a matplotlib plotting method)

Notice:<br/> 
In order to use datetime objects in plots a datetime converter is necessary. Currently the datetime converter is registered implicitly by pandas on import. However, Future versions of pandas will require you to explicitly register matplotlib converters. This is why we import and register the datetime converter manually.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters

Further, we want to set the font size of all plots globally and we need to register the datetime converter. 

In [None]:
# set font size of all plots globally
plt.rcParams.update({'font.size': 16})

# register datetime converter for a matplotlib plotting method
register_matplotlib_converters()

## Data import

In [None]:
data_df = pd.read_pickle('data/e32_data_df.pkl')

## Available data
The data that is available for the following exercises is a pandas dataframe called data_df with the following rows:
* days: day as datetime64
* daily_distance: distance in m covered by the vehicle in one day

In [None]:
display(data_df)

## Exercise 3.2.1 Calculate statistical measures

### Task
First of all, we are going to calculate some statistical measures for later usage. The Five-Number Summary is a proven method to describe a numerical data set statistically. 
1. Calculate all statistical measures according to the Five-number summary.
2. Ensure that all quantaties are scaled to km in order to increase readability. 

##### Signature of the script
The signature of the script definies the interfaces (INPUT, OUTPUT) of the current task within this notebook. It is up to you, how you get from INPUT to OUTPUT.
* INPUT: Pandas dataframe data_df
* OUPUT: Scalar values minimum, lower_quartile, median, upper_quartile and maximum

###### Reminder
The Five-Number-Summary is a common set of statistical measures. It consists of the following five statistical measures:
* The minimum value of a data set
* The lower quartile of a data set
* The median of a data set 
* The upper quartile of a data set
* the maximum value of a data set 

##### Hint
Use numpy (https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.statistics.html) to calculate the Five-Number-Summary. 

##### Solution
minimum = 4.194<br/>
lower_quartile = 29.7685<br/>
median = 84.403<br/>
upper_quartile = 126.7895<br/>
maximum = 353.66

### Your code goes here:

In [None]:
#<<solution>>
# convert data to numpy arrays
daily_distance = data_df['daily_distance'].values
days = data_df['day'].values

# scale values to km
daily_distance = daily_distance / 1000

# calculate minimum
minimum = np.min(daily_distance)

# calculate lower quartile
lower_quartile = np.quantile(daily_distance, 0.25)

# calculate median
median = np.median(daily_distance)

# calculate upper quartile
upper_quartile = np.quantile(daily_distance, 0.75)

# calculate maximum
maximum = np.max(daily_distance)

# print results
print('minimum =', minimum)
print('lower_quartile =', lower_quartile)
print('median =', median)
print('upper_quartile =', upper_quartile)
print('maximum =', maximum)
#<</solution>>

## Exercise 3.2.2 Plot daily distances over days

### Task
In this task we want to plot the daily distances over the days. Our aim is to gain a first impression of our vehicle's movement patterns. We are going to use a bar chart for this purpose. In addition, we want to visualize the Five-Number Summary using horizontal lines within the same plot. 
1. Create a plot as depicted below.

##### Signature of the script
The signature of the script definies the interfaces (INPUT, OUTPUT) of the current task within this notebook. It is up to you, how you get from INPUT to OUTPUT.
* INPUT: Pandas dataframe data_df, scalar values minimum, lower_quartile, median, upper_quartile and maximum
* OUPUT: Plot as depicted below

###### Reminder
-

##### Hint
-

##### Solution
![title](data/img/solution_e412.png)

### Your code goes here:

In [None]:
#<<solution>>
# create figure
fig, ax = plt.subplots(figsize=(15, 10))

# set title
plt.title('Daily distances over days')

# plot daily distances as bar plot
ax.bar(days, daily_distance, label='Daily distance')

# plot statistical measures as horizontal lines
plt.hlines(maximum, days.min(), days.max(), label='Maximum', color='red', linestyle='-')
plt.hlines(upper_quartile, days.min(), days.max(), label='Upper quartile', color='grey', linestyle='-.')
plt.hlines(median, days.min(), days.max(), label='Median', color='orange', linestyle='-')
plt.hlines(lower_quartile, days.min(), days.max(), label='Lower quartile', color='grey', linestyle='--')
plt.hlines(minimum, days.min(), days.max(), label='Minimum', color='green', linestyle='-')

plt.setp( ax.xaxis.get_majorticklabels(), rotation=45 )

ax.legend(loc='upper right')

plt.ylabel('Daily distance in km')
plt.xlabel('Day')

ax.set_axisbelow(True)

ax.grid(axis='y')

plt.show()
#<</solution>>

## Exercise 3.2.3 Compare histogram and CDF

### Task
As you already know, a CDF has some clear advantages over the classic histogram. In order to compare both visualization techniques, we now want to plot both within one matplotlib figure. 
1. Create a plot as depicted below.
2. Can you confirm all claims of the articel "Why we love the CDF and do not like histograms that much?" published here: https://www.andata.at/en/software-blog-reader/why-we-love-the-cdf-and-do-not-like-histograms-that-much.html 

##### Signature of the script
The signature of the script definies the interfaces (INPUT, OUTPUT) of the current task within this notebook. It is up to you, how you get from INPUT to OUTPUT.
* INPUT: Pandas dataframe data_df
* OUPUT: Plot as depicted below

###### Reminder
-

##### Hint
* Use subplot function (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.subplot.html) to place two plots within one figure
* In order to get the same plot styling as depicted below, you need to use following parameters: alpha=0.2, edgecolor='blue'
* Use the hist function (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html) for both plots, i.e., histogram and CDF. Use following parameters to get the CDF: density=True and cumulative=True

##### Solution
![title](data/img/solution_e413.png)

### Your code goes here:

In [None]:
#<<solution>>
fig, ax = plt.subplots(2, sharex=True, figsize=(15, 10))
fig.suptitle('Histogram and CDF')

ax[0].hist(daily_distance, label='Daily distance', alpha=0.2, edgecolor='blue')
ax[1].hist(daily_distance, density=True, cumulative=True, label='Daily distance', alpha=0.2, edgecolor='blue')

ax[0].set(ylabel='Number of days')
ax[1].set(ylabel='Cumulative relative frequency')

x_min = daily_distance.min()
x_max = daily_distance.max()

for axs in ax.flat:
    axs.set(xlabel='Daily distance in km')
    axs.set_xlim([x_min, x_max])
    axs.grid(axis='y')
    axs.set_axisbelow(True)
    # Hide x labels and tick labels for top plots 
    axs.label_outer()
    
plt.show()
#<</solution>>

## Exercise 3.2.4 Plot CDF 

### Task
Obviously, the CDF has some advantages when comparing it to the histogram. However, the explanatory power of our CDF above is still limited due to the quite big bin size (= 10, see https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html) automatically used by matplotlib. To get a more familiar representation of the CDF, i.e., as a continous line, we need to calculate the CDF before plotting it as a simple line plot. 
1. Create a plot as depicted below.

###### Reminder
You can calculate the CDF using following code:<br/>
hist, bin_edges = np.histogram(data, bins=1000, density=True)<br/>
cdf = np.cumsum(hist)<br/>
cdf = cdf / cdf[-1]<br/>

And plot it as follows:<br/>
ax.plot(bin_edges[1:], cdf)

##### Hint


##### Solution
![title](data/img/solution_e414.png)

 ### Your code here:

In [None]:
#<<solution>>
# Choose how many bins you want here
num_bins = 1000

# Use the numpy histogram function to bin the data
hist, bin_edges = np.histogram(daily_distance, bins=num_bins, density=True)

# Now find the cdf
cdf = np.cumsum(hist)
cdf = cdf/cdf[-1]

# create figure
fig, ax = plt.subplots(figsize=(15, 10))

# set title
plt.title('Cumulative distribution function')

# plot cdf
ax.plot(bin_edges[1:], cdf, label='CDF')

# plot statistical measures as vertical lines
plt.vlines(maximum, 0, 1, label='Maximum', color='red', linestyle='-')
plt.vlines(upper_quartile, 0, 0.75, label='Upper quartile', color='grey', linestyle='-.')
plt.vlines(median, 0, 0.5, label='Median', color='orange', linestyle='-')
plt.vlines(lower_quartile, 0, 0.25, label='Lower quartile', color='grey', linestyle='--')
plt.vlines(minimum, 0, cdf.min(), label='Minimum', color='green', linestyle='-')

plt.hlines(1, 0, maximum, colors='red', linestyle='-')
plt.hlines(0.75, 0, upper_quartile, colors='grey', linestyle='-.')
plt.hlines(0.5, 0, median, colors='orange', linestyle='-')
plt.hlines(0.25, 0, lower_quartile, colors='grey', linestyle='--')
plt.hlines(cdf.min(), 0, minimum, colors='green', linestyle='-')


ax.legend(loc='lower right')

ax.set_xlim([0, daily_distance.max() * 1.05])
ax.set_ylim([0, cdf.max() * 1.05])
        
plt.ylabel('Cumulative relative frequency')
plt.xlabel('Daily distance in km')

plt.yticks(np.arange(0, 1.1, step=0.1))

ax.grid()
plt.show()
#<</solution>>