# Cross validation of time series with scikit-learn

In machine learning, it is quite common to assume that the data are
"independent and identically distributed" (i.i.d),
meaning that the generative process does not have any memory of past samples
to generate new samples.
This assumption is usually violated when dealing with time series. A sample
depends on past information.
We will take an example to highlight such issues with non-i.i.d. data in the
previous cross-validation strategies presented. 

First we load financial quotations from some energy companies.

In [None]:
!wget https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/financial-data/COP.csv -P ../datasets/financial-data/
!wget https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/financial-data/CVX.csv -P ../datasets/financial-data/
!wget https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/financial-data/TOT.csv -P ../datasets/financial-data/
!wget https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/financial-data/VLO.csv -P ../datasets/financial-data/
!wget https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/financial-data/XOM.csv -P ../datasets/financial-data/

In [None]:
import pandas as pd

symbols = {"TOT": "Total", "XOM": "Exxon", "CVX": "Chevron",
           "COP": "ConocoPhillips", "VLO": "Valero Energy"}
template_name = ("../datasets/financial-data/{}.csv")

quotes = {}
for symbol in symbols:
    data = pd.read_csv(
        template_name.format(symbol), index_col=0, parse_dates=True
    )
    quotes[symbols[symbol]] = data["open"]
quotes = pd.DataFrame(quotes)

We can start by plotting the different financial quotations.

In [None]:
import matplotlib.pyplot as plt

quotes.plot(figsize=(10, 6))
plt.ylabel("Quote value")
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
_ = plt.title("Stock values over time")

Here, we want to predict the quotation of Chevron
using all other energy companies' quotes.

To make explanatory plots, we will use a single split in addition to the
cross-validation that you used in the introductory exercise.

In [None]:
from sklearn.model_selection import train_test_split

data, target = quotes.drop(columns=["Chevron"]), quotes["Chevron"]
data_train, data_test, target_train, target_test = train_test_split(
    data, target, shuffle=True, random_state=0)

We will use a decision tree regressor that we expect to overfit and thus not
generalize to unseen data. We will use a `ShuffleSplit` cross-validation to
check the generalization performance of our model.

Let's first define our model

In [None]:
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor()

And now the cross-validation strategy.

In [None]:
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(random_state=0)

Finally, we perform the evaluation.

In [None]:
from sklearn.model_selection import cross_val_score

test_score = cross_val_score(regressor, data_train, target_train, cv=cv,
                             n_jobs=2)
print(f"The mean R2 is: "
      f"{test_score.mean():.2f} +/- {test_score.std():.2f}")

Surprisingly, we get outstanding generalization performance. We will investigate
and find the reason for such good results with a model that is expected to
fail. We previously mentioned that `ShuffleSplit` is an iterative
cross-validation scheme that shuffles data and split. We will simplify this
procedure with a single split and plot the prediction. We can use
`train_test_split` for this purpose.

In [None]:
regressor.fit(data_train, target_train)
target_predicted = regressor.predict(data_test)
# Affect the index of `target_predicted` to ease the plotting
target_predicted = pd.Series(target_predicted, index=target_test.index)

Let's check the generalization performance of our model on this split.

In [None]:
from sklearn.metrics import r2_score

test_score = r2_score(target_test, target_predicted)
print(f"The R2 on this single split is: {test_score:.2f}")

Similarly, we obtain good results in terms of $R^2$.
We will plot the training, testing and prediction samples.

In [None]:
target_train.plot(label="Training", figsize=(10, 6))
target_test.plot(label="Testing")
target_predicted.plot(label="Prediction")

plt.ylabel("Quote value")
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
_ = plt.title("Model predictions using a ShuffleSplit strategy")

So in this context, it seems that the model predictions are following the
testing. But we can also see that the testing samples are next to some
training sample. And with these time-series, we see a relationship between a
sample at the time `t` and a sample at `t+1`. In this case, we are violating
the i.i.d. assumption. The insight to get is the following: a model can
output of its training set at the time `t` for a testing sample at the time
`t+1`. This prediction would be close to the true value even if our model
did not learn anything, but just memorized the training dataset.

An easy way to verify this hypothesis is to not shuffle the data when doing
the split. In this case, we will use the first 75% of the data to train and
the remaining data to test.

In [None]:
data_train, data_test, target_train, target_test = train_test_split(
    data, target, shuffle=False, random_state=0,
)
regressor.fit(data_train, target_train)
target_predicted = regressor.predict(data_test)
target_predicted = pd.Series(target_predicted, index=target_test.index)

In [None]:
test_score = r2_score(target_test, target_predicted)
print(f"The R2 on this single split is: {test_score:.2f}")

In this case, we see that our model is not magical anymore. Indeed, it
performs worse than just predicting the mean of the target. We can visually
check what we are predicting.

In [None]:
target_train.plot(label="Training", figsize=(10, 6))
target_test.plot(label="Testing")
target_predicted.plot(label="Prediction")

plt.ylabel("Quote value")
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
_ = plt.title("Model predictions using a split without shuffling")

We see that our model cannot predict anything because it doesn't have samples
around the testing sample. Let's check how we could have made a proper
cross-validation scheme to get a reasonable generalization performance estimate.

One solution would be to group the samples into time blocks, e.g. by quarter,
and predict each group's information by using information from the other
groups. We can use the `LeaveOneGroupOut` cross-validation for this purpose.

In [None]:
from sklearn.model_selection import LeaveOneGroupOut

groups = quotes.index.to_period("Q")
cv = LeaveOneGroupOut()
test_score = cross_val_score(regressor, data, target,
                             cv=cv, groups=groups, n_jobs=2)
print(f"The mean R2 is: "
      f"{test_score.mean():.2f} +/- {test_score.std():.2f}")

In this case, we see that we cannot make good predictions, which is less
surprising than our original results.

Another thing to consider is the actual application of our solution. If our
model is aimed at forecasting (i.e., predicting future data from past data),
we should not use training data that are ulterior to the testing data. In
this case, we can use the `TimeSeriesSplit` cross-validation to enforce this
behaviour.

In [None]:
from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits=groups.nunique())
test_score = cross_val_score(regressor, data, target,
                             cv=cv, groups=groups, n_jobs=2)
print(f"The mean R2 is: "
      f"{test_score.mean():.2f} +/- {test_score.std():.2f}")

In conclusion, it is really important to not use an out of the shelves
cross-validation strategy which do not respect some assumptions such as
having i.i.d data. It might lead to absurd results which could make think
that a predictive model might work.

# Exercise

Load the [bike_rides.csv dataset](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_bike_rides.html). The exercise consists of using measurements from cheap sensors (GPS, heart-rate monitor, etc.) to predict a cyclist power time-series. Power can indeed be recorded via a cycling power meter device, but this device is rather expensive.

In [None]:
!wget https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/bike_rides.csv -P ../datasets/

In [None]:
import pandas as pd

cycling = pd.read_csv("../datasets/bike_rides.csv", index_col=0,
                      parse_dates=True)
cycling.index.name = ""
target_name = "power"
data, target = cycling.drop(columns=target_name), cycling[target_name]
data

Instead of using blindly machine learning, we will first introduce some flavor of
classic mechanics: the Newton's second law.

$P_{meca} = (\frac{1}{2} \rho . SC_x . V_{a}^{2} + C_r . mg . \cos \alpha + mg . \sin \alpha + ma) V_d$

where $\rho$ is the air density in kg.m$^{-3}$, $S$ is frontal surface of the
cyclist in m$^{2}$, $C_x$ is the drag coefficient, $V_a$ is the air speed in
m.s$^{-1}$, $C_r$ is the rolling coefficient, $m$ is the mass of the rider and
bicycle in kg, $g$ is the standard acceleration due to gravity which is equal
to 9.81 m.s$^{-2}$, $\alpha$ is the slope in radian, $V_d$ is the rider speed
in m.s$^{-1}$, and $a$ is the rider acceleration in m.s$^{-2}$.

This equation might look a bit complex at first but we can explain with words
what the different terms within the parenthesis are:

- the first term is the power that a cyclist is required to produce to fight
  wind
- the second term is the power that a cyclist is required to produce to fight
  the rolling resistance created by the tires on the floor
- the third term is the power that a cyclist is required to produce to go up a
  hill if the slope is positive. If the slope is negative the cyclist does not
  need to produce any power to go forward
- the fourth and last term is the power that a cyclist requires to change his
  speed (i.e. acceleration).

We can simplify the model above by using the data that we have at hand. It
would look like the following.

$P_{meca} = \beta_{1} V_{d}^{3} + \beta_{2} V_{d} + \beta_{3} \sin(\alpha) V_{d} + \beta_{4} a V_{d}$

This model is closer to what we saw previously: it is a linear model trained on
a non-linear feature transformation. We will build, train and evaluate such a
model as part of this exercise. Thus, you need to:

- create a new data matrix containing the cube of the speed, the speed, the
  speed multiplied by the sine of the angle of the slope, and the speed
  multiplied by the acceleration. To compute the angle of the slope, you need to
  take the arc tangent of the slope (`alpha = np.arctan(slope)`). In addition,
  we can limit ourself to positive acceleration only by clipping to 0 the
  negative acceleration values (they would correspond to some power created by
  the braking that we are not modeling here).
- using the new data matrix, create a linear predictive model based on a
  [`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
  and a
  [`sklearn.linear_model.RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html);
- use a
  [`sklearn.model_selection.ShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html)
  cross-validation strategy with only 4 splits (`n_splits=4`) to evaluate the
  generalization performance of the model. Use the mean absolute error (MAE) as
  a generalization performance metric. Also, pass the parameter
  `return_estimator=True` and `return_train_score=True` to answer the subsequent
  questions. Be aware that the `ShuffleSplit` strategy is a naive strategy and
  we will investigate the consequence of making this choice in the subsequent
  questions.

### Q1

What is the mean value of the column containing the information of
$\sin(\alpha) V_{d}$?

- a) about -3
- b) about -0.3
- c) about -0.03
- d) about -0.003

_Select a single answer_

### Q2
On average, the Mean Absolute Error on the test sets obtained through
cross-validation is closest to:

- a) 20 Watts
- b) 50 Watts
- c) 70 Watts
- d) 90 Watts

_Select a single answer_

Hint: pass `scoring="neg_mean_absolute_error"` to the `cross_validate`
function to compute the (negative of) the requested metric.
Hint: it is possible to replace the negative acceleration values by 0 using
`data["acceleration"].clip(lower=0)`

### Q3
Given the model
$P_{meca} = \beta_{1} V_{d}^{3} + \beta_{2} V_{d} + \beta_{3} \sin(\alpha) V_{d} + \beta_{4} a V_{d}$
that you created, inspect the weights of the linear models fitted during
cross-validation and select the correct statements:

- a) $\beta_{1} < \beta_{2} < \beta_{3}$
- b) $\beta_{3} < \beta_{1} < \beta_{2}$
- c) $\beta_{2} < \beta_{3} < \beta_{1}$
- d) $\beta_{1} < 0$
- e) $\beta_{2} < 0$
- f) $\beta_{3} < 0$
- g) $\beta_{4} < 0$
- h) All $\beta$s are $> 0$

_Select all answers that apply_

### Q4
Now, we will create a predictive model that uses all `data`, including available
sensor measurements such as cadence (the speed at which a cyclist turns pedals
measured in rotation per minute) and heart-rate (the number of beat per minute
of the heart of the cyclist while exercising). Also, we will use a non-linear
regressor, a
[`sklearn.ensemble.HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html).
Fix the number of maximum iterations to 1000 (`max_iter=1_000`) and activate the
early stopping (`early_stopping=True`). Repeat the previous evaluation using
this regressor.

On average, the Mean Absolute Error on the test sets obtained through
cross-validation is closest to:

- a) 20 Watts
- b) 40 Watts
- c) 60 Watts
- d) 80 Watts

_Select a single answer_

### Q5
Comparing both the linear model and the histogram gradient boosting model and
taking into consideration the train and test MAE obtained via cross-validation,
select the correct statements:

- a) the generalization performance of the histogram gradient-boosting model is
  limited by its underfitting
- b) the generalization performance of the histogram gradient-boosting model is
  limited by its overfitting
- c) the generalization performance of the linear model is limited by its
  underfitting
- d) the generalization performance of the linear model is limited by its
  overfitting

_Select all answers that apply_

Hint: look at the values of the `train_score` and the `test_score` collected
in the dictionaries returned by the `cross_validate` function.

### Q6
How many bike rides are stored in the dataframe `data`?

- a) 2
- b) 3
- c) 4
- d) 5

_Select a single answer_

Hint: You can check the unique day in the `DatetimeIndex` (the index of the
dataframe `data`). Indeed, we assume that on a given day the rider went cycling
at most once per day.
Hint: You can access to the date and time of a `DatetimeIndex` using
`df.index.date` and `df.index.time`, respectively.

### Q7
Instead of using the naive `ShuffleSplit` strategy, we will use a strategy that
takes into account the group defined by each individual date. It corresponds to
a bike ride. We would like to have a cross-validation strategy that evaluates
the capacity of our model to predict on a completely new bike ride: the samples
in the validation set should only come from rides not present in the training
set. Therefore, we can use a `LeaveOneGroupOut` strategy: at each iteration of
the cross-validation, we will keep a bike ride for the evaluation and use all
other bike rides to train our model.

Thus, you concretely need to:

- create a variable called `group` that is a 1D numpy array containing the
  index of each ride present in the dataframe. Therefore, the length of `group`
  will be equal to the number of samples in `data`. If we had 2 bike
  rides, we would expect the indices 0 and 1 in `group` to differentiate the
  bike ride. You can use
  [`pd.factorize`](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html)
  to encode any Python types into integer indices.
- create a cross-validation object named `cv` using the
  [`sklearn.model_selection.LeaveOneGroupOut`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut)
  strategy.
- evaluate both the linear and histogram gradient boosting models with this
  strategy.

Using the previous evaluations (with the `LeaveOneGroupOut` strategy)
and looking at the train and test errors for both models, select the
correct statements:

- a) the generalization performance of the gradient-boosting model is
  limited by its underfitting
- b) the generalization performance of the gradient-boosting model is
  limited by its overfitting
- c) the generalization performance of the linear model is limited by its
  underfitting
- d) the generalization performance of the linear model is limited by its
  overfitting

_Select all answers that apply_

### Q8
In this case we cannot compare cross-validation scores fold-to-fold as the folds
are not aligned (they are not generated by the exact same strategy). Instead,
compare the mean of the cross-validation test errors in the evaluations of the
**linear model** to select the correct statement.

When using the `ShuffleSplit` strategy, the mean test error:

- a) is greater than the `LeaveOneGroupOut` mean test error by more than 3 Watts,
  i.e. `ShuffleSplit` is giving over-pessimistic results
- b) differs from the `LeaveOneGroupOut` mean test error by less than 3 Watts,
  i.e. both cross-validation strategies are equivalent
- c) is lower than the `LeaveOneGroupOut` mean test error by more than 3 Watts,
  i.e. `ShuffleSplit` is giving over-optimistic results

_Select a single answer_

### Q9
Compare the mean of the cross-validation test errors in the evaluations of the
**gradient-boosting model** to select the correct statement.

When using the `ShuffleSplit` strategy, the mean test error:

- a) is greater than the `LeaveOneGroupOut` mean test error by more than 3 Watts,
  i.e. `ShuffleSplit` is giving over-pessimistic results
- b) differs from the `LeaveOneGroupOut` mean test error by less than 3 Watts,
  i.e. both cross-validation strategies are equivalent
- c) is lower than the `LeaveOneGroupOut` mean test error by more than 3 Watts,
  i.e. `ShuffleSplit` is giving over-optimistic results

_Select a single answer_

### Q10

Compare more precisely the errors estimated through cross-validation and select
the correct statement:

- a) in general, the standard deviation of the train and test errors increased
  using the `LeaveOneGroupOut` cross-validation
- b) in general, the standard deviation of the train and test errors decreased
  using the `LeaveOneGroupOut` cross-validation

_Select a single answer_

### Q11
Now, we will go more into details by picking a single ride for the testing and
analyse the predictions of the models for this test ride. To do so, we can reuse
the `LeaveOneGroupOut` cross-validation object in the following manner:

In [None]:
cv = LeaveOneGroupOut()
train_indices, test_indices = list(cv.split(data, target, groups=groups))[0]

data_linear_model_train = data_linear_model.iloc[train_indices]
data_linear_model_test = data_linear_model.iloc[test_indices]

data_train = data.iloc[train_indices]
data_test = data.iloc[test_indices]

target_train = target.iloc[train_indices]
target_test = target.iloc[test_indices]

Now, fit both the linear model and the histogram gradient boosting regressor
models on the training data and collect the predictions on the testing data.
Make a scatter plot where on the x-axis, you will plot the measured powers (true
target) and on the y-axis, you will plot the predicted powers (predicted
target). Do two separated plots for each model.

By analysing the plots, select the correct statements:

- a) the linear regressor tends to under-predict samples with high power
- b) the linear regressor tends to over-predict samples with high power
- c) the linear regressor makes catastrophic predictions for samples with power
  close to zero
- d) the histogram gradient boosting regressor tends to under-predict samples
  with high power
- e) the histogram gradient boosting regressor tends to over-predict samples
  with high power
- f) the histogram gradient boosting makes catastrophic predictions for samples
  with power close to zero

_Select all answers that apply_

### Q12
Now select a portion of the testing data using the following code:

In [None]:
time_slice = slice("2020-08-18 17:00:00", "2020-08-18 17:05:00")

data_test_linear_model_subset = data_linear_model_test[time_slice]
data_test_subset = data_test[time_slice]
target_test_subset = target_test[time_slice]

It allows to select data from 5.00 pm until 5.05 pm. Used the previous fitted
models (linear and gradient-boosting regressor) to predict on this portion of
the test data. Draw on the same plot the true targets and the predictions of
each model.

By using the previous plot, select the correct statements:

- a) the linear model is more accurate than the histogram gradient boosting
  regressor
- b) the histogram gradient boosting regressor is more accurate than the linear
  model

_Select a single answer_