# Feature Engineering
In this section you'll learn the following topics:
* How to create features for timeseries data
* How to build functions that can help you generate a ton of features with little code
* How to make your functions testable so that you can guarantee expected behavior
* Bonus: implement your own `scikit-learn` pipeline to do some data transformation

---

## Load your dataset
We'll use the dataset from previous exercise, load the dataset in memory and call it `df`. If you have more than 2000 rows, go back and check if you wrote the output correctly.
- Are the data types correct? If so, how come?

In [None]:
# Your solution

---

# Create helper functions
In this section you'll create helper functions that can generate a bunch of features for you. Since we're doing timeseries forecasting we are especially interested in features that are derived from the target variable and the time column.

## Datetime features
Create a function `create_datetime_features` that, takes in a dataframe with a datetime index and produces the following derived columns:
- Day of week
- Week
- Month
- Quarther
- Year

In [None]:
# Your solution

In [1]:
# Your solution

---

# Windows and moving aggregates

To have a better idea of the trend, we can make use of window functions. Essentially we'll compute a moving average over a period to filter out noise. Window functions perform calculations on a set of rows that are related together. But, unlike the aggregate functions, windowing functions do not collapse the result of the rows into a single value. Instead, all the rows maintain their original identity and the calculated result is returned for every row.

- What kind of effect does the window size have?
- What is the disadvantage of using a moving average?
- Implement a 14 day moving average in pandas. Plot this together with the daily power consumption.
- What choices do you need to make when creating a window function? There are three important components (not including the column choice). 

In [None]:
# Your solution

## Weighting datapoints
One can imagine that more recent data can be more important than data points that are further back in time. What kind of alternative for the moving average can you think of?
- Implement this in pandas
- Add it to the plot
- What difference do you see?

In [None]:
# Your solution

## Expanding windows
Investigate what expanding windows are in Pandas
- When would you use something like this?
- Can you implement an expanding window that computes the total daily power used?

In [None]:
# Your solution

---

## Calculating lags as features
In this section you'll compute a few features that are derived from the target variable column. Derived features can be a powerful thing in timeseries forecasting. I advise to try to write a function that can do the different computations for you. Making clever use of loops can save you a bunch of time. In addition, your code will be testable which makes it easier to bring something to production.

We'll consider the following type of features:
- Lags (values of previous day)
- Windowed values (aggregated values over a certain period)

Try to add the following features to your data frame, base the features on your `power` column:
- Add lags for 1 to 14 days ago (14 columns)
- Add windows for the mean, and variance with different timespans: 7, 14, 21, 28, 35 (10 columns). Make the input configurable.
- Plot the result

In [None]:
# Your solution

## Missing values
Adding these lags could introduce missing values, make sure to deal with these before you continue.

In [None]:
# Your solution

---
# End of training material

---

## Bonus: Scikit-learn pipelines 
If creating the functions was straightforward and you feel up to a challenge, implement your feature creation as `scikit-learn` pipeline objects. `Scikit-learn` pipeline objects are a great way to chain together a bunch of operations that allow you to build a model pipeline with blocks. 

These blocks are testable and follow the `scikit-learn` style. Pipelines allow you to create a single pipeline object that can do your data preprocessing and model fitting in one go. Additionally, it's easy to swap out or change components in the pipeline.

In [None]:
# Your solution

---

## Other features
Can you add other features that you think are useful for predicting the power usage? Perhaps something to do with holidays or season. Feel free to add a bunch of features, we'll be using them in the final part. :)

In [None]:
# Your solution

---

# Store prepared data
Store your prepared dataset as parquet. You'll need it in the next exercise.

In [None]:
# Your solution