# MLP for timeseries example

This notebooks provides an example on how to create a timeseries model (MLP) with SAM.

The timeseries model utilizes the feature engineering capabilities of SAM. To learn more about feature engineering, see the notebook `feature_engineering.ipynb` and the [Feature Engineering](https://sam.nist.gov/docs/feature-engineering) section of the SAM documentation.

In [92]:
# autoreload
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [93]:
from sam.models import LassoTimeseriesRegressor
from sam.feature_engineering import SimpleFeatureEngineer

import pandas as pd

In [94]:
data = pd.read_parquet("../data/rainbow_beach.parquet")
data.head()

Unnamed: 0_level_0,batttery_life,transducer_depth,turbidity,water_temperature,wave_height,wave_period
TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-06-15 00:00:00,11.6,1.495,0.85,16.6,0.136,3.0
2014-06-15 01:00:00,11.6,1.42,0.87,16.3,0.117,4.0
2014-06-15 02:00:00,11.6,1.478,0.79,16.1,0.114,7.0
2014-06-15 03:00:00,11.6,1.518,0.76,15.9,0.111,3.0
2014-06-15 04:00:00,11.6,1.507,0.77,15.7,0.107,3.0


To use the model, we need a feature engineering transformer. `sam.feature_engineering` contains a number of transformers that can be used to create features from the data, suitable for time series problems.

In [104]:
simple_features = SimpleFeatureEngineer(
    rolling_features=[
        ("wave_height", "mean", 48),
        ("wave_height", "mean", 24),
        ("wave_height", "mean", 12),
        ("wave_height", "mean", 6),
        ("wave_height", "mean", 3),
    ],
    time_features=[
        ("hour_of_day", "onehot"),
        ("day_of_week", "onehot"),
    ],
    keep_original=False,
)

X = data
y = data["water_temperature"]


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

feature_pipeline = Pipeline(
    steps=[
        ("features", simple_features),
        ("imputer", SimpleImputer()),
        ("scaler", StandardScaler()),
    ]
)


The following example creates a model for nowcasting (predicting the current value of a certain variable).

In [103]:
model = LassoTimeseriesRegressor(
    predict_ahead=(0,),
    quantiles=(0.1, 0.9),
    alpha=0.01,
    average_type="median",
    feature_engineer=feature_pipeline,
)

model.fit(X, y)

In [102]:
pred, Xout = model.predict(X, return_data=True, force_monotonic_quantiles=True)

pred.plot()
y.plot()

ValueError: Input X contains NaN.
QuantileRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
pred

Unnamed: 0_level_0,predict_lead_0_q_0.1,predict_lead_0_q_0.5,predict_lead_0_q_0.9,predict_lead_0_mean
TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-06-15 00:00:00,15.977594,17.131630,19.792821,17.334895
2014-06-15 01:00:00,15.977594,17.127520,19.792821,17.211953
2014-06-15 02:00:00,15.977594,17.131630,19.792821,17.252665
2014-06-15 03:00:00,15.943347,17.063374,19.628948,17.148618
2014-06-15 04:00:00,15.813574,17.061542,19.546175,17.117780
...,...,...,...,...
2014-07-15 16:00:00,18.269236,18.976203,20.306284,19.621426
2014-07-15 17:00:00,18.284076,19.673023,20.288593,19.724330
2014-07-15 18:00:00,18.244491,19.158389,20.392872,19.728669
2014-07-15 19:00:00,18.347562,19.219369,20.172443,19.632209


In [99]:
model.score(X, y)

3.601537859814757

To create a forecasting model, one can choose `predict_ahead` differently. Choose a tuple of multiple values to predict multiple timesteps ahead. Also, the parameter `use_diff_of_y` can be useful in forecasting applications.