created by Ignacio Oguiza - email: oguiza@gmail.com

In [4]:
%%javascript
utils.load_extension('collapsible_headings/main')
utils.load_extension('hide_input/main')
utils.load_extension('autosavetime/main')
utils.load_extension('execute_time/ExecuteTime')
utils.load_extension('code_prettify/code_prettify')
utils.load_extension('scroll_down/main')
utils.load_extension('jupyter-js-widgets/extension')

<IPython.core.display.Javascript object>

In [5]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

## Purpose 😇

The purpose of this notebook is to show you how you can create a simple, state-of-the-art time series classification model using the great **fastai2** library in 4 steps:
1. Import libraries
2. Prepare data
3. Build learner
4. Train model

In general, there are 3 main ways to classify time series, based on the input to the neural network:

- raw data

- image data (encoded from raw data)

- feature data (extracted from raw data)

In this notebook, we will use the first approach. We will cover other approaches in future notebooks.

Throughout the notebook you will see this ✳️. It means there's some value you need to select.

## Import libraries 📚

In [6]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
from timeseries.imports import *
from timeseries.data import *
from timeseries.models import *
print('fastai2 :', fastai2.__version__)
print('torch   :', torch.__version__)

fastai2 : 0.0.12
torch   : 1.4.0


## Prepare data 🔢

### Download data ⬇️

In this notebook, we'll use one of the most widely used time series classification databases: UEA & UCR Time Series Classification Repository. As of Sep 2019 it contains 129 univariate datasets and 30 multivariate datasets.


In [8]:
print(len(get_UCR_univariate_list()))
#pprint.pprint(get_UCR_univariate_list())

128


In [9]:
print(len(get_UCR_multivariate_list()))
# pprint.pprint(get_UCR_multivariate_list())

30


In the case of UCR data it's very easy to get data loaded. Let's select a dataset. You can modify this and select any one from the previous lists (univariate of multivariate).

In [10]:
# dataset id
dsid = 'ArticularyWordRecognition'   # ✳️

In [12]:
X_train, y_train, X_valid, y_valid = get_UCR_data(dsid, path='..', verbose=True) # indicate path to data/UCR dir

Dataset: ArticularyWordRecognition
X_train: (275, 9, 144)
y_train: (275,)
X_valid: (300, 9, 144)
y_valid: (300,) 



In [13]:
for i,dsid in enumerate(get_UCR_univariate_list()):
    try:
        X_train, y_train, X_test, y_test = get_UCR_data(dsid, path='..', verbose=False)
        print(f'{i:3} {dsid:30} {len(y_train):5} {len(y_test):5} {X_train.shape[1]:4} {X_train.shape[-1]:5} {len(np.unique(y_train)):2}')
        del X_train, y_train, X_test, y_test
    except:
        print(f'{i:3} {dsid:30}***')

  0 ACSF1                            100   100    1  1460 10
  1 Adiac                            390   391    1   176 37
  2 AllGestureWiimoteX               300   700    1   385 10
  3 AllGestureWiimoteY               300   700    1   369 10
  4 AllGestureWiimoteZ               300   700    1   326 10
  5 ArrowHead                         36   175    1   251  3
  6 Beef                              30    30    1   470  5
  7 BeetleFly                         20    20    1   512  2
  8 BirdChicken                       20    20    1   512  2
  9 BME                               30   150    1   128  3
 10 Car                               60    60    1   577  4
 11 CBF                               30   900    1   128  3
 12 Chinatown                         20   343    1    24  2
 13 ChlorineConcentration            467  3840    1   166  3
 14 CinCECGTorso                      40  1380    1  1639  4
 15 Coffee                            28    28    1   286  2
 16 Computers           

In [14]:
for i,dsid in enumerate(get_UCR_multivariate_list()):
    try:
        X_train, y_train, X_test, y_test = get_UCR_data(dsid, path='..', verbose=False)
        print(f'{i:3} {dsid:30} {len(y_train):5} {len(y_test):5} {X_train.shape[1]:4} {X_train.shape[-1]:5} {len(np.unique(y_train)):2}')
        del X_train, y_train, X_test, y_test
    except:
        print(f'{i:3} {dsid:30}***')

  0 ArticularyWordRecognition        275   300    9   144 25
  1 AtrialFibrillation                15    15    2   640  3
  2 BasicMotions                      40    40    6   100  4
  3 CharacterTrajectories           1422  1436    3   180 20
  4 Cricket                          108    72    6  1197 12
  5 DuckDuckGeese                 ***
  6 EigenWorms                       128   131    6 17984  5
  7 Epilepsy                         137   138    3   206  4
  8 ERing                             30   270    4    65  6
  9 EthanolConcentration             261   263    3  1751  4
 10 FaceDetection                   5890  3524  144    62  2
 11 FingerMovements                  316   100   28    50  2
 12 HandMovementDirection            160    74   10   400  4
 13 Handwriting                      150   850    3   152 26
 14 Heartbeat                        204   205   61   405  2
 15 InsectWingbeat                ***
 16 JapaneseVowels                   270   370   12    26  9
 17 Libra

☣️ **Something very important when you prepare your own data is that data needs to be in a 3-d array with the following format:**

1. Number of samples
2. Dimensions
3. Time series length (aka time steps)

All UEA & UCR Time Series Classification data have already been split between train and valid. When you use your own data, you'll have to split it yourself. We'll see examples of this in future notebooks.

### Prepare databunch 💿

You always need to define the bs at the time of creating the databunch, the object that contains all data required.

It's also best practice to scale the data using the train stats. There are several options available: 

1. standardization or normalization.

2. calculate them based on all samples, per channel or per sample. 

3. scale range (for normalization only).

The most common practice is to standardize data per channel.

In [None]:
bs = 64                            # ✳️
seed = 1234                        # ✳️
scale_type = 'standardize'         # ✳️ 'standardize', 'normalize'
scale_by_channel = True            # ✳️ 
scale_by_sample  = False           # ✳️ 
scale_range = (-1, 1)              # ✳️ for normalization only: usually left to (-1, 1)

Now, the last step in data preparation is to prepare a databunch.
Time series data may come as numpy arrays, pandas dataframes, etc.
The 2 most common ways to load data into a databunch will be from a numpy array/ torch tensors or a pandas dataframe. Let's see how we'd work in either case. 

#### From 3D numpy arrays/ torch tensors

1) You need to first create ItemLists from TimeSeriesList (custom type of ItemList built for Time Series)

2) You need to label the ItemLists. You'll find a lot of information [here](https://docs.fast.ai/data_block.html)

3) You enter the train bs and val_bs and crate the databunch object. 

4) You add features and seq_len.

In [None]:
db = (ItemLists('.', TimeSeriesList(X_train), TimeSeriesList(X_valid))
      .label_from_lists(y_train, y_valid)
      .databunch(bs=min(bs, len(X_train)), val_bs=min(len(X_valid), bs * 2), num_workers=cpus, device=device)
      .scale(scale_type=scale_type, scale_by_channel=scale_by_channel, 
             scale_by_sample=scale_by_sample,scale_range=scale_range)
     )
db

#### From pandas dataframe

Let's now extract data from a pandas dataframe. Since we don't have the UCR data available as a dataframe, we'll first need to create and save it. You won't need to do this when you have a time series dataframe.

In [None]:
dsid = 'NATOPS' 
X_train, y_train, X_valid, y_valid = get_UCR_data(dsid)
for ch in range(X_train.shape[-2]):
    data_ch = np.concatenate((np.full((len(np.concatenate((X_train, X_valid))), 1), ch),
                              np.concatenate((X_train, X_valid))[:, ch], 
                              np.concatenate((y_train, y_valid))[:, None]), axis=-1)
    if ch == 0: data = data_ch
    else: data = np.concatenate((data, data_ch))
df = pd.DataFrame(data, columns=['feat'] + list(np.arange(X_train.shape[-1]).astype('str')) + ['target'])
df.to_csv(path/f'data/UCR/{dsid}/{dsid}.csv', index=False)
pd.read_csv(path/f'data/UCR/{dsid}/{dsid}.csv')
print(df.shape)
df.head()

You would actually start here, loading an existing dataframe.

In [None]:
dsid = 'NATOPS'   # ✳️
df = pd.read_csv(path/f'data/UCR/{dsid}/{dsid}.csv')
print(df.shape)
display(df.head())

🔎 To create the TimeSeriesList, you need to select the columns that contain the time series only, neither the target, not the feature (for multivariate TS).

🔎 You should use **label_cls=CategoryList** when labels are floats but it is a classification problem. Otherwise, the fastai library would take it as a regression problem.

1) You need to first TimeSeriesList (custom type of ItemList built for Time Series) from the dataframe. As cols you should only enter the data from the time series (X values, not y).

2) Then you split the TimeSeriesList into 2 lists (traina and valid). There are multiple ways to do that. More info [here](https://docs.fast.ai/data_block.html)

3) You need to label the ItemLists. You'll find a lot of information [here](https://docs.fast.ai/data_block.html)

4) You enter the train bs and val_bs and crate the databunch object. 

5) You add features and seq_len.

In [None]:
db = (TimeSeriesList.from_df(df, '.', cols=df.columns.values[1:-1], feat='feat')
      .split_by_rand_pct(valid_pct=0.2, seed=seed)
      .label_from_df(cols='target', label_cls=CategoryList)
      .databunch(bs=bs,  val_bs=bs * 2,  num_workers=cpus,  device=device)
      .scale(scale_type=scale_type, scale_by_channel=scale_by_channel, 
             scale_by_sample=scale_by_sample,scale_range=scale_range)
     )
db

### Visualize data

In [None]:
db.show_batch()

## Build learner 🏗

In [None]:
from torchtimeseries.models import *
# Select one arch from these state-of-the-art time series/ 1D models:
# ResCNN, FCN, InceptionTime, ResNet
arch = InceptionTime                     # ✳️   
arch_kwargs = dict()                     # ✳️ 
opt_func=Ranger                          # ✳️ a state-of-the-art optimizer
loss_func = LabelSmoothingCrossEntropy() # ✳️

In [None]:
model = arch(db.features, db.c, **arch_kwargs).to(device)
learn = Learner(db, model, opt_func=opt_func, loss_func=loss_func)
learn.save('stage_0')
print(learn.model)
print(learn.summary())

## Train model 🚵🏼‍

### LR find 🔎

In [None]:
learn.load('stage_0')
learn.lr_find()
learn.recorder.plot(suggestion=True)

### Train 🏃🏽‍♀️

In [None]:
epochs = 100         # ✳️ 
max_lr = 1e-2        # ✳️ 
warmup = False       # ✳️
pct_start = .7       # ✳️
metrics = [accuracy] # ✳️
wd = 1e-2

In [None]:
learn.metrics = metrics
learn.load('stage_0')
learn.fit_one_cycle(epochs, max_lr=max_lr, pct_start=pct_start, moms=(.95, .85) if warmup else (.95, .95),
                    div_factor=25.0 if warmup else 1., wd=wd)
learn.save('stage_1')
learn.recorder.plot_lr()
learn.recorder.plot_losses()
learn.recorder.plot_metrics()

In [None]:
archs_names, acc_, acces_, acc5_, n_params_,  = [], [], [], [], []
archs_names.append(arch.__name__)
early_stop = math.ceil(np.argmin(learn.recorder.losses) / len(learn.data.train_dl))
acc_.append('{:.5}'.format(learn.recorder.metrics[-1][0].item()))
acces_.append('{:.5}'.format(learn.recorder.metrics[early_stop - 1][0].item()))
acc5_.append('{:.5}'.format(np.mean(np.max(learn.recorder.metrics))))
n_params_.append(count_params(learn))
clear_output()
df = (pd.DataFrame(np.stack((archs_names, acc_, acces_, acc5_, n_params_)).T,
                   columns=['arch', 'accuracy', 'accuracy train loss', 'max_accuracy','n_params'])
      .sort_values('accuracy train loss', ascending=False).reset_index(drop=True))
display(df)

### Results

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()