# Long Short-Term Memory

In this jupyter notebook you will find the implementation of the long short-term memory algorithm using the sklearn library. It will help to test this algorithm and to complete [forecasting.md](https://github.com/Hurence/historian/blob/forecasting/docs/forecasting.md) document.

First we need to import all the different libraries that we will use and we make the 'create_dataset' function, it will be used later. 

In [1]:
import time
import math
import numpy as np
import sklearn.linear_model as sk
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# LSTM
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

In [2]:
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return np.array(dataX), np.array(dataY)
# fix random seed for reproducibility
np.random.seed(7)

#### 1) Dataset

We prepare the dataset for the next step.

In [3]:
# Load the dataset
# ts_data = pd.read_csv('../data/dataHistorian.csv', sep=';', encoding='cp1252')
ts_data = pd.read_csv('../data/it-data-4metrics.csv', sep=',')

# Delete the useless columns
ts_data = ts_data.iloc[:,0:4]
ts_data.head()

Unnamed: 0,metric_id,timestamp,value,metric_name
0,091c334c-a90a-4d8f-ba75-2c936220cd64,1575157723,13.375,cpu_prct_used
1,091c334c-a90a-4d8f-ba75-2c936220cd64,1575157423,13.5,cpu_prct_used
2,091c334c-a90a-4d8f-ba75-2c936220cd64,1575157123,13.375,cpu_prct_used
3,091c334c-a90a-4d8f-ba75-2c936220cd64,1575156823,13.5,cpu_prct_used
4,091c334c-a90a-4d8f-ba75-2c936220cd64,1575156523,13.75,cpu_prct_used


We create to dictionnary to class all the time series according to their metric_name and their metric_id.

In [4]:
# Creation of the dictionnary of all the metric_name in association with their metric_id
dic_name = {}
dic_id = {}
for indx in ts_data.index:
    if ts_data['metric_name'][indx] not in dic_name.keys():
        dic_name[ts_data['metric_name'][indx]] = []
    if ts_data['metric_id'][indx] not in dic_name[ts_data['metric_name'][indx]]:
        dic_name[ts_data['metric_name'][indx]].append(ts_data['metric_id'][indx])
        dic_id[ts_data['metric_id'][indx]] = [ts_data['metric_name'][indx]]
keys_name = list(dic_name.keys())
keys_id = list(dic_id.keys())

#### 2) Training the neural network

We are going to separate the series into a training and a testing serie (for each metric_id). Then this series are going to fit the neural network and to test it. We will store the results in dic_id for keep them and reuse them later.

In [5]:
sample = len(keys_id)
for i in range(sample):
    indx = keys_id[i]
    indexNames = ts_data[ ts_data['metric_id'] == indx ].index
    data = ts_data.iloc[indexNames].sort_values(by='timestamp', ascending=True).loc[:,'value']
    dataset = data.values
    dataset = dataset.astype('float32')
    dic_id[indx].append(dataset)
    
    # normalize the dataset
    scaler = MinMaxScaler(feature_range=(0, 1))
    dataset = scaler.fit_transform(dataset.reshape(-1, 1))
    
    # split into train and test sets
    train_size = int(len(dataset) * 0.67)
    test_size = len(dataset) - train_size
    train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
    
    # reshape into X=t and Y=t+1
    look_back = 1
    x_train, y_train = create_dataset(train, look_back)
    x_valid, y_valid = create_dataset(test, look_back)
    
    # reshape input to be [samples, time steps, features]
    x_train = np.reshape(x_train, (x_train.shape[0], 1, x_train.shape[1]))
    x_valid = np.reshape(x_valid, (x_valid.shape[0], 1, x_valid.shape[1]))
    
    # create and fit the LSTM network
    model = Sequential()
    model.add(LSTM(4, input_shape=(1, look_back)))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    start_train = time.time()
    model.fit(x_train, y_train, epochs=100, batch_size=1, verbose=0)
    end_train = time.time()
    
    # make predictions
    y_pred_train = model.predict(x_train)
    start_pred = time.time()
    y_pred_valid = model.predict(x_valid)
    end_pred = time.time()
    
    # invert predictions
    y_pred_train = scaler.inverse_transform(y_pred_train)
    y_train = scaler.inverse_transform([y_train])
    y_pred_valid = scaler.inverse_transform(y_pred_valid)
    y_valid = scaler.inverse_transform([y_valid])
    # calculate root mean squared error
    testScore = math.sqrt(mean_squared_error(y_valid[0], y_pred_valid[:,0]))
    
    dic_id[indx].append(testScore)
    dic_id[indx].append([x_train, y_train, y_pred_train])
    dic_id[indx].append([x_valid, y_valid, y_pred_valid])
    dic_id[indx].append(end_train - start_train)
    dic_id[indx].append(end_pred - start_pred)
    if (i+1) % 25 == 0:
        print("%.2f" % ((100/sample)*(i+1)),"% completed...")

9.88 % completed...
19.76 % completed...
29.64 % completed...
39.53 % completed...
49.41 % completed...
59.29 % completed...
69.17 % completed...
79.05 % completed...
88.93 % completed...
98.81 % completed...


In [6]:
pd.DataFrame(dic_id).loc[:,['metric_name', 'metric_id', 'mean_squarred_error', 'training_time', 'inference_time']].to_csv('LSTM_bis.csv', encoding='utf-8')

In [7]:
# Here we have two dictionaries:
# First, we have a link between the metric_name and their metric_id

# {'metric_name_1':[metric_id_1, metric_id_2, ...],
#  'metric_name_2':[metric_id_1, metric_id_2, ...],
#  ...}


# Second, we have all the information according to the metric_id

# {'metric_id_1':[metric_name_x, ts_data['value'], RMS, [x_train, y_train, y_pred_train], [x_valid, y_valid, y_pred_valid], training_time, inference_time],
#  'metric_id_2':[metric_name_y, ts_data['value'], RMS, [x_train, y_train, y_pred_train], [x_valid, y_valid, y_pred_valid], training_time, inference_time],
#  ...}
# 

#### 3) Results

In [8]:
l = []
for indx_name in keys_name:
    somme = 0
    cptr = 0
    for indx_id in dic_name[indx_name]:
        somme += dic_id[indx_id][2]
        cptr += 1
    l.append(somme/cptr)
    
dic = {'metric_name':keys_name, 'r2_mean':l}
mean_error = pd.DataFrame(dic)
print(mean_error)

IndexError: list index out of range

In [None]:
for i in range(sample):
    indx = keys_id[i]
    fig, ax = plt.subplots()
    # shift train predictions for plotting
    trainPlot = np.empty_like(dic_id[indx][1].to_numpy().reshape(-1,1))
    trainPlot[:, :] = np.nan
    trainPlot[look_back:len(dic_id[indx][3][2])+look_back, :] = dic_id[indx][3][2]
    # shift test predictions for plotting
    validPlot = np.empty_like(dic_id[indx][1].to_numpy().reshape(-1,1))
    validPlot[:, :] = np.nan
    validPlot[len(dic_id[indx][3][2])+(look_back*2)+1:len(dic_id[indx][1])-1, :] = dic_id[indx][4][2]
    # plot baseline and predictions
    ax.plot(dic_id[indx][1].to_numpy().reshape(-1,1))
    ax.plot(trainPlot)
    ax.plot(validPlot)
plt.show()