## Data File Format

The data we received from the company is in a ',' delimited csv file. Each line of the csv file is of the following format:

`metricId, hostId, controlPointId, n, firstTime, lastTime, warn, crit, v1:s1:t1, ..., vn:sn:tn`

Here:
* `metricID`: the id of the sensor.
* `hostId`: the id of a host or device. A host contains multiple metrics. We assume different hosts are independent of one anonther.
* `controlPointId` - you can ignore for the time.
* `n` - number of data points we have for the current metric.
* `firstTime, lastTime` - time stamps of first and last data points.
* `warn, crit` - values designating two status thresholds, warning and critical.
* `vi:si:ti` - ith data point. `vi` is the actual value, `si` is the current status of the metric (OK, warning or critical), `ti` is the unix time stamp.

## Method to Read Data
Following is the python method we have for redaing the data.

In [None]:
# required packages
import numpy as np
import linecache

'''
get_data: method to read certain metrics from the data file.
@param: FILE_NAME is the path to the data file
@param: line_indices is the list of line numbers (indices) corresponding to metrics to be retrieved
@return: data matrix is in the size of [number of instances (n) , number of time series (length of line_indices)]
@return: metric_ids, host_ids, header_names 
'''
def get_data(FILE_NAME, line_indices=[14,15]):
    # This block processes the header line, if it exits
    header = True  # True means the first line of the csv file in the columns 1 to 8 are variable names
    if header == True:
        a = linecache.getline(FILE, 1)
        b = a.split(',')
        header_names = b[0:7]
        # dictionaries to store metric ids and host ids against the line indices
        metric_ids = {}
        host_ids = {}
        
    # empty matrix to store data
    data = []
    
    # line_indices: input the time series correspond to the same device
    for line_index in line_indices:
        # retrieve  different fields of a line
        a = linecache.getline(FILE, line_index) 
        b = a.split(',')
        
        # stores the metricID and hostID against line numbers
        if header == True:
            metric_ids[line_index] = b[0]
            host_ids[line_index] = b[1]
        # values of the current metric, v1..vn     
        V = []                
        for i in range(8,len(b)):            
            c = b[i]
            v, s, t = c.split(":") # value:status:time
            V.append(float(v))
        # append current values to the data matrix
        data.append(V)
    
    # convert data to numpy format to be used later by sk-learn mathods
    data = np.array(data)
    data = np.transpose(data)
    # returned data matrix is in the size of [number of instances (n) , number of time series (length of line_indices)] 
    # each column contains the sequence of a time series
    return (data, metric_ids, host_ids, header_names)

## Testing Protocol
We assume that we already converted the multi-variate time series to the training the matrix. For each variable we select a window size which is best for that variable. We use cross-validation to select the window sizes. This window sizes are later used to embed the time series into the training matrix.

In [None]:
# required packages
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestRegressor

'''
train_model_CV: learns parameters with CV on window size and number of trees
@param: data matrix obtained from get_data method above
@param: number of folds for cross-validations 
@param: window sizes to chose from
@param: number of trees to choose from
@param: forecasting horizon

@return: selected window size, number of trees and the corresponding MSE
'''
def train_model_CV(data, n_fold=5, windows = [5, 10, 25, 50, 75, 100, 150],
                   n_trees = [500, 1000, 2000, 3000], f_horizon=1):

    scores = np.zeros((len(windows), len(n_trees), n_fold))    

    for w_ind in range(0, len(windows)):
        # obtain the matrix from  the time series data with a given window-size
        (w_train, y_wtrain) = windowize(data, windows[w_ind], f_horizon = f_horizon)
            
        # cross-validation 
        r,c = w_train.shape
        kf = KFold(r, n_folds=n_fold)
        for tree_ind in range(0, len(n_trees)):
            reg = RandomForestRegressor(n_estimators=n_trees[tree_ind], n_jobs=24)
            n = 0
            for train_index, val_index in kf:
                # getting training and validation data
                X_train, X_val = w_train[train_index,:], w_train[val_index,:]
                y_train, y_val = y_wtrain[train_index], y_wtrain[val_index]
                # train the model and predict the MSE
                reg.fit(X_train, y_train)
                pred_val = reg.predict(X_val)
                scores[w_ind, tree_ind, n] = mean_squared_error(pred_val, y_val)
                n += 1
    m_scores = np.average(scores, axis=2)
    mse = m_scores.min()
        
    # select best window_size and best n_tree with smallest MSE
    (b_w_ind, b_tree_ind) = np.where(m_scores == mse)
    window_size, nr_tree = windows[b_w_ind], n_trees[b_tree_ind]

    return (window_size, nr_tree, mse)