## Data File Format

The data we received from the company is in a ',' delimited csv file. Each line of the csv file is of the following format:

`metricId, hostId, controlPointId, n, firstTime, lastTime, warn, crit, v1:s1:t1, ..., vn:sn:tn`

Here:
* `metricID`: the id of the sensor.
* `hostId`: the id of a host or device. A host contains multiple metrics. We assume different hosts are independent of one anonther.
* `controlPointId` - you can ignore for the time.
* `n` - number of data points we have for the current metric.
* `firstTime, lastTime` - time stamps of first and last data points.
* `warn, crit` - values designating two status thresholds, warning and critical.
* `vi:si:ti` - ith data point. `vi` is the actual value, `si` is the current status of the metric (OK, warning or critical), `ti` is the unix time stamp.


## Method to Read Data
Following is the python method we have for redaing the data.

In [None]:
# required packages
import numpy as np
import linecache

'''
get_data: method to read certain metrics from the data file.
@param: FILE_NAME is the path to the data file
@param: line_indices is the list of line numbers (indices) corresponding to metrics to be retrieved
@return: data matrix is in the size of [number of instances (n) , number of time series (length of line_indices)]
@return: metric_ids, host_ids, header_names 
'''
def get_data(FILE_NAME, line_indices=[14,15]):
    # This block processes the header line, if it exits
    header = True  # True means the first line of the csv file in the columns 1 to 8 are variable names
    if header == True:
        a = linecache.getline(FILE, 1)
        b = a.split(',')
        header_names = b[0:7]
        # dictionaries to store metric ids and host ids against the line indices
        metric_ids = {}
        host_ids = {}
        
    # empty matrix to store data
    data = []
    
    # line_indices: input the time series correspond to the same device
    for line_index in line_indices:
        # retrieve  different fields of a line
        a = linecache.getline(FILE, line_index) 
        b = a.split(',')
        
        # stores the metricID and hostID against line numbers
        if header == True:
            metric_ids[line_index] = b[0]
            host_ids[line_index] = b[1]
        # values of the current metric, v1..vn     
        V = []                
        for i in range(8,len(b)):            
            c = b[i]
            v, s, t = c.split(":") # value:status:time
            V.append(float(v))
        # append current values to the data matrix
        data.append(V)
    
    # convert data to numpy format to be used later by sk-learn mathods
    data = np.array(data)
    data = np.transpose(data)
    # returned data matrix is in the size of [number of instances (n) , number of time series (length of line_indices)] 
    # each column contains the sequence of a time series
    return (data, metric_ids, host_ids, header_names)