# Dataset: PEMS

http://www.timeseriesclassification.com/description.php?Dataset=PEMS-SF  
https://archive.ics.uci.edu/ml/datasets/PEMS-SF

### Info from data source:
Source: California Department of Transportation, www.pems.dot.ca.gov
Creator: Marco Cuturi, Kyoto University, mcuturi '@' i.kyoto-u.ac.jp

Data Set Information:

15 months worth of daily data from the California Department of Transportation PEMS website. The data describes the occupancy
rate, between 0 and 1, of different car lanes of San Francisco bay area freeways. The measurements cover the period from Jan. 1st 2008 to Mar. 30th 2009 and are sampled every 10 minutes. We consider each day in this database as a single time series of dimension 963 (the number of sensors which functioned consistently throughout the studied period) and length 6 x 24=144. We remove public holidays from the dataset, as well
as two days with anomalies (March 8th 2009 and March 9th 2008) where all sensors were muted between 2:00 and 3:00 AM.
This results in a database of 440 time series.

The task is to classify each observed day as the correct day of the week, from Monday to Sunday, e.g. label it with an integer in {1,2,3,4,5,6,7}.
Each attribute describes the measurement of the occupancy rate (between 0 and 1) of a captor location as recorded by a measuring station, at a given timestamp in time during the day. The ID of each station is given in the stations_list text file. For more information on the location (GPS, Highway, Direction) of each station please refer to the PEMS website. There are 963 (stations) x 144 (timestamps) = 138.672 attributes for each record.

Relevant Papers:
[1] M. Cuturi, Fast Global Alignment Kernels, Proceedings of the Intern. Conference on Machine Learning 2011.


### Size:
+ Training samples: 267
+ Test sampels: 173
+ Dimension: 144 timepoints x 963 channels
+ Classes: 7


In [1]:
import numpy as np
import os
import sys
import pandas as pd

CODE = 'C:\\OneDrive - Netherlands eScience Center\\Project_mcfly\\mcfly\\mcfly'
DATA = 'C:\\OneDrive - Netherlands eScience Center\\Project_mcfly\\data\\PEMS-SF'
sys.path.append(CODE)

In [2]:
file_train = os.path.join(DATA, 'PEMS-SF_TRAIN.arff')
file_test = os.path.join(DATA, 'PEMS-SF_TEST.arff')

In [3]:
def load_arff(filename):
    start = 0

    data = []
    labels = []
    start_line = 0
    with open(filename) as fp:
        line = fp.readline()
        count = 0
        while line:
            if start == 1:
                label = line.split("',")[-1]
                labels.append(label.replace('\n', ''))
                line = line.split("',")[0] 
                lines = line.split('\\n')
                data_line = []
                for l in lines:
                    data_line_sub = []
                    #for entry in l.split(','):
                        #data_line_sub.append(entry.replace("'", ""))
                    #data_line.append(data_line_sub)
                    data_line.append([x.replace("'", "") for x in l.split(',')])
                data.append(data_line)

            if line.startswith('@data'):
                start_line = count
                #print("Actual data start in line", start_line)
                start = 1

            line = fp.readline()
            count += 1
            
    return np.swapaxes(np.array(data).astype(float), 1,2), labels

X_train, y_train = load_arff(file_train)
X_test0, y_test0 = load_arff(file_test)

In [4]:
print("X_train.shape", X_train.shape)
print(len(y_train))

print("X_test.shape", X_test0.shape)
print(len(y_test0))

X_train.shape (267, 144, 963)
267
X_test.shape (173, 144, 963)
173


In [5]:
type(X_train[0,0,0])

numpy.float64

In [6]:
X_train[0,:,10]

array([0.0134, 0.0129, 0.0122, 0.0105, 0.0103, 0.0095, 0.0086, 0.0084,
       0.0079, 0.0075, 0.0075, 0.0076, 0.0073, 0.0073, 0.007 , 0.0074,
       0.0074, 0.0072, 0.0071, 0.0078, 0.0078, 0.0101, 0.0109, 0.0111,
       0.0113, 0.0126, 0.0161, 0.0175, 0.0238, 0.0247, 0.0275, 0.0314,
       0.0397, 0.0532, 0.0568, 0.0593, 0.0589, 0.0721, 0.0765, 0.0893,
       0.0947, 0.0951, 0.094 , 0.0987, 0.1094, 0.1108, 0.1159, 0.1143,
       0.1076, 0.1083, 0.1078, 0.1052, 0.1051, 0.0975, 0.0931, 0.0879,
       0.086 , 0.0861, 0.0857, 0.0834, 0.0754, 0.0745, 0.0736, 0.0731,
       0.0742, 0.0725, 0.0691, 0.0704, 0.0711, 0.072 , 0.0713, 0.0699,
       0.0683, 0.0703, 0.0707, 0.0714, 0.0719, 0.0718, 0.0683, 0.0703,
       0.071 , 0.0703, 0.0723, 0.0706, 0.0698, 0.072 , 0.0736, 0.0744,
       0.0774, 0.0743, 0.0731, 0.079 , 0.079 , 0.077 , 0.0814, 0.0794,
       0.0759, 0.0791, 0.0769, 0.0765, 0.0823, 0.081 , 0.0813, 0.0865,
       0.0892, 0.0834, 0.083 , 0.0789, 0.0755, 0.0747, 0.0723, 0.0657,
      

### Split test into test and validation:

In [7]:
y_val = []
y_test = []
IDs_val = []
IDs_test = []

np.random.seed(1)
for label in list(set(y_test0)):
    idx = np.where(np.array(y_test0) == label)[0]
    idx1 = np.random.choice(idx, len(idx)//2, replace=False)
    idx2 = list(set(idx) - set(idx1))
    IDs_val.extend(idx1)
    IDs_test.extend(idx2)
    y_val.extend(len(idx1) * [label])
    y_test.extend(len(idx2) * [label])

    print(label, y_test0.count(label))
    
X_test = X_test0[IDs_test,:,:]
X_val = X_test0[IDs_val,:,:]

2.0 25
3.0 26
1.0 30
4.0 23
7.0 20
5.0 22
6.0 27


In [9]:
print(X_test.shape, X_val.shape)
print(len(y_test), len(y_val))

(88, 144, 963) (85, 144, 963)
88 85


## Save pre-processed data as numpy files

In [10]:
dataset_name = 'PEMS_'

output_path = 'C:\\OneDrive - Netherlands eScience Center\\Project_mcfly\\data\\processed'
np.save(os.path.join(output_path, dataset_name + 'X_train.npy'), X_train)
np.save(os.path.join(output_path, dataset_name + 'X_val.npy'), X_val)
np.save(os.path.join(output_path, dataset_name + 'X_test.npy'), X_test)
np.save(os.path.join(output_path, dataset_name + 'y_train.npy'), y_train)
np.save(os.path.join(output_path, dataset_name + 'y_val.npy'), y_val)
np.save(os.path.join(output_path, dataset_name + 'y_test.npy'), y_test)

## Or: Create new split of data ?