You can load the data sets either from snappy compressed parquet files or from an HDF5 file store. Both formats contain the same data. There is no need to import both (parquet and HDF5). 
The parquet files are much more compressed and therefore faster to download. But you need to install pyArrow in your environment. An HDF5 library is already available in the anaconda base environment.

In [None]:
import pandas as pd

# How to load train and validation set from Parquet

In [None]:
train = pd.read_parquet("c99temp_train_pseudo.snappy.parquet")
valid = pd.read_parquet("c99temp_valid_pseudo.snappy.parquet")

# How to load train and validation set from HDF5

In [5]:
trainH5 = pd.read_hdf('c99temp.h5', key='train')
validH5 = pd.read_hdf('c99temp.h5', key='valid')

# A first look at the training data

In [11]:
train[train['machineNumberPseudo'] == 'B0DrwR'].sort_values("timestamp")

Unnamed: 0,timestamp,cpuType,cpuBoard,tempBoardAK0,tempBoardSLAVE,remainingLifetime,machineNumberPseudo
1779671,2015-06-17 00:44:16.259,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,37.963571,41.358571,940,B0DrwR
1798193,2015-06-17 01:04:16.258,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,37.881429,41.337143,940,B0DrwR
1089908,2015-06-17 02:09:16.259,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,38.158571,41.532857,940,B0DrwR
1835037,2015-06-17 03:34:16.258,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,37.737143,41.218571,940,B0DrwR
1191552,2015-06-17 04:09:16.260,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,38.113750,41.513750,940,B0DrwR
1712635,2015-06-17 05:14:16.258,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,37.692857,41.194286,940,B0DrwR
1496850,2015-06-17 06:54:16.259,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,38.077143,41.467143,940,B0DrwR
1264755,2015-06-17 07:44:16.258,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,37.837143,41.314286,940,B0DrwR
388047,2015-06-17 08:04:16.258,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,38.094286,41.468571,940,B0DrwR
1712648,2015-06-17 09:14:16.259,Intel(R) Celeron(R) M processor 1.50GHz,886LCD-M/ATX 08_01_07,37.777500,41.263750,940,B0DrwR


# Data Summary 

In [6]:
train.describe(include='all')

Unnamed: 0,timestamp,cpuType,cpuBoard,tempBoardAK0,tempBoardSLAVE,remainingLifetime,machineNumberPseudo
count,1851112,1851112,1851112,1851112.0,1851112.0,1851112.0,1851112
unique,1847880,6,10,,,,235
top,2015-09-06 23:59:57.067000,Intel(R) Celeron(R) M processor 1.50GHz,CB60-BX/C 09_11_02,,,,YGJdPp
freq,3,926565,794461,,,,29110
first,2011-01-03 00:23:16.032000,,,,,,
last,2016-06-26 00:13:38.610000,,,,,,
mean,,,,37.04872,40.5239,476.9044,
std,,,,37.21539,18.61576,331.326,
min,,,,-99.0,-99.0,0.0,
25%,,,,39.135,38.83333,189.0,


In [12]:
valid.describe(include='all')

Unnamed: 0,timestamp,cpuType,cpuBoard,tempBoardAK0,tempBoardSLAVE,remainingLifetime,machineNumberPseudo
count,685756,685756,685756,685756.0,685756.0,685756.0,685756
unique,685695,6,8,,,,102
top,2016-12-01 01:58:29.888000,INTEL Pentium III CPU,CB60-BX/C 09_11_02,,,,MwV yV
freq,2,348457,345763,,,,9867
first,2016-06-26 00:23:55.684000,,,,,,
last,2017-08-29 00:13:11.702000,,,,,,
mean,,,,38.279918,39.717545,364.867082,
std,,,,35.138555,22.153997,200.937949,
min,,,,-99.0,-99.0,0.0,
25%,,,,39.1875,39.0,202.0,


### As you can see from the timestamp ranges, training and validation data were not selected randomly, but by a split on the time axis.

# Your Tasks

## 1. Predict remaining lifetime on the validation set 

The validation metric is your own choice.

We will evaluate your final model on a secret test set with different machine numbers.

## 2. Ensure a good user experience

Some aspects to think about:
- Can you explain your predictions to the user? 
    - Extrapolating the temperature trend could be helpful. 
    - Informing about the usual temperature range of similar machines might also be nice.
- Are there sudden changes in the remaining lifetime you predicted (one day it's 500 days, the next day it's only 5)? This might confuse the user.
- How confident are you about your predictions?
- Is there a minimum confidence that should be given, before alerting the user?
- Should you display information about the certainty of your prediction to the user?