# Utilizing HPAT and daal4py in Data Science Workflows

The notebook below has been made to demonstrate daal4py in a data science context.  It utilizes a Cycling Dataset for pyworkout-toolkit, and attempts to create a linear regression model from the 5 features collected for telemetry to predict the user's Power output in the absence of a power meter.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import sys
%matplotlib inline
sys.version

This example will be exploring workout data pulled from Strava, processed into a CSV for Pandas and daal4py usage.  Below, we utilize pandas to read in the CSV file, and look at the head of dataframe with .head()

In [None]:
train_set = pd.read_parquet('cycling_train_dataset.pq')
train_set.head()

The data above has several key features that would be of great use here.  
- Altitude can affect performance, so it might be a useful feature.  
- Cadence is the revolutions per minute of the crank, and may have possible influence.  
- Heart Rate is a measure of the body's workout strain, and would have a high possibly of influence.
- Distance may have a loose correlation as it is highly route dependent, but might be possible.
- Speed has possible correlations as it ties directly into power.

## Steps with HPAT and daal4py
In general, we are trying to predict on the 'power' in Watts to see if we can generate a model that can predict one's power 
output without the usage of a cycling power meter.
1. Load train data and some clean-up
2. Linear regression training

## Loading and preparing the data

In the sections below, we will be using daal4py directly.  After importing the model, we will arrange it in a separate independent and dependent dataframes, then use the daal4py's training class to generate a workable model.

In [None]:
train_set = pd.read_parquet('cycling_train_dataset.pq')

In [None]:
# Remove entries where power==0
train_set = train_set[train_set.power!=0]
# Reduce the dataset, create X.  We drop the target, and other non-essential features.
reduced_dataset = train_set.drop(['time','power','latitude','longitude'], axis=1)
# Get the target, create Y as an 2d array of float64
target = train_set.power.values.reshape(len(train_set),1).astype(np.float64)
# This is essentially doing np.array(dataset.power.values, ndmin=2).T
# as it needs to force a 2 dimensional array as we only have 1 target

X is 5 features by 1991 rows, Y is 1991 rows by 1 column

In [None]:
print(reduced_dataset.values.shape, target.shape)

## Training the model

Create the Linear Regression Model, and train the model with the data.  We utilize daal4py's linear_regression_training class to create the model, then call .compute() with the independent and dependent data as the parameters.

In [None]:
import daal4py as d4p

In [None]:
# Create a linear regression algorithm object
d4p_lm = d4p.linear_regression_training(interceptFlag=True)
# Train the model
lm_trained = d4p_lm.compute(reduced_dataset.values, target)

In [None]:
print("Model has this number of features: ", lm_trained.model.NumberOfFeatures)

## Adding HPAT
Put the above into a function and use the HPAT jit decorator

In [None]:
# First import all the hpat stuff
import hpat
import daal4py
import daal4py.hpat

In [None]:
@hpat.jit
def train():
    # Read training data
    train_set = pd.read_parquet('cycling_train_dataset.pq')
    # Remove entries where power==0
    train_set = train_set[train_set.power!=0]
    # Reduce the dataset, create X.  We drop the target, and other non-essential features.
    reduced_dataset = train_set.drop(['time','power','latitude','longitude'], axis=1)
    # Get the target, create Y as an 2d array of float64
    target = train_set.power.values.reshape(len(train_set),1).astype(np.float64)
    
    # Create a daal4py linear regression algorithm object
    d4p_lm = d4p.linear_regression_training(interceptFlag=True)
    # Train the model
    lm_trained = d4p_lm.compute(reduced_dataset.values, target)

    # Finally return the result
    return lm_trained

In [None]:
train_result = train()
print(train_result)

## Prediction (inference) with the trained model

Now that the model is trained, we can test it with the test part of the dataset.  We drop the same features to match that of the trained model, and put it into daal4py's linear_regression_prediction class.

In [None]:
@hpat.jit
def predict(model):
    # read and clean as before
    test_set = pd.read_parquet('cycling_test_dataset.pq')
    test_set = test_set[test_set.power!=0]
    subset = test_set.drop(['time','power','latitude','longitude'], axis=1)
    
    # create our prediction algorithm object
    lm_predictor = d4p.linear_regression_prediction()
    # Now run prediction. The arguments use the independent data and the trained model from above as the parameters.
    result = lm_predictor.compute(subset.values, model)
    
    return result

In [None]:
# our linear model is the 'model' attribute of the training result
pred_result = predict(train_result.model)

In [None]:
test_set = pd.read_parquet('cycling_test_dataset.pq')
test_set = test_set[test_set.power!=0]
plt.plot(pred_result.prediction[0:300])
plt.plot(test_set.power.values[0:300])
plt.show()

The graph above shows the Orange (predicted) result over the Blue (original data).  This data is notoriously sparse in features leading to a difficult to predict target!

The reminder provides more details on using daal4py.

------------------------------------------
------------------------------------------

## Model properties
Another aspect of the model is the trained model's properties, which are explored below.

In [None]:
print("Betas:",lm_trained.model.Beta) 
print("Number of betas:", lm_trained.model.NumberOfBetas)
print("Number of Features:", lm_trained.model.NumberOfFeatures)

## Additional metrics
We can generate metrics on the independent data with daal4py's low_order_moments() class.

In [None]:
metrics_processor = d4p.low_order_moments()
data = metrics_processor.compute(reduced_dataset.values)
data.standardDeviation

## Migrating the trained model for inference on external systems

Occasionally one may need to migrate the trained model to another system for inference only--this use case allows the training on a much more powerful machine with a larger dataset, and placing the trained model for inference-only on a smaller machine.

In [None]:
import pickle

In [None]:
with open('trained_model2.pickle', 'wb') as model_pi:
    pickle.dump(lm_trained.model, model_pi)
    model_pi.close

The trained model file above can be moved to an inference-only or embedded system.  This is useful if the training is extreamly heavy or computed-limited.  

In [None]:
with open('trained_model2.pickle', 'rb') as model_import:
    lm_import = pickle.load(model_import)

The imported model from file is now usable again.  We can check the betas from the model to ensure that the trained model is present.

In [None]:
lm_import.Beta