# The final touch: a decision model with the extracted features 

We will go very fast through the process of exploratory data analysis, with the data we have available from the gait score experiment.


## Load and prepare the data

In [None]:
import pixiedust

Let us load all feature files.

First lets read the features from the forceplate.

In [None]:
# If you didn't run Notebook 2, get the data from here
# !wget https://zenodo.org/record/3563513/files/fp_features.csv
fp = spark.read.csv('fp_features.csv', inferSchema=True, header=True)

In [None]:
fp.printSchema()

Second, we read the features from the accelerator sensors.

In [None]:
# If you didn't run Notebook 3, get the data from here
# !wget https://zenodo.org/record/3563513/files/acc_features.csv
acc = spark.read.csv('acc_features.csv', inferSchema=True, header=True)

In [None]:
acc.printSchema()

Last comes the metadata file, that includes the gait score

In [None]:
metadata = spark.read.csv('files/Walking_trial_IDmatch_edu.csv', inferSchema=True, header=True)

In [None]:
gait = metadata.select('Wingband0','Score').withColumnRenamed('Wingband0','ID')

In [None]:
gait.printSchema()

Let stitch it alltogether, with two joins!

In [None]:
gait = gait.join(acc, 'ID')\
           .join(fp, 'ID')

gait.show()

What we will try to do, is to develop (train and assess) a regression model that will estimate the gait score from the sensor features we have extracted previously.

$score = f(steps,timeOnPlate, weight)$

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import LinearRegressionModel
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator


# Create a learning pipeline

In [None]:
vectorizer = VectorAssembler(
                inputCols=["steps", "timeOnPlate", "weight"],
                outputCol="features")

In [None]:
 # Initialize the linear regression learner with default values for the parameters
lr = LinearRegression()

In [None]:
lr.setPredictionCol("Predicted_Score")\
  .setLabelCol("Score")

In [None]:
lrPipeline = Pipeline()
lrPipeline.setStages([vectorizer, lr])

## Train and inspect a model

In [None]:
# Let's first train on the trining dataset to see what we get
lrModel = lrPipeline.fit(gait)

In [None]:
# The coefficients (i.e., weights) are as follows:
weights = lrModel.stages[1].coefficients

# The corresponding features for these weights are:
featuresNoLabel = vectorizer.getInputCols()


# Print coefficients 
print(list(zip(featuresNoLabel, weights)))
 
 # Print the intercept
print('intercept', lrModel.stages[1].intercept)

## Learner evaluation

In [None]:
 # Apply the LR model to the same dataset and predict gait score
predictionsLR = lrModel.transform(gait)\
                       .select("steps", "timeOnPlate", "weight", "Score","Predicted_Score")

 # Print the first rows of the predictions
predictionsLR.show() 

Looks great! We managed to fit a line between 3 points :)

Now let's compute an evaluation metric for our (training) dataset

In [None]:
 # Create an RMSE evaluator using the label and predicted columns 
regEval = RegressionEvaluator(predictionCol="Predicted_Score", labelCol="Score", metricName="rmse")

 # Run the evaluator on the DataFrame
rmse = regEval.evaluate(predictionsLR)

print("Root Mean Squared Error: %.20f" % rmse)