# Regression Task

In this notebook, the duration of bike trips is predicted with several prediction methods.

In [1]:
# Load the environment
!pip install -e ..
import nextbike as nb
import warnings
warnings.filterwarnings('ignore')

Obtaining file:///C:/Users/meikh/Dropbox/Programming%20Data%20Science/PDS2020_Herber
Installing collected packages: PDS2020-Herber
  Attempting uninstall: PDS2020-Herber
    Found existing installation: PDS2020-Herber 0.0.1
    Uninstalling PDS2020-Herber-0.0.1:
      Successfully uninstalled PDS2020-Herber-0.0.1
  Running setup.py develop for PDS2020-Herber
Successfully installed PDS2020-Herber


Using TensorFlow backend.


In [2]:
# Load dataset, conduct train-test-split and standardize the input values
x_train, x_test, x_train_scale, x_test_scale, y_train, y_test = nb.models.general.prepare_dataset()

x_train_scale and x_test_scale are standardized transformations of x_train and x_test. As problems occured with the standardized datasets by using fractions of the whole dataset, the standardized datasets are only used when no fractioning takes place.

In [3]:
nb.models.regression.modelling('Mean', x_train_scale, x_test_scale, y_train, y_test)

   Name   MSE CI_low CI_high  Time
0  Mean  6407   6089    6725   0.0


The mean model predicts for every observation the mean of all durations. It is a reference model for the evaluation of the performance of the other predictive models. Applied to the test set, the mean model has a mean squared error of 6407 and a 95%-confidence interval ranging from 6089 to 6725.

I also implemented for every model the option to export it as a pickle file which is disabled by default.

In [4]:
nb.models.regression.modelling('OLS_Regression', x_train_scale, x_test_scale, y_train, y_test)

             Name   MSE  CI_low  CI_high      Time
0  OLS_Regression  6035    5740     6329  0.674629
1            Mean  6407    6089     6725  0.000000


Since the confidence of the OLS regression overlaps with the one of the mean model, it is not significantly better predicting than the mean model.

In [6]:
nb.models.general.hyperparameter('KNN_Regression', fraction=0.5, x_train=x_train, y_train=y_train, 
                                 leaf_size=list(range(100,200)), n_neighbors=list(range(10,200)))

{'algorithm': 'auto', 'leaf_size': 140, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 34, 'p': 2, 'weights': 'uniform'} Time:  2171.2984414100647 seconds


This operation conducts a randomized search for the hyperparameters leaf size and n neighbors and outputs the hyperparameters of the best performing out of 100 models based on a 5-fold cross-validation. n neighbors stands for the number of nearest neighbors to be considered for every observation and the leaf size stands for how close the neighbors are that are checked by the algorithm.

Just a fraction of the training set is used in order to reduce processing time. Moreover, it would be more sophisticated to apply additionally a grid search around the hyperparameters found here and to input the optimized hyperparameters in the actual model.

In [7]:
nb.models.regression.modelling('KNN_Regression', x_train_scale, x_test_scale, y_train, y_test, 
                               leaf_size=140, n_neighbors=34)

             Name   MSE  CI_low  CI_high        Time
0  KNN_Regression  5947    5659     6234  146.013052
1  OLS_Regression  6035    5740     6329    0.674629
2            Mean  6407    6089     6725    0.000000


The KNN regression does perform better than the OLS regression, but still not significantly better than the mean model.

In [16]:
nb.models.general.hyperparameter('Random_Forest_Regression', x_train=x_train, y_train=y_train, 
                                 n_estimators=list(range(10,50)), min_samples_split=list(range(2,15)), 
                                 min_samples_leaf=list(range(1,15)), fraction=0.25)

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 11, 'min_samples_split': 3, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 39, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False} Time:  4991.5405921936035 seconds


With this operation the number of trees, the minimum observations required for splitting the node of a tree and the minimum number of observations required for the leaf of a tree are optimized for a random forest regressor.

In [17]:
nb.models.regression.modelling('Random_Forest_Regression', x_train_scale, x_test_scale, y_train, y_test,
                              n_estimators=39, min_samples_split=3, min_samples_leaf=11)

                       Name   MSE  CI_low  CI_high        Time
0  Random_Forest_Regression  5079    4822     5336   84.472171
1            KNN_Regression  5947    5659     6234  146.013052
2            OLS_Regression  6035    5740     6329    0.674629
3                      Mean  6407    6089     6725    0.000000


The random forest regressor performs way better than the other models and also significantly better than the mean model.

In [19]:
nb.models.regression.modelling('Neural_Network_2hl', x_train_scale, x_test_scale, y_train, y_test, 
                               layers=256, batch_size=16)

                       Name   MSE  CI_low  CI_high        Time
0  Random_Forest_Regression  5079    4822     5336   84.472171
1        Neural_Network_2hl  5656    5383     5930  189.244573
2            KNN_Regression  5947    5659     6234  146.013052
3            OLS_Regression  6035    5740     6329    0.674629
4                      Mean  6407    6089     6725    0.000000


Here, a neural network with two hidden layers is applied. The optimal number of layers and the batch size was optimized manually. It is the second best predictive of the four models behind the random forest regressor.

All in all, the random forest regressor is the best predictive out of four models. It has a mean squared error of 5079 and is therewith explaining 21% of the variance of the trip duration that is unexplained by the mean model. Its prediction deviates from the actual trip durattions in the test set on average by 71 minutes.

I could also have used principal component analysis to reduce the complexity of the models. In a more sophisticated modelling strategy and confronted with a larger dataset, I could have implemented a search for a trade-off between applying principal component analyses and utilizing fractions of the dataset for hyperparameter optimization and for modelling.

In addition, the python library includes a command that can be applied in a console which trains and exports a model. The default model is an OLS regression, but the other models can also be called instead as an option, however, only with their standard hyperparameters. 