# Classification Task

In this notebook, the direction of bike trips towards or away from the university is predicted with several prediction methods.

In [1]:
# Load the environment
!pip install -e ..
import nextbike as nb
import warnings
warnings.filterwarnings('ignore')

Obtaining file:///C:/Users/meikh/Dropbox/Programming%20Data%20Science/PDS2020_Herber
Installing collected packages: PDS2020-Herber
  Attempting uninstall: PDS2020-Herber
    Found existing installation: PDS2020-Herber 0.0.1
    Uninstalling PDS2020-Herber-0.0.1:
      Successfully uninstalled PDS2020-Herber-0.0.1
  Running setup.py develop for PDS2020-Herber
Successfully installed PDS2020-Herber


Using TensorFlow backend.


In [2]:
# Load dataset, conduct train-test-split and standardize the input values
x_train, x_test, x_train_scale, x_test_scale, y_train, y_test = nb.models.general.prepare_dataset('dortmund_preprocessed_classification', 'towards_uni')

x_train_scale and x_test_scale are standardized transformations of x_train and x_test. As problems occured with the standardized datasets by using fractions of the whole dataset, the standardized datasets are only used when no fractioning takes place.

In [3]:
nb.models.classification.modelling('Mean', x_train_scale, x_test_scale, y_train, y_test)

   Name  Accuracy    CI_low  CI_high  Time
0  Mean  0.698904  0.695567  0.70224   0.0


The mean model predicts for every observation the most likely outcome of all observations. It is a reference model for the evaluation of the performance of the other predictive models. Applied to the test set, the mean model has an accuracy of 69.9% and a 95%-confidence interval ranging from 69.6% to 70.2%.

I also implemented for every model the option to export it as a pickle file which is disabled by default.

In [4]:
nb.models.classification.modelling('Logistic_Regression', x_train_scale, x_test_scale, y_train, y_test)

                  Name  Accuracy    CI_low   CI_high      Time
0  Logistic_Regression  0.710595  0.707297  0.713894  2.727823
1                 Mean  0.698904  0.695567  0.702240  0.000000


Since the confidence of the logistic regression not overlaps with the one of the mean model, it is significantly better predicting than the mean model.

In [11]:
nb.models.general.hyperparameter('KNN_Classification', fraction=0.5, x_train=x_train, y_train=y_train, 
                                 leaf_size=list(range(30,150)), n_neighbors=list(range(20,200)))

{'algorithm': 'auto', 'leaf_size': 38, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 134, 'p': 2, 'weights': 'uniform'} Time:  2437.5062384605408 seconds


This operation conducts a randomized search for the hyperparameters leaf size and n neighbors and outputs the hyperparameters of the best performing out of 100 models based on a 5-fold cross-validation. n neighbors stands for the number of nearest neighbors to be considered for every observation and the leaf size stands for how close the neighbors are that are checked by the algorithm.

Just a fraction of the training set is used in order to reduce processing time. Moreover, it would be more sophisticated to apply additionally a grid search around the hyperparameters found here and to input the optimized hyperparameters in the actual model.

In [12]:
nb.models.classification.modelling('KNN_Classification', x_train_scale, x_test_scale, y_train, y_test, 
                                   leaf_size=38, n_neighbors=134)

                  Name  Accuracy    CI_low   CI_high        Time
0   KNN_Classification  0.719298  0.716030  0.722566  142.190417
1  Logistic_Regression  0.710595  0.707297  0.713894    2.727823
2                 Mean  0.698904  0.695567  0.702240    0.000000


The KNN classifier does perform better than the logistic regression.

In [21]:
nb.models.general.hyperparameter('Random_Forest_Classification', x_train=x_train_scale, y_train=y_train, 
                                 n_estimators=list(range(10,75)), min_samples_split=list(range(2,15)), 
                                 min_samples_leaf=list(range(1,15)))

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 9, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 63, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False} Time:  5118.01823091507 seconds


With this operation the number of trees, the minimum observations required for splitting the node of a tree and the minimum number of observations required for the leaf of a tree are optimized for a random forest classifier.

In [22]:
nb.models.classification.modelling('Random_Forest_Classification', x_train_scale, x_test_scale, y_train, y_test, 
                                   n_estimators=63, min_samples_split=9, min_samples_leaf=1)

                           Name  Accuracy    CI_low   CI_high        Time
0  Random_Forest_Classification  0.776226  0.773195  0.779258   19.545172
1            KNN_Classification  0.719298  0.716030  0.722566  142.190417
2           Logistic_Regression  0.710595  0.707297  0.713894    2.727823
3                          Mean  0.698904  0.695567  0.702240    0.000000


The random forest regressor performs way better than the other models.

In [32]:
nb.models.classification.modelling('Neural_Network_2hl', x_train_scale, x_test_scale, y_train, y_test, 
                               layers=256, batch_size=16)

                           Name  Accuracy    CI_low   CI_high        Time
0  Random_Forest_Classification  0.776226  0.773195  0.779258   19.545172
1            Neural_Network_2hl  0.732353  0.729133  0.735573  176.658982
2            KNN_Classification  0.719298  0.716030  0.722566  142.190417
3           Logistic_Regression  0.710595  0.707297  0.713894    2.727823
4                          Mean  0.698904  0.695567  0.702240    0.000000


Here, a neural network with two hidden layers is applied. The optimal number of layers and the batch size was optimized manually. It is the second best predictive of the four models behind the random forest classifier.

All in all, the random forest classifier is the best predictive out of four models. It has an accuracy of 77.6% and is therewith correctly predicting 25.7% of the observations misspecified by the mean model.