#  [Bike Sharing Demand] Competition

My solution comes with:
- "Exploratory.ipynb": graphical notebook to explore the main features of the dataset;
- "Train.ipynb": main notebook, in which all the feature engineering, model tuning, and prediction is included;
- "utils.py": a file including the utilities functions developed for the task;
- "xgb_lgb.csv": the solution produced in the kaggle format.

## Exploratory Data Analysis, Feature Engineering

Firstly, an EDA has been performed in order to detect flaws in the dataset, as well as a mean to accomplish feature engineering in a sound way.

By inspecting the dataset, it is interesting to notice how the variable "weather=4" only occurs once in the whole training dataset provided, and only twice in the test set. As a matter of choice, this weather has been merged with the most similar weather (weather=3).

Next, by checking the USA / Washington DC bank holidays, it was observed that thanksgiving fridays were not included as holidays, and were hence recoded as such.

Similarly, crucial days sorrounding Christmas and New Year's Eve (24, 26, 31 Dec) were recoded in terms of holidays/workignday variables.

Further, as it can be glanced visually in the "Exploratory.ipynb" notebook, distributions of the log(response+1) variables have been plotted, in order to detect any particularly displeasing behaviour. Although the distributions seem to be skewed and far from normal, a shift (i.e. log(response + shift)) different from 1 didn't empirically aid the modelling phase in terms of RMSLE, and were hence kept with shift = 1 in the next steps.

Major interactions between variables have been checked graphically, and the ones that were included in the following steps can be appreciated visually in "Exploratory.ipynb". To illustrate, temp-atemp, atemp-humidity, hour-workingday attracted my attention, and were hence included as predictors in a specific way as it can been checked in utils.py. 

Further, squared(temp) and squared(atemp) were included. The intuition with regards to this last two features is that as the temperature becomes too hot or too cold, the propensity score for a person to use a bike decreases more than proportionally, making these two variables a good guidance for a model towards such reasoning.

## Variables Encoding

From the "datetime" variable, classical time related variables were extracted, and specifically "hour", and "day of the week" were kept as predictors. 

As usual, categorical variables (weather, season, etc) were one-hot encoded (dummy), with the exception for the "hour" variable, which empirically showed to be better performing in terms of predicting power when kept as a "one standing" variable. This can be explained as follows: while each day of the week weights equally in the choice of renting a bike, the hour of the day does not work in this way.

Finally, heavely skewed variables were box-cox transformed, such as "windspeed", "sq_atemp", "sq_temp", "temp_atemp".

## Modeling part

Although the problem could be treated as a time series problem, as stated on kaggle forums classical TS models (e.g. GARCH models) tended to be underperforming when compared to modern machine learning methods. Intuitevely this makes sense, since any auto-correlation scheme in the bike rental process seems to be a weak assumption, while the driving forces underlying the event are usually more related to the time, weather etc.

Nevertheless, to double check whether machine learning models tend to be robust with many days ahead predictions, a preliminary xgboost was trained on the first 15 days, and the RMSLE was computed for each of the next 4 days provided, showing a non significantly difference in terms of the metric.

Lot of time has been dedicated to creating a sound cross-validation scheme, in order to obtain a generalizable result, as well as a result as close as possible to the leaderboard results.

Since the test set is on a bulk of days every month (20-end of month), a 5-folds CV scheme has been created, in which every fold included 19/5 adjacent days (e.g. train on 7-19, predict on 1-4 and so on).

For the sake of time and computational resources, only tree based models were put at trial, and specifically xgboost and light gbm. Although they both come from a GBM schema, the underlying boosting algorithm is significantly different, and hence an ensemble of the two models resulted in an improved performance.

Specifically, both models' hyper-parameters were briefly tuned with a random search approach, and subsequently merged in a naive fashion (0.6 * xgboost + 0.4 * lgb). Such approach resulted in a 3.65x RMSLE on the leaderboard.

In terms of models un-blackboxing, importance plots for the XGBoost models are presented at the end of "Train.ipynb" notebook, in which the gain metric (i.e. the relative contribution of the corresponding feature for each tree in the model) is presented for both casual and registered models. Gain has been chosen over weight or frequency since it better describes which features are more important in generating a prediction.

Describing the results, it makes sense that the most importance variables for both sub-models are:
- hour;
- rush_x feature engineered hour (hour - workingday binned interaction);
- temperature variables, including interactions, and squared values;
- workingday.



## Conclusions

Although good, my result could be significantly improved in several ways, several ways were not implemented, for the sake of time. Clearly, in a commercial environment or in the case of a still active Kaggle competition, the following paths should be taken:
- Tuning was kept minimal with only 10 models, which is far from approximating the parameters' space: usually 60+ models should be tested in a random search approach. Alternatively, cartesian grid or Bayesian optimisation approaches could be implemented.
- One could argue that blending predictions from XGBoost and LGBM is not a sound choice, given the aforementioned similarity in underlying models dynamics. As a consequence, using a completely different model such as a DNN could aid the performance of the final model, and should hence be implemented.
- Feature engineering helped a lot in obtaining the performance as witnessed by the importance plot. However, further exploration would be interesting; also, including "artificial" variables by taking advantage of k-nn models would be worth exploring.
- Using meta-features from different models to implement a more refined models stacking.

Finally, several other features have been created, but were nto helpful in imporving the model, such as implementing a "school holidays" variable. Most likely due to the similar carriage of information when compared to the "holiday" variable, as well as due to the fact that a big amount of these holidays are in the test set (winter holidays).
