In [1]:
import pandas as pd

# Summary


**Task:** Prediction of bike rental counts using available features recoded both hourly and daily. Evaluation is performed via a 10% random sample with mean absolute error (MAE) used as the scoring metric.

**Algorithm:** Gradient Boosted Decision Trees

**Files:**

- hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours
- day.csv - bike sharing counts aggregated on daily basis. Records: 731 days

**Data Source:** https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

# Approach

The general steps I took in this analysis are as follows:

1. **Data Understanding:** Review information about the data available and, and how it related to the task at hand. Create tentative hypothesis to guide exploration. 
2. **Data Exploration:** Visualise target variable and relationships between the target variable and the other features available. 
3. **Data Cleaning and Transformation:** Handle any missing data and peculiarities that may effect the model fitting and perform any transformations necessary.  
4. **Model Fitting:** Fitting model and tuning parameters. Also analyse feature importances in model fitting and test/train deviance plots
5. **Model Evaluation:** Assess how well the fitted model generalises by carrying of cross validation.


# Understanding

It is know that the rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc.

Therefore the following features are likely candidates to be of most interest in this task: 

* season : season (1:springer, 2:summer, 3:fall, 4:winter)
* hr : hour (0 to 23)
* workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
* temp/atemp : Normalized temperature/feeling temperature in Celsius. The values are divided to 41 (max)
* weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* hum : Normalised humidity
* windspeed : Normalised windspeed

**Further Details of Dataset:** [README.txt](./data/raw/README.txt)


# Exploration

* [Explore Rentals by Hour.ipynb](./Explore Rentals by Hour.ipynb)
* [Explore Rentals by Day.ipynb](./Explore Rentals by Day.ipynb)

## Findings

* In both data sets weather has an obvious impact on the rental count as well as season. 
* Daily rental count looks to be positively correlated with temp/atemp, and negatively correlated with windspeed.
* Hourly rental count looks to be positively correlated with temp/atemp, and negatively correlated with humidity.
* There is an obvious bimodal relationship to time of day as well as show below during commuting times.

# Data Cleaning and Transformation

The data set appears to already have been well sanitised and the features themselves have already been normalised to a scale of [0, 1]. Therefore there is little to do in this stage with this data. 

# Prediction

## By Day
* [Predict Rentals by Day.ipynb](./Predict Rentals by Day.ipynb)

### Error Metrics

In [2]:
pd.read_csv('data/results/RentalPredictionsByDay-MAEs.csv', index_col=0, names=['Scores'])

Unnamed: 0,Scores
MeanAbsoluteError,1137.927306
MedianAbsoluteError,1005.646595


### Feature Importances and Deviance

![](./data/results/RentalPredictionsByDay-DevianceImportance.png)

## By Hour

* [Predict Rentals by Hour.ipynb](./Predict Rentals by Hour.ipynb)

In [3]:
pd.read_csv('data/results/RentalPredictionsByHour-MAEs.csv', index_col=0, names=['Scores'])

Unnamed: 0,Scores
MeanAbsoluteError,72.930337
MedianAbsoluteError,44.325089


![](./data/results/RentalPredictionsByHour-DevianceImportance.png)

# Evaluation


## Cross Validation (using Mean Absolute Error)

### By Day

In [4]:
res = pd.read_csv('data/results/RentalPredictionsByDay-CV.csv', names=['Fold', 'Score'])
res['Score'].describe()

count      10.000000
mean     1574.286833
std       130.917235
min      1335.975730
25%      1509.937722
50%      1570.112931
75%      1624.928505
max      1810.883460
Name: Score, dtype: float64

### By Hour

In [5]:
res = pd.read_csv('data/results/RentalPredictionsByHour-CV.csv', names=['Fold', 'Score'])
res['Score'].describe()

count     10.000000
mean      83.164166
std       12.819342
min       67.916556
25%       73.368066
50%       81.068206
75%       91.922432
max      108.370040
Name: Score, dtype: float64

# Conclusions

This analysis used Gradient Boosting to predict rental counts by hour and by day.

* Measures for the average absolute error are shown above along with the standard deviation of these cross validated error scores. This learner shows less spread in error scores when fit to the hourly data set over the daily data set, suggesting that the learner would generalise better to unseen data and thus make a better predictor for hourly rental counts than daily rental counts. 

* A look at the deviance plot for daily predictions also shows that the error in the gap between training and test set error increases as more estimators are added. This suggests that the daily rental model is overfitting to the training data, and may perform poorly on newly gathered daily rental data.   

* From looking at feature importances, `weathersit` scored low on both daily and hourly rental predictions. This suggests that the learner is able to account for weather conditions well from other factors like temperature, humidity and windspeed.