# Exercise with Capital Bikeshare data

## Introduction

- Capital Bikeshare dataset from Kaggle: [data](https://github.com/justmarkham/DAT8/blob/master/data/bikeshare.csv), [data dictionary](https://www.kaggle.com/c/bike-sharing-demand/data)
- Each observation represents the bikeshare rentals initiated during a given hour of a given day

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics

In [None]:
# read the data and set "datetime" as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)

In [None]:
# "count" is a method, so it's best to rename that column
bikes.rename(columns={'count':'total'}, inplace=True)

In [None]:
# create "hour" as its own feature
bikes['hour'] = bikes.index.hour

In [None]:
bikes.head()

In [None]:
bikes.tail()

- **hour** ranges from 0 (midnight) through 23 (11pm)
- **workingday** is either 0 (weekend or holiday) or 1 (non-holiday weekday)

## Task 1

Run these two `groupby` statements and figure out what they tell you about the data.

In [None]:
# mean rentals for each value of "workingday"
bikes.groupby('workingday').total.mean()

In [None]:
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean()

## Task 2

Run this plotting code, and make sure you understand the output. Then, separate this plot into two separate plots conditioned on "workingday". (In other words, one plot should display the hourly trend for "workingday=0", and the other should display the hourly trend for "workingday=1".)

In [None]:
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean().plot()

In [None]:
# hourly rental trend for "workingday=0"
bikes[bikes.workingday==0].groupby('hour').total.mean().plot()

In [None]:
# hourly rental trend for "workingday=1"
bikes[bikes.workingday==1].groupby('hour').total.mean().plot()

In [None]:
# combine the two plots
bikes.groupby(['hour', 'workingday']).total.mean().unstack().plot()

## Task 3

Create a decision Tree to predict the total number of bikeshare customers

- Create a train/test split
- Use at least two features of your choice
- Evaluate the model using the root mean squared error

In [None]:
# set your features and response
feature_cols = ['holiday', 'workingday', 'weather','temp', 'humidity','windspeed', 'hour']
X = bikes[feature_cols] 
y = bikes.total

# create a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.tree import DecisionTreeRegressor

# instantiate a decision tree (regression model)
tree = DecisionTreeRegressor()

# fit the model
tree.fit(X_train, y_train)

# create and store predictions
y_pred = tree.predict(X_test)

# evaluate the model
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

## Task 4 - Observe feature importances and improve the decision tree

Improve the model by trying different features or model hyperparameters

In [None]:
# observe the feature_importances_
pd.DataFrame(list(zip(feature_cols, tree.feature_importances_)))

In [None]:
# improve the decison tree

# instantiate a decision tree (regression model)

# fit the model

# create and store predictions

# evaluate the model


## Task 5

Repeat the process with the following models:

- sklearn.ensemble.BaggingRegressor (using a DecisionTreeRegressor)
- sklearn.ensemble.RandomForestRegressor

Tune the hyperparameters of each model

In [None]:
from sklearn.ensemble import BaggingRegressor

# instantiate a decision tree (regression model)

# fit the model

# create and store predictions

# evaluate the model


In [None]:
from sklearn.ensemble import RandomForestRegressor

# instantiate a decision tree (regression model)

# fit the model

# create and store predictions

# evaluate the model


## Task 5

Use 5-fold cross-validation to evaluate a decision tree model with those same features (fit to any "max_depth" you choose).

In [None]:
# cross validation with a decision tree


In [None]:
# cross validation with a bagging regressor



## Select the best model and continue to improve by creating your own new features