# Bike Sharing Demand

## Forecast use of a city bikeshare system

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

![](https://kaggle2.blob.core.windows.net/competitions/kaggle/3948/media/bikes.png)

## The Data

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

* **datetime** - hourly date + timestamp  
* **season** -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
* **holiday** - whether the day is considered a holiday
* **workingday** - whether the day is neither a weekend nor holiday
* **weather** - 
    * 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
    * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
    * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered * **clouds**
    * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
* **temp** - temperature in Celsius
* **atemp** - "feels like" temperature in Celsius
* **humidity** - relative humidity
* **windspeed** - wind speed
* **casual** - number of non-registered user rentals initiated
* **registered** - number of registered user rentals initiated
* **count** - number of total rentals

## Evaluation

Submissions are evaluated one the Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated as

$$ \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } $$

Where:

* $n$ is the number of hours in the test set
* $p_i$ is your predicted count
* $a_i$ is the actual count
* $log(x)$ is the natural logarithm

### [Kaggle Reference](https://www.kaggle.com/c/bike-sharing-demand)

## Planning

1. **Exploration**
    1. Distributions (Univariate)
    1. Correlations (Bivariate)
    1. Plots (Multivariate)
2. **Analysis**
    1. Write the Scoring Method
    1. Build a 'mean value' baseline model for reference
    1. Set up crossvalidation pipeline
    1. Build our first regression model
    1. Feature Engineering
    1. Tune parameters to improve the model
3. **Submission**
    1. Submit our predictions to Kaggle.

## Exploration

### Version

In [None]:
import sys
print pd.__name__, pd.__version__
print np.__name__, np.__version__
print mpl.__name__, mpl.__version__
print sns.__name__, sns.__version__
print sklearn.__name__, sklearn.__version__
print sys.version

### Preliminaries

In [None]:
%matplotlib inline

#### Utility Functions

In [None]:
def table(df,replace_match="",replace_str=""):
    return IPython.display.display(HTML(df.to_html().replace('<table border="1" class="dataframe">','<table class="table table-striped table-hover">').replace(replace_match,replace_str)))

### Load the Data

In [None]:
DATA_DIR = '../data/bikeshare/'
TRAIN_FILE = DATA_DIR + 'train.csv'
TEST_FILE = DATA_DIR + 'test.csv'

In [None]:
df = pd.read_csv(TRAIN_FILE)

### Inspect the Data

In [None]:
df.info()

#### Fix the Datatypes

In [None]:
df.datetime = pd.to_datetime(df.datetime)

In [None]:
df.info()

In [None]:
df = df.set_index('datetime')

#### Head

In [None]:
table(df.head(5))

In [None]:
table(df.tail(5))

#### Random sample of rows

In [None]:
from random import sample
table(df.ix[sample(df.index,10)])

### Univariate

In [None]:
df.hist(figsize=(12,12));

#### Checking for normality

In [None]:
from statsmodels.graphics.gofplots import qqplot

with sns.plotting_context("poster", font_scale=1, rc=c):
    qqplot(df['windspeed'], line='45', fit=True);

### Bivariate

In [None]:
b, g, r, p = sns.color_palette("muted", 4)

with sns.plotting_context("poster", font_scale=1, rc=c):
    g = sns.PairGrid(df, hue="workingday")
    g.map_diag(plt.hist)
    g.map_offdiag(plt.scatter)
    g.add_legend()

#### Working Day vs. Count

In [None]:
with sns.plotting_context("poster", font_scale=1, rc=c):
    g = sns.FacetGrid(df, col="workingday")
    g.map(plt.hist, "count");

In [None]:
with sns.plotting_context("poster", font_scale=1, rc=c):
    g = sns.FacetGrid(df, col="season", size=4, aspect=.5)
    g.map(sns.boxplot, "atemp");

### Multivariate

In [None]:
with sns.plotting_context("poster", font_scale=1, rc=c):
    g = sns.FacetGrid(df, col="season", hue="count")
    g.map(plt.scatter, "temp", "atemp", alpha=.7)

## Analysis

### Training / Test Split

In [None]:
def get_train_data():
    # Loads the training data, but splits the y from the X
    df = pd.read_csv(TRAIN_FILE)
    return df.iloc[:, 0:9], df.iloc[:,-1]

### Scoring Method

In [None]:
from sklearn.metrics import make_scorer

# First, we should set up some sort of testing framework, so that we can benchmark our progress as we go
# The evaluation metric is Root mean squared logarithmic error.
def rmsele(actual, pred):
    """
    Given a column of predictions and a column of actuals, calculate the RMSELE
    """
    squared_errors = (np.log(pred + 1) - np.log(actual + 1)) ** 2
    mean_squared = np.sum(squared_errors) / len(squared_errors)
    return np.sqrt(mean_squared)

# This helper function will make a callable that we can use in cross_val_score
rmsele_scorer = make_scorer(rmsele, greater_is_better=False)

### Baseline Model

In [None]:
from sklearn.cross_validation import KFold, cross_val_score

expected_value = df['count'].mean()
yhat = np.array([expected_value] * len(df['count']))

rmsele(df['count'].values,yhat)


### Simple Model

In [None]:
from sklearn.linear_model import Ridge
from sklearn.cross_validation import KFold, cross_val_score

# Lets just train a basic model so that we can test if our scoring and
# cross validation framework works well. We'll use a Ridge regression,
# which is a form of linear regression
X, y = get_train_data()
# Subset the X to just use temp, atemp, and workingday
Xhat = X[['temp', 'atemp', 'humidity']]
ridge_estimator = Ridge(normalize=True)
scores = cross_val_score(ridge_estimator, Xhat, y, scoring=rmsele_scorer, cv=5, verbose=1)
scores

### CrossValidation

In [None]:
# Fill in some of the parameters on cross_val_score
def perform_cv(estimator, X, y):
    return cross_val_score(estimator, X, y, scoring=rmsele_scorer, cv=5, verbose=1)

### Grid Search

In [None]:
from sklearn.grid_search import GridSearchCV

# Try a simple grid search with the estimator
parameters = {'alpha': np.logspace(0, 2, 10)}
grid = GridSearchCV(ridge_estimator, parameters, scoring=rmsele_scorer, cv=5)
grid.fit(Xhat, y)
grid.grid_scores_

In [None]:
# And for grid_search
def perform_grid_search(estimator, parameters, X, y):
    grid_search = GridSearchCV(estimator, parameters, scoring=rmsele_scorer, cv=5)
    grid_search.fit(X, y)
    return grid_search

### Custom Ridge to floor to Zero

In [None]:
from sklearn.linear_model import Ridge

# Custom Ridge to floor predictions at 0
class FlooredRidge(Ridge):
    def predict(self, X, *args, **kwargs):
        pred = super(FlooredRidge, self).predict(X, *args, **kwargs)
        pred[pred < 0] = 0
        return pred

## Transform Data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
Xhat = X[['temp', 'atemp', 'humidity']]

In [None]:
Xhat = X[['temp', 'atemp', 'humidity']]
normalize.fit(Xhat)

In [None]:
print normalize.std_
print normalize.mean_
(Xhat - Xhat.mean()) / Xhat.std()
(Xhat - normalize.mean_) / normalize.std_

In [None]:
from sklearn.preprocessing import StandardScaler

# Now lets move on to the actual transformation of the inputs
# First, not every estimator we'll use will have the "normalize" keyword
# So let's break it out into a transformer, so that we have better control over it
normalize = StandardScaler()
ridge_estimator = Ridge()
Xhat = X[['temp', 'atemp', 'humidity']]
Xhat = normalize.fit_transform(Xhat)
scores = perform_cv(ridge_estimator, Xhat, y)
scores

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion

# Now we have the beginnings of a multi-step pipeline
# Scikit lets you wrap each of these steps into a Pipeline object, so you just have to run fit / predict once
# instead of manually feeding the data from one transformer to the next
normalize = StandardScaler()
ridge_estimator = Ridge()
pipeline = Pipeline([('normalize', normalize), ('ridge', ridge_estimator)])
Xhat = X[['temp', 'atemp', 'humidity']]
scores = perform_cv(pipeline, Xhat, y)
scores

In [None]:
# Additionally, you can perform grid search over all of the steps of the pipeline
# So you don't have to tune each step manually
# The pipeline exposes the underlying steps' parameters like so:
# ridge__alpha, and normalize__with_mean
normalize = StandardScaler()
ridge_estimator = Ridge()
parameters = {'ridge__alpha': np.logspace(0, 3, 10)}
Xhat = X[['temp', 'atemp', 'humidity']]
pipeline = Pipeline([('normalize', normalize), ('ridge', ridge_estimator)])
grid = GridSearchCV(pipeline, parameters, scoring=rmsele_scorer, cv=5)
grid.fit(Xhat, y)
grid.grid_scores_

## Feature Engineering

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelBinarizer
# Lets move on to including more features in our model
# We probably want to use a factor like Season in our model, but it's
# a categorical feature, and we'll need to convert it to a series of booleans
one_hot = OneHotEncoder()
season = one_hot.fit_transform(X['season'].reshape(X.shape[0], 1)).toarray()

In [None]:
one_hot.fit_transform(X['season'].reshape(X.shape[0], 1)).toarray()

In [None]:
# We then have to join this with the other variables
normalize = StandardScaler()
ridge_estimator = Ridge()
pipeline = Pipeline([('normalize', normalize), ('ridge', ridge_estimator)])
Xhat = np.hstack([X[['temp', 'atemp', 'humidity']], season])
scores = perform_cv(pipeline, Xhat, y)
scores

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# Actually there's a faster way of doing this with the argument 'categorical_features'
class ToArray(BaseEstimator, TransformerMixin):
    # We need this because OneHotEncoder returns a sparse array, and normalize requires a non-sparse array
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.toarray()
        
Xhat = X[['season', 'weather', 'temp', 'atemp', 'humidity']]
# I think it needs to be 5 here, because it assumes that '0' is a possible value for an int datatype
# Should probably specify the data types in read_csv
one_hot = OneHotEncoder(n_values=[5, 5], categorical_features=[0, 1])
desparse = ToArray()
normalize = StandardScaler()
ridge_estimator = FlooredRidge()
pipeline = Pipeline([('onehot', one_hot), ('desparse', desparse), ('normalize', normalize), ('ridge', ridge_estimator)])
scores = perform_cv(pipeline, Xhat, y)
scores

In [None]:
# OK, so now we've got a pipeline that does one-hot encoding of two categorical variables
# and then normalizes the variables
# But actually we're not supposed to normalize the the dummy variables.
# So we need some way of only normalizing non-dummy variables

# Oops, actually the CV splitting converts the Pandas DF to an array, so we can't rely
# on the normalize having the proper column names
class SelectiveNormalize(StandardScaler):
    def __init__(self, cols, copy=True, with_mean=True, with_std=True):
        self.cols = cols
        super(SelectiveNormalize, self).__init__(copy, with_mean, with_std)
    
    def fit(self, X, y=None):
        subset = X[:, self.cols]
        return super(SelectiveNormalize, self).fit(subset, y)
        
    def transform(self, X):
        subset = X[:, self.cols]
        normalized = super(SelectiveNormalize, self).transform(subset)
        others = [col for col in range(X.shape[1]) if col not in self.cols]
        res = np.hstack([normalized, X[:, others]])
        return res

Xhat = X[['season', 'weather', 'temp', 'atemp', 'humidity']]
one_hot = OneHotEncoder(n_values=[5, 5], categorical_features=[3, 4])
normalize = SelectiveNormalize([2, 3, 4])
desparse = ToArray()
ridge_estimator = FlooredRidge()
pipeline = Pipeline([('normalize', normalize), ('onehot', one_hot), ('desparse', desparse), ('ridge', ridge_estimator)])
scores = perform_cv(pipeline, Xhat, y)
scores

## DateTime

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Lets try tackling the date column now.  The time of day is probably really important
# So we need some way of extracting the hour
# We'll use a FeatureUnion to do this, to demonstrate the functionality
def get_train_data():
    # Loads the training data, but splits the y from the X
    df = pd.read_csv(TRAIN_FILE, parse_dates=['datetime'])
    return df.iloc[:, 0:9], df.iloc[:,-1]


class SelectColumns(BaseEstimator, TransformerMixin):
    """
    Passes on a subset of columns from an input ndarray
    """
    def __init__(self, cols):
        self.cols = cols
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[:, self.cols]
    

class ExtractHour(BaseEstimator, TransformerMixin):
    """
    Extracts hour from a datetime series
    """
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        res = np.zeros(X.shape)
        for xx in xrange(X.shape[0]):
            res[xx] = X[xx, 0].hour
        return res.reshape(res.shape[0], 1)
    

class CastType(BaseEstimator, TransformerMixin):
    def __init__(self, cast_to):
        self.cast_to = cast_to
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.astype(self.cast_to)

X, y = get_train_data()
# Reminder of the columns:
# ['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed']
select_date = SelectColumns([0])
select_others = SelectColumns(range(1, 9))
cast_float = CastType(np.float64)
one_hot = OneHotEncoder(n_values=[5, 5], categorical_features=[0, 3])
get_hour = ExtractHour()
normalize = SelectiveNormalize(range(2, 8))
desparse = ToArray()
ridge_estimator = RandomForestRegressor(n_estimators=200)

hour_feature = Pipeline([('select_date', select_date), ('get_hour', get_hour)])
other_features = Pipeline([('select_others', select_others), ('cast_float', cast_float), ('onehot', one_hot), ('desparse', desparse)])
join_features = FeatureUnion([('hour', hour_feature), ('others', other_features)])
predict = Pipeline([('featurize', join_features), ('estimator', ridge_estimator)])
scores = perform_cv(predict, X, y)
scores

## Submisison

In [None]:
from IPython.display import HTML

HTML('''<script>

code_show=true;

function code_toggle() {
    if (code_show){ 
        $('div.input').hide();
        $('.output_scroll').removeClass('output_scroll');
        $('.prompt').hide();
    } else {
        $('div.input').show();
        $('.output_scroll').removeClass('output_scroll');
        $('.prompt').show();
    }
    code_show = !code_show
}
</script>
 
<a class='btn btn-warning btn-lg' style="margin:0 auto; display:block; max-width:320px" href="javascript:code_toggle()">TOGGLE CODE</a>''')

In [None]:
HTML('''<link href='http://fonts.googleapis.com/css?family=Roboto|Open+Sans' rel='stylesheet' type='text/css'>
<style>
body #notebook {
    font-family : 'Open Sans','Source Sans Pro','Proxima Nova', sans-serif;
    font-size : 1.3em;
    line-height : 1.5em;
}

h1,h2,h3,h4,h5 {
    font-family : 'Roboto','Source Sans Pro','Proxima Nova', sans-serif;
}


#notebook .panel-body {
  font-size: 1.1em;
  line-height: 1.6em;
}

#notebook .table,
#notebook .table th,
#notebook .table td,
#notebook .table tr {
    text-align : center;
    border: 0;
}
</style>

<script>
$(function(){
    code_toggle()
})
</script>

''')

## Exploration

In [None]:
import numpy as np
import pandas as pd


TEST_FILE = 'data/test.csv'
TRAIN_FILE = 'data/train.csv'

In [None]:
df = pd.read_csv(TRAIN_FILE)

### Distibutions

In [None]:
import matplotlib
pd.options.display.mpl_style = 'default'

### Histograms

In [None]:
df.hist(figsize=(12, 12))

### Boxplots

In [None]:
df.boxplot(column='atemp', by='season');

In [None]:
df.boxplot(figsize=(12, 12))

### Frequency Analysis

In [None]:
df['windspeed'].value_counts().plot(kind='bar');

### QQPlot

In [None]:
from statsmodels.graphics.gofplots import qqplot

qqplot(df['windspeed'], line='45', fit=True);

### Scatter Matrix

In [None]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df, alpha=0.2, figsize=(12, 12), diagonal='kde')

## Analysis

### Training / Test Split

In [None]:
def get_train_data():
    # Loads the training data, but splits the y from the X
    df = pd.read_csv(TRAIN_FILE)
    return df.iloc[:, 0:9], df.iloc[:,-1]

## Scoring Method

In [None]:
from sklearn.metrics import make_scorer

# First, we should set up some sort of testing framework, so that we can benchmark our progress as we go
# The evaluation metric is Root mean squared logarithmic error.
def rmsele(actual, pred):
    """
    Given a column of predictions and a column of actuals, calculate the RMSELE
    """
    squared_errors = (np.log(pred + 1) - np.log(actual + 1)) ** 2
    mean_squared = np.sum(squared_errors) / len(squared_errors)
    return np.sqrt(mean_squared)

# This helper function will make a callable that we can use in cross_val_score
rmsele_scorer = make_scorer(rmsele, greater_is_better=False)

### Simple Method

In [None]:
from sklearn.cross_validation import KFold, cross_val_score

# Lets just train a basic model so that we can test if our scoring and
# cross validation framework works well. We'll use a Ridge regression,
# which is a form of linear regression
X, y = get_train_data()
# Subset the X to just use temp, atemp, and workingday
Xhat = X[['temp', 'atemp', 'humidity']]
ridge_estimator = Ridge(normalize=True)
scores = cross_val_score(ridge_estimator, Xhat, y, scoring=rmsele_scorer, cv=5, verbose=1)
scores

### CrossValidation

In [None]:
# Fill in some of the parameters on cross_val_score
def perform_cv(estimator, X, y):
    return cross_val_score(estimator, X, y, scoring=rmsele_scorer, cv=5, verbose=1)

### Grid Search

In [None]:
from sklearn.grid_search import GridSearchCV

# Try a simple grid search with the estimator
parameters = {'alpha': np.logspace(0, 2, 10)}
grid = GridSearchCV(ridge_estimator, parameters, scoring=rmsele_scorer, cv=5)
grid.fit(Xhat, y)
grid.grid_scores_

In [None]:
# And for grid_search
def perform_grid_search(estimator, parameters, X, y):
    grid_search = GridSearchCV(estimator, parameters, scoring=rmsele_scorer, cv=5)
    grid_search.fit(X, y)
    return grid_search

### Custom Ridge to floor to Zero

In [None]:
from sklearn.linear_model import Ridge

# Custom Ridge to floor predictions at 0
class FlooredRidge(Ridge):
    def predict(self, X, *args, **kwargs):
        pred = super(FlooredRidge, self).predict(X, *args, **kwargs)
        pred[pred < 0] = 0
        return pred

### Transformers

In [None]:
# Now lets move on to the actual transformation of the inputs
# First, not every estimator we'll use will have the "normalize" keyword
# So let's break it out into a transformer, so that we have better control over it
normalize = StandardScaler()
ridge_estimator = Ridge()
Xhat = X[['temp', 'atemp', 'humidity']]
Xhat = normalize.fit_transform(Xhat)
scores = perform_cv(ridge_estimator, Xhat, y)
scores

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion

# Now we have the beginnings of a multi-step pipeline
# Scikit lets you wrap each of these steps into a Pipeline object, so you just have to run fit / predict once
# instead of manually feeding the data from one transformer to the next
normalize = StandardScaler()
ridge_estimator = Ridge()
pipeline = Pipeline([('normalize', normalize), ('ridge', ridge_estimator)])
Xhat = X[['temp', 'atemp', 'humidity']]
scores = perform_cv(pipeline, Xhat, y)
scores

In [None]:
# Additionally, you can perform grid search over all of the steps of the pipeline
# So you don't have to tune each step manually
# The pipeline exposes the underlying steps' parameters like so:
# ridge__alpha, and normalize__with_mean
normalize = StandardScaler()
ridge_estimator = Ridge()
parameters = {'ridge__alpha': np.logspace(0, 3, 10)}
Xhat = X[['temp', 'atemp', 'humidity']]
pipeline = Pipeline([('normalize', normalize), ('ridge', ridge_estimator)])
grid = GridSearchCV(pipeline, parameters, scoring=rmsele_scorer, cv=5)
grid.fit(Xhat, y)
grid.grid_scores_

### More Features

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelBinarizer
# Lets move on to including more features in our model
# We probably want to use a factor like Season in our model, but it's
# a categorical feature, and we'll need to convert it to a series of booleans
one_hot = OneHotEncoder()
season = one_hot.fit_transform(X['season'].reshape(X.shape[0], 1)).toarray()

In [None]:
# We then have to join this with the other variables
normalize = StandardScaler()
ridge_estimator = Ridge()
pipeline = Pipeline([('normalize', normalize), ('ridge', ridge_estimator)])
Xhat = np.hstack([X[['temp', 'atemp', 'humidity']], season])
scores = perform_cv(pipeline, Xhat, y)
scores

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# Actually there's a faster way of doing this with the argument 'categorical_features'
class ToArray(BaseEstimator, TransformerMixin):
    # We need this because OneHotEncoder returns a sparse array, and normalize requires a non-sparse array
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.toarray()
        
Xhat = X[['season', 'weather', 'temp', 'atemp', 'humidity']]
# I think it needs to be 5 here, because it assumes that '0' is a possible value for an int datatype
# Should probably specify the data types in read_csv
one_hot = OneHotEncoder(n_values=[5, 5], categorical_features=[0, 1])
desparse = ToArray()
normalize = StandardScaler()
ridge_estimator = FlooredRidge()
pipeline = Pipeline([('onehot', one_hot), ('desparse', desparse), ('normalize', normalize), ('ridge', ridge_estimator)])
scores = perform_cv(pipeline, Xhat, y)
scores

In [None]:
# OK, so now we've got a pipeline that does one-hot encoding of two categorical variables
# and then normalizes the variables
# But actually we're not supposed to normalize the the dummy variables.
# So we need some way of only normalizing non-dummy variables

# Oops, actually the CV splitting converts the Pandas DF to an array, so we can't rely
# on the normalize having the proper column names
class SelectiveNormalize(StandardScaler):
    def __init__(self, cols, copy=True, with_mean=True, with_std=True):
        self.cols = cols
        super(SelectiveNormalize, self).__init__(copy, with_mean, with_std)
    
    def fit(self, X, y=None):
        subset = X[:, self.cols]
        return super(SelectiveNormalize, self).fit(subset, y)
        
    def transform(self, X):
        subset = X[:, self.cols]
        normalized = super(SelectiveNormalize, self).transform(subset)
        others = [col for col in range(X.shape[1]) if col not in self.cols]
        res = np.hstack([normalized, X[:, others]])
        return res

Xhat = X[['season', 'weather', 'temp', 'atemp', 'humidity']]
one_hot = OneHotEncoder(n_values=[5, 5], categorical_features=[3, 4])
normalize = SelectiveNormalize([2, 3, 4])
desparse = ToArray()
ridge_estimator = FlooredRidge()
pipeline = Pipeline([('normalize', normalize), ('onehot', one_hot), ('desparse', desparse), ('ridge', ridge_estimator)])
scores = perform_cv(pipeline, Xhat, y)
scores

## DateTime

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Lets try tackling the date column now.  The time of day is probably really important
# So we need some way of extracting the hour
# We'll use a FeatureUnion to do this, to demonstrate the functionality
def get_train_data():
    # Loads the training data, but splits the y from the X
    df = pd.read_csv(TRAIN_FILE, parse_dates=['datetime'])
    return df.iloc[:, 0:9], df.iloc[:,-1]


class SelectColumns(BaseEstimator, TransformerMixin):
    """
    Passes on a subset of columns from an input ndarray
    """
    def __init__(self, cols):
        self.cols = cols
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[:, self.cols]
    

class ExtractHour(BaseEstimator, TransformerMixin):
    """
    Extracts hour from a datetime series
    """
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        res = np.zeros(X.shape)
        for xx in xrange(X.shape[0]):
            res[xx] = X[xx, 0].hour
        return res.reshape(res.shape[0], 1)
    

class CastType(BaseEstimator, TransformerMixin):
    def __init__(self, cast_to):
        self.cast_to = cast_to
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.astype(self.cast_to)

X, y = get_train_data()
# Reminder of the columns:
# ['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed']
select_date = SelectColumns([0])
select_others = SelectColumns(range(1, 9))
cast_float = CastType(np.float64)
one_hot = OneHotEncoder(n_values=[5, 5], categorical_features=[0, 3])
get_hour = ExtractHour()
normalize = SelectiveNormalize(range(2, 8))
desparse = ToArray()
ridge_estimator = RandomForestRegressor(n_estimators=200)

hour_feature = Pipeline([('select_date', select_date), ('get_hour', get_hour)])
other_features = Pipeline([('select_others', select_others), ('cast_float', cast_float), ('onehot', one_hot), ('desparse', desparse)])
join_features = FeatureUnion([('hour', hour_feature), ('others', other_features)])
predict = Pipeline([('featurize', join_features), ('estimator', ridge_estimator)])
scores = perform_cv(predict, X, y)
scores

In [None]:
X, y = get_train_data()
X['datetime'].apply(lambda xx: xx.hour)