# PDPilot Bike Rentals Colab Example

This notebook demonstrates how to use PDPilot to anlayze a model trained on the [hourly bike rentals dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset), with pre-processing by [Christoph Molnar](https://christophm.github.io/interpretable-ml-book/bike-data.html).

First, we install PDPilot.

In [None]:
!pip install pdpilot -U

Next, we import pandas to load the data, our chosen model class from scikit-learn, and the `partial_dependence` function and `PDPilotWidget` class from PDPilot.

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from pdpilot import partial_dependence, PDPilotWidget

We also have some additional imports to make custom Jupyter widgets work in Colab notebooks.

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

I've hosted a pre-processed version of the dataset as a gist. Here, we load it into a pandas dataframe.

In [None]:
dataset_url = 'https://gist.githubusercontent.com/DanielKerrigan/f324b392dc9a58d8bd8f8d79e1101a12/raw/c3b4760c9facfac26bcab2cd7465c4cab88ef304/bike-hour.csv'

In [None]:
df_original = pd.read_csv(dataset_url).drop(columns=['yr'])

The target variable is "cnt", which is the count of the number of bikes rented during that hour.

In [None]:
df_original

The weather situation ("weathersit") feature is categorical, so we'll one-hot encode it.

In [None]:
df_one_hot = pd.get_dummies(df_original, columns=['weathersit'])

In [None]:
df_X = df_one_hot.drop(columns=['cnt'])

In [None]:
y = df_original['cnt'].to_numpy()

Next, we train a random forest model on the dataset.

In [None]:
regr = RandomForestRegressor(n_estimators=20)
regr.fit(df_X, y)

We will give PDPilot a list of features that we want it to compute plots for. Here, we get a list of the names of all of the features. Note that we use the original names of one-hot encoded features. For example, this list includes "weathersit" instead of "weathersit_1", "weathersit_2", etc.

In [None]:
features = [col for col in df_original.columns if col != 'cnt']

For one-hot encoded features, we need to tell PDPilot which columns belong to which feature and what values those columns correspond to.

In [None]:
one_hot_features = {
    'weathersit': [
        ('weathersit_1', 'clear'),
        ('weathersit_2', 'mist'),
        ('weathersit_3', 'rain'),
        ('weathersit_4', 'storm')
    ]
}

For ordinal encoded features, we can optionally supply string names for the feature values to use rather than the integer values. 

In [None]:
feature_value_mappings = {
    'season': {
        1: 'winter',
        2: 'spring',
        3: 'summer',
        4: 'fall'
    },
    'weekday': {
        0: 'S',
        1: 'M',
        2: 'T',
        3: 'W',
        4: 'R',
        5: 'F',
        6: 'S'
    }
}

PDPilot can support up to a few thousand instances. Here we randomly sample 500 instances from our dataset and get the corresponding ground truth labels.

In [None]:
subset = df_X.sample(500)
labels = y[subset.index]

We pass this data to the `partial_dependence` function to compute the necessary data for the widget. For regression, the function that we pass to the `predict` parameter is expected to take a pandas dataframe containing instances as input and return a 1D numpy array containing the predictions for those instances.

In [None]:
pd_data = partial_dependence(
    predict=regr.predict,
    df=subset,
    features=features,
    one_hot_features=one_hot_features,
    feature_value_mappings=feature_value_mappings,
    resolution=20,
    n_jobs=1,
)

Now we are ready to run the widget.

In [None]:
w = PDPilotWidget(
    predict=regr.predict,
    df=subset,
    labels=labels,
    pd_data=pd_data,
    height=600
)

w