This notebook is an example of what users can achieve on their own individual systems with the installation of Jupyter, Pandas, SciKit Learn, etc. Please feel free to edit and run the code here.

Predicting Pittsburgh Bike-share Rides
This example takes a data set from a Pittsburgh bike-share company and shows the strong correlation between the maximum temperature and the number of rides taken for a given day. The data set contains other variables, such as month or holiday, so feel free to modify the code to explore the data further!

In [None]:
import warnings
warnings.filterwarnings("ignore")
import io
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from js import fetch
import json
res = await fetch('https://jupyterlite.anaconda.cloud/b0df9a1c-3954-4c78-96e6-07ab473bea1a/files/bikeshare.csv')
csv_data = await res.text()

In [None]:
df = pd.read_csv(io.StringIO(csv_data))
df.head()

Next we visualize the maximum temperature and number of rides. The correlation looks pretty strong, but appearances can be deceiving. Feel free to modify this code to look at other variables, like Holiday or home_game!

In [None]:
df_temp_rides = df.set_index('Date')[['Max Temp','n_rides']]
fig, axs = plt.subplots(2, 1, figsize=(18,6))
plot_temp = axs[0]
plot_rides = axs[1]
plot_temp.set_xlabel('Day')
plot_temp.set_ylabel('Max Temperature')
plot_temp.plot(df_temp_rides['Max Temp'])
plot_rides.set_xlabel('Day')
plot_rides.set_ylabel('Number of Rides')
plot_rides.plot(df_temp_rides['n_rides'])
plt.show()

Grid search with a gradient boosting regressor¶
Lastly, we apply a grid search using a gradient boosting regressor to see how strongly each variable contributes to the n_rides prediction. We take 20% of the dataset to train and then validate the estimator scores returned from the grid search.

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
 = pd.get_dummies(df, columns=['Month'], drop_first=True).drop(['Date','n_rides'], axis='columns')
y = df['n_rides']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

In [None]:
model = make_pipeline(MinMaxScaler(), GradientBoostingRegressor())
params = {'gradientboostingregressor__max_depth': range(3, 20)}
grid = GridSearchCV(model, params, cv=5)

grid.fit(X_train, y_train)

grid.best_estimator_.score(X_valid, y_valid)

In [None]:
grid.best_params_

Lastly, we extract the feature importances from the best estimator discovered by the grid search. We can see that the best estimator also ranks maximum temperator as the most important feature.