Let's build a model to try to predict how many bikes there will be at a given station. 

The following features can be deduced directly from the data: 

* Hour of day (people riding to work in the morning, from work in the afternoon)
* Day of week (usage patterns may be different on weekends)
* Month of the year (weather and seasonality may be a factor)

Additionally we enrich the data with:

* Elevation information for each station (from open-elevation.com)

Start by importing and prepping the data

In [24]:
import pandas as pd

# read the archived data, because there is more of it
df = pd.read_csv('station_status.csv')

df['last_reported'] = pd.to_datetime(df['last_reported'], unit='s')
df['hour'] = df['last_reported'].dt.hour
df['day_of_week'] = df['last_reported'].dt.dayofweek
df['is_weekday'] = df['last_reported'].dt.weekday < 5
df['month'] = df['last_reported'].dt.month


station_information = pd.read_csv('station_information.csv')
df = station_information.join(
    df.reset_index().set_index('station_id'), on='station_id', lsuffix='_information', how='inner'
)
station_elevation = pd.read_csv('station_elevation.csv')
df = station_elevation.join(
    df.reset_index().set_index('station_id'), on='station_id', how='inner'
)

# Add capacity_used column, as the stations vary in size 
df['capacity_used'] = df['num_bikes_available'] / df['capacity']


# Calculate various rolling averages
# Set datetime as index for rolling operations
df = df.set_index('last_reported')
df['7day_avg'] = df.groupby('station_id')['capacity_used'].rolling('7D').mean().reset_index(0, drop=True)
df['24hr_avg'] = df.groupby('station_id')['capacity_used'].rolling('24h').mean().reset_index(0, drop=True)
df['7day_std'] = df.groupby('station_id')['capacity_used'].rolling('7D').std().reset_index(0, drop=True)

# Calculate week-over-week difference
df['wow_diff'] = df.groupby('station_id')['capacity_used'].diff(periods=7*20*3)

# Calculate exponentially weighted moving average
df['ewma'] = df.groupby('station_id')['capacity_used'].ewm(span=7).mean().reset_index(0, drop=True)

df.dropna(inplace=True)
df['station_id'] = df['station_id'].astype('category')


Split the data into test and training sets and define our features

In [25]:
from sklearn.model_selection import train_test_split

# Features and target
columns = [
    # category
    'station_id',
    # temporal features
    'hour', 
    'day_of_week', 
    'month',
    'is_weekday',
    # location features
    'lat', 'lon',
    'elevation',
    # rolling averages
    '7day_avg',
    '24hr_avg',
    '7day_std',
    'wow_diff',
    'ewma',
]
X = df[columns]
# the variable we want to predict
y_bikes = df['capacity_used']

# Train-test split
X_train, X_test, y_bikes_train, y_bikes_test = train_test_split(X, y_bikes, test_size=0.2, shuffle=False)


In [26]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_bikes_train)

# Make predictions
y_bikes_pred = model.predict(X_test)

# Evaluate the model
mae_bikes = mean_absolute_error(y_bikes_test, y_bikes_pred)
mse_bikes = mean_squared_error(y_bikes_test, y_bikes_pred)

print(f"Features: {', '.join(X.columns)}")
print('--------------------------')
print(f"MAE Bikes: {mae_bikes}")
print(f"MSE Bikes: {mse_bikes}")

# View the coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})

print('--------------------------')
print(coefficients.sort_values(by='Coefficient', ascending=False))
print('--------------------------')
print('Diff prediction actual')
print(y_bikes_test[:10].values - y_bikes_pred[:10])

Features: station_id, hour, day_of_week, month, is_weekday, lat, lon, elevation, 7day_avg, 24hr_avg, 7day_std, wow_diff, ewma
--------------------------
MAE Bikes: 0.040965197926536016
MSE Bikes: 0.004416209870681401
--------------------------
        Feature  Coefficient
12         ewma     1.012358
8      7day_avg     0.132357
11     wow_diff     0.081184
10     7day_std     0.014875
1          hour     0.000516
7     elevation     0.000059
0    station_id     0.000011
3         month    -0.000186
2   day_of_week    -0.000398
4    is_weekday    -0.002835
6           lon    -0.084451
9      24hr_avg    -0.150325
5           lat    -0.263020
--------------------------
Diff prediction actual
[-0.02161095 -0.00831283  0.03766476  0.0707149   0.04221698  0.05988045
  0.03698652  0.00182849 -0.00693978 -0.01378236]
