Let's build a model to try to predict how many bikes there will be at a given station. 

The following features can be deduced directly from the data: 

* Hour of day (people riding to work in the morning, from work in the afternoon)
* Day of week (usage patterns may be different on weekends)
* Month of the year (weather and seasonality may be a factor)

Additionally we enrich the data with:

* Elevation information for each station (from open-elevation.com)

Start by importing and prepping the data

In [146]:
import pandas as pd

# read the archived data, because there is more of it
df = pd.read_csv('station_status.archive-2022-08-2023-10.csv')

df['last_reported'] = pd.to_datetime(df['last_reported'], unit='s')
df['hour'] = df['last_reported'].dt.hour
df['day_of_week'] = df['last_reported'].dt.dayofweek
df['is_weekday'] = df['last_reported'].dt.weekday < 5
df['month'] = df['last_reported'].dt.month

station_information = pd.read_csv('station_information.csv')
df = station_information.join(
    df.reset_index().set_index('station_id'), on='station_id', lsuffix='_information'
)
station_elevation = pd.read_csv('station_elevation.csv')
df = station_elevation.join(
    df.reset_index().set_index('station_id'), on='station_id',
)

# Add capacity_used column, as the stations vary in size 
df['capacity_used'] = df['num_bikes_available'] / df['capacity']
df.dropna(inplace=True)
df['station_id'] = df['station_id'].astype('category')


Split the data into test and training sets and define our features

In [147]:
from sklearn.model_selection import train_test_split

# Features and target
columns = [
    # category
    'station_id',
    # temporal features
    'hour', 
    'day_of_week', 
    'month',
    'is_weekday',
    # location features
    'lat', 'lon',
    'elevation'
]
X = df[columns]
# the variable we want to predict
y_bikes = df['capacity_used']

# Train-test split
X_train, X_test, y_bikes_train, y_bikes_test = train_test_split(X, y_bikes, test_size=0.2, shuffle=False)


In [148]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_bikes_train)

# Make predictions
y_bikes_pred = model.predict(X_test)

# Evaluate the model
mae_bikes = mean_absolute_error(y_bikes_test, y_bikes_pred)
mse_bikes = mean_squared_error(y_bikes_test, y_bikes_pred)

print(f"Features: {', '.join(X.columns)}")
print('--------------------------')
print(f"MAE Bikes: {mae_bikes}")
print(f"MSE Bikes: {mse_bikes}")

# View the coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})

print('--------------------------')
print(coefficients.sort_values(by='Coefficient', ascending=False))
print('--------------------------')
print('First 10 predictions:')
print(y_bikes_pred[:10])  # Show first 10 predictions
print('--------------------------')
print('First 10 actual values:')
print(y_bikes_test[:10].values)
print('--------------------------')
print('Diff prediction actual')
print(y_bikes_test[:10].values - y_bikes_pred[:10])

Features: station_id, hour, day_of_week, month, is_weekday, lat, lon, elevation
--------------------------
MAE Bikes: 0.2657597743147935
MSE Bikes: 0.10329297768802523
--------------------------
       Feature  Coefficient
5          lat     3.554366
6          lon     0.941189
1         hour     0.000481
0   station_id     0.000171
2  day_of_week    -0.000387
4   is_weekday    -0.000951
7    elevation    -0.007064
3        month    -0.020759
--------------------------
First 10 predictions:
[0.38387199 0.38387199 0.38387199 0.38435263 0.38435263 0.38435263
 0.38483327 0.38483327 0.38483327 0.38531391]
--------------------------
First 10 actual values:
[0.20512821 0.20512821 0.20512821 0.20512821 0.20512821 0.20512821
 0.23076923 0.23076923 0.23076923 0.23076923]
--------------------------
Diff prediction actual
[-0.17874379 -0.17874379 -0.17874379 -0.17922443 -0.17922443 -0.17922443
 -0.15406404 -0.15406404 -0.15406404 -0.15454468]
