Let's build a model to try to predict how many bikes there will be at a given station. 

We assume the following variables are the biggest factor in bike availability: 

* Hour of day (people riding to work in the morning, from work in the afternoon)
* Day of week (usage patterns may be different on weekends)
* Month of the year (weather and seasonality may be a factor)

Start by importing and prepping the data

In [117]:
import pandas as pd

# read the archived data, because there is more of it
df = pd.read_csv('station_status.archive-2022-08-2023-10.csv')

df['last_reported'] = pd.to_datetime(df['last_reported'], unit='s')
df['hour'] = df['last_reported'].dt.hour
df['day_of_week'] = df['last_reported'].dt.dayofweek
df['is_weekday'] = df['last_reported'].dt.weekday < 5
df['month'] = df['last_reported'].dt.month

station_information = pd.read_csv('station_information.csv')
df = station_information.join(
    df.reset_index().set_index('station_id'), on='station_id', lsuffix='_information'
)
df['station_id'] = df['station_id'].astype('category')

# Add capacity_used column, as the stations vary in size 
df['capacity_used'] = df['num_bikes_available'] / df['capacity']
df.dropna(inplace=True)

Split the data into test and training sets and define our features

In [118]:
from sklearn.model_selection import train_test_split

# Features and target
columns = [
    # category
    'station_id',
    # temporal features
    'hour', 
    'day_of_week', 
    'month',
    'is_weekday',
    'station_id', 
    # location features
    'lat', 'lon',
]
X = df[columns]
# the variable we want to predict
y_bikes = df['capacity_used']

# Train-test split
X_train, X_test, y_bikes_train, y_bikes_test = train_test_split(X, y_bikes, test_size=0.2, shuffle=False)


In [125]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_bikes_train)

# Make predictions
y_bikes_pred = model.predict(X_test)

# Evaluate the model
mae_bikes = mean_absolute_error(y_bikes_test, y_bikes_pred)
mse_bikes = mean_squared_error(y_bikes_test, y_bikes_pred)

print(f"Features: {', '.join(X.columns)}")
print('--------------------------')
print(f"MAE Bikes: {mae_bikes}")
print(f"MSE Bikes: {mse_bikes}")

# View the coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})

print('--------------------------')
print(coefficients.sort_values(by='Coefficient', ascending=False))
print('--------------------------')
print('First 10 predictions:')
print(y_bikes_pred[:10])  # Show first 10 predictions
print('--------------------------')
print('First 10 actual values:')
print(y_bikes_test[:10].values)
print('--------------------------')
print('Diff prediction actual')
print(y_bikes_test[:10].values - y_bikes_pred[:10])

Features: station_id, hour, day_of_week, month, is_weekday, station_id, lat, lon
--------------------------
MAE Bikes: 0.2673303001577008
MSE Bikes: 0.09641394536518778
--------------------------
       Feature    Coefficient
5   station_id  227310.300135
1         hour      -0.000320
2  day_of_week      -0.000954
4   is_weekday      -0.003464
3        month      -0.020200
7          lon      -0.932282
6          lat     -12.531551
0   station_id -227310.300078
--------------------------
First 10 predictions:
[0.33707747 0.33707747 0.33675775 0.33675775 0.33675775 0.34155879
 0.34155879 0.34155879 0.34123907 0.34123907]
--------------------------
First 10 actual values:
[0.15151515 0.15151515 0.18181818 0.18181818 0.18181818 0.18181818
 0.18181818 0.18181818 0.15151515 0.18181818]
--------------------------
Diff prediction actual
[-0.18556232 -0.18556232 -0.15493957 -0.15493957 -0.15493957 -0.1597406
 -0.1597406  -0.1597406  -0.18972392 -0.15942089]
