Skip to content

Eric-Ristol/bicing-predictor

Repository files navigation

Bicing Availability Predictor

Predict how many bikes will be available at a Bicing station in Barcelona 15 minutes from now.

This is a small but complete time-series ML project: data pipeline, feature engineering, proper time-based train/test split, a persistence baseline, three sklearn/LightGBM models, evaluation, saving, prediction, and tests.


Why this problem

Bicing is the public bike-sharing service in Barcelona. If you've ever cycled up to a station at 8:55 and found zero bikes, or rolled up to drop your bike off at 9:05 and found zero docks, you already know why predicting availability a few minutes ahead is useful.

A 15-minute horizon is short enough to be useful for routing decisions and long enough to be non-trivial (you can't just return the current count — demand shifts by neighbourhood and hour).


What's in the repo

bicing-availability-predictor/
├── data.py              synthetic snapshot generator + feature engineering
├── fetch_live.py        optional: pull real GBFS snapshots from the live API
├── train.py             baseline + LinearRegression + RandomForest + LightGBM, saves winner
├── predict.py           load model, predict bikes_available at t+15 min
├── main.py              CLI menu (I - VII)
├── test_pipeline.py     pytest tests
├── requirements.txt
├── api/
│   ├── app.py           FastAPI server (model loaded once, REST predictions)
│   └── static/
│       └── index.html   web demo (pick a station, see the prediction)
├── data/                generated snapshots live here
├── models/              saved model + feature list + comparison.csv
└── plots/               true-vs-predicted scatter + feature importance chart

Dataset

The default path uses synthetic snapshots generated by data.py. They match the real Bicing GBFS schema (station_id, lat, lon, capacity, timestamp, bikes_available, docks_available), so swapping in real data requires no changes elsewhere.

Three station archetypes are baked in:

  • residential: full at night, empties in the morning
  • central: empty at night, fills during the workday
  • university: bell-shaped fill pattern around midday

Weekends flatten the daily pattern. Gaussian noise is added so the series isn't deterministic.

If you want to train on real data, run python fetch_live.py --loop on your laptop for a few days and the CSV will grow with live observations.

Features

For each row (station, timestamp) the model sees:

  • hour, minute, dayofweek, is_weekend — time features
  • capacity, bikes_available, docks_available — current state
  • lag_1, lag_2, lag_3 — bikes available 15, 30, 45 minutes ago (per station)
  • rolling_1h — rolling mean of bikes available over the last hour
  • type_residential, type_central, type_university — one-hot station type

Target: bikes_available at t + 15 minutes (same station).

Models

Model Why it's here
Persistence Baseline — "the future will look like right now". Must beat it.
Linear Regression Sanity check. Fast and interpretable.
Random Forest Handles interactions and non-linearities without tuning.
LightGBM Gradient-boosted trees — winner on most tabular time-series problems. Fast.

Latest results on the synthetic 14-day dataset (20 stations, 15-min snapshots):

Model MAE RMSE
Persistence 2.44 3.12 0.79
LinearRegression 2.04 2.59 0.86
RandomForest 1.86 2.36 0.88
LightGBM 1.84 2.33 0.89

Metrics reported: MAE, RMSE, R². RMSE is the primary score because big misses (empty / full stations) hurt users the most.

Train / test split

Time-based, never random. The last 20% of the timeline is the test set. Random splitting leaks future info into training on time-series data and produces optimistic-but-wrong numbers.

How to run

pip install -r requirements.txt
python main.py         # interactive menu

Or drive each piece directly:

python data.py          # (nothing — functions only; use main.py option I)
python train.py         # generates data if missing, trains, saves best model
python predict.py       # asks you for a station id, prints a prediction
pytest -q               # runs the tests

Web demo (API)

After training, launch the API server:

python main.py          # pick option VI
# or directly:
uvicorn api.app:app --reload

Then open http://localhost:8000 in your browser. Pick a station from the dropdown, click Predict, and see the current bikes vs the 15-minute prediction with a visual capacity bar.

Endpoints:

  • GET / — the web demo
  • GET /stations — list all stations with metadata (type, capacity, coordinates)
  • POST /predict — send {"station_id": 5}, get back bikes now + predicted in 15 min
  • GET /health — server health check

What I'd do next

  • Build a Streamlit dashboard showing the live map with predicted availability per station. Replaced with FastAPI + web demo!
  • Expand the horizon to a multi-step forecast (15 / 30 / 45 / 60 min) and plot error vs horizon.
  • Use weather features (rain → bike demand crashes). Barcelona publishes free weather data via the AEMET open API.
  • Deploy fetch_live.py on a cheap VM or a cron job and train on weeks of real Bicing history.
  • Tune LightGBM with n_estimators=500 + early stopping once real data is available.

Tests

pytest -q

Covers: CSV generation, feature matrix correctness, no train/test time leakage, model artifacts are saved, best model beats the persistence baseline, and predictions stay inside [0, capacity].


Built as part of my AI/ML portfolio. Feedback and issues welcome.

About

Barcelona Bicing bike-sharing availability predictor. RandomForest model with lag features and rolling means on 15-min snapshots.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages