Predict how many bikes will be available at a Bicing station in Barcelona 15 minutes from now.
This is a small but complete time-series ML project: data pipeline, feature engineering, proper time-based train/test split, a persistence baseline, three sklearn/LightGBM models, evaluation, saving, prediction, and tests.
Bicing is the public bike-sharing service in Barcelona. If you've ever cycled up to a station at 8:55 and found zero bikes, or rolled up to drop your bike off at 9:05 and found zero docks, you already know why predicting availability a few minutes ahead is useful.
A 15-minute horizon is short enough to be useful for routing decisions and long enough to be non-trivial (you can't just return the current count — demand shifts by neighbourhood and hour).
bicing-availability-predictor/
├── data.py synthetic snapshot generator + feature engineering
├── fetch_live.py optional: pull real GBFS snapshots from the live API
├── train.py baseline + LinearRegression + RandomForest + LightGBM, saves winner
├── predict.py load model, predict bikes_available at t+15 min
├── main.py CLI menu (I - VII)
├── test_pipeline.py pytest tests
├── requirements.txt
├── api/
│ ├── app.py FastAPI server (model loaded once, REST predictions)
│ └── static/
│ └── index.html web demo (pick a station, see the prediction)
├── data/ generated snapshots live here
├── models/ saved model + feature list + comparison.csv
└── plots/ true-vs-predicted scatter + feature importance chart
The default path uses synthetic snapshots generated by data.py. They match the real Bicing GBFS schema (station_id, lat, lon, capacity, timestamp, bikes_available, docks_available), so swapping in real data requires no changes elsewhere.
Three station archetypes are baked in:
- residential: full at night, empties in the morning
- central: empty at night, fills during the workday
- university: bell-shaped fill pattern around midday
Weekends flatten the daily pattern. Gaussian noise is added so the series isn't deterministic.
If you want to train on real data, run python fetch_live.py --loop on your laptop for a few days and the CSV will grow with live observations.
For each row (station, timestamp) the model sees:
hour,minute,dayofweek,is_weekend— time featurescapacity,bikes_available,docks_available— current statelag_1,lag_2,lag_3— bikes available 15, 30, 45 minutes ago (per station)rolling_1h— rolling mean of bikes available over the last hourtype_residential,type_central,type_university— one-hot station type
Target: bikes_available at t + 15 minutes (same station).
| Model | Why it's here |
|---|---|
| Persistence | Baseline — "the future will look like right now". Must beat it. |
| Linear Regression | Sanity check. Fast and interpretable. |
| Random Forest | Handles interactions and non-linearities without tuning. |
| LightGBM ✓ | Gradient-boosted trees — winner on most tabular time-series problems. Fast. |
Latest results on the synthetic 14-day dataset (20 stations, 15-min snapshots):
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Persistence | 2.44 | 3.12 | 0.79 |
| LinearRegression | 2.04 | 2.59 | 0.86 |
| RandomForest | 1.86 | 2.36 | 0.88 |
| LightGBM | 1.84 | 2.33 | 0.89 |
Metrics reported: MAE, RMSE, R². RMSE is the primary score because big misses (empty / full stations) hurt users the most.
Time-based, never random. The last 20% of the timeline is the test set. Random splitting leaks future info into training on time-series data and produces optimistic-but-wrong numbers.
pip install -r requirements.txt
python main.py # interactive menuOr drive each piece directly:
python data.py # (nothing — functions only; use main.py option I)
python train.py # generates data if missing, trains, saves best model
python predict.py # asks you for a station id, prints a prediction
pytest -q # runs the testsAfter training, launch the API server:
python main.py # pick option VI
# or directly:
uvicorn api.app:app --reloadThen open http://localhost:8000 in your browser. Pick a station from the dropdown, click Predict, and see the current bikes vs the 15-minute prediction with a visual capacity bar.
Endpoints:
GET /— the web demoGET /stations— list all stations with metadata (type, capacity, coordinates)POST /predict— send{"station_id": 5}, get back bikes now + predicted in 15 minGET /health— server health check
Build a Streamlit dashboard showing the live map with predicted availability per station.Replaced with FastAPI + web demo!- Expand the horizon to a multi-step forecast (15 / 30 / 45 / 60 min) and plot error vs horizon.
- Use weather features (rain → bike demand crashes). Barcelona publishes free weather data via the AEMET open API.
- Deploy
fetch_live.pyon a cheap VM or a cron job and train on weeks of real Bicing history. - Tune LightGBM with
n_estimators=500+ early stopping once real data is available.
pytest -qCovers: CSV generation, feature matrix correctness, no train/test time leakage, model artifacts are saved, best model beats the persistence baseline, and predictions stay inside [0, capacity].
Built as part of my AI/ML portfolio. Feedback and issues welcome.