The below example uses data from the BigQuery Public Dataset on London Bike Share which contains millions of bike trips for analysis.
The London Bike Share dataset has two tables:
cycle_stations
- the basic information about where you can rent bikes (has lat/long)cycle_hire
- transactional inforamation on rentals (duration, starting location, ending location etc.)
Our goal is to create a map showing the station start and ending pair which has the highest average speed for rides over 30 minutes.
WITH staging AS (
SELECT
STRUCT(
start_stn.name,
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude) AS point,
start_stn.docks_count,
start_stn.install_date
) AS starting,
STRUCT(
end_stn.name,
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude) AS point,
end_stn.docks_count,
end_stn.install_date
) AS ending,
STRUCT(
rental_id,
bike_id,
duration, -- seconds
ST_DISTANCE(
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude),
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude)
) AS distance, -- meters
ST_MAKELINE(
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude),
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude)
) AS trip_line, -- straight line (for GeoViz)
start_date,
end_date
) AS bike
FROM `bigquery-public-data.london_bicycles.cycle_stations` AS start_stn
LEFT JOIN `bigquery-public-data.london_bicycles.cycle_hire` AS b
ON start_stn.id = b.start_station_id
LEFT JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS end_stn
ON end_stn.id = b.end_station_id
)
-- Find the fastest avg biking pace for rides over 30 mins
SELECT
starting.name AS starting_name,
ending.name AS ending_name,
ROUND(bike.distance/1000,2) distance_km,
ST_UNION_AGG(bike.trip_line) AS trip_line,
COUNT(bike.rental_id) AS total_trips,
ROUND(
AVG(
(bike.distance / 1000) -- meters --> km
/ (bike.duration / 60 / 60) -- seconds --> hours
)
,2)
AS avg_kmph
FROM staging
WHERE bike.duration > (30 * 60) -- at least 30 minutes = 1800 seconds
GROUP BY
starting.name,
ending.name,
bike.distance
HAVING total_trips > 100
ORDER BY avg_kmph DESC
LIMIT 100;
Notes to interpret the query:
- What's with the large
WITH
clause? Wouldn't it be better to store that as a table? Possibly but if you are filtering later on in your query (like Duration > 30), BigQuery can magically push those filters into theWITH
clause with automatic predicate pushdown. If you processed that data already into a table and your users are only using a small fraction of it you may be better of with using WITH clauses. - What are the ST_XXXXX functions? "ST" functions are spatial type functions which are commonly used in GIS applications like map calculations. BigQuery natively supports many GIS functions like
ST_GEOGPOINT
which converts a lat/long input into a mappable point andST_DISTANCE
that calculates the straight-line distance in meters between two map points. - What are
STRUCT()
functions? STRUCTs are containers for other fields -- think of them like pre-joined tables in a single table. STRUCTs themselves have names and their field values have names so a STRUCT field likebike.duration
reads just like a table alias. STRUCTs are commonly used for readability in very wide and denormalized schemas like the Google Analytics schema.
Below are the fastest (on average) bike station pairs
You can quickly connect your BigQuery project to BigQuery GeoViz and use a UI to build a map based on your geographic columns. Here we mapped the straight line path of the fastest bike commutes with the thicker lines being the faster riders.
For our next problem, we want to see how long bikes are idle at each station on averge before they are picked up by another rider. This could help avoid stock-outs for high-demand stations if we setup a program to incentivize bike rentals from slower turnover stations to highly demanded ones.
Questions
- Which 10 stations turn over bikes the quickest (lowest time at station)?
- Which 10 stations are the slowest at turning over bikes?
WITH staging AS (
SELECT
STRUCT(
start_stn.name,
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude) AS point,
start_stn.docks_count,
start_stn.install_date
) AS starting,
STRUCT(
end_stn.name,
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude) AS point,
end_stn.docks_count,
end_stn.install_date
) AS ending,
STRUCT(
rental_id,
bike_id,
duration, -- seconds
ST_DISTANCE(
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude),
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude)
) AS distance, -- meters
ST_MAKELINE(
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude),
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude)
) AS trip_line, -- straight line (for GeoViz)
start_date,
end_date
) AS bike
FROM `bigquery-public-data.london_bicycles.cycle_stations` AS start_stn
LEFT JOIN `bigquery-public-data.london_bicycles.cycle_hire` AS b
ON start_stn.id = b.start_station_id
LEFT JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS end_stn
ON end_stn.id = b.end_station_id
),
lag_end_date AS (
-- Find how long after one ride ends, another one begins (on average)
SELECT
starting.name,
starting.docks_count,
starting.install_date,
bike.bike_id,
LAG(bike.end_date) OVER (
PARTITION BY bike.bike_id
ORDER BY bike.start_date)
AS last_end_date,
bike.start_date,
bike.end_date
FROM staging
)
SELECT
name,
docks_count,
install_date,
COUNT(bike_id) AS total_trips,
ROUND(
AVG(
TIMESTAMP_DIFF(start_date, last_end_date, HOUR)
)
,2) AS time_at_station_hrs
FROM lag_end_date
GROUP BY
name,
docks_count,
install_date
HAVING total_trips > 0
ORDER BY time_at_station_hrs ASC -- fastest turnover first
LIMIT 10;
Notes to interpret the query:
- The tricky part of this query is getting the last bike's return time and then calculating the time until the next rental for that
bike_id
. If you're focused on the previous of something or the next something you want to use SQLLAG
orLEAD
functions. - Here we want the previous rental return date for a given bike which should alert you that you need to break apart your data into separate windows for analysis. It would do us no good to get the last return time for any bike at any time. That is why we want the
LAG(end_date)
over a window of all the samebike_id
ensuring that the return dates are in order from oldest to newest. That is what thePARTITION BY
analytical window function does in the code above. - Note: You could absolutely combine the
lag_end_date
common table expression into the final query by doing theTIMESTAMP_DIFF
and the analytical window function in one go. I broke them out here for readability.
Simply change the ORDER BY
to DESC
and now you get the highest idle time for bikes at these stations:
Lastly, let's use our new knowledge of window functions to create a list of bikes which are in the most need of maintenance.
WITH staging AS (
SELECT
STRUCT(
start_stn.name,
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude) AS point,
start_stn.docks_count,
start_stn.install_date
) AS starting,
STRUCT(
end_stn.name,
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude) AS point,
end_stn.docks_count,
end_stn.install_date
) AS ending,
STRUCT(
rental_id,
bike_id,
duration, -- seconds
ST_DISTANCE(
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude),
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude)
) AS distance, -- meters
ST_MAKELINE(
ST_GEOGPOINT(start_stn.longitude, start_stn.latitude),
ST_GEOGPOINT(end_stn.longitude, end_stn.latitude)
) AS trip_line, -- straight line (for GeoViz)
start_date,
end_date
) AS bike
FROM `bigquery-public-data.london_bicycles.cycle_stations` AS start_stn
LEFT JOIN `bigquery-public-data.london_bicycles.cycle_hire` AS b
ON start_stn.id = b.start_station_id
LEFT JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS end_stn
ON end_stn.id = b.end_station_id
)
-- Collect key stats for each bike on total usage
, maintenance_stats AS (
SELECT
bike.bike_id,
STRUCT(
RANK() OVER(
PARTITION BY bike.bike_id
ORDER BY bike.start_date
) AS current_trip_number,
SUM(bike.duration/60/60) OVER(
PARTITION BY bike.bike_id
ORDER BY bike.start_date
) AS cumulative_duration_hr,
SUM(bike.distance/1000) OVER(
PARTITION BY bike.bike_id
ORDER BY bike.start_date
) AS cumulative_distance_km,
bike.start_date,
bike.end_date
) AS stats
FROM staging
)
SELECT
-- High level summary
RANK() OVER(
ORDER BY MAX(stats.cumulative_distance_km) DESC
) AS most_distance_km_rank,
bike_id,
MAX(stats.cumulative_distance_km) AS distance_travelled,
-- Detail within array (show 10 most recent rides)
ARRAY_AGG(stats ORDER BY stats.end_date DESC LIMIT 10) AS maint_stats
FROM maintenance_stats
GROUP BY bike_id
ORDER BY most_distance_km_rank LIMIT 10
We used two levels of RANK()
functions to first rank each trip for a given bike and then we ranked the bikes themselves based on their overall distance travelled. The final table shows both aggregate metrics (comparing between bikes) and also per-bike per-ride metrics in an ARRAY
. BigQuery natively supports repeated fields (ARRAYs) so you can have both levels of granularity in a single table: