Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document Mosaic #1012

Merged
merged 6 commits into from Mar 7, 2024
Merged

document Mosaic #1012

merged 6 commits into from Mar 7, 2024

Conversation

Fil
Copy link
Contributor

@Fil Fil commented Mar 7, 2024

related #1011
mosaic

cc: @jheer

@Fil Fil requested a review from mbostock March 7, 2024 09:28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a data loader to make this, or at least include the script to make it alongside this cached output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I just didn't want to document a data loader as part of this page, which has enough matter already.

It's not a hugely complicated data loader, except it needs to install the binary duckdb.

export TMPDIR="docs/.observablehq/cache"
export PATH=$TMPDIR:$PATH
duckdb :memory: << EOF
-- Load spatial extension
INSTALL spatial; LOAD spatial;

-- Project, following the example at https://github.com/duckdb/duckdb_spatial
CREATE TEMP TABLE rides AS SELECT
  pickup_datetime::TIMESTAMP AS datetime,
  ST_Transform(ST_Point(pickup_latitude, pickup_longitude), 'EPSG:4326', 'EPSG:32118') AS pick,
  ST_Transform(ST_Point(dropoff_latitude, dropoff_longitude), 'EPSG:4326', 'EPSG:32118') AS drop
FROM 'https://uwdata.github.io/mosaic-datasets/data/nyc-rides-2010.parquet';

-- Write output parquet file
COPY (SELECT
  HOUR(datetime) + MINUTE(datetime) / 60 AS time,
  ST_X(pick)::INTEGER AS px, -- extract pickup x-coord
  ST_Y(pick)::INTEGER AS py, -- extract pickup y-coord
  ST_X(drop)::INTEGER AS dx, -- extract dropff x-coord
  ST_Y(drop)::INTEGER AS dy  -- extract dropff y-coord
FROM rides
ORDER BY 2,3,4,5,1 -- optimize output size by sorting
) TO '$TMPDIR/trips.parquet' (COMPRESSION 'ZSTD', row_group_size 10000000);
EOF

cat $TMPDIR/trips.parquet >&1  # Write output to stdout
rm $TMPDIR/trips.parquet       # Clean up

Copy link
Member

@mbostock mbostock Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think have the data loader checked-in but not functional, even if undocumented, is a big help!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for including it! I agree that the data loader is an important component of the example, even if non-functional in this particular deployment.

Copy link
Contributor Author

@Fil Fil Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked in a simple version that does not try to work in CI — with the expectation that duckdb is on the $PATH.

Note: it's not active, only here for reference. To make this data loader work in CI, you have to install the proper duckdb binary on the $PATH, and I'd recommend to use a dedicated $TMPDIR rather than write in the root folder.
docs/lib/mosaic.md Outdated Show resolved Hide resolved
docs/lib/mosaic.md Outdated Show resolved Hide resolved
@@ -0,0 +1,25 @@
duckdb :memory: << EOF
Copy link
Member

@mbostock mbostock Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a shebang? One of these probably…

Suggested change
duckdb :memory: << EOF
#!/usr/bin/env bash
duckdb :memory: << EOF
Suggested change
duckdb :memory: << EOF
#!/bin/sh
duckdb :memory: << EOF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so; shebang are only used for .exe?

Copy link
Member

@mbostock mbostock Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so even if you have a shebang, it’s ignored because we explicitly run it with sh? If it’s not ignored, we should add one just to document our expectations around how the script is interpreted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sh ignores the shebang and interprets the rest of the code directly

❯ chmod +x ./test.py
❯ cat ./test.py
#! /usr/bin/env python3
print("hello, world")
❯ bash ./test.py
./test.py: line 2: syntax error near unexpected token `"hello, world"'
./test.py: line 2: `print("hello, world")'
❯ ./test.py
hello, world

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for educating me. 😄

Co-authored-by: Mike Bostock <mbostock@gmail.com>
@Fil Fil changed the base branch from main to mbostock/vgplot March 7, 2024 21:00
@mbostock
Copy link
Member

mbostock commented Mar 7, 2024

Waiting for #1015 to land and then we’ll update this.

@mbostock mbostock merged commit 44d21dc into mbostock/vgplot Mar 7, 2024
@mbostock mbostock deleted the fil/document-mosaic branch March 7, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants