# Spatial Joins Exercises

Here\'s a reminder of some of the functions we have seen. Hint: they
should be useful for the exercises!

-   `sum(expression)`: aggregate to
    return a sum for a set of records
-   `count(expression)`: aggregate to
    return the size of a set of records
-   `ST_Area(geometry)` returns the
    area of the polygons
-   `ST_AsText(geometry)` returns WKT `text`
-   `ST_Contains(geometry A, geometry B)` returns the true if geometry A contains geometry B
-   `ST_Distance(geometry A, geometry B)` returns the minimum distance between geometry A and
    geometry B
-   `ST_DWithin(geometry A, geometry B, radius)` returns the true if geometry A is radius distance or less from geometry B
-   `ST_GeomFromText(text)` returns `geometry`
-   `ST_Intersects(geometry A, geometry B)` returns the true if geometry A intersects geometry B
-   `ST_Length(linestring)` returns the length of the linestring
-   `ST_Touches(geometry A, geometry B)` returns the true if the boundary of geometry A touches geometry B
-   `ST_Within(geometry A, geometry B)` returns the true if geometry A is within geometry B


Uncomment and run the following cell to install the required packages.


In [11]:
import leafmap
import duckdb

# 下载并解压 ZIP 文件
url = "https://open.gishub.org/data/duckdb/nyc_data.zip"
leafmap.download_file(url, unzip=True, overwrite=True)

# 连接 DuckDB 并启用 httpfs
con = duckdb.connect("nyc_data.db")
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
con.install_extension("spatial")
con.load_extension("spatial")

con.sql("SHOW TABLES;")


Downloading...
From: https://open.gishub.org/data/duckdb/nyc_data.zip
To: /content/nyc_data.zip
100%|██████████| 8.73M/8.73M [00:00<00:00, 62.8MB/s]


Extracting files...


┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ 0 rows  │
└─────────┘

In [21]:
import os

extract_path = "/content/nyc_data"

# 检查路径是否存在
if os.path.exists(extract_path):
    if os.path.isfile(extract_path):  # 该路径是一个文件
        print("⚠️ `/content/nyc_data` 已经是一个文件，先删除")
        os.remove(extract_path)  # 删除该文件
    elif os.path.isdir(extract_path):  # 该路径是一个目录
        print("✅ `/content/nyc_data` 已经是一个文件夹")
else:
    print("🔍 `/content/nyc_data` 目录不存在，可以安全解压")

⚠️ `/content/nyc_data` 已经是一个文件，先删除


In [22]:
import zipfile

zip_path = "/content/nyc_data.zip"
extract_path = "/content/nyc_data"

# 重新解压 ZIP 文件
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

# 查看解压后的内容
print("📂 解压后文件列表:", os.listdir(extract_path))

📂 解压后文件列表: ['nyc_census_blocks.shp', 'nyc_subway_stations.prj', 'nyc_subway_stations.shx', 'nyc_neighborhoods.shp', 'nyc_streets.prj', 'nyc_census_blocks.dbf', 'nyc_census_blocks.shx', 'nyc_census_blocks.prj', 'nyc_homicides.prj', 'nyc_neighborhoods.dbf', 'nyc_census_sociodata.sql', 'nyc_homicides.shp', 'nyc_streets.shp', 'nyc_subway_stations.dbf', 'nyc_streets.dbf', 'nyc_neighborhoods.prj', 'nyc_homicides.shx', 'nyc_subway_stations.shp', 'nyc_homicides.dbf', 'nyc_neighborhoods.shx', 'nyc_streets.shx', 'README.txt']


In [18]:
shp_file = "https://open.gishub.org/data/duckdb/nyc_data/nyc_neighborhoods.shp"

con.execute(f"""
    CREATE TABLE nyc_neighborhoods AS
    SELECT * FROM ST_Read('{shp_file}');
""")

print(con.execute("SHOW TABLES;").fetchdf())
print(con.execute("SELECT * FROM nyc_neighborhoods LIMIT 5").fetchdf())

IOException: IO Error: GDAL Error (4): Failed to open file https://open.gishub.org/data/duckdb/nyc_data/nyc_neighborhoods.shp: {"exception_type":"HTTP","exception_message":"Unable to connect to URL \"https://open.gishub.org/data/duckdb/nyc_data/nyc_neighborhoods.shp\": 404 (Not Found)","header_ETag":"\"64d39a40-24a3\"","header_Server":"GitHub.com","header_Date":"Tue, 28 Jan 2025 14:21:41 GMT","header_Via":"1.1 varnish","header_Access-Control-Allow-Origin":"*","status_code":"404","header_X-Cache-Hits":"0","header_X-Cache":"MISS","reason":"Not Found","header_Accept-Ranges":"bytes","response_body":"","header_X-Served-By":"cache-lax-kwhp1940101-LAX","header_Vary":"Accept-Encoding","header_Content-Type":"text/html; charset=utf-8","header_Content-Security-Policy":"default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'","header_Content-Length":"9379","header_X-Timer":"S1738074101.415844,VS0,VE110","header_x-proxy-cache":"MISS","header_Age":"0","header_X-Fastly-Request-ID":"a17d1f562c324fd1e84886060da7b991841cfa88","header_X-GitHub-Request-Id":"1D22:903BB:B44799:B94B9C:6798E7F4","header_Connection":"keep-alive"}

Download the [nyc_data.zip](https://github.com/opengeos/data/raw/main/duckdb/nyc_data.zip) dataset using leafmap. The zip file contains the following datasets. Create a new DuckDB database and import the datasets into the database. Each dataset should be imported into a separate table.

- nyc_census_blocks
- nyc_homicides
- nyc_neighborhoods
- nyc_streets
- nyc_subway_stations

1. **What subway station is in \'Little Italy\'? What subway route is it on?**

In [27]:
con.execute("INSTALL spatial;")
con.execute("LOAD spatial;")

shp_file = "/content/nyc_data/nyc_subway_stations.shp"

con.execute(f"""
    CREATE TABLE nyc_subway_stations AS
    SELECT * FROM ST_Read('{shp_file}');
""")


<duckdb.duckdb.DuckDBPyConnection at 0x7f3ef71c45b0>

In [29]:
con.sql("select name, st_astext(geom) from nyc_subway_stations where borough = 'Little Italy'")

┌─────────┬─────────────────┐
│  NAME   │ st_astext(geom) │
│ varchar │     varchar     │
├─────────┴─────────────────┤
│          0 rows           │
└───────────────────────────┘

2. **What are all the neighborhoods served by the 6-train?** (Hint: The `routes` column in the `nyc_subway_stations` table has values like \'B,D,6,V\' and \'C,6\')


3. **After 9/11, the \'Battery Park\' neighborhood was off limits for several days. How many people had to be evacuated?**

4. **What neighborhood has the highest population density (persons/km2)?**


When you're finished, you can check your answers [here](https://postgis.net/workshops/postgis-intro/joins_exercises.html).

# Ship-to-Ship Transfer Detection

Now for a less structured exercise. We're going to look at ship-to-ship transfers. The idea is that two ships meet up in the middle of the ocean, and one ship transfers cargo to the other. This is a common way to avoid sanctions, and is often used to transfer oil from sanctioned countries to other countries. We're going to look at a few different ways to detect these transfers using AIS data.

In [None]:
%pip install duckdb duckdb-engine jupysql

In [None]:
import duckdb
import pandas as pd

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
%sql duckdb:///:memory:

In [None]:
%%sql
INSTALL httpfs;
LOAD httpfs;
INSTALL spatial;
LOAD spatial;

## Step 1

Create a spatial database using the following AIS data:

https://storage.googleapis.com/qm2/casa0025_ships.csv

Each row in this dataset is an AIS 'ping' indicating the position of a ship at a particular date/time, alongside vessel-level characteristics.

It contains the following columns:
* `vesselid`: A unique numerical identifier for each ship, like a license plate
* `vessel_name`: The ship's name
* `vsl_descr`: The ship's type
* `dwt`: The ship's Deadweight Tonnage (how many tons it can carry)
* `v_length`: The ship's length in meters
* `draught`: How many meters deep the ship is draughting (how low it sits in the water). Effectively indicates how much cargo the ship is carrying
* `sog`: Speed over Ground (in knots)
* `date`: A timestamp for the AIS signal
* `lat`: The latitude of the AIS signal (EPSG:4326)
* `lon`: The longitude of the AIS signal (EPSG:4326)

Create a table called 'ais' where each row is a different AIS ping, with no superfluous information. Construct a geometry column.

Create a second table called 'vinfo' which contains vessel-level information with no superfluous information.

## Step 2

Use a spatial join to identify ship-to-ship transfers in this dataset.
Two ships are considered to be conducting a ship to ship transfer IF:

* They are within 500 meters of each other
* For more than two hours
* And their speed is lower than 1 knot

Some things to consider: make sure you're not joining ships with themselves. Try working with subsets of the data first while you try different things out.