This uses `duckdb` to illustrate the use of a window function.

## Task

I want to group stars by healpix region, and know for each star how it ranks in magnitude within its healpix region (1 if it is the brightest, 2 for the second brightest etc).

## Get data

I get data from the Gaia catalogue, then convert it to a `pandas` df.

In [1]:
%%time
from astroquery.gaia import Gaia

query_string = """SELECT random_index, phot_g_mean_mag, GAIA_HEALPIX_INDEX(4, source_id) as healpix
  FROM gaiadr3.gaia_source 
  WHERE random_index < 1000000 
  ORDER BY random_index"""

job = Gaia.launch_job_async(query=query_string, verbose=False)
gaia_data = job.get_results()
gaia_df = gaia_data.to_pandas()

INFO: Query finished. [astroquery.utils.tap.core]
CPU times: user 6.48 s, sys: 345 ms, total: 6.82 s
Wall time: 39.5 s


In [3]:
gaia_df

Unnamed: 0,random_index,phot_g_mean_mag,healpix
0,0,15.244129,1895.0
1,1,20.906347,2332.0
2,2,20.531225,860.0
3,3,20.145899,2651.0
4,4,19.787357,2643.0
...,...,...,...
999995,999995,20.033766,2682.0
999996,999996,19.326818,2068.0
999997,999997,18.571827,628.0
999998,999998,20.949343,801.0


In [2]:
import duckdb

### Method 1

In [15]:
%%time
duckdb.query(
    """SELECT
    healpix,
    phot_g_mean_mag,
    random_index,
    ROW_NUMBER() OVER (PARTITION BY healpix ORDER BY phot_g_mean_mag ASC) AS mag_rank
FROM gaia_df
ORDER BY random_index;"""
)

CPU times: user 2.44 ms, sys: 502 μs, total: 2.95 ms
Wall time: 2.23 ms


┌─────────┬─────────────────┬──────────────┬──────────┐
│ healpix │ phot_g_mean_mag │ random_index │ mag_rank │
│ double  │      float      │    int64     │  int64   │
├─────────┼─────────────────┼──────────────┼──────────┤
│  1895.0 │       15.244129 │            0 │       33 │
│  2332.0 │       20.906347 │            1 │     1769 │
│   860.0 │       20.531225 │            2 │      122 │
│  2651.0 │       20.145899 │            3 │     1579 │
│  2643.0 │       19.787357 │            4 │     1454 │
│  2670.0 │        21.05777 │            5 │      416 │
│  2720.0 │       19.050842 │            6 │      221 │
│  2672.0 │        20.33379 │            7 │      933 │
│  2310.0 │       18.942072 │            8 │       77 │
│  1848.0 │       20.702047 │            9 │     1156 │
│     ·   │           ·     │            · │       ·  │
│     ·   │           ·     │            · │       ·  │
│     ·   │           ·     │            · │       ·  │
│  2070.0 │       18.848368 │         9990 │    

### Method 2 - using a Common Table Expression

In [16]:
%%time
duckdb.query(
    """WITH ranked_stars AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY healpix ORDER BY phot_g_mean_mag ASC) AS mag_rank
    FROM gaia_df
)
SELECT healpix, phot_g_mean_mag, random_index, mag_rank
FROM ranked_stars
ORDER BY random_index;"""
)

CPU times: user 2.46 ms, sys: 1.11 ms, total: 3.56 ms
Wall time: 2.63 ms


┌─────────┬─────────────────┬──────────────┬──────────┐
│ healpix │ phot_g_mean_mag │ random_index │ mag_rank │
│ double  │      float      │    int64     │  int64   │
├─────────┼─────────────────┼──────────────┼──────────┤
│  1895.0 │       15.244129 │            0 │       33 │
│  2332.0 │       20.906347 │            1 │     1769 │
│   860.0 │       20.531225 │            2 │      122 │
│  2651.0 │       20.145899 │            3 │     1579 │
│  2643.0 │       19.787357 │            4 │     1454 │
│  2670.0 │        21.05777 │            5 │      416 │
│  2720.0 │       19.050842 │            6 │      221 │
│  2672.0 │        20.33379 │            7 │      933 │
│  2310.0 │       18.942072 │            8 │       77 │
│  1848.0 │       20.702047 │            9 │     1156 │
│     ·   │           ·     │            · │       ·  │
│     ·   │           ·     │            · │       ·  │
│     ·   │           ·     │            · │       ·  │
│  2070.0 │       18.848368 │         9990 │    

### Method 3 - subquery

In [17]:
%%time
duckdb.query(
    """SELECT
    r.healpix,
    r.phot_g_mean_mag,
    r.random_index,
    r.mag_rank
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY healpix ORDER BY phot_g_mean_mag ASC) AS mag_rank
    FROM gaia_df
) r
ORDER BY r.random_index;"""
)

CPU times: user 2.68 ms, sys: 858 μs, total: 3.54 ms
Wall time: 2.6 ms


┌─────────┬─────────────────┬──────────────┬──────────┐
│ healpix │ phot_g_mean_mag │ random_index │ mag_rank │
│ double  │      float      │    int64     │  int64   │
├─────────┼─────────────────┼──────────────┼──────────┤
│  1895.0 │       15.244129 │            0 │       33 │
│  2332.0 │       20.906347 │            1 │     1769 │
│   860.0 │       20.531225 │            2 │      122 │
│  2651.0 │       20.145899 │            3 │     1579 │
│  2643.0 │       19.787357 │            4 │     1454 │
│  2670.0 │        21.05777 │            5 │      416 │
│  2720.0 │       19.050842 │            6 │      221 │
│  2672.0 │        20.33379 │            7 │      933 │
│  2310.0 │       18.942072 │            8 │       77 │
│  1848.0 │       20.702047 │            9 │     1156 │
│     ·   │           ·     │            · │       ·  │
│     ·   │           ·     │            · │       ·  │
│     ·   │           ·     │            · │       ·  │
│  2070.0 │       18.848368 │         9990 │    