# Raster and geoparquet

## 1. Raster Data

Raster data is a `type of geospatial data model`. It represents the world as `a grid of cells (pixels)`, each cell containing a value.

Typical uses of raster data:
- storing satellite imagery,
- digital elevation models (DEM),
- climate data,
- land cover maps.

Each pixel may store:
- A continuous value (elevation, temperature, rainfall).
- A discrete class (land cover type: forest, water, urban).

Example: A 1000×1000 raster with 30 m resolution covers 30 km × 30 km of terrain.

Raster data model is a collection of various formats:
- NetCDF (Network Common Data Form): Climate/Weather, Remote sensing
- GeoTIFF: GIS maps
- ASCII Grid (ESRI .asc)
- HDF (Hierarchical Data Format, HDF4 / HDF5): Climate/Weather, Remote sensing
- ETc.
Standard in GIS, flexible, supports compression, multiple bands, large sizes.

### 1.1 NetCDF (Network Common Data Form)

**NetCDF** is a file format and data model, which stores `multi-dimensional arrays` (e.g., latitude × longitude × time × depth).

It has the below advantages:
- Self-describing: contains metadata describing variables, units, coordinate systems.
- Binary: Efficient for large spatiotemporal datasets (like daily temperature over decades).

It's commonly used in atmospheric reanalysis, global climate models, and remote sensing time series.

A NetCDF file defines:

- dimensions: It defines the shape of data arrays.
- variables: It stores the actual data.
- coordinates: 1D or 2D variables that define position.
- attributes: It stores the metadata at the file or variable level.

### 1.2 GeoTIFF

**GeoTIFF** is a `georeferenced raster image format` widely used in `geographic information systems (GIS)` to store spatial data such as satellite imagery, aerial photos, digital elevation models, and other gridded data.
GeoTIFF is compatible with TIFF viewers, but extended with `geographic tags(e.g. coordinate system, projection, geotransform)` following the GeoTIFF specification. Almost all GIS and remote sensing software can read/write GeoTIFF.

It can contain multiple bands (e.g., RGB, multispectral, or elevation).
It's widely supported by `GIS tools (e.g. QGIS, ArcGIS, GDAL, etc.)`.

Example: a Landsat scene distributed as GeoTIFF files, each band in a separate file.

### 1.3 ASCII Grid (ESRI .asc)

`ASCII Grid` is a simple text-based raster format.
It's easy to read but `inefficient for large datasets`.

### 1.4 HDF (Hierarchical Data Format, HDF4 / HDF5)

**HDF** is similar to `NetCDF`, designed for storing large scientific datasets.

> NASA distributes MODIS satellite products in HDF.

### 1.5 Data source in this tutorial

The `NetCDF` sample data(sea_surface_temperature_O1_2001-2002.nc) is from https://www.unidata.ucar.edu/software/netcdf/examples/files.html. It describes sea surface temperatures collected by `PCMDI` for use by the `IPCC`.


The `geotiff` sample data() is from https://www.planetobserver.com/geospatial-data-samples. It describes the global elevation of San Francisco USA.

## 2. Use sedona to read raster data

In this tutorial, we will use sedona to read various raster data formats, and try some raster operations.

The full API docs is [here](https://sedona.apache.org/latest/api/sql/Raster-operators/)

In [1]:
from sedona.spark import *
from pathlib import Path
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import expr

In [2]:
import os
os.environ["PYSPARK_PYTHON"]="python"
os.environ["PYSPARK_DRIVER_PYTHON"]="python"

In [3]:
# build a sedona session offline
project_root_dir = Path.cwd().parent

print(project_root_dir.as_posix())

C:/Users/pliu/Documents/git/Webinaire_CASD_GeoParquet_Sedona


In [5]:
# here we choose sedona 1.8.0 for spark 3.5.* build with scala 2.12
sedona_version = "sedona-35-212-180"
jar_folder = Path(f"{project_root_dir}/jars/{sedona_version}")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.8.0) offline
spark = SparkSession.builder \
    .appName("sedona_tutorial") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()

In [6]:
# create a sedona context
sedona = SedonaContext.create(spark)
# get the spark context
sc = spark.sparkContext
# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

In [7]:

raster_data_dir = f"{project_root_dir}/data/raster"
netcdf_sample = f"{raster_data_dir}/netcdf/sea_surface_temperature_O1_2001-2002.nc"
geotiff_sample = f"{raster_data_dir}/geotiff/PlanetDEM_3s_SanFrancisco.tif"
ascii_sample = f"{raster_data_dir}/sample.asc"

### 2.1 Read netcdf via sedona

We will use `RS_FromNetCDF` api to read netcdf data. This API reads the array data of the record variable in memory along with `all its dimensions`. Since the netCDF format has many variants, the reader might not work for your case.

This api has been tested for `netCDF classic (NetCDF 1, 2, 5) and netCDF4/HDF5 files`.



In [8]:
raw_netcdf = sedona.read.format("binaryFile").load(netcdf_sample)

In [9]:
raw_netcdf.show(5)

+--------------------+--------------------+-------+--------------------+
|                path|    modificationTime| length|             content|
+--------------------+--------------------+-------+--------------------+
|file:/C:/Users/pl...|2025-06-16 11:42:...|2949152|[43 44 46 01 00 0...|
+--------------------+--------------------+-------+--------------------+



In [10]:
# we need to get the netcdf record info first
recordInfo = raw_netcdf.selectExpr("RS_NetCDFInfo(content) as record_info").first()[0]
print(recordInfo)

lon_bnds(lon=180, bnds=2)

lat_bnds(lat=170, bnds=2)

time_bnds(time=24, bnds=2)

tos(time=24, lat=170, lon=180)


In [11]:
# convert the binary column into raster column
netcdf_df = raw_netcdf.withColumn("raster", expr("RS_FromNetCDF(content, 'tos', 'lon', 'lat')"))

In [12]:
netcdf_df.select("raster").show(1, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now let's check if sedona can work with the raster column or not
- The function `RS_PixelAsCentroid`: returns the centroid (point geometry) of the specified pixel's area. The pixel coordinates specified are 1-indexed. If `colX` and `rowY` are out of bounds for the raster, they are interpolated assuming the same skew and translate values.

In [13]:
netcdf_df_centroid_1_1 = netcdf_df.select("raster").withColumn("centroid", expr("RS_PixelAsCentroid(raster,1,1)"))
netcdf_df_centroid_1_1.show(1, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [14]:
def get_raster_pixel_centroid(src_raster_df: DataFrame, grid_x: int, grid_y: int) -> DataFrame:
    """
    This function take a raster dataframe, and a pixel coordinates, then it returns the centroid of the pixel
    """
    return src_raster_df.select("raster").withColumn("centroid", expr(f"RS_PixelAsCentroid(raster,{grid_x},{grid_y})"))


def show_raster_pixel_centroid(src_raster_df: DataFrame, grid_x: int, grid_y: int) -> None:
    """
    This function take a raster dataframe, and a pixel coordinates, then it prints the centroid of the pixel
    """
    get_raster_pixel_centroid(src_raster_df, grid_x, grid_y).show(1, truncate=False, vertical=True)

In [15]:
netcdf_df_centroid_1_2 = get_raster_pixel_centroid(netcdf_df, 1, 2)
netcdf_df_centroid_1_2.show(1, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [16]:
show_raster_pixel_centroid(netcdf_df, 1, 3)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We can notice the centroid of
- grid(1,1): POINT (1, 89.5)
- grid(1,2): POINT (1, 88.5)
- grid(1,3): POINT (1, 87.5)

### 2.2 Read geotiff with sedona

In [17]:
# read the raw geotiff file
raw_geotiff = sedona.read.format("binaryFile").load(geotiff_sample)

# you can notice the content of the geotiff file is read as binary column
raw_geotiff.show(5)


+--------------------+--------------------+-------+--------------------+
|                path|    modificationTime| length|             content|
+--------------------+--------------------+-------+--------------------+
|file:/C:/Users/pl...|2025-06-16 11:42:...|2890124|[49 49 2A 00 08 F...|
+--------------------+--------------------+-------+--------------------+



In [18]:
raw_geotiff.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)



In [19]:
geotiff_df = raw_geotiff.withColumn("raster", expr("RS_FromGeoTiff(content)")).select("modificationTime", "raster",
                                                                                      "content")

In [20]:
geotiff_df.select("raster").show(1, truncate=False, vertical=True)

-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 raster | GridCoverage2D["geotiff_coverage", GeneralBounds[(-123.0, 37.00000000000001), (-122.0, 38.0)], DefaultGeographicCRS["WGS 84"]]\r\n│   RenderedSampleDimension("GRAY_INDEX":[-32767.0 ... -32767.0])\r\n│     ‣ Category("No data":[-32767.0 ... -32767.0])\r\n└ Image=RenderedImageAdapter[]\r\n 



In [21]:
geotiff_df.printSchema()

root
 |-- modificationTime: timestamp (nullable = true)
 |-- raster: raster (nullable = true)
 |-- content: binary (nullable = true)



In [22]:
geotiff_df_polygon_1_1 = geotiff_df.select("raster").withColumn("polygon", expr("RS_PixelAsPolygon(raster,1,1)"))
geotiff_df_polygon_1_1.show(1, truncate=False, vertical=True)

-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 raster  | GridCoverage2D["geotiff_coverage", GeneralBounds[(-123.0, 37.00000000000001), (-122.0, 38.0)], DefaultGeographicCRS["WGS 84"]]\r\n│   RenderedSampleDimension("GRAY_INDEX":[-32767.0 ... -32767.0])\r\n│     ‣ Category("No data":[-32767.0 ... -32767.0])\r\n└ Image=RenderedImageAdapter[]\r\n 
 polygon | POLYGON ((-123 38, -122.99916666666667 38, -122.99916666666667 37.99916666666667, -123 37.99916666666667, -123 38))                                                                                                                                                                              



In [23]:
def get_raster_pixel_polygon(src_raster_df: DataFrame, grid_x: int, grid_y: int) -> DataFrame:
    """
    This function take a raster dataframe, and a pixel coordinates, then it returns the centroid of the pixel
    """
    return src_raster_df.select("raster").withColumn("polygon", expr(f"RS_PixelAsPolygon(raster,{grid_x},{grid_y})"))


def show_raster_pixel_polygon(src_raster_df: DataFrame, grid_x: int, grid_y: int) -> None:
    """
    This function take a raster dataframe, and a pixel coordinates, then it prints the centroid of the pixel
    """
    get_raster_pixel_centroid(src_raster_df, grid_x, grid_y).show(1, truncate=False, vertical=True)

In [24]:
geotiff_df_polygon_1_2 = get_raster_pixel_polygon(geotiff_df, 1, 2)
geotiff_df_polygon_1_2.show(1, truncate=False, vertical=True)

-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 raster  | GridCoverage2D["geotiff_coverage", GeneralBounds[(-123.0, 37.00000000000001), (-122.0, 38.0)], DefaultGeographicCRS["WGS 84"]]\r\n│   RenderedSampleDimension("GRAY_INDEX":[-32767.0 ... -32767.0])\r\n│     ‣ Category("No data":[-32767.0 ... -32767.0])\r\n└ Image=RenderedImageAdapter[]\r\n 
 polygon | POLYGON ((-123 37.99916666666667, -122.99916666666667 37.99916666666667, -122.99916666666667 37.998333333333335, -123 37.998333333333335, -123 37.99916666666667))                                                                                                                               



In [25]:
geotiff_df_polygon_1_2.printSchema()

root
 |-- raster: raster (nullable = true)
 |-- polygon: geometry (nullable = true)



### 2.3 Read ASCII format


In [26]:
# read the asci file as binary
raw_ascii = sedona.read.format("binaryFile").load(ascii_sample)
raw_ascii.show(5)

+--------------------+--------------------+------+--------------------+
|                path|    modificationTime|length|             content|
+--------------------+--------------------+------+--------------------+
|file:/C:/Users/pl...|2025-08-25 11:01:...|   181|[6E 63 6F 6C 73 2...|
+--------------------+--------------------+------+--------------------+



In [27]:
# convert the binary content column to raster column
ascii_df = raw_ascii.withColumn("raster", expr("RS_FromArcInfoAsciiGrid(content)"))
ascii_df.select("raster").show(1, truncate=False, vertical=True)

-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 raster | GridCoverage2D["AsciiGrid", GeneralBounds[(100.0, 200.0), (150.0, 240.0)], DefaultEngineeringCRS["Generic cartesian 2D"]]\r\n│   RenderedSampleDimension("AsciiGrid":[-9999.0 ... -9999.0])\r\n│     ‣ Category("No data":[-9999.0 ... -9999.0])\r\n└ Image=RenderedImageAdapter[]\r\n 



## Don't forget to close the spark session


In [28]:
spark.stop()