# Raster and geoparquet

## 1. Raster Data

Raster data is a `type of geospatial data model`. It represents the world as `a grid of cells (pixels)`, each cell containing a value.

Typical uses of raster data:
- storing satellite imagery,
- digital elevation models (DEM),
- climate data,
- land cover maps.

Each pixel may store:
- A continuous value (elevation, temperature, rainfall).
- A discrete class (land cover type: forest, water, urban).

Example: A 1000×1000 raster with 30 m resolution covers 30 km × 30 km of terrain.

Raster data model is a collection of various formats:
- NetCDF (Network Common Data Form): Climate/Weather, Remote sensing
- GeoTIFF: GIS maps
- ASCII Grid (ESRI .asc)
- HDF (Hierarchical Data Format, HDF4 / HDF5): Climate/Weather, Remote sensing
- ETc.
Standard in GIS, flexible, supports compression, multiple bands, large sizes.

### 1.1 NetCDF (Network Common Data Form)

**NetCDF** is a file format and data model, which stores `multi-dimensional arrays` (e.g., latitude × longitude × time × depth).

It has the below advantages:
- Self-describing: contains metadata describing variables, units, coordinate systems.
- Binary: Efficient for large spatiotemporal datasets (like daily temperature over decades).

It's commonly used in atmospheric reanalysis, global climate models, and remote sensing time series.

A NetCDF file defines:

- dimensions: It defines the shape of data arrays.
- variables: It stores the actual data.
- coordinates: 1D or 2D variables that define position.
- attributes: It stores the metadata at the file or variable level.

### 1.2 GeoTIFF

**GeoTIFF** is a `georeferenced raster image format` widely used in `geographic information systems (GIS)` to store spatial data such as satellite imagery, aerial photos, digital elevation models, and other gridded data.
GeoTIFF is compatible with TIFF viewers, but extended with `geographic tags(e.g. coordinate system, projection, geotransform)` following the GeoTIFF specification. Almost all GIS and remote sensing software can read/write GeoTIFF.

It can contain multiple bands (e.g., RGB, multispectral, or elevation).
It's widely supported by `GIS tools (e.g. QGIS, ArcGIS, GDAL, etc.)`.

Example: a Landsat scene distributed as GeoTIFF files, each band in a separate file.

### 1.3 ASCII Grid (ESRI .asc)

`ASCII Grid` is a simple text-based raster format.
It's easy to read but `inefficient for large datasets`.

### 1.4 HDF (Hierarchical Data Format, HDF4 / HDF5)

**HDF** is similar to `NetCDF`, designed for storing large scientific datasets.

> NASA distributes MODIS satellite products in HDF.


## 2. Use sedona to read raster data

In this tutorial, we will use sedona to read various raster data format

In [1]:
from sedona.spark import *
from pathlib import Path
from pyspark.sql import SparkSession, DataFrame

In [2]:
# build a sedona session offline
project_root_dir = Path.cwd().parent

print(project_root_dir.as_posix())

C:/Users/PLIU/Documents/git/Seminar_PySpark_Sedona_GeoParquet


In [3]:
# here we choose sedona 1.7.2 for spark 3.5.* build with scala 2.12
jar_folder = Path(f"{project_root_dir}/jars/sedona-35-212-172")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.7.2) offline
spark = SparkSession.builder \
    .appName("sedona_tutorial") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()

In [4]:
# create a sedona context
sedona = SedonaContext.create(spark)
# get the spark context
sc = spark.sparkContext
# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

In [5]:

raster_data_dir = f"{project_root_dir}/data/raster"
netcdf_sample = f"{raster_data_dir}/netcdf/sea_surface_temperature_O1_2001-2002.nc"
geotiff_sample = f"{raster_data_dir}/geotiff/PlanetDEM_3s_SanFrancisco.tif"

### 2.1 Read netcdf via sedona

We will use `RS_FromNetCDF` api to read netcdf data. This API reads the array data of the record variable in memory along with `all its dimensions`. Since the netCDF format has many variants, the reader might not work for your case.

This api has been tested for `netCDF classic (NetCDF 1, 2, 5) and netCDF4/HDF5 files`.



In [6]:
raw_netcdf = sedona.read.format("binaryFile").load(netcdf_sample)

In [7]:
raw_netcdf.show(5)

+--------------------+--------------------+-------+--------------------+
|                path|    modificationTime| length|             content|
+--------------------+--------------------+-------+--------------------+
|file:/C:/Users/PL...|2025-06-16 11:42:...|2949152|[43 44 46 01 00 0...|
+--------------------+--------------------+-------+--------------------+



In [9]:
# we need to get the netcdf record info first
recordInfo = raw_netcdf.selectExpr("RS_NetCDFInfo(content) as record_info").first()[0]
print(recordInfo)

lon_bnds(lon=180, bnds=2)

lat_bnds(lat=170, bnds=2)

time_bnds(time=24, bnds=2)

tos(time=24, lat=170, lon=180)


In [10]:
from pyspark.sql.functions import expr

netcdf_df = raw_netcdf.withColumn("raster", expr("RS_FromNetCDF(content, 'tos', 'lon', 'lat')"))

In [12]:
netcdf_df.select("raster").show(1, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------