# Sedona quick start

In this notebook, we will try to
1. create a sedona-spark context
2. read geo spatial data
3. perform some geo computation
4. write result as geoparquet file

## 1. Get the sedona required jar files

As I explained before, sedona is written in `scala`, and released as `.jar` files.


To make sedona work, you need two jar files:
- `sedona-spark-shaded-3.5_2.12-1.7.2.jar`: this sedona jar is built for `spark 3.5.*` compile with `scala 2.12`. The `sedona version is 1.7.2` (latest on 11/08/2025)
- `geotools-wrapper-1.7.2-28.5.jar`: this geotools jar is built for `sedona version 1.7.2`, the `geotools version is 28.5`

You can find the origin of the jar files in the below urls:
- sedona-jars: https://repo.maven.apache.org/maven2/org/apache/sedona/
- geotools-wrapper-jars: https://repo.maven.apache.org/maven2/org/datasyslab/geotools-wrapper/

### 1.1. Install apache-sedona python dependency

Here, we suppose you have completed the tutorial `01.Create_python_env_in_CASD` and `02.Use_pyspark_in_CASD`.
In another word, you already installed spark, and you have a `python virtual environment` which contains `pyspark`

```shell
# To install pyspark along with Sedona Python in one go, use the spark extra
pip install apache-sedona[spark]

# you need to check the version of apacke-sedona, because the jar version must be compatible
pip show apache-sedona

# you need also verify the pyspark version after installing apache-sedona
# because sedona use pyspark as requirements, so it try to install the lastest pyspark
pip show pyspark

# if the pyspark version is not what you want, you need to reinstall the right version
pip install pyspark==3.5.7
```

> Make sure the sedona jar files version and apache-sedona python dependency are exactly the same.

## 2. Create a sedona-spark session

As I explained before, sedona spark context is built on top of spark. So we need to create a spark session first in order to create a sedona context.

In [1]:
from sedona.spark import *
from pathlib import Path
from pyspark.sql import SparkSession, DataFrame


In [2]:
import os
os.environ["PYSPARK_PYTHON"]="python"
os.environ["PYSPARK_DRIVER_PYTHON"]="python"

In [3]:
# build a sedona session offline
project_root_dir = Path.cwd().parent

print(project_root_dir.as_posix())

C:/Users/pliu/Documents/git/Webinaire_CASD_GeoParquet_Sedona


In [4]:
# here we choose sedona 1.8.0 for spark 3.5.* build with scala 2.12
sedona_version = "sedona-35-212-180"
jar_folder = Path(f"{project_root_dir}/jars/{sedona_version}")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.8.0) offline
spark = SparkSession.builder \
    .appName("sedona_tutorial") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()


In [5]:
# create a sedona context
sedona = SedonaContext.create(spark)

In [6]:
# get the spark context
sc = spark.sparkContext
# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

## 3. Read geospatial data with sedona

In this section, we will read two shape files:
- airports_shape: all international airports in the world. The geo column is a point
- counties_shape: all countries in the world. The geo column is a multi-polygon

In [7]:
data_dir = f"{project_root_dir}/data"
airports_file_path = f"{data_dir}/airports_shape"
countries_file_path = f"{data_dir}/countries_shape"

In [8]:
# read countries shape file
countries_df = sedona.read.format("shapefile").load(countries_file_path)
countries_df.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- featurecla: string (nullable = true)
 |-- scalerank: long (nullable = true)
 |-- LABELRANK: long (nullable = true)
 |-- SOVEREIGNT: string (nullable = true)
 |-- SOV_A3: string (nullable = true)
 |-- ADM0_DIF: long (nullable = true)
 |-- LEVEL: long (nullable = true)
 |-- TYPE: string (nullable = true)
 |-- ADMIN: string (nullable = true)
 |-- ADM0_A3: string (nullable = true)
 |-- GEOU_DIF: long (nullable = true)
 |-- GEOUNIT: string (nullable = true)
 |-- GU_A3: string (nullable = true)
 |-- SU_DIF: long (nullable = true)
 |-- SUBUNIT: string (nullable = true)
 |-- SU_A3: string (nullable = true)
 |-- BRK_DIFF: long (nullable = true)
 |-- NAME: string (nullable = true)
 |-- NAME_LONG: string (nullable = true)
 |-- BRK_A3: string (nullable = true)
 |-- BRK_NAME: string (nullable = true)
 |-- BRK_GROUP: string (nullable = true)
 |-- ABBREV: string (nullable = true)
 |-- POSTAL: string (nullable = true)
 |-- FORMAL_EN: string (nullable

In [9]:
countries_df.show(1, vertical=True)

-RECORD 0--------------------------
 geometry   | MULTIPOLYGON (((3... 
 featurecla | Admin-0 country      
 scalerank  | 1                    
 LABELRANK  | 3                    
 SOVEREIGNT | Zimbabwe             
 SOV_A3     | ZWE                  
 ADM0_DIF   | 0                    
 LEVEL      | 2                    
 TYPE       | Sovereign country    
 ADMIN      | Zimbabwe             
 ADM0_A3    | ZWE                  
 GEOU_DIF   | 0                    
 GEOUNIT    | Zimbabwe             
 GU_A3      | ZWE                  
 SU_DIF     | 0                    
 SUBUNIT    | Zimbabwe             
 SU_A3      | ZWE                  
 BRK_DIFF   | 0                    
 NAME       | Zimbabwe             
 NAME_LONG  | Zimbabwe             
 BRK_A3     | ZWE                  
 BRK_NAME   | Zimbabwe             
 BRK_GROUP  |                      
 ABBREV     | Zimb.                
 POSTAL     | ZW                   
 FORMAL_EN  | Republic of Zimbabwe 
 FORMAL_FR  |               

In [12]:
countries_view_df = countries_df.select("geometry","FORMAL_EN","POP_EST","POP_RANK","POP_YEAR")
countries_view_df.show(5)

+--------------------+--------------------+--------+--------+--------+
|            geometry|           FORMAL_EN| POP_EST|POP_RANK|POP_YEAR|
+--------------------+--------------------+--------+--------+--------+
|MULTIPOLYGON (((3...|Republic of Zimbabwe|13805084|      14|    2017|
|MULTIPOLYGON (((3...|  Republic of Zambia|15972000|      14|    2017|
|MULTIPOLYGON (((5...|   Republic of Yemen|28036829|      15|    2017|
|MULTIPOLYGON (((1...|Socialist Republi...|96160163|      16|    2017|
|MULTIPOLYGON (((-...|Bolivarian Republ...|31304016|      15|    2017|
+--------------------+--------------------+--------+--------+--------+
only showing top 5 rows



In [11]:
# read airports shape file
airports_df = sedona.read.format("shapefile").load(airports_file_path)
airports_df.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- scalerank: long (nullable = true)
 |-- featurecla: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- abbrev: string (nullable = true)
 |-- location: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- wikipedia: string (nullable = true)
 |-- natlscale: decimal(8,3) (nullable = true)



In [15]:
airports_df.show(1, vertical=True)

-RECORD 0--------------------------
 geometry   | POINT (113.935016... 
 scalerank  | 2                    
 featurecla | Airport              
 type       | major                
 name       | Hong Kong Int'l      
 abbrev     | HKG                  
 location   | terminal             
 gps_code   | VHHH                 
 iata_code  | HKG                  
 wikipedia  | http://en.wikiped... 
 natlscale  | 150.000              
only showing top 1 row



## 4. Use sedona geo function

In this example, we join the country data frame and airport data frame by using the condition **ST_Contains(c.geometry, a.geometry)**. It means if the `airport (point)` in the `country (polygon)`, then we link the two rows.

In [16]:
from pyspark.sql.functions import col

# create a new dataframe to host the result of the join
countries_airport_df = (
    countries_df.alias("c")
    .join(
        airports_df.alias("a"),
        ST_Contains(col("c.geometry"), col("a.geometry"))
    )
    .select(
        col("c.geometry").alias("country_location"),
        col("c.NAME_EN").alias("country_name"),
        col("a.geometry").alias("airport_location"),
        col("a.name").alias("airport_name")
    )
)

countries_airport_df.show(5, truncate=False)


+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
countries_airport_df.printSchema()

root
 |-- country_location: geometry (nullable = true)
 |-- country_name: string (nullable = true)
 |-- airport_location: geometry (nullable = true)
 |-- airport_name: string (nullable = true)



In [18]:
# create a new view to store the result of the geo join
countries_airport_df.createOrReplaceTempView("country_airport")

In [19]:
from pyspark.sql.functions import count, desc

airports_count = (countries_airport_df
                  .groupBy("country_name", "country_location")
                  .agg(count("*").alias("airport_count"))
                  .sort(desc("airport_count")))

airports_count.show(5)

+--------------------+--------------------+-------------+
|        country_name|    country_location|airport_count|
+--------------------+--------------------+-------------+
|United States of ...|MULTIPOLYGON (((-...|           35|
|              Canada|MULTIPOLYGON (((-...|           15|
|              Mexico|MULTIPOLYGON (((-...|           12|
|              Brazil|MULTIPOLYGON (((-...|           12|
|People's Republic...|MULTIPOLYGON (((1...|            7|
+--------------------+--------------------+-------------+
only showing top 5 rows



In [20]:
france_airports = (
    countries_airport_df
    .filter(col("country_name").like("%France%"))
    .select(
        col("country_name"),
        col("airport_name"),
        col("airport_location")
    )
)
france_airports.show(5, truncate=False)

+------------+-----------------------+--------------------------------------------+
|country_name|airport_name           |airport_location                            |
+------------+-----------------------+--------------------------------------------+
|France      |Paris Orly             |POINT (2.367379127837731 48.73130304580517) |
|France      |Charles de Gaulle Int'l|POINT (2.5418677673945727 49.01442009693855)|
+------------+-----------------------+--------------------------------------------+



In [21]:
france_airports.printSchema()

root
 |-- country_name: string (nullable = true)
 |-- airport_name: string (nullable = true)
 |-- airport_location: geometry (nullable = true)



## 5 Write result in a geoparquet file

In [22]:
out_path = f"{data_dir}/tmp/france_airports"
france_airports.write.mode("overwrite").format("geoparquet").save(out_path)

In [23]:
tmp_df = sedona.read.format("geoparquet").load(out_path)
tmp_df.show(5, truncate=False)

+------------+-----------------------+--------------------------------------------+
|country_name|airport_name           |airport_location                            |
+------------+-----------------------+--------------------------------------------+
|France      |Paris Orly             |POINT (2.367379127837731 48.73130304580517) |
|France      |Charles de Gaulle Int'l|POINT (2.5418677673945727 49.01442009693855)|
+------------+-----------------------+--------------------------------------------+



In [24]:
tmp_df.printSchema()

root
 |-- country_name: string (nullable = true)
 |-- airport_name: string (nullable = true)
 |-- airport_location: geometry (nullable = true)



In [25]:
# close spark session
spark.stop()