# Spatial data processing with apache sedona

[Apache Sedona](https://sedona.apache.org/latest/) is a cluster computing system for processing `large-scale spatial data`. In this tutorial, we will learn
how to use sedona in a `pyspark environment`. It supports other `cluster computing systems`, such as `Apache Flink, and Snowflake`. But we will not cover them in this tutorial.

## 0.sedona internals



2. Sedona Core (Scala)
Sedona adds support for spatial data types and operations to Spark.

It includes:

Spatial RDDs

Geometry types (Point, Polygon, etc.)

Spatial indexes (QuadTree, RTree)

Spatial operations (joins, predicates)

3. Py4J Bridge
PySpark uses Py4J to connect the Python driver to the JVM.

Sedona functions you call in Python (e.g., ST_Point, ST_Within) are just Python wrappers that translate calls to JVM methods.
Sedona offers a python package `apache-sedona[]`
## 1. Prepare sedona in your dev environment

### 1.1 check the required jar files

As I explained before, sedona is written in `scala`, and released as `.jar` files.

2. Install the python dependencies
pip install apache-sedona[spark]


## 1. Prepare sedona in your dev environment

### 1.1 check the required jar files

As I explained before, sedona is written in `scala`, and released as `.jar` files.

2. Install the python dependencies
pip install apache-sedona[spark]

In [1]:
from sedona.spark import *
from pyspark.sql import SparkSession, DataFrame
from pathlib import Path
from pyspark.sql.functions import trim, split, expr, col

In [2]:
# build a sedona session offline
project_root_dir = Path.cwd().parent
print(project_root_dir.as_posix())

C:/Users/PLIU/Documents/git/Seminar_PySpark_Sedona_GeoParquet


In [3]:
# here we choose sedona 1.7.2 for spark 3.5.* build with scala 2.12
jar_folder = Path(f"{project_root_dir}/jars/sedona-35-212-172")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.7.2) offline
spark = SparkSession.builder \
    .appName("sedona_tutorial") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()

In [4]:
# create a sedona context
sedona = SedonaContext.create(spark)

In [5]:
sc = spark.sparkContext
# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

In [6]:
fr_immo_transaction_path = project_root_dir / "data/fr_immo_transaction.parquet"
fr_immo_transactions_df = spark.read.parquet(fr_immo_transaction_path.as_posix())

In [7]:
required_col = ["id_transaction","date_transaction","prix","departement","ville","code_postal","adresse","type_batiment","n_pieces","surface_habitable","latitude","longitude"]
clean_fr_immo_df = fr_immo_transactions_df.select(required_col)

In [8]:
# cache the dataframe for better performance
# clean_fr_immo_df.cache()
clean_fr_immo_df.show()

+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+----------------+----------------+
|id_transaction|date_transaction|    prix|departement|               ville|code_postal|             adresse|type_batiment|n_pieces|surface_habitable|        latitude|       longitude|
+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+----------------+----------------+
|        141653|      2014-01-02|197000.0|         01|             TREVOUX|       1600|  6346 MTE DES LILAS|  Appartement|       4|               84|45.9423014034837|4.77069364742062|
|        141970|      2014-01-02|157500.0|         01|              VIRIAT|       1440|1369 RTE DE STRAS...|       Maison|       4|              103|46.2364072868351|5.26293493674271|
|        139240|      2014-01-02|112000.0|         01|SAINT-JEAN-SUR-VEYLE|     

In [9]:
clean_fr_immo_df.printSchema()

root
 |-- id_transaction: integer (nullable = true)
 |-- date_transaction: date (nullable = true)
 |-- prix: double (nullable = true)
 |-- departement: string (nullable = true)
 |-- ville: string (nullable = true)
 |-- code_postal: integer (nullable = true)
 |-- adresse: string (nullable = true)
 |-- type_batiment: string (nullable = true)
 |-- n_pieces: integer (nullable = true)
 |-- surface_habitable: integer (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)



## create a geometry column

You can notice the raw data frame, the latitude and longitude columns are double types. We need to create a geometry column to be able to do spatial-join operations with other geospatial data.

In [10]:
fr_immo_geometry_df = clean_fr_immo_df.withColumn("geo_coord", ST_Point(col("longitude"), col("latitude"))).drop("longitude", "latitude")

In [11]:
fr_immo_geometry_df.show()

+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+--------------------+
|id_transaction|date_transaction|    prix|departement|               ville|code_postal|             adresse|type_batiment|n_pieces|surface_habitable|           geo_coord|
+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+--------------------+
|        141653|      2014-01-02|197000.0|         01|             TREVOUX|       1600|  6346 MTE DES LILAS|  Appartement|       4|               84|POINT (4.77069364...|
|        141970|      2014-01-02|157500.0|         01|              VIRIAT|       1440|1369 RTE DE STRAS...|       Maison|       4|              103|POINT (5.26293493...|
|        139240|      2014-01-02|112000.0|         01|SAINT-JEAN-SUR-VEYLE|       1290|5174  SAINT JEAN ...|       Maison|       3|              

In [None]:
## filter with column code_postal


In [31]:
montrouge_immo_df = fr_immo_geometry_df.filter((col("code_postal")==92120) & (col("date_transaction") > lit("2021-12-31"))).select("adresse","type_batiment","n_pieces","surface_habitable","prix","geo_coord")
montrouge_immo_df.count()

1720

In [29]:
map_config = {
    "visState": {
        "layers": [
            {
                "type": "point",
                "config": {
                    "dataId": "Montrouge Transactions",
                    "label": "Transactions",
                    "color": [255, 0, 0],
                    "isVisible": True,
                },
                "visualChannels": {
                    "colorField": {"name": "prix", "type": "real"},
                    "colorScale": "quantile"
                }
            }
        ]
    },
    "mapState": {
        "bearing": 0,
        "latitude": 48.816,      # Starting center latitude
        "longitude": 2.313,      # Starting center longitude
        "pitch": 0,
        "zoom": 13                # Starting zoom level
    }
}

In [32]:
# to be able to use kepler map, you must install the kepler extension. pip install apache-sedona[kepler-map]
kepler_map_path = project_root_dir / "tmp/montrouge_immo_map.html"
montrouge_immo_map = SedonaKepler.create_map(df=montrouge_immo_df, name="montrouge_immo_transaction", config=map_config)
montrouge_immo_map.save_to_html(file_name=kepler_map_path.as_posix())

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter
Map saved to C:/Users/PLIU/Documents/git/Seminar_PySpark_Sedona_GeoParquet/tmp/montrouge_immo_map.html!
