# Sedona Exercise

In this notebook, you will try to use what you have learned before to solve the below problems:
1. Create a sedona context
2. Read a parquet file
3. Convert string columns into geometry columns
4. Get all immo transactions of 2020
5. Get all immo transactions near casd (in 3 kilometers radius)
6. Write the result of 4, 5 in a geoparquet file

## 1.Create a sedona context

Before you start
- check spark and pyspark
- check the required jar files of sedona
- check the python dependencies of sedona

If everything is checked, you can continue

In [2]:
from sedona.spark import *
from pyspark.sql import SparkSession, DataFrame
from pathlib import Path
from pyspark.sql.functions import trim, split, expr, col

In [3]:
# build a sedona session offline
project_root_dir = Path.cwd().parent
print(project_root_dir.as_posix())

C:/Users/PLIU/Documents/git/Seminar_PySpark_Sedona_GeoParquet


In [4]:
# here we choose sedona 1.7.2 for spark 3.5.* build with scala 2.12
jar_folder = Path(f"{project_root_dir}/jars/sedona-35-212-172")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.7.2) offline
spark = SparkSession.builder \
    .appName("sedona_tutorial") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()

In [5]:
# create a sedona context
sedona = SedonaContext.create(spark)

In [6]:
sc = spark.sparkContext
# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

## 2. Read a parquet file

In [7]:
data_dir = project_root_dir / "data"
fr_immo_transaction_path = data_dir / "large_ds/fr_immo_transaction.parquet"
fr_immo_transactions_df = spark.read.parquet(fr_immo_transaction_path.as_posix())

In [8]:
required_col = ["id_transaction", "date_transaction", "prix", "departement", "ville", "code_postal", "adresse",
                "type_batiment", "n_pieces", "surface_habitable", "latitude", "longitude"]
clean_fr_immo_df = fr_immo_transactions_df.select(required_col)

In [9]:
# cache the dataframe for better performance
# clean_fr_immo_df.cache()
clean_fr_immo_df.show()

+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+----------------+----------------+
|id_transaction|date_transaction|    prix|departement|               ville|code_postal|             adresse|type_batiment|n_pieces|surface_habitable|        latitude|       longitude|
+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+----------------+----------------+
|        141653|      2014-01-02|197000.0|         01|             TREVOUX|       1600|  6346 MTE DES LILAS|  Appartement|       4|               84|45.9423014034837|4.77069364742062|
|        141970|      2014-01-02|157500.0|         01|              VIRIAT|       1440|1369 RTE DE STRAS...|       Maison|       4|              103|46.2364072868351|5.26293493674271|
|        139240|      2014-01-02|112000.0|         01|SAINT-JEAN-SUR-VEYLE|     

In [10]:
clean_fr_immo_df.printSchema()

root
 |-- id_transaction: integer (nullable = true)
 |-- date_transaction: date (nullable = true)
 |-- prix: double (nullable = true)
 |-- departement: string (nullable = true)
 |-- ville: string (nullable = true)
 |-- code_postal: integer (nullable = true)
 |-- adresse: string (nullable = true)
 |-- type_batiment: string (nullable = true)
 |-- n_pieces: integer (nullable = true)
 |-- surface_habitable: integer (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)



## create a geometry column

You can notice the raw data frame, the latitude and longitude columns are double types. We need to create a geometry column to be able to do spatial-join operations with other geospatial data.

In [11]:
fr_immo_geometry_df = clean_fr_immo_df.withColumn("geo_coord", ST_Point(col("longitude"), col("latitude"))).drop(
    "longitude", "latitude")

In [12]:
fr_immo_geometry_df.show()

+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+--------------------+
|id_transaction|date_transaction|    prix|departement|               ville|code_postal|             adresse|type_batiment|n_pieces|surface_habitable|           geo_coord|
+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+--------------------+
|        141653|      2014-01-02|197000.0|         01|             TREVOUX|       1600|  6346 MTE DES LILAS|  Appartement|       4|               84|POINT (4.77069364...|
|        141970|      2014-01-02|157500.0|         01|              VIRIAT|       1440|1369 RTE DE STRAS...|       Maison|       4|              103|POINT (5.26293493...|
|        139240|      2014-01-02|112000.0|         01|SAINT-JEAN-SUR-VEYLE|       1290|5174  SAINT JEAN ...|       Maison|       3|              

## 4. Get all immo transactions of 2020

We have a column `date_transaction` of date type. We need to use this column to filter

In [13]:
from pyspark.sql.functions import year

transaction_immo_2020 = fr_immo_geometry_df.filter(year(col("date_transaction")) == 2020)
transaction_immo_2020.show(5)

+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+--------------------+
|id_transaction|date_transaction|    prix|departement|               ville|code_postal|             adresse|type_batiment|n_pieces|surface_habitable|           geo_coord|
+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+--------------------+
|         67688|      2020-01-02|182925.0|         01|   AMBERIEU-EN-BUGEY|       1500|78 AV DU GEN SARRAIL|  Appartement|       5|               93|POINT (5.33384020...|
|         69832|      2020-01-02|430000.0|         01|     PREVESSIN-MOENS|       1280| 134 CHE DES HAUTINS|  Appartement|       5|              105|POINT (6.08684459...|
|         69585|      2020-01-02|165000.0|         01|              CHEVRY|       1170|  347 RUE DU CHATEAU|  Appartement|       3|              

## 5. Get all immo transactions near CASD

In [23]:
# casd geo location
casd_latitude = "48.8190155"
casd_longitude = "2.3081911"
casd_geo = f"POINT({casd_longitude} {casd_latitude})"

distance = 500.0



In [26]:
from pyspark.sql.functions import asc


def get_near_immo_transaction(geo_df: DataFrame, target_loc: str, distance: float) -> DataFrame:
    """
    This function get the nearest hospital based on distance with a given patient location
    :return:
    """
    tmp_df = geo_df.withColumn("distance_meter", ST_DistanceSphere(ST_GeomFromWKT(lit(target_loc)), col("geo_coord"))) \
        .orderBy(asc("distance_meter"))

    return tmp_df.filter(tmp_df.distance_meter <= distance)

In [27]:
immo_transaction_near_casd = get_near_immo_transaction(fr_immo_geometry_df, casd_geo, distance)

In [29]:
immo_transaction_near_casd.count()

2372

In [28]:
immo_transaction_near_casd.show(5)


+--------------+----------------+---------+-----------+--------+-----------+--------------------+-------------+--------+-----------------+--------------------+-----------------+
|id_transaction|date_transaction|     prix|departement|   ville|code_postal|             adresse|type_batiment|n_pieces|surface_habitable|           geo_coord|   distance_meter|
+--------------+----------------+---------+-----------+--------+-----------+--------------------+-------------+--------+-----------------+--------------------+-----------------+
|      14144156|      2019-12-18|1007200.0|         92|MALAKOFF|      92240|9 RUE FRANCOIS CO...|  Appartement|       2|              139|POINT (2.30806162...|30.35060097718332|
|      14168314|      2019-07-29|1164550.0|         92|MALAKOFF|      92240|9 RUE FRANCOIS CO...|  Appartement|       3|              103|POINT (2.30806162...|30.35060097718332|
|      14042392|      2020-02-27|1620000.0|         92|MALAKOFF|      92240|9 RUE FRANCOIS CO...|  Appartement