# 2. Use sedona to read various geospatial data format

To work with geospatial data, it's essential to read and write data in geospatial data format. The **full list of the constructor for the geo data types** can be found [here](https://sedona.apache.org/1.7.2/api/sql/Constructor/)
For sedona version `1.7.2`, it supports at least 10 formats:

In this tutorial, we will show an example of sedona to read various geospatial data formats such as:
- plain text string without format
- csv/tsv
- wkt/wkb
- Ewkt/Ekb
- GML
- KML
- geojson
- shape file
- GeoHash
- geoparquet
- pbf
- Raster (we will talk this)
- ETC

This website: https://data.enseignementsup-recherche.gouv.fr/explore/dataset/fr-esr-referentiel-geographique/export/ shows how the French ministry shares their geospatial data.

## Geometry data type

To be able to use sedona to do geospatial operations (e.g. calculate distance, area hierarchy, etc.), we need to construct dataframe with geometry column first. A geometry column can have the below types:
- Point : a point on the map with a (x,y) coordinates
- Line: two or more points which can form a line
- Polygon: a list of points which can form a polygon
- Multi-Line: a list of lines
- Multi-Polygone: a list of polygone


In [1]:
from sedona.spark import *
from pyspark.sql import SparkSession, DataFrame
from pathlib import Path
from pyspark.sql.functions import trim, split, expr, col

In [None]:
import os
os.environ["PYSPARK_PYTHON"]="python"
os.environ["PYSPARK_DRIVER_PYTHON"]="python"

In [2]:
# build a sedona session offline
project_root_dir = Path.cwd().parent
print(project_root_dir.as_posix())

C:/Users/PLIU/Documents/git/Seminar_PySpark_Sedona_GeoParquet


In [3]:
data_dir = f"{project_root_dir}/data"

In [4]:
# here we choose sedona 1.8.0 for spark 3.5.* build with scala 2.12
sedona_version = "sedona-35-212-180"
jar_folder = Path(f"{project_root_dir}/jars/{sedona_version}")
jar_list = [str(jar) for jar in jar_folder.iterdir() if jar.is_file()]
jar_path = ",".join(jar_list)

# build a sedona session (sedona = 1.8.0) offline
spark = SparkSession.builder \
    .appName("sedona_tutorial") \
    .master("local[*]") \
    .config("spark.jars", jar_path) \
    .getOrCreate()


In [5]:
# create a sedona context
sedona = SedonaContext.create(spark)

In [6]:
sc = spark.sparkContext
# use utf as default encoding
sc.setSystemProperty("sedona.global.charset", "utf8")

## 1. Read geospatial data

In this section, we will use sedona to read various geospatial data formats.

### 1.1 Plain text file format

In the plain text file format, all geospatial coordinates are represented as string. These values can be located in a normal `csv file` which contains two columns (e.g. x, y). You can notice the content of the csv is `plain text` string.

#### 1.1.1 Point example

In below example, we will construct a geo dataframe which contains a **Point** column

In [7]:
point_file_path = f"{data_dir}/test_points.csv"

# read a normal csv
raw_point_df = sedona.read.format("csv").\
          option("delimiter",",").\
          option("header","true").\
          load(point_file_path)

raw_point_df.printSchema()

root
 |-- longitude: string (nullable = true)
 |-- latitude: string (nullable = true)



In [8]:
raw_point_df.show(5)

+---------+--------+
|longitude|latitude|
+---------+--------+
|      1.1|   101.1|
|      2.1|   102.1|
|      3.1|   103.1|
|      4.1|   104.1|
|      5.1|   105.1|
+---------+--------+
only showing top 5 rows



In [9]:
# we cast the string type to decimal first, then we use `ST_Point` function to build a geometry column by using the two column in the csv file
geo_point_df = (
    raw_point_df
    .select(
        ST_Point(
            col("longitude").cast("decimal(24,20)"),
            col("latitude").cast("decimal(24,20)")
        ).alias("point")
    )
)

In [10]:

geo_point_df.printSchema()

root
 |-- point: geometry (nullable = true)



In [11]:
geo_point_df.show(5,truncate=False)

+-----------------+
|point            |
+-----------------+
|POINT (1.1 101.1)|
|POINT (2.1 102.1)|
|POINT (3.1 103.1)|
|POINT (4.1 104.1)|
|POINT (5.1 105.1)|
+-----------------+
only showing top 5 rows



#### 1.1.2 Line example

To create a line type, we can use the constructor **ST_LineStringFromText (Text:string, Delimiter:char)**. In below example, we can notice it takes a list of gps coordinates, then it returns a geometry column.

In [12]:
geo_line_df = (
    spark.range(1)  # dummy DataFrame with 1 row
    .select(
        ST_LineStringFromText(
            lit("-74.0428197,40.6867969,-74.0421975,40.6921336,-74.0508020,40.6912794"),
            lit(",")
        ).alias("line")
    )
)
geo_line_df.printSchema()

root
 |-- line: geometry (nullable = true)



In [13]:
geo_line_df.show(5,truncate=False)

+----------------------------------------------------------------------------------+
|line                                                                              |
+----------------------------------------------------------------------------------+
|LINESTRING (-74.0428197 40.6867969, -74.0421975 40.6921336, -74.050802 40.6912794)|
+----------------------------------------------------------------------------------+



#### 1.1.3 Polygon example

To build a polygon, we can use the constructor **ST_GeomFromText()**

In [14]:
us_county_file_path = f"{data_dir}/county_small.tsv"

In [15]:
raw_df = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load(us_county_file_path)
raw_df.show()

+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|                 _c0|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|NULL| NULL|NULL|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|NULL| NULL|NULL|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|NULL| NULL|NULL|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109| 

In [16]:
raw_poly_df = raw_df.select("_c0","_c6").withColumnRenamed("_c0","county_polygon").withColumnRenamed("_c6","county_name")
raw_poly_df.createOrReplaceTempView("gon_raw_table")
raw_poly_df.show(5)

+--------------------+----------------+
|      county_polygon|     county_name|
+--------------------+----------------+
|POLYGON ((-97.019...|   Cuming County|
|POLYGON ((-123.43...|Wahkiakum County|
|POLYGON ((-104.56...|  De Baca County|
|POLYGON ((-96.910...|Lancaster County|
|POLYGON ((-98.273...| Nuckolls County|
+--------------------+----------------+
only showing top 5 rows



In [17]:
raw_poly_df.printSchema()

root
 |-- county_polygon: string (nullable = true)
 |-- county_name: string (nullable = true)



In [18]:
geo_polygon_df = (
    raw_poly_df
    .select(
        ST_GeomFromText(col("county_polygon")).alias("county_shape"),
        col("county_name")
    )
)
geo_polygon_df.show(5)

+--------------------+----------------+
|        county_shape|     county_name|
+--------------------+----------------+
|POLYGON ((-97.019...|   Cuming County|
|POLYGON ((-123.43...|Wahkiakum County|
|POLYGON ((-104.56...|  De Baca County|
|POLYGON ((-96.910...|Lancaster County|
|POLYGON ((-98.273...| Nuckolls County|
+--------------------+----------------+
only showing top 5 rows



In [19]:
geo_polygon_df.printSchema()

root
 |-- county_shape: geometry (nullable = true)
 |-- county_name: string (nullable = true)



### 1.2 Read wkt and wkb file

**Well-known text (WKT)** is a `text markup language` for representing vector geometry objects on a map and spatial reference systems of spatial objects. A binary equivalent, known as **well-known binary (WKB)** is used to transfer and store the same information for geometry objects.

Geometries in a `WKT and WKB` file always occupy a single column no matter how many coordinates they have. Sedona provides `WktReader and WkbReader` to create `generic SpatialRDD`. Then we need to convert the spatial rdd to dataframe.

> You must use the wkt reader to read wkt file, and wkb reader to read wkb file.

For `EWKT/EWKB`, we just have a extra column `SRID(Spatial Reference Identifier)  code` compare to WKT

```sql
SELECT ST_AsText(ST_GeomFromEWKT('SRID=4269;POINT(40.7128 -74.0060)'))

# output example
# POINT(40.7128 -74.006)
```

In [20]:
us_county_wkb_file_path = f"{data_dir}/county_small_wkb.tsv"

In [21]:
from sedona.core.formatMapper import WkbReader
from sedona.utils.adapter import Adapter

In [22]:
# The WKT string starts from Column 0
wkbColumn = 0 
allowTopologyInvalidGeometries = True
skipSyntaxInvalidGeometries = False

spatialRdd = WkbReader.readToGeometryRDD(sedona.sparkContext, us_county_wkb_file_path, wkbColumn, allowTopologyInvalidGeometries, skipSyntaxInvalidGeometries)

In [23]:
geo_county_wkb_df = Adapter.toDf(spatialRdd,sedona).withColumnRenamed("geometry", "county_shape")
geo_county_wkb_df.show(5)

+--------------------+
|        county_shape|
+--------------------+
|POLYGON ((-97.019...|
|POLYGON ((-123.43...|
|POLYGON ((-104.56...|
|POLYGON ((-96.910...|
|POLYGON ((-98.273...|
+--------------------+
only showing top 5 rows



In [24]:
geo_county_wkb_df.printSchema()

root
 |-- county_shape: geometry (nullable = true)



### 1.3 Read geojson ((Geographic JavaScript Object Notation))

https://sedona.apache.org/1.6.1/tutorial/sql
Geojson has two different organization:
- single-line
- multi-line 

#### 1.3.1 Read single-line GeoJSON

In the single-line geoJSON organization, each line is a separate, self-contained GeoJSON object. Below is an example

```json
{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}
```
You can notice that each line starts with `{` ends with `}`, which means it's a self-contained json object.

> This format is efficient for processing large datasets, because each line is an independent GeoJSON Feature which can be processed in parallel.  

In [25]:
us_county_json_file_path = f"{data_dir}/us_county.json"

In [26]:
raw_json_df = sedona.read.format("geojson").load(us_county_json_file_path)
raw_json_df.show(5)

+--------------------+--------------------+-------+
|            geometry|          properties|   type|
+--------------------+--------------------+-------+
|POLYGON ((-87.621...|{1500000US0107701...|Feature|
|POLYGON ((-85.719...|{1500000US0104502...|Feature|
|POLYGON ((-86.000...|{1500000US0105500...|Feature|
|POLYGON ((-86.574...|{1500000US0108900...|Feature|
|POLYGON ((-85.382...|{1500000US0106904...|Feature|
+--------------------+--------------------+-------+
only showing top 5 rows



In [27]:
raw_json_df.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- AFFGEOID: string (nullable = true)
 |    |-- ALAND: long (nullable = true)
 |    |-- AWATER: long (nullable = true)
 |    |-- BLKGRPCE: string (nullable = true)
 |    |-- COUNTYFP: string (nullable = true)
 |    |-- GEOID: string (nullable = true)
 |    |-- LSAD: string (nullable = true)
 |    |-- NAME: string (nullable = true)
 |    |-- STATEFP: string (nullable = true)
 |    |-- TRACTCE: string (nullable = true)
 |-- type: string (nullable = true)



#### 1.3.2 Read multi-line GeoJSON

The multi-line GeoJSON use a global `{ "type": "FeatureCollection", }` to encapsulate all geo features in one JSON object. Below is an example

```json
{ "type": "FeatureCollection",
    "features": [
      { "type": "Feature",
        "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
        "properties": {"prop0": "value0"}
        },
      { "type": "Feature",
        "geometry": {
          "type": "LineString",
          "coordinates": [
            [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
            ]
          },
        "properties": {
          "prop0": "value1",
          "prop1": 0.0
          }
        },
      { "type": "Feature",
         "geometry": {
           "type": "Polygon",
           "coordinates": [
             [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
               [100.0, 1.0], [100.0, 0.0] ]
             ]
         },
         "properties": {
           "prop0": "value2",
           "prop1": {"this": "that"}
           }
         }
       ]
}
```

Multiline format is preferable for scenarios where files need to be human-readable or manually edited.

As the entire file is considered as a single json object, it's hard to process in parallel

In [28]:
from pyspark.sql.functions import expr

multi_line_json_file_path = f"{data_dir}/multi_lines.json"

df_raw = sedona.read.format("geojson").option("multiLine", "true").load(multi_line_json_file_path)
          
df_raw.show(5,truncate=False)
df_raw.printSchema()

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+
|features                                                                                                                                                                                            |type             |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+
|[{POINT (102 0.5), {value0, NULL}, Feature}, {LINESTRING (102 0, 103 1, 104 0, 105 1), {value1, 0.0}, Feature}, {POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0)), {value2, {"this":"that"}}, Feature}]|FeatureCollection|
+-----------------------------------------------------------------------------------------------------------------------------------

As the entire file is a single json object(FeatureCollection), all features are consider as items in one big array, To get one feature per row, we need to explode the array envelope.

In [29]:
df = df_raw.selectExpr("explode(features) as features").select("features.*")

df.show()
df.printSchema()

+--------------------+--------------------+-------+
|            geometry|          properties|   type|
+--------------------+--------------------+-------+
|     POINT (102 0.5)|      {value0, NULL}|Feature|
|LINESTRING (102 0...|       {value1, 0.0}|Feature|
|POLYGON ((100 0, ...|{value2, {"this":...|Feature|
+--------------------+--------------------+-------+

root
 |-- geometry: geometry (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- prop0: string (nullable = true)
 |    |-- prop1: string (nullable = true)
 |-- type: string (nullable = true)



### 1.4 Read GML

**GML(Geography Markup Language)** is an `XML based encoding standard` for geographic information developed by the `OpenGIS Consortium (OGC)`. You can find the official doc [here](https://www.ogc.org/publications/standard/gml/)
It has three major versions:
- GML 1
- GML 2
- GML 3

Sedona(<v1.6.1) only supports `GML1 and GML2` for now.  

In [30]:
gml_sample = """
<gml:LineString srsName="EPSG:4269">
        <gml:coordinates>
            -71.16028,42.258729
            -71.160837,42.259112
            -71.161143,42.25932
        </gml:coordinates>
</gml:LineString>
"""

gml_df = sedona.sql(f"SELECT ST_GeomFromGML('{gml_sample}') as line")

In [31]:
gml_df.show(5, truncate=False)
gml_df.printSchema()

+---------------------------------------------------------------------------+
|line                                                                       |
+---------------------------------------------------------------------------+
|LINESTRING (-71.16028 42.258729, -71.160837 42.259112, -71.161143 42.25932)|
+---------------------------------------------------------------------------+

root
 |-- line: geometry (nullable = true)



### 1.5 Read KML

**Keyhole Markup Language (KML)** is an `XML notation` for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers. `KML was developed for use with Google Earth`, which was originally named `Keyhole Earth Viewer`.

A complete kml file example.

```xml
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
  <name>New York City</name>
  <description>New York City</description>
  <Point>
    <coordinates>-74.006393,40.714172,0</coordinates>
  </Point>
</Placemark>
</Document>
</kml>
``` 

You can notice the coordinates has three attributes (longitude,latitude,altitude), with the altitude coordinates, we can build a 3D map.




In [32]:
kml_sample = """
<Point>
    <coordinates>-74.006393,40.714172,0</coordinates>
</Point>
"""

kml_df = sedona.sql(f"SELECT ST_GeomFromKML('{kml_sample}') as point")

In [33]:
kml_df.show(5, truncate=False)
kml_df.printSchema()

+----------------------------+
|point                       |
+----------------------------+
|POINT (-74.006393 40.714172)|
+----------------------------+

root
 |-- point: geometry (nullable = true)



### 1.6 Read Geohash

A **Geohash is a unique identifier** of a specific region on the Earth. The basic idea is that the Earth is divided into regions of user-defined size and each region is assigned a unique id, which is called its Geohash. 

You can try to get the geohash of any region with this web site: https://www.movable-type.co.uk/scripts/geohash.html

In [34]:
geohash_sample = "s00twy01mt"


geohash_df = sedona.sql(f"SELECT ST_GeomFromGeoHash('{geohash_sample}') as polygon")

In [35]:
geohash_df.show(5, truncate=False)
geohash_df.printSchema()

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|polygon                                                                                                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|POLYGON ((0.9999918937683105 0.9999972581863403, 0.9999918937683105 1.0000026226043701, 1.0000026226043701 1.0000026226043701, 1.0000026226043701 0.9999972581863403, 0.9999918937683105 0.9999972581863403))|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

### 1.7 Read shape file

A shapefile is a `vector data file format` commonly used for geospatial analysis. Shapefiles store the location, geometry, and attribution of `point, line, and polygon` features.

You can visit the wiki page via this [link](https://en.wikipedia.org/wiki/Shapefile).


In [36]:
airports_shape_file_path = f"{data_dir}/airports_shape"

# read communes shape file
airports_rdd = ShapefileReader.readToGeometryRDD(sc, airports_shape_file_path)
airports_df = Adapter.toDf(airports_rdd, sedona)


In [37]:
airports_df.printSchema()
airports_df.show(1, truncate=False)

root
 |-- geometry: geometry (nullable = true)
 |-- scalerank: string (nullable = true)
 |-- featurecla: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- abbrev: string (nullable = true)
 |-- location: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- wikipedia: string (nullable = true)
 |-- natlscale: string (nullable = true)

+---------------------------------------------+---------+--------------------------------------------------------------------------------+--------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------------------------+------------------------------------------------------------------------------------------------------------

In [38]:
airports_df.select("geometry","type","name","location","iata_code","gps_code").show(5, truncate=False)

+---------------------------------------------+--------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|geometry                                     |type                                              |name                                     

## 1.8 Read GeoParquet

GeoParquet is an **Open Geospatial Consortium (OGC) standard** that adds interoperable geospatial types `(Point, Line, Polygon)` to Parquet. Currently(27/11/2024), the stable version is 1.1.0.

You can find the official site of geo-parquet [here](https://geoparquet.org/).

You can find the current version data format specification [here](https://geoparquet.org/releases/v1.1.0/). The supported geo type are:
- point
- line
- polygon
- multipoint
- multiline
- multipolygon

In [44]:
airports_parquet_path = f"{data_dir}/airports.parquet"

airports_parquet_df= sedona.read.format("geoparquet").load(airports_parquet_path)

In [45]:
airports_parquet_df.printSchema()
airports_parquet_df.show(5)

root
 |-- geometry: geometry (nullable = true)
 |-- scalerank: string (nullable = true)
 |-- featurecla: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- abbrev: string (nullable = true)
 |-- location: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- wikipedia: string (nullable = true)
 |-- natlscale: string (nullable = true)

+--------------------+---------+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+---------+
|            geometry|scalerank|          featurecla|                type|                name|abbrev|            location|            gps_code|           iata_code|           wikipedia|natlscale|
+--------------------+---------+--------------------+--------------------+--------------------+------+--------------------+--------------------+------------------

#### 1.8.1 Read metadata of geo parquet

As parquet file can contain custom metadata, we can read the metadata of any parquet file

In [46]:
metadata_df = sedona.read.format("geoparquet.metadata").load(airports_parquet_path)
metadata_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- version: string (nullable = true)
 |-- primary_column: string (nullable = true)
 |-- columns: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- encoding: string (nullable = true)
 |    |    |-- geometry_types: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- bbox: array (nullable = true)
 |    |    |    |-- element: double (containsNull = true)
 |    |    |-- crs: string (nullable = true)
 |    |    |-- covering: string (nullable = true)



In [47]:
metadata_df.show(truncate=False, vertical=True)

-RECORD 0--------------------------------------------------------------------------------------------------------------------
 path           | file:/C:/Users/PLIU/Documents/git/Seminar_PySpark_Sedona_GeoParquet/data/airports.parquet                  
 version        | 1.1.0                                                                                                      
 primary_column | geometry                                                                                                   
 columns        | {geometry -> {WKB, [Point], [-175.135635, -53.005069825517666, 178.5600483699593, 71.289299], null, NULL}} 



## 2. Write geospatial data with sedona

Natively, sedona(v.1.7.2) allows user to write geospatial data in
  - wkt
  - wkb
  - plaintext(csv, parquet)
  - geoparquet
  - GeoJson

### 2.1 Prepare source data

We will take a sub set of airports as our source data. You can notice the dataset contains a geometry column which is the airport location.
 

In [48]:
airports_parquet_path = f"{data_dir}/parquet/airports"

airports_parquet_df= sedona.read.format("geoparquet").load(airports_parquet_path)

In [49]:
airports_parquet_df.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- scalerank: string (nullable = true)
 |-- featurecla: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- abbrev: string (nullable = true)
 |-- location: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- wikipedia: string (nullable = true)
 |-- natlscale: string (nullable = true)



In [48]:
output_df = airports_parquet_df.select("geometry","name","iata_code","gps_code")

In [51]:
output_df.show(1,truncate=False,vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 geometry  | POINT (113.93501638737635 22.315332828086753)                                                                                                                                                                                                                  
 name      | Hong Kong Int'l                                                                                                                                                                                                                                                
 iata_code | HKG                                                                                                                                                                                 

In [52]:
output_df.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- name: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- gps_code: string (nullable = true)



In [52]:
print(f"Total row number: {output_df.count()}")

Total row number: 281


### 2.2 Write as WKT

The WKT(Well-Known Text) file format stores the Geometry column as `plain text (POINT(…), POLYGON(…))`. The advantage is human-readable, easy modification.

The disadvantage is that it takes `larger storage spaces`, `no indexing`, `slower for large-scale use`.

> We can convert the geometry column to wkt column by using function `ST_AsText()`

In [53]:
out_dir = f"{data_dir}/tmp"
wkt_out_path = f"{out_dir}/airports_output_wkt"

In [55]:
# convert the geometry column to wkt column by using function ST_AsText()
airports_wkt = output_df.withColumn("wkt", ST_AsText(col("geometry")))
airports_wkt.show(1,truncate=False,vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 geometry  | POINT (113.93501638737635 22.315332828086753)                                                                                                                                                                                                                  
 name      | Hong Kong Int'l                                                                                                                                                                                                                                                
 iata_code | HKG                                                                                                                                                                                 

In [56]:
airports_wkt.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- name: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- wkt: string (nullable = true)



In [57]:
airports_wkt.select("name", "iata_code", "gps_code", "wkt").write.option("header",True).mode("overwrite").csv(wkt_out_path)

> You can check the output file in `data/tmp/airports_output_wkt`

### 2.2 Write as WKB

The WKB (Well-Known Binary) format stores the geometry column as binary.
As wkb is binary, we can't write binary columns in plaintext file formats like csv.
But we can write in parquet native (which is not geoparquet).

It's `more compact than WKT`. But It's `no longer human-readable`.


In [62]:
wkb_out_path = f"{out_dir}/airports_output_wkb"

In [58]:
# convert the geometry column to wkt column by using function ST_AsText()
airports_wkb = output_df.withColumn("wkb", ST_AsBinary(col("geometry")))
airports_wkb.show(1,truncate=False,vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 geometry  | POINT (113.93501638737635 22.315332828086753)                                                                                                                                                                                                                  
 name      | Hong Kong Int'l                                                                                                                                                                                                                                                
 iata_code | HKG                                                                                                                                                                                 

In [59]:
airports_wkb.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- name: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- wkb: binary (nullable = true)



In [65]:
# we can't write binary in a plaintext file. So we can only write the dataframe as parquet file
airports_wkb.select("name", "iata_code", "gps_code", "wkb").write.mode("overwrite").parquet(wkb_out_path)

### 2.3 Write as GeoJSON

Since `v1.6.1`, the GeoJSON data source in Sedona can be used to save a Spatial DataFrame to a `single-line JSON` file, with geometries written in GeoJSON format.

It's very widely supported (e.g. GIS, web maps).But it requires `large storage space`, `slower parsing`, `no strong typing`.

In [66]:
geojson_out_path= f"{out_dir}/airports_output_geojson"
output_df.write.format("geojson").save(geojson_out_path)

### 2.4 Write as geoparquet

As geoparquet contains many new concepts, we will use another chapter to explain geoparquet. Here we only show a simple example.

In [67]:
geoparquet_path = f"{out_dir}/airports_output_geoparquet"
output_df.write.format("geoparquet").option("geoparquet.version","1.1.0").save(geoparquet_path)

In [84]:
output_df.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- location: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- gps_code: string (nullable = true)



## Don't forget to close spark session

If you don't close the spark session, the reserved resources can be released. So your teammates can't use them.

In [None]:
spark.stop()