Working with Geopandas and Shapely

**Test this on a geospatially-enabled carto server with Sedona already installed.**

In [0]:
import sedona
import geopandas as gpd
import folium
from sedona.sql.types import GeometryType
from sedona.register import SedonaRegistrator
from sedona.utils.adapter import Adapter
from pyspark.sql.functions import col, expr
from shapely import wkt

Sedona before 1.6.0 only works with Shapely 1.x. If you want to work with Shapely 2.x, please use Sedona no earlier than 1.6.0.

If you use Sedona < 1.6.0, please use GeoPandas <= 0.11.1 since GeoPandas > 0.11.1 will automatically install Shapely 2.0. If you use Shapely, please use <= 1.8.5.

**BUT** if you are using Sedona 1.7.0 like we are, use Geopandas > 1.0.0

# Test Sedona Installation #

In [0]:
#test Sedona using SQL. Should return a point
print(f"sedona version: {sedona.version}")
df = spark.sql('SELECT ST_ASTEXT(ST_POINT(0,0)) AS test_point')
df.display()

Some sources say you need to load the Sedona context. The way we set up the cluster, the context is loaded when the cluster starts.

In [0]:
#you don't need to do this. Sedona conext created when the Databricks cluster builds.
from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sc = SedonaContext.create(config)



# Test Geopandas Installation #

In [0]:
print(f" geopandas version: {gpd.__version__}")

In [0]:
#Test Apache Arrow. It is not enabled by default but should be enabled on the cluster.
# It is necessary for reading geodataframes.
arrow_enabled = spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
print(f"Is Arrow enabled? {arrow_enabled}")

## Read a map into a Pandas Geodataframe

In [0]:


#read geojson. Non-Delta tables can be stored easily in a volume instead of a Delta table.

#can also work on zipped Shapefiles
shapefile_path = '/Volumes/moo_ops_workspace/geospatial/geospatial_files_volume/Community Districts.zip'
gdf = gpd.read_file(shapefile_path)
gdf.info()

geojson_path = '/Volumes/moo_ops_workspace/geospatial/geospatial_files_volume/BusinessImprovementDistrict.geojson'
gdf = gpd.read_file(geojson_path)

#dates will cause the geodataframe to break.
try:
   del gdf['created']
   del gdf['modified']
except: 
   pass
gdf.info()
print(gdf.geometry.head(4))
m = folium.Map(location=[gdf.geometry.centroid.y.mean(), 
                         gdf.geometry.centroid.x.mean()
                         ], 
               zoom_start=10
            )
folium.GeoJson(gdf.to_json()).add_to(m)
display(m)


## Write Geopandas GeoDataframeas geojson. ##

https://sedona.apache.org/1.6.1/tutorial/sql/#load-geojson-data

In [0]:
test_path =  '/Volumes/moo_ops_workspace/geospatial/geospatial_files_volume/test.geojson'
gdf.to_file(test_path)

# Interoperate with GeoPandas #
Sedona Python has implemented serializers and deserializers which allows to convert Sedona Geometry objects into Shapely BaseGeometry objects. Based on that it is possible to load the data with geopandas from file (look at Fiona possible drivers) and create Spark DataFrame based on GeoDataFrame object.

# Geopandas to Sedona #


You can use geopandas (which uses Shapely) to import geometry. Then you may pass the GeoPandas Geodataframe to Sedona

In [0]:
#read the GeoPandas Geodataframe into a Sedona Data Frame
#convert the  the geometry as a well-known text geometry first.
gdf_wkt = gdf.assign(geometry=gdf.geometry.apply(lambda geom: geom.wkt))
gdf_wkt.info()

#convert geodataframe into vanilla Spark dataframe.
sdf = spark.createDataFrame(gdf_wkt)
sdf.display()




The data is now in a Spark dataframe. You need to parse the geometry column as a well-known text geometry.

In [0]:
#parse the geometry string column as wkt
sedona_sdf_from_wkt = sdf.select(col("geometry"), expr("ST_GeomFromWKT(geometry) AS geom"))

#can't display a geospatial dataframe unless you convert to text
#sedona_sdf_from_wkt.display()


In [0]:
# you cannot display a geospatial dataframe
sedona_sdf_from_wkt.display()

# Display a Sedona Geospatial Dataframe by converting it back to a GeoPandas GeoDataframe

To display a Sedona dataframe, convert it to GeoJSON. I don't know how to display directly.

In [0]:
#convert it back to to wkt
sdf_back_to_wkt = sedona_sdf_from_wkt.select(expr("ST_ASTEXT(geom) AS geometry"))

#convert it back to pandas
sdf_to_pd = sdf_back_to_wkt.toPandas()

#convert WKT string to Shapely geometry object using shapely.wkt
pd_to_gdf = sdf_to_pd.copy()
pd_to_gdf['geometry'] = sdf_to_pd['geometry'].apply(wkt.loads)

#convert to pandas dataframe with shapely geometry back to a Geodataframe to get centroids.
gdf = gpd.GeoDataFrame(pd_to_gdf, geometry='geometry')

#plot using folium

m = folium.Map(location=[gdf.geometry.centroid.y.mean(),
                         gdf.geometry.centroid.x.mean()
                         ],
                zoom_start=10
                )


try:
    folium.GeoJson(pd_to_gdf.to_json()).add_to(m)
except OverflowError :
    print("OverflowError: Maximum recursion level reached.\n The map is too detailed.\n Simplifying")
    gdf_simplified = gdf.simplify(tolerance=0.0001)
    folium.GeoJson(gdf_simplified.to_json()).add_to(m)
display(m)

# From Spark Sedona to Geopandas #

# From Spark DataFrame to GeoPandas GeoDataframe#
Convert the vanilla Spark Data Frame back into a Geodataframe for display

In [0]:
from shapely import wkt
df_from_spark = sdf.toPandas()
df_from_spark['geometry'] = df_from_spark['geometry'].apply(wkt.loads)
df_from_spark.info()
gdf_from_spark = gpd.GeoDataFrame(df_from_spark, geometry='geometry')
m = folium.Map(location=[gdf_from_spark.geometry.centroid.y.mean(), 
                         gdf_from_spark.geometry.centroid.x.mean()
                         ], 
               zoom_start=10
            )
folium.GeoJson(gdf_from_spark.to_json()).add_to(m)
display(m)

#Read geojson in to Sedona natively
This doesn't work so well. **The easiest thing to do is read as geopandas and convert to Sedona.**

One way is to do spark.read.json. But this only imports it as a Spark Dataframe, not a geospatial dataframe.


In [0]:
'''
 0   BIDID       76 non-null     int32   
 1   BID         76 non-null     object  
 2   SHAPE_AREA  76 non-null     float64 
 3   SHAPE_LEN   76 non-null     float64 
 4   borough     76 non-null     object  
 5   geometry    76 non-null     geometry
 '''
schema = "BIDID INT, BID STRING, SHAPE_AREA DOUBLE, SHAPE_LEN DOUBLE, borough STRING, geometry STRING"
#schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";
#sdf = sedona.read.schema(schema).json(test_path)

#sdf = sedona.read.schema(schema).json(test_path)
sdf_from_geojson = spark.read.json(geojson_path)
sdf_from_geojson.display()

as you can see, the file is corrupt. however, you can use the corrupt file to tweak the schema to get the correct results 

In [0]:
#you may unpack the json with a select query. AI will speed this up.
sdf_unpacked = sdf_from_geojson.select(
    col("geometry.coordinates")[0].alias("geometry"),
    col("properties.BIDID").alias("BIDID"),
    col("properties.BID").alias("BID"),
    col("properties.SHAPE_AREA").alias("SHAPE_AREA"),
    col("properties.SHAPE_LEN").alias("SHAPE_LEN"),
    col("properties.borough").alias("borough")
)
display(sdf_unpacked)

But how do you convert an array of an array of strings to a Sedona geometry?

In [0]:
#this isn't working because geometry is packed inside a double array
st_gdf = sdf_unpacked.withColumn("geometry_string", expr("CAST(geometry AS STRING)"))

st_gdf = st_gdf.withColumn("geometry",
                           expr("ST_ASWKT(ST_GeomFromGeoJSON(CONCAT('{\"type\": \"Polygon\", \"coordinates\": ', geometry_string, '}')))")
                          )
'''
st_gdf = st_gdf.withColumn(
    "geometry",
    expr(
        "ST_ASWKT(ST_GeomFromGeoJSON(CONCAT('{\" \"type\": \"Polygon\", \"coordinates\": ', geometry_string, '\"}')))"
    )
)
'''
#you cannot display this geospatially enabled geodataframe
st_gdf.display()

## Some other attempts at reading a geojson into Sedona that failed ##

In [0]:
#this does not work. there is no spark source type called geojson
sdf = spark.read.format("geojson").load(map_path)
display(sdf)

In [0]:
#there is no geojson method
spark.read.geojson(map_path).display()

In [0]:
#this does not work
from sedona.core.formatMapper import GeoJsonReader
print(map_path)
geojson_reader : GeoJsonReader = GeoJsonReader()
sdf_direct = geojson_reader.readToGeometryRDD(spark, map_path)

In [0]:
#this is listed in the documentation but doesn't work
geojson_reader.createSpatialRDD(spark, map_path)

To display, convert it to pandas  