# PySpark - GeoPandas Example

*Rob Knapen, Wageningen Environmental Research*
<br>

A notebook to investigate using GeoPandas for spatial data operations and PySpark to (potentially) run them on a compute cluster. This could be useful for processing large datasets of species observations.

Note that installation is slightly tricky due to the mixing of Python with the JVM (Scala/Java) based Spark and the use of latest developments for bridging them (in pyspark, such as Apache Arrow). Transporting spatial data between both environments adds some additional complexity. This notebook is tested with Spark 3.4.0 (currently the latest version).

### Import Packages
Some imports are for future use :-)

Hint: Make sure pyspark package version matches the installed Apache Spark version!

In [None]:
import os
os.environ['USE_PYGEOS'] = '0'

import geopandas.geoseries
import matplotlib.pyplot as plt

import matplotlib
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (10, 5)

# for exponential back down when calling APIs
from retrying import retry

# PySpark libraries
import pyspark.sql.functions as func
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DecimalType
from pyspark.sql import SparkSession

# Spatial pandas libraries
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame
from shapely.geometry import Point, Polygon, shape
from shapely import wkb, wkt

# Used to decode data from Java
from ast import literal_eval as make_tuple

### Create a Spark Session

In [None]:
spark = SparkSession.builder.appName("fairicube-geopandas").getOrCreate()
sc = spark.sparkContext
sc

### Load sample species observation data
As an example a dataset from the Dutch 'Nationale Databank Flora en Fauna' (ndff.nl). While we are hoping for this to be available as open data (soon), it is not yet. However, we have permission to use it for the FAIRiCUBE EU project.

In [None]:
# the NDFF datafile to process
species_filename = "../../../local/data/ndff/broedvogels_2016.csv"

# the columns to drop right away
always_drop_cols_from_source = None

In [None]:
species_df = pd.read_csv(species_filename, header='infer', sep=';', on_bad_lines='warn')

# remove not needed columns
if always_drop_cols_from_source:
    species_df.drop(columns=always_drop_cols_from_source, inplace=True)

# remove the crs prefix from the wkt data
species_df['wkt_excl_crs'] = species_df['wkt'].map(lambda x: x.split(';')[1], na_action='ignore')
species_df.drop(columns=['wkt'], inplace=True)
species_df.rename(columns={'wkt_excl_crs': 'wkt'}, inplace=True)

species_df.head(5)

In [None]:
# get stats on the numerical data
species_df.describe()

In [None]:
# get stats on the species
species_df['sci_name'].value_counts(sort=True, dropna=True)

### Create a GeoPandas DataFrame
The observations have spatial attributes, so lift them into a GeoPandas DataFrame to be able to process them.

In [None]:
# construct a GeoDataFrame, with the data using the Dutch RD coordinate reference system
gs = gpd.GeoSeries.from_wkt(species_df['wkt'])
species_gdf = gpd.GeoDataFrame(species_df, geometry=gs, crs="EPSG:28992")

# transform the dataset to the more common WGS84 (unprojected) CRS
species_gdf.to_crs(crs="EPSG:4326", inplace=True)
species_gdf.drop(columns=['wkt'], inplace=True)

species_gdf.head(5)

In [None]:
# display a single sample observation
sample_gdf = species_gdf[0:1]
sample_gdf.plot(column='sci_name', categorical=True, legend=True)
plt.show()

In [None]:
# display all species observations (this might take a while to draw)
species_gdf.plot(column='sci_name', categorical=False, legend=False)
plt.show()

In [None]:
# or use explore(), but not for large datasets
# species_gdf[0:100].explore()

### Create a Spark DataFrame
Here comes the more tricky part to lift the GeoDataFrame into a Spark (distributed) DataFrame.

In [None]:
# take a small sample to experiment with
small_gdf = species_gdf[0:100000].copy()

# have to convert the geometry objects (back) to wkt strings for Spark compatibility
small_gdf['wkt'] = pd.Series(
    small_gdf['geometry'].map(lambda x: str(x.wkt), na_action='ignore'),
    index=small_gdf.index, dtype='string'
)

# get rid of the geometry objects that Spark can not automatically interpret
small_gdf.drop(columns=['geometry'], axis=1, inplace=True)

small_gdf

In [None]:
# now create a Spark DataFrame from the GeoPandas DataFrame
spark_df = spark.createDataFrame(data=small_gdf)
spark_df.printSchema()

In [None]:
spark_df.show(10, truncate=True)

In [None]:
# now we can put Spark to work ...
spark_df.select("sci_name", "orig_abundance", 'straal').summary().show()