# MobilityPySpark UDTs

This notebook serves as a basic example to how MobilityPySpark handles UDTs.

In [1]:
from pymeos import *

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

from pysparkmeos.UDT.MeosDatatype import *
from pysparkmeos.utils.udt_appender import udt_append
from pysparkmeos.utils.utils import *
from pysparkmeos.UDF.udf import *

from typing import *
import os, sys

## Initialize PySpark and PyMEOS

In [2]:
# Initialize PyMEOS
pymeos_initialize("UTC")

os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

# Initialize a Spark session
spark = SparkSession.builder \
    .appName("PySpark UDT Example with PyMEOS") \
    .master("local[3]") \
    .config("spark.default.parallelism", 3) \
    .config("spark.executor.memory", "3g") \
    .config("spark.executor.cores", 1) \
    .config("spark.driver.memory", "2g") \
    .config("spark.driver.maxResultSize", 0) \
    .config("spark.sql.allowMultipleTableArguments.enabled", True) \
    .getOrCreate()

#spark.sparkContext.setLogLevel("DEBUG")

# Append the UDT mapping to the PyMEOS classes
udt_append()

# Register the UDFs in spark
register_udfs_under_spark_sql(spark)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/23 09:16:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/23 09:17:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/07/23 09:17:03 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/07/23 09:17:11 WARN SimpleFunctionRegistry: The function length replaced a previously registered function.
24/07/23 09:17:11 WARN SimpleFunctionRegistry: The function nearest_approach_distance replaced a previously registered function.


We have an example dataset prepared, let's explore it first.

In [3]:
data_path = '../datasets/preproc.csv'

In [4]:
!head -n 2 $data_path

icao24,pointStr
34718e,POINT(1.9229736328125 40.87294006347656)@2022-06-27 00:00:00+00


## Read UDTs

Apparently we already have a preprocessed set of Points that can be easily read by MobilityPySpark, by defining a schema using TGeogPointInstUDT.

In [5]:
schema = StructType([
    StructField("icao24", StringType()),
    StructField("PointStr", TGeogPointInstUDT())  
])
df = spark.read.csv(
    data_path, 
    header=True, 
    schema=schema,
    mode='PERMISSIVE'
)
df.printSchema()
df.withColumnRenamed("PointStr", "PointInst").withColumn("STBox", point_to_stbox("PointInst")).show(5)
df.head()

root
 |-- icao24: string (nullable = true)
 |-- PointStr: pythonuserdefined (nullable = true)



                                                                                

+------+--------------------+--------------------+
|icao24|           PointInst|               STBox|
+------+--------------------+--------------------+
|34718e|POINT(1.922973632...|SRID=4326;GEODSTB...|
|ac6364|POINT(-85.5262662...|SRID=4326;GEODSTB...|
|406471|POINT(1.838302612...|SRID=4326;GEODSTB...|
|a04417|POINT(-83.4583702...|SRID=4326;GEODSTB...|
|c04aa1|POINT(-79.3079393...|SRID=4326;GEODSTB...|
+------+--------------------+--------------------+
only showing top 5 rows



Row(icao24='34718e', PointStr=TGeogPointInstWrap(POINT(1.9229736328125 40.87294006347656)@2022-06-27 00:00:00+00))

## Write UDTs

Now we save the dataframe back in a file.

In [6]:
df.write.csv("../datasets/out.csv")
df.write.parquet("../datasets/out.parquet")

                                                                                

In [7]:
!ls ../../out.csv

part-00000-ca918e4c-36f8-40c7-90fa-e0c7924029f8-c000.csv  _SUCCESS


In [8]:
!ls ../../out.parquet

part-00000-30bf6d15-c985-4def-b2dc-9fa66dad9502-c000.snappy.parquet  _SUCCESS


This is a very simple notebook that shows how using UDTs allows for basic read/write operations in MobilityPySpark.