**This notebook describes the Classes and APIs used in Location Intelligence SDK for Big Data.**

**Supported spatial operations:**

**1. PointInPolygon
2. SearchNearest
3. JoinByDistance
4. GenerateHexagon**



In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("yarn") \
      .appName("Li-sdk-pyspark") \
      .config("spark.sql.legacy.allowUntypedScalaUDF", True) \
      .getOrCreate();
spark.sparkContext.addPyFile('/location_intelligence_bigdata_li_sdk_pyspark_0_SNAPSHOT.zip')

## **PointInPolygon**

**A PointInPolygon Operation: This method filters the point coordinates in input dataframe which are within a specified polygon. (for example, the polygon of the continental USA). Adds output fields from polygon table to input dataset as columns.**

In [0]:
# perform point in polygon operation
from li.SpatialAPI import SpatialAPI
fabricPath = "addressFabric50.csv" #The HDFS path to input file
fabricDF = spark.read.csv(fabricPath, header=True, sep = ',' )
pointInPolygonDF = SpatialAPI.pointInPolygon(
    inputDF = fabricDF, #dataframe of input dataset
    tableFileType ="TAB", #Type of spatial data provided
    tableFilePath = "", #The HDFS path to spatial table
    tableFileName = "", #Spatial table file name
    libraries = None, # libraries in case of geodatabase tableFileType, defaults to None.
    longitude = "lon", #Longitude column name 
    latitude = "lat", #Latitude column name
    includeEmptySearchResults = True, # if true then an empty search will keep the original input row and the new columns will be null and if false then an empty search will result in the row not appearing in the outputted DataFrame
    outputFields = ["ZIP", "Name"] # Fields from the polygon table to include in the output
)
pointInPolygonDF.show(2)

## **SearchNearest**
**A SearchNearest Operation: This method takes in a geometry string (either in GeoJSON, WKT, KML or WKB format) and searches for it in a table of geometries within a specified distance. Searched geometries counts can be limited by defining maxCandidates parameter. By default, geometries are listed from nearest to farthest.**

In [0]:
from li.SpatialAPI import SpatialAPI
inputPath = "/FileStore/lisdk/input/geometryGeoJson6.csv" #The HDFS path to input file
inputDF = spark.read.csv(inputPath, header=True, sep = ',' )
searchNearestDF = SpatialAPI.searchNearest(
        inputDF = inputDF, 
        tableFileType = "TAB", #Type of spatial data provided
        tableFilePath = "", #The HDFS path to spatial table
        tableFileName = "LANDMARKS.TAB", #Spatial table file name
        maxCandidates = 2, # Maximum number of candidates
        distanceValue = 100.0, # The absolute value of buffer length around point 1 to search for point 2
        libraries = None,
        distanceUnit = "mi", # unit of measurement for distanceUnit parameter
        distanceColumnName = "distance", # Distance column name in output
        geometryStringType = "GeoJSON", # Type of geometry string data provided,
        geometryColumnName = "geometry", # Geometry column name for input data
        includeEmptySearchResults = True, # if true then an empty search will keep the original input row and the new columns will be null and if false then an empty search will result in the row not appearing in the outputted DataFrame
        outputFields = ["Name", "State", "Landmark"], # Fields from the polygon table to include in the output
)
searchNearestDF.show(5)

## **JoinByDistance**
**A JoinByDistance Operation: This method joins two dataframes taking longitude and latitude values, one set from each dataframe, representing the location of the record to be joined. The coordinate values must be in CoordSysConstants.longLatWGS84 coordinate system. This method also takes a searchRadius, which is the buffer around the first point to search for the second point to be inside. The last parameter is a geohash precision that will be used within the calculation.**

In [0]:
from li.DistanceJoinOption import DistanceJoinOption
from li.LimitMethods import LimitMethods
from li.SpatialAPI import SpatialAPI
poiCSV = "" #The path to the first input file
addressFabricCSV = "" # The path to the second input file
distanceColumnName = "outputDistance" # The output distance column name
limitMethod = LimitMethods.RowNumber # The limit method name
limitMatches = 7 # Limit Value as the DistanceJoinOption
df1 = spark.read.csv(poiCSV, header=True, sep = ',', inferSchema=True)
df2 = spark.read.csv(addressFabricCSV, header=True, sep = ',', inferSchema=True)

# Perform join by distance
joinedDF = SpatialAPI.joinByDistance(df1 = df1,
      df2 = df2,
      df1Longitude = "LONGITUDE", # The longitude column name of the first dataframe
      df1Latitude = "LATITUDE", # The latitude column name of the first dataframe
      df2Longitude = "LON", # The longitude column name of the second dataframe
      df2Latitude = "LAT", # The latitude column name of the second dataframe
      searchRadius = 0.5, # The absolute value of buffer length around point 1 to search for point 2
      distanceUnit = "mi", # unit of measurement for distanceUnit parameter
      geoHashPrecision = 7, # The geohash precision
      options = {DistanceJoinOption.DistanceColumnName: distanceColumnName, DistanceJoinOption.LimitMatches: limitMatches, DistanceJoinOption.LimitMethod: LimitMethods.RowNumber})
joinedDF.show(2)

## **Hexgen**
**A HexagonGeneration Operation: This method generates the hexagons within a bounding box defined by minimum and maximum value of longitude and latitude Hexagon output can be used for map display.**

In [0]:
# Perform Hexgen
from li.SpatialAPI import SpatialAPI
hexGenDF = SpatialAPI.generateHexagon(
    sparkSession = spark,
    minLongitude = -73.728200, # The bottom left longitude of the bounding box
    minLatitude = 40.979800, # The bottom left latitude of the bounding box
    maxLongitude = -71.787480, # The upper right longitude of the bounding box
    maxLatitude = 42.050496, # The upper right latitude of the bounding box
    hexLevel =  3, # The level to generate hexagons for. Must be between 1 and 11
    containerLevel = 2, # A hint for providing some parallel hexagon generation. Must be less than the hexLevel property
    numOfPartitions = 1, # Number of partitions,
    maximumNumOfRowsPerPartition = 5 # Max number of rows per partition. This number will depend on available memory for executor.
)
hexGenDF.show(2)