Skip to content

Denikozub/Geomarketing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Spatial Feature Engineering for Geomarketing

This is a quick guide for data scientists and analysts on the topic of working with geospatial data. Geographical information in a dataset it can be used for feature generation and provide statistically important data. This is called spatial feature engineering. In this guide I mainly focus on geomarketing task, but its concepts can be used in any field.

Spatial weights

Spatial weights represent spatial structure of the data. They define neighbouring spatial relations, so they can be treated as a weighted graph, stored in adjacency matrix or list. If no relation is considered between objects i and j, corresponding w_ij = 0. Spatial weight matrix is required to build models compute most geospatial indices. Spatial relationships, represented in weights, can be defined in different ways. For better understanding on the methods I recommend reading this article by ArcGIS.

Geostatistics

Basic features

  1. Counting nearby objects (e.g. competitors) - buffer + clip
  2. Distance to key objects (e.g. feature center, central feature or feature hotspot)
    • Haversine distance
    • Manhattan distance with haversine formula
  3. Distance to closest object (e.g. storage, shopping center or bus stop)
  4. Spatial lag - feature neighbour-weighted average
  5. Building areas, perimeters and volumes (momepy)
  6. Building alignment, adjacency, shared walls (momepy)
  7. Building intensity
  8. Neighbouring-based diversity indices (momepy)
  9. Elevation (altitude)

Advanced features

  1. Local spatial autocorrelation (cluster and outlier analysis)
  2. Shape measures and urban shape measures for polygonal objects
  3. Spatial accessibility metrics
  4. Clustering
  5. Location set covering problem (LSCP)
  6. Silhouette samples
  7. Distance-preserving dimensionality reduction

Network analysis

In urban areas each object has a corresponding part of road graph (defined by point and radius). It can be utilized for generating features using osmnx.stats.basic_stats.

  1. Average circuity (circuity_avg)
  2. Total edge length per km^2 (edge_density_km)
  3. Intersection count per km^2 (intersection_density_km)
  4. Self-loop proportion (self_loop_proportion)
  5. Average number of streets per node (streets_per_node_avg)
  6. Node count per km^2 (node_density_km)
  7. Average street length (street_length_avg)

PySAL spaghetti module also provides various network statistics:

  1. Moran’s I (API)
  2. Point snap distance (API)
  3. Network weights (Network w_network attribute) used to find distances to network hotspots / K nearest clusters / ...

Geocoding

In some cases reverse geocoding can provide useful features. It can be done using GeoPy or reverse geocoder. Physical address can be used to generate categorical features (country, city, district, street) or can be treated as text data.

Another categorical feature, representing spatial proximity, is geohash, which is a unique identifier of a specific region on the Earth. It can be computed using geohash library.

Text data

Text data can also be used for feature engineering. This has nothing to do with spatial data analysis, but I want to cover the majority of topics in this guide. One way to utilize text data is to generate categorical data by parsing text using separators or regular expressions. A different approach involves using word2vec word embeddings for a single word. If you have a phrase of multiple words, it is worth taking average of word embeddings or (more precisely) average of word embeddings multiplied by their TF-IDF score. A complete different approach involves using BERT text embeddings as features.

Geospatial models

There are machine learning models that do not require any spatial feature engineering at all. They understand spatial relationships along with attributive information and can be a pipeline for quick metrics estimation, a good start for hypothesis testing or initial data overview. Since most models are linear, it is worth using feature transformations (log, power).

References

  1. ArcGIS documentation
  2. GeoDa documentation
  3. PySAL library
  4. Geographic Data Science with Python