Spatial Feature Engineering for Geomarketing

This is a quick guide for data scientists and analysts on the topic of working with geospatial data. Geographical information in a dataset it can be used for feature generation and provide statistically important data. This is called spatial feature engineering. In this guide I mainly focus on geomarketing task, but its concepts can be used in any field.

Spatial weights

Spatial weights represent spatial structure of the data. They define neighbouring spatial relations, so they can be treated as a weighted graph, stored in adjacency matrix or list. If no relation is considered between objects i and j, corresponding w_ij = 0. Spatial weight matrix is required to build models compute most geospatial indices. Spatial relationships, represented in weights, can be defined in different ways. For better understanding on the methods I recommend reading this article by ArcGIS.

Geostatistics

Basic features

Counting nearby objects (e.g. competitors) - buffer + clip
Distance to key objects (e.g. feature center, central feature or feature hotspot)
- Haversine distance
- Manhattan distance with haversine formula
Distance to closest object (e.g. storage, shopping center or bus stop)
Spatial lag - feature neighbour-weighted average
Building areas, perimeters and volumes (momepy)
Building alignment, adjacency, shared walls (momepy)
Building intensity
Neighbouring-based diversity indices (momepy)
Elevation (altitude)

Advanced features

Local spatial autocorrelation (cluster and outlier analysis)
- Local Moran
- Local G
- Local Geary
- Local join counts (for binary features)
Shape measures and urban shape measures for polygonal objects
Spatial accessibility metrics
Clustering
- AZP (automatic zoning procedure)
- Bottom-up agglomerative
- Regional K-Means
- A-DBSCAN
- Multivariate
Location set covering problem (LSCP)
Silhouette samples
Distance-preserving dimensionality reduction

Network analysis

In urban areas each object has a corresponding part of road graph (defined by point and radius). It can be utilized for generating features using osmnx.stats.basic_stats.

Average circuity (circuity_avg)
Total edge length per km^2 (edge_density_km)
Intersection count per km^2 (intersection_density_km)
Self-loop proportion (self_loop_proportion)
Average number of streets per node (streets_per_node_avg)
Node count per km^2 (node_density_km)
Average street length (street_length_avg)

PySAL spaghetti module also provides various network statistics:

Moran’s I (API)
Point snap distance (API)
Network weights (Network w_network attribute) used to find distances to network hotspots / K nearest clusters / ...

Geocoding

In some cases reverse geocoding can provide useful features. It can be done using GeoPy or reverse geocoder. Physical address can be used to generate categorical features (country, city, district, street) or can be treated as text data.

Another categorical feature, representing spatial proximity, is geohash, which is a unique identifier of a specific region on the Earth. It can be computed using geohash library.

Text data

Text data can also be used for feature engineering. This has nothing to do with spatial data analysis, but I want to cover the majority of topics in this guide. One way to utilize text data is to generate categorical data by parsing text using separators or regular expressions. A different approach involves using word2vec word embeddings for a single word. If you have a phrase of multiple words, it is worth taking average of word embeddings or (more precisely) average of word embeddings multiplied by their TF-IDF score. A complete different approach involves using BERT text embeddings as features.

Geospatial models

There are machine learning models that do not require any spatial feature engineering at all. They understand spatial relationships along with attributive information and can be a pipeline for quick metrics estimation, a good start for hypothesis testing or initial data overview. Since most models are linear, it is worth using feature transformations (log, power).

Geographically weighted regression (GWR)
Multiscale GWR (MGWR)
Ordinary least squares
Seemingly unrelated regression
Fixed and random effects panels
Forest-based classification and regression
Tests of homoskedasticity, normality, spatial randomness etc.

References

ArcGIS documentation
GeoDa documentation
PySAL library
Geographic Data Science with Python

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Spatial Feature Engineering for Geomarketing

Spatial weights

Geostatistics

Basic features

Advanced features

Network analysis

Geocoding

Text data

Geospatial models

References

About

License

Denikozub/Geomarketing

Folders and files

Latest commit

History

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Spatial Feature Engineering for Geomarketing

Spatial weights

Geostatistics

Basic features

Advanced features

Network analysis

Geocoding

Text data

Geospatial models

References

About

Topics

Resources

License

Stars

Watchers

Forks