# Python Applications: GeoPandas Data Exploration

In [None]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import rasterio

- Libraries generally have tutorials and detailed documentation available.
- The most important part of learning to code is how capable you are at using your search engine.

# Basic Operations on GIS Data

In this cell, we begin by loading a vector dataset containing borehole information using GeoPandas, a useful package that brings spatial data handling to the world of pandas.

Step-by-Step Actions:

- Define Data Path: source_boreholes stores the relative path to a shapefile (*.shp) containing borehole locations.

- Load with read_file(): gpd.read_file() reads the shapefile into a GeoDataFrame (df_boreholes), preserving both attribute and geometry data.

- Inspect with .info(): This method displays a summary of the dataset, including:

    - Number of entries (e.g., boreholes)

    - Data types and memory usage

    - Column names and non-null counts

In [None]:
# Define the path to the file
source_boreholes = './data/GIS/UFS_Boreholes.shp'
# Feed the path to the file to the Geopandas Read File function
df_boreholes = gpd.read_file(source_boreholes)
# Let us display some basic information of the loaded shapefile
df_boreholes.info()
# These methods are very useful when working with large data sets,
# where GIS software can slow down significantly due to the rendering.

In [None]:
# Similar to regular Pandas:
# We can also request a description of the shapefile which will give statistics for the included data
df_boreholes.describe()

In [None]:
# If we want to know the coordinate reference system of the shapefile it only requires a single command
df_boreholes.crs

In [None]:
# We can also just request the geometry of the object, disregarding any attached data
df_boreholes.geometry.head()

In [None]:
# This will be the primary way of viewing the data
df_boreholes.head(10)
# The head command displays the fist 5 records in the dataset, we modify it to show 10

In [None]:
# We can evoke the explore method to visualize our data
# The default tiles for explore is Open Street Map
df_boreholes.explore()

Let us calculate the distance between two boreholes on the map. But first, we need to change to a projected CRS:

In [None]:
# First we need to convert the CRS of the boreholes
# from a geographic CRS to a projected CRS
df_boreholes_Merc = df_boreholes.to_crs(3395) # WGS84 / World Mercator
# Save the newly projected file to a geopackage.
df_boreholes_Merc.to_file('./output/UFS_Boreholes_Merc.gpkg',
                           driver='GPKG',
                           layer='boreholes')
# Let us make sure the CRS change has taken effect.
df_boreholes_Merc.crs

Next, we isolate and display the records for two specific boreholes using attribute filtering.

What’s happening:

- Define Borehole IDs: Stores the identifiers of two target boreholes, 'UO10' and 'UO24', into borehole_name_a and borehole_name_b.

- Filter GeoDataFrame:

    - Uses .loc[] indexing with a conditional filter (['ID'] == borehole_name) to extract matching records from df_boreholes_Merc.

    - These filters yield two new GeoDataFrames: borehole_a and borehole_b, each containing a single record.

- Display Borehole Info: print(borehole_a.head()) shows the attribute and geometry data of the selected borehole to confirm the filter was successful.

In [None]:
# Add the identifiers of the boreholes in question
borehole_name_a = 'UO10'
borehole_name_b = 'UO24'
# Get the boreholes from the shapefile
borehole_a = df_boreholes_Merc.loc[
    df_boreholes_Merc['ID'] == borehole_name_a
]
borehole_b = df_boreholes_Merc.loc[
    df_boreholes_Merc['ID'] == borehole_name_b
]
# Print the information of the boreholes
print(borehole_a.head())

In [None]:
# Print the information of the second borehole
print(borehole_b.head())
# Not how the geometry has different coordinates
# This is due to the CRS transformation, but geopandas will keep the original values in the shapefile.

Now that we've isolated the two boreholes of interest, next we use GeoPandas to calculate the Euclidean (straight-line) distance between them in meters.

How It Works:

- .distance() Method: Computes the shortest distance between the geometries of borehole_a and borehole_b.

- The parameter align=False ensures the distance is calculated across full GeoDataFrames without requiring index alignment. This is essential when comparing two single-row frames.

- Accessing the Result: The result is a GeoSeries with a single value, so we extract it with .iloc[0], cast it to float, and round it to two decimal places for readability.

In [None]:
# Calculate the distance between borehole a and borehole b
distance_between = borehole_a.distance(borehole_b, align=False)
# Print the calculated distance
print(f"Distance between {borehole_name_a} and {borehole_name_b}: ")
print(f"{round(float(distance_between.iloc[0]), 2)} m")

Assume we want to clip these boreholes so that only those boreholes that occur within our study area remain. First, we need to import the boundary with which to clip our boreholes:

In [None]:
# Import our boundary polygon
cts_boundary = './data/GIS/CTS_Boundary.shp'
df_cts_boundary = gpd.read_file(cts_boundary)
# Take a look at the file
df_cts_boundary.head()

In [None]:
# We need to make sure the two files are in the same projected CRS
# Convert to 3395 World Mercator
df_cts_boundary = df_cts_boundary.to_crs(3395)
df_cts_boundary.crs

In [None]:
# Take a look at the boundary, with Google satelite images as the base map.
df_cts_boundary.explore(tiles='https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}',
                          attr='Google Earth')

This cell inspects key geometric properties of the imported polygon layer.

Geometric Attributes Calculated:

- .area Calculates the polygon’s area using the coordinate reference system’s units (typically square meters if projected). This gives a quantitative sense of the model domain’s size.

- .centroid Returns the geometric center of the polygon as a Point object. This is often used for labeling, snapping, or referencing the polygon’s central location.

- .boundary Extracts the polygon’s exterior boundary as a LineString, useful for plotting outlines or for clipping/intersecting with other layers.

Always double-check that your boundary is in the right projection and has the expected area/shape before using it as a mask, domain, or buffer.

In [None]:
# Let us calculate some metrics of our polygon
print(f"Model Boundary Polygon:\nArea: {df_cts_boundary.area}")
print(f"Centroid: {df_cts_boundary.centroid}")
print(f"Boundary: {df_cts_boundary.boundary}")

Now, we perform the spatial clip to limit the borehole dataset to only those points that fall within the defined model boundary polygon (df_cts_boundary). 

Operation Details of df_boreholes_Merc.clip(df_cts_boundary):

- Calls the .clip() method from GeoPandas, which retains only the geometries in df_boreholes_Merc that intersect the model boundary polygon.

- Returns a new GeoDataFrame, df_boreholes_cts, containing the clipped subset of boreholes.

The coordinate systems must match (i.e., both datasets should be in the same projected CRS) for this to work correctly.

In [None]:
# In order to clip the boreholes we just evoke the clip command
# on the boreholes and we pass the model boundary as the argument
df_boreholes_cts = df_boreholes_Merc.clip(df_cts_boundary)
# Let us display the results again
df_boreholes_cts.explore(tiles='https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}',
                          attr='Google Earth')

For our next magic trick, let's see what land use zones these boreholes are located on. First, we once again import the file containing the land use data:

In [None]:
# Let us import the land use shapefile
source_land_use = './data/GIS/CTS_LandUse.shp'
df_land_use = gpd.read_file(source_land_use)
# And preview the data
df_land_use.head()

In [None]:
# Spatial join will require all geometries to be in the same CRS
df_land_use = df_land_use.to_crs(3395)
df_land_use.crs

In [None]:
# We can also overload the explore method to visualize geospatial data
df_land_use.explore(tiles='https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}',
                    attr='Google Earth',
                    column="LandUseTyp",
                    cmap="rainbow")

This cell performs a spatial join that assigns land use classifications to each borehole based on its location within polygon zones.

What are we using:

- Left Table: df_boreholes_cts contains the clipped boreholes (points).

- Right Table: df_land_use[['LandUseTyp', 'geometry']] contains land use zones as polygons, with just the relevant attributes selected.

Join Operation:

- how="left" ensures that every borehole is retained in the result—even if it doesn’t intersect a land use polygon.

- predicate="within" specifies that a borehole must fall inside a polygon for the join to succeed.

- Result: The resulting df_boreholes_land_use includes a new LandUseTyp column describing the land use zone for each borehole.

Why do we include geometry in the selected columns? 

- GeoPandas requires the geometry column to be present in both dataframes during spatial joins to execute geometry-based predicates like "within" or "intersects".

In [None]:
# Perform the spatial join, notice the overloads specifying how the operation is executed
# Geometry should always be included as a selected column when performing a spatial join
df_boreholes_land_use = df_boreholes_cts.sjoin(df_land_use[['LandUseTyp', 'geometry']],
                                                   how="left",
                                                   predicate="within")
# Preview the table to confirm the spatial join
df_boreholes_land_use.head()

In [None]:
# Let us plot the results
fig, ax = plt.subplots(figsize=(10,10))
# Give the figure a title
ax.set_title('Boreholes by Land Use')
# Classify the points by Land Use Type column
df_land_use.plot(ax=ax, column='LandUseTyp', cmap='Pastel1')
# Plot the land use with the Type column
df_boreholes_land_use.plot(ax=ax,
                           column='LandUseTyp',
                           cmap='rainbow',
                           missing_kwds={
                               "color": "black",
                               "label": "Outside Land Use Zones"
                           },
                           legend=True)

Next, let's buffer the boreholes. The cell below describes how to generate buffer polygons around each borehole point, representing a uniform spatial zone (e.g., for zone-of-influence analysis, environmental compliance, or proximity checks).

Step-by-step breakdown:

- Copy the Source Layer:

    - df_boreholes_cts.copy() ensures that the original clipped borehole layer remains unchanged.

    - The copy is stored in df_boreholes_buffered.

- Generate Buffers:

    - The .buffer(distance=5) method creates a 5-meter circular buffer around each borehole’s geometry.

    - The new geometries overwrite the geometry column in the copied GeoDataFrame.

- Plot the Buffers:

    - plot(figsize=(5,5)) visualizes the result—each borehole now displayed as a small polygon (rather than a point).

In [None]:
# Create a copy of our clipped boreholes layer
df_boreholes_buffered = df_boreholes_cts.copy()
# Buffer the boreholes, and assign the new geometry to the copied layer
df_boreholes_buffered['geometry'] = df_boreholes_cts.geometry.buffer(distance=5)
# Visualize the results
df_boreholes_buffered.plot(figsize=(5,5))

This next cell combines all individual land use polygons into a single unified geometry using GeoPandas’ geometry aggregation tool. This is especially useful when simplifying spatial layers or defining a total study area for masking, buffering, or overlay analysis.

Key Steps:

- Merge Geometries:

    - df_land_use['geometry'].unary_union merges all land use polygons into one seamless MultiPolygon or Polygon, removing internal boundaries.

    - This operation uses the Shapely engine under the hood, ensuring clean and topologically valid output.

- Create Attribute Table:

    - Constructs a simple dictionary merged_data representing the new merged feature with basic metadata (ID, Name, geometry).

- Convert to GeoDataFrame:

    - Wraps the new record in a GeoDataFrame, explicitly setting the same CRS (EPSG:3857) to ensure alignment with existing spatial layers.

- Plot:

    - Displays the result using plot(), giving a quick visual confirmation of the merged boundary.

In [None]:
# Merge vectors
poly_land_use_merged = df_land_use['geometry'].unary_union
# Take the merged polygon and create a new dataframe with it
merged_data = {'ID': ['0'],
               'Name': ["Land use merged"],
               'geometry': [poly_land_use_merged]}
df_land_use_merged = pd.DataFrame(merged_data)
# The DataFrame is converted to a geopandas DataFrame.
gdf_land_use_merged = gpd.GeoDataFrame(df_land_use_merged, crs='epsg:3857')
gdf_land_use_merged.plot(figsize=(10,10))

Next, let's take a look at working with raster data using rasterio. A quick breakdown:

- src.read(1) grabs the first band — common for DEMs or intensity rasters.

- cmap='gray' renders it as a grayscale image, but you can also try colormaps like 'terrain', 'viridis', or 'plasma'.

- If it’s a multiband raster (e.g. RGB), you can read all bands and display them as an image using 'np.dstack'.

In [None]:
# Define the path to your raster file
raster_path = './data/GIS/output_AW3D30.tif'

# Open the raster using rasterio
with rasterio.open(raster_path) as src:
    raster_data = src.read(1)  # Read the first band
    raster_crs = src.crs
    raster_bounds = src.bounds
    # Mask NoData values (optional but common)
    nodata = src.nodata
    masked_raster = np.ma.masked_equal(raster_data, nodata)

# Display the raster
plt.figure(figsize=(10, 10))
plt.imshow(raster_data, cmap='gray')
plt.title("Raster Visualization")
plt.axis('off')
plt.show()

Luckily, we had a good raster and thus masking the NA values (invalid or missing data) was not necessary, but we will visualise it regardless:

In [None]:
# Display the masked raster
plt.figure(figsize=(10, 10))
plt.imshow(masked_raster, cmap='gray')
plt.title("Masked Raster Visualization")
plt.axis('off')
plt.show()

When working with spatial grids — such as digital elevation models, rasters, or groundwater model arrays — we often need to perform large-scale mathematical operations across thousands (or millions) of grid cells. This is where NumPy becomes a game changer.

What Makes NumPy So Powerful?

- Vectorized Operations: Unlike standard Python loops, NumPy allows you to apply operations across entire arrays at once — like adding two grids or computing slopes.

- Memory Efficiency: NumPy uses compact data structures that keep memory usage low, which matters when working with high-resolution rasters or 3D model arrays. Numpy is faster than working with raw Python code because it is implemented in the C programming language.

- Broadcasting: You can apply operations across arrays of different shapes — for example, scaling each grid layer differently — without writing extra logic.

Why It’s Perfect for Spatial Modeling

- Whether you’re:

    - Applying mathematical transforms (e.g. normalizing a raster)

    - Creating masks (e.g. cells above a threshold)

    - Computing differences across time steps or layers

    - Extracting values for zonal stats

- ...NumPy allows these to happen quickly, reproducibly, and expressively.

Next, we explore how to manipulate raster data numerically after loading it with rasterio. Since raster files are essentially gridded datasets, we can apply standard NumPy operations for analysis and transformation.

# Example Calculations:

Scaling:

- Multiply the raster by 100 to convert values (e.g., from meters to centimeters).

Thresholding:

- Create a binary mask where values greater than 1000 become 1 and others become 0. This is useful for classification or visibility filtering.

Slope Approximation:

- Use np.gradient() to estimate local elevation change (or other raster gradients) across cells.

- Combine gradients using the Pythagorean formula to get a simple slope-like surface.

Normalization:

- Rescale values between 0 and 1 using min–max normalization — a standard preprocessing step for many machine learning workflows.

In [None]:
# 1. Scale the raster by a factor (e.g., convert elevation from meters to centimeters)
scaled_raster = masked_raster * 100

# 2. Apply thresholding (e.g., highlight elevation > 1000 m)
thresholded_raster = np.where(masked_raster > 1000, 1, 0)

# 3. Calculate slope-like differences between adjacent cells
gradient_y, gradient_x = np.gradient(masked_raster.filled(0))
slope_like = np.sqrt(gradient_x**2 + gradient_y**2)

# 4. Normalize raster values (e.g., 0 to 1)
raster_min = masked_raster.min()
raster_max = masked_raster.max()
normalized_raster = (masked_raster - raster_min) / (raster_max - raster_min)

# Plot the normalized results
plt.figure(figsize=(8, 8))
plt.imshow(normalized_raster, cmap='viridis')
plt.title("Normalized Raster")
plt.axis('off')
plt.colorbar(label='Normalized Value')
plt.show()

# Exercise: 

Play with these functions to make sure you understand them and ask if you have any questions. I might not know the answers out of my head, but we can explore how we should go about researching and implementing new code.