# Raster Data Extraction

Raster data is often of little use unless we can extract and summarize the data. For instance, extracting raster to points by coordinates allows us to pass data to machine learning models for land cover classification or cloud masking.

## Subsetting rasters

We can subset sections of the data array in a number of ways. In this case we will create a slice based on row and column location to subset LandSat data using a `rasterio.window.Window`.

Either a `rasterio.window.Window` object or tuple can be used with `rxr.open_rasterio`.

In [9]:
import rioxarray as rxr
import xarray as xr
from rasterio.windows import Window

# Use the actual rgbn file from your data
rgbn = '../../pygis/data/rgbn.tif'

w = Window(row_off=0, col_off=0, height=100, width=100)
src = rxr.open_rasterio(rgbn, chunks=True)
src = src.sel(band=[1, 2, 3]).astype('float32')
src = src.assign_coords(band=['blue', 'green', 'red'])
src = src.isel(x=slice(w.col_off, w.col_off + w.width), 
               y=slice(w.row_off, w.row_off + w.height))
print(src)

<xarray.DataArray (band: 3, y: 100, x: 100)> Size: 120kB
dask.array<getitem, shape=(3, 100, 100), dtype=float32, chunksize=(1, 100, 100), chunktype=numpy.ndarray>
Coordinates:
  * x            (x) float64 800B 7.93e+05 7.93e+05 ... 7.935e+05 7.935e+05
  * y            (y) float64 800B 2.05e+06 2.05e+06 ... 2.05e+06 2.05e+06
    spatial_ref  int64 8B 0
  * band         (band) <U5 60B 'blue' 'green' 'red'
Attributes:
    DataType:            Generic
    AREA_OR_POINT:       Area
    RepresentationType:  ATHEMATIC
    scale_factor:        1.0
    add_offset:          0.0


We can also slice a subset of data using a tuple of bounded coordinates.

In [10]:
import rioxarray as rxr

bounds = (793475.76, 2049033.03, 794222.03, 2049527.24)
src = rxr.open_rasterio(rgbn, chunks=True)
src = src.sel(band=[2, 3, 4]).astype('float32')
src = src.assign_coords(band=['green', 'red', 'nir'])
src_clipped = src.rio.clip_box(*bounds)
print(src_clipped)

<xarray.DataArray (band: 3, y: 100, x: 150)> Size: 180kB
dask.array<getitem, shape=(3, 100, 150), dtype=float32, chunksize=(1, 100, 150), chunktype=numpy.ndarray>
Coordinates:
  * x            (x) float64 1kB 7.935e+05 7.935e+05 ... 7.942e+05 7.942e+05
  * y            (y) float64 800B 2.05e+06 2.05e+06 ... 2.049e+06 2.049e+06
  * band         (band) <U5 60B 'green' 'red' 'nir'
    spatial_ref  int64 8B 0
Attributes:
    DataType:            Generic
    AREA_OR_POINT:       Area
    RepresentationType:  ATHEMATIC
    scale_factor:        1.0
    add_offset:          0.0


By default, the subset will be returned by the upper left coordinates of the bounds, potentially shifting cell alignment with the reference raster. To subset a raster and align it to the same grid, you can use the reference raster for resampling.

In [11]:
# Alternative method using clip_box with CRS
src = rxr.open_rasterio(rgbn, chunks=True)
src_aligned = src.rio.clip_box(*bounds, crs=src.rio.crs)
print(src_aligned)

<xarray.DataArray (band: 4, y: 100, x: 150)> Size: 60kB
dask.array<getitem, shape=(4, 100, 150), dtype=uint8, chunksize=(1, 100, 150), chunktype=numpy.ndarray>
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * x            (x) float64 1kB 7.935e+05 7.935e+05 ... 7.942e+05 7.942e+05
  * y            (y) float64 800B 2.05e+06 2.05e+06 ... 2.049e+06 2.049e+06
    spatial_ref  int64 8B 0
Attributes:
    DataType:            Generic
    AREA_OR_POINT:       Area
    RepresentationType:  ATHEMATIC
    scale_factor:        1.0
    add_offset:          0.0


## Extracting data by coordinates

To extract values at a coordinate pair, translate the coordinates into array indices. For extraction by geometry, for instance with a shapefile, see [extract by point geometry](#f-rs-extraction-point).

In [13]:
import rioxarray as rxr
import rasterio

# Use actual Landsat file
l8_224078_20200518 = '../../pygis/data/LC08_L1TP_224078_20200518_20200518_01_RT.TIF'

# Coordinates in map projection units
y, x = -2823031.15, 761592.60

with rasterio.open(l8_224078_20200518) as dataset:
    # Transform the map coordinates to data indices
    j, i = dataset.index(x, y)
    # Check if indices are within bounds
    if 0 <= j < dataset.height and 0 <= i < dataset.width:
        data = dataset.read()[:3, j, i]  # First 3 bands
    else:
        print(f"Coordinates ({x}, {y}) are out of bounds")
        data = [0, 0, 0]

print(data.flatten())

[7448 6882 6090]


A latitude/longitude pair can be extracted after converting to the map projection.

In [14]:
import rioxarray as rxr
import rasterio
from pyproj import Transformer

# Coordinates in latitude/longitude
lat, lon = -25.50142964, -54.39756038

with rasterio.open(l8_224078_20200518) as dataset:
    # Transform the coordinates to map units
    transformer = Transformer.from_crs("EPSG:4326", dataset.crs, always_xy=True)
    x, y = transformer.transform(lon, lat)
    # Transform the map coordinates to data indices
    j, i = dataset.index(x, y)
    # Check if indices are within bounds
    if 0 <= j < dataset.height and 0 <= i < dataset.width:
        data = dataset.read()[:3, j, i]  # First 3 bands
    else:
        print(f"Coordinates ({x}, {y}) are out of bounds")
        data = [0, 0, 0]

print(data.flatten())

[7448 6882 6090]


(f_rs_extraction_point)=
## Extracting data with point geometry

In the example below, 'LC08_L1TP_224078_20200518_20200518_01_RT_points.gpkg' is a [GeoPackage](https://www.geopackage.org/) of point locations, and the output `df` is a [GeoPandas GeoDataFrame](https://geopandas.org/docs/reference/api/geopandas.GeoDataFrame.html?highlight=geodataframe#geopandas.GeoDataFrame). To extract the raster values at the point locations, use rioxarray clipping operations.

In [16]:
import rioxarray as rxr
import geopandas as gpd
import rasterio
import pandas as pd

l8_224078_20200518_points = '../../pygis/data/LC08_L1TP_224078_20200518_20200518_01_RT_points.gpkg'

# Read points
points_gdf = gpd.read_file(l8_224078_20200518_points)

# Extract values at point locations using rasterio sampling
results = []
with rasterio.open(l8_224078_20200518) as src:
    for idx, point in points_gdf.iterrows():
        geom = point.geometry
        if geom.geom_type == 'Point':
            row, col = src.index(geom.x, geom.y)
            if 0 <= row < src.height and 0 <= col < src.width:
                values = src.read()[:3, row, col]  # First 3 bands
                result = {'1': values[0], '2': values[1], '3': values[2]}
                # Add original attributes
                for col_name in points_gdf.columns:
                    if col_name != 'geometry':
                        result[col_name] = point[col_name]
                result['geometry'] = geom
                results.append(result)

df = pd.DataFrame(results)
df = gpd.GeoDataFrame(df, geometry='geometry')
print(df)

      1     2     3       name                         geometry
0  7966  7326  6254      water  POINT (741522.314 -2811204.698)
1  8030  7490  8080       crop  POINT (736140.845 -2806478.364)
2  7561  6874  6106       tree  POINT (745919.508 -2805168.579)
3  8302  8202  8111  developed  POINT (739056.735 -2811710.662)
4  8277  7982  7341      water  POINT (737802.183 -2818016.412)
5  7398  6711  6007       tree   POINT (759209.443 -2828566.23)


In the previous example, the point vector had a CRS that matched the raster. If the CRS had not matched, rioxarray and rasterio handle coordinate transformations automatically.

In [17]:
import rioxarray as rxr
import geopandas as gpd
import rasterio
import pandas as pd

# Read points and check CRS
point_df = gpd.read_file(l8_224078_20200518_points)
print(point_df.crs)

# Transform the CRS to WGS84 lat/lon
point_df = point_df.to_crs('epsg:4326')
print(point_df.crs)

# Extract values at point locations
results = []
with rasterio.open(l8_224078_20200518) as src:
    # Transform points back to raster CRS for sampling
    points_in_raster_crs = point_df.to_crs(src.crs)
    
    for idx, point in points_in_raster_crs.iterrows():
        geom = point.geometry
        if geom.geom_type == 'Point':
            row, col = src.index(geom.x, geom.y)
            if 0 <= row < src.height and 0 <= col < src.width:
                values = src.read()[:3, row, col]
                result = {'1': values[0], '2': values[1], '3': values[2]}
                # Add original attributes from the original point_df
                original_point = point_df.iloc[idx]
                for col_name in point_df.columns:
                    if col_name != 'geometry':
                        result[col_name] = original_point[col_name]
                result['geometry'] = original_point.geometry
                results.append(result)

df = pd.DataFrame(results)
df = gpd.GeoDataFrame(df, geometry='geometry', crs='epsg:4326')
print(df)

EPSG:32621
epsg:4326
      1     2     3       name                     geometry
0  7966  7326  6254      water   POINT (-54.5992 -25.39813)
1  8030  7490  8080       crop  POINT (-54.65348 -25.35635)
2  7561  6874  6106       tree  POINT (-54.55662 -25.34295)
3  8302  8202  8111  developed    POINT (-54.6236 -25.4031)
4  8277  7982  7341      water  POINT (-54.63496 -25.46019)
5  7398  6711  6007       tree  POINT (-54.42018 -25.55178)


Set the data band names for better readability, assigning band names blue, green, red.

In [18]:
import rioxarray as rxr
import geopandas as gpd
import rasterio
import pandas as pd

# Extract with named bands
results = []
with rasterio.open(l8_224078_20200518) as src:
    points_gdf = gpd.read_file(l8_224078_20200518_points)
    
    for idx, point in points_gdf.iterrows():
        geom = point.geometry
        if geom.geom_type == 'Point':
            row, col = src.index(geom.x, geom.y)
            if 0 <= row < src.height and 0 <= col < src.width:
                values = src.read()[:3, row, col]
                result = {'blue': values[0], 'green': values[1], 'red': values[2]}
                # Add original attributes
                for col_name in points_gdf.columns:
                    if col_name != 'geometry':
                        result[col_name] = point[col_name]
                result['geometry'] = geom
                results.append(result)

df = pd.DataFrame(results)
df = gpd.GeoDataFrame(df, geometry='geometry')
print(df)

   blue  green   red       name                         geometry
0  7966   7326  6254      water  POINT (741522.314 -2811204.698)
1  8030   7490  8080       crop  POINT (736140.845 -2806478.364)
2  7561   6874  6106       tree  POINT (745919.508 -2805168.579)
3  8302   8202  8111  developed  POINT (739056.735 -2811710.662)
4  8277   7982  7341      water  POINT (737802.183 -2818016.412)
5  7398   6711  6007       tree   POINT (759209.443 -2828566.23)


## Extracting time series images by point geometry

We can also easily extract a time series of raster images. Extracted pixel values are provided in 'wide' format with appropriate labels, for instance the column 't2_blue' would be the blue band for the second time period.

In [19]:
import rioxarray as rxr
import xarray as xr
import geopandas as gpd
import rasterio
import pandas as pd

# Load raster twice to simulate time series
src1 = rxr.open_rasterio(l8_224078_20200518, chunks=True)
src2 = rxr.open_rasterio(l8_224078_20200518, chunks=True)

# Concatenate along a time dimension
src = xr.concat([src1, src2], dim='time')
src = src.assign_coords(time=['t1', 't2'])
src = src.assign_coords(band=['blue', 'green', 'red'])

# Extract time series at points
points_gdf = gpd.read_file(l8_224078_20200518_points)

results = []
with rasterio.open(l8_224078_20200518) as raster_src:
    for idx, point in points_gdf.iterrows():
        geom = point.geometry
        if geom.geom_type == 'Point':
            row, col = raster_src.index(geom.x, geom.y)
            if 0 <= row < raster_src.height and 0 <= col < raster_src.width:
                # Extract for both time periods
                values_t1 = raster_src.read()[:3, row, col]
                values_t2 = raster_src.read()[:3, row, col]  # Same data for demo
                
                result = {
                    't1_blue': values_t1[0], 't1_green': values_t1[1], 't1_red': values_t1[2],
                    't2_blue': values_t2[0], 't2_green': values_t2[1], 't2_red': values_t2[2]
                }
                # Add original attributes
                for col_name in points_gdf.columns:
                    if col_name != 'geometry':
                        result[col_name] = point[col_name]
                result['geometry'] = geom
                results.append(result)

df = pd.DataFrame(results)
df = gpd.GeoDataFrame(df, geometry='geometry')
print(df)

   t1_blue  t1_green  t1_red  t2_blue  t2_green  t2_red       name  \
0     7966      7326    6254     7966      7326    6254      water   
1     8030      7490    8080     8030      7490    8080       crop   
2     7561      6874    6106     7561      6874    6106       tree   
3     8302      8202    8111     8302      8202    8111  developed   
4     8277      7982    7341     8277      7982    7341      water   
5     7398      6711    6007     7398      6711    6007       tree   

                          geometry  
0  POINT (741522.314 -2811204.698)  
1  POINT (736140.845 -2806478.364)  
2  POINT (745919.508 -2805168.579)  
3  POINT (739056.735 -2811710.662)  
4  POINT (737802.183 -2818016.412)  
5   POINT (759209.443 -2828566.23)  


## Extracting data by polygon geometry

To extract values within polygons, use rioxarray clipping operations with polygon geometries.

In [21]:
import rioxarray as rxr
import geopandas as gpd
import numpy as np
import pandas as pd

l8_224078_20200518_polygons = '../../pygis/data/LC08_L1TP_224078_20200518_20200518_01_RT_polygons.gpkg'

# Read polygons
polygons_gdf = gpd.read_file(l8_224078_20200518_polygons)

# Load raster
src = rxr.open_rasterio(l8_224078_20200518, chunks=True)
src = src.sel(band=[1, 2, 3]).assign_coords(band=['blue', 'green', 'red'])

# Extract pixel values for each polygon
all_results = []

for idx, polygon_row in polygons_gdf.iterrows():
    polygon_geom = polygon_row.geometry
    polygon_id = polygon_row.get('id', idx)
    polygon_name = polygon_row.get('name', f'polygon_{idx}')
    
    try:
        # Clip raster to polygon
        clipped = src.rio.clip([polygon_geom], polygons_gdf.crs)
        
        # Get coordinates of valid pixels
        y_coords, x_coords = np.meshgrid(clipped.y.values, clipped.x.values, indexing='ij')
        
        # Flatten arrays and get valid (non-NaN) pixels
        valid_mask = ~np.isnan(clipped.isel(band=0).values)
        
        if valid_mask.any():
            y_flat = y_coords[valid_mask]
            x_flat = x_coords[valid_mask]
            
            # Extract values for each band
            blue_vals = clipped.sel(band='blue').values[valid_mask]
            green_vals = clipped.sel(band='green').values[valid_mask]
            red_vals = clipped.sel(band='red').values[valid_mask]
            
            # Create point results for each pixel
            for i in range(len(y_flat)):
                from shapely.geometry import Point
                point_geom = Point(x_flat[i], y_flat[i])
                
                result = {
                    'id': polygon_id,
                    'point': i,
                    'geometry': point_geom,
                    'name': polygon_name,
                    'blue': blue_vals[i],
                    'green': green_vals[i],
                    'red': red_vals[i]
                }
                all_results.append(result)
                
    except Exception as e:
        print(f"Error processing polygon {idx}: {e}")
        continue

# Create final DataFrame
df = pd.DataFrame(all_results)
df = gpd.GeoDataFrame(df, geometry='geometry')
print(df)

  return x.astype(astype_dtype, **kwargs)
  return x.astype(astype_dtype, **kwargs)
  return x.astype(astype_dtype, **kwargs)


      id  point                 geometry       name  blue  green   red
0      0      0  POINT (737550 -2795250)      water  7994   7423  6272
1      0      1  POINT (737580 -2795250)      water  8017   7428  6292
2      0      2  POINT (737610 -2795250)      water  8008   7446  6292
3      0      3  POINT (737640 -2795250)      water  8008   7412  6291
4      0      4  POINT (737670 -2795250)      water  8018   7398  6250
...   ..    ...                      ...        ...   ...    ...   ...
1030   3     85  POINT (739050 -2811840)  developed  8329   8216  8239
1031   3     86  POINT (739080 -2811840)  developed  8810   8828  8746
1032   3     87  POINT (739110 -2811840)  developed     0      0     0
1033   3     88  POINT (739140 -2811840)  developed     0      0     0
1034   3     89  POINT (739170 -2811840)  developed     0      0     0

[1035 rows x 7 columns]


  return x.astype(astype_dtype, **kwargs)


## Calculate mean pixel value by polygon

It is simple then to calculate the mean value of pixels within each polygon by using the polygon `id` column and pandas groupby function. You can easily calculate other statistics like min, max, median etc.

In [22]:
# Calculate mean values by polygon using the previous extraction results
if len(all_results) > 0:
    df_stats = df.drop(columns=['geometry']).groupby(['id', 'name']).agg({
        'point': 'mean',
        'blue': 'mean',
        'green': 'mean', 
        'red': 'mean'
    })
    print(df_stats)
else:
    print("No valid polygon extractions found")

              point         blue        green          red
id name                                                   
0  water      110.0  7664.425339  7086.855204  6009.547511
1  crop       241.5  3051.607438  2791.654959  3002.904959
2  tree       119.5  6191.087500  5636.945833  5022.350000
3  developed   44.5  7804.111111  7458.033333  7499.144444
