--------------
```{admonition} Learning Objectives
  - Subset raster data using spatial windows and coordinate bounds
  - Extract pixel values at specific coordinate locations using rasterio
  - Sample raster data at point geometries with coordinate transformations
  - Aggregate raster values within polygon boundaries using clipping operations
  - Handle coordinate system transformations during extraction workflows
```
--------------


# Raster Data Extraction with Rioxarray and Rasterio

Raster data becomes valuable when we can extract and summarize information from specific locations or areas. This includes extracting pixel values at point locations for analysis, sampling data for machine learning models, or calculating statistics within defined boundaries.

## Spatial Subsetting Using Windows

Raster subsetting allows you to work with smaller, focused areas of large datasets. This example demonstrates using rasterio Windows to extract specific row/column ranges from raster data.

In [None]:
import rioxarray as rxr
import xarray as xr
import numpy as np
from rasterio.windows import Window
import rasterio
from rasterio.transform import from_bounds

# Create sample RGB data similar to rgbn
height, width = 500, 500
rgbn_data = np.random.randint(0, 255, (3, height, width), dtype=np.uint8)
transform = from_bounds(-120, 35, -119, 36, width, height)

# Save as temporary file
rgbn = '/tmp/rgbn.tif'
with rasterio.open(rgbn, 'w', driver='GTiff', height=height, width=width,
                   count=3, dtype=rgbn_data.dtype, crs='EPSG:4326',
                   transform=transform) as dst:
    dst.write(rgbn_data)

w = Window(row_off=0, col_off=0, height=100, width=100)
src = rxr.open_rasterio(rgbn, chunks=True)
src = src.sel(band=[1, 2, 3]).astype('float32')
src = src.assign_coords(band=['blue', 'green', 'red'])
src = src.isel(x=slice(w.col_off, w.col_off + w.width), 
               y=slice(w.row_off, w.row_off + w.height))
print(src)

## Alternative Subsetting Methods

Rioxarray provides multiple approaches for spatial subsetting. You can use coordinate bounds with `rio.clip_box()` or apply bounds during the opening process. These methods are particularly useful when working with known geographic extents.

```python
# Using coordinate bounds
bounds = (793475.76, 2049033.03, 794222.03, 2049527.24)
src_clipped = src.rio.clip_box(*bounds)

# Or specify bounds when opening
src = rxr.open_rasterio('file.tif', 
                       chunks=True,
                       clip_box=bounds)
```

## Point-based Data Extraction
 
To extract raster values at specific coordinate locations, you need to convert geographic coordinates to array indices. This process involves transforming coordinates using the raster's spatial reference system.

In [None]:
import rioxarray as rxr
import rasterio
import numpy as np
from rasterio.transform import from_bounds

# Create sample Landsat data
height, width = 1000, 1000
landsat_data = np.random.randint(0, 4096, (7, height, width), dtype=np.uint16)
transform = from_bounds(500000, 4000000, 530000, 4030000, width, height)  # UTM coordinates

l8_224078_20200518 = '/tmp/l8_224078_20200518.tif'
with rasterio.open(l8_224078_20200518, 'w', driver='GTiff', height=height, width=width,
                   count=7, dtype=landsat_data.dtype, crs='EPSG:32633',
                   transform=transform) as dst:
    dst.write(landsat_data)

# Coordinates in map projection units - use center of raster
with rasterio.open(l8_224078_20200518) as dataset:
    bounds = dataset.bounds
    y, x = (bounds.bottom + bounds.top) / 2, (bounds.left + bounds.right) / 2
    # Transform the map coordinates to data indices
    j, i = dataset.index(x, y)
    # Check if indices are within bounds
    if 0 <= j < dataset.height and 0 <= i < dataset.width:
        data = dataset.read()[:, j, i]
    else:
        print(f"Coordinates ({x}, {y}) are out of bounds")
        data = np.full(dataset.count, dataset.nodata or 0)
print(data.flatten())

## Extracting Data Using Lat/Lon Coordinates

When working with latitude/longitude coordinates, you need to transform them to the raster's projection system before extraction. This example uses pyproj for coordinate transformation.

In [None]:
import rioxarray as rxr
import rasterio
from pyproj import Transformer
import numpy as np
from rasterio.transform import from_bounds

# Use the same sample data
height, width = 1000, 1000
landsat_data = np.random.randint(0, 4096, (7, height, width), dtype=np.uint16)
transform = from_bounds(500000, 4000000, 530000, 4030000, width, height)

l8_224078_20200518 = '/tmp/l8_224078_20200518.tif'
with rasterio.open(l8_224078_20200518, 'w', driver='GTiff', height=height, width=width,
                   count=7, dtype=landsat_data.dtype, crs='EPSG:32633',
                   transform=transform) as dst:
    dst.write(landsat_data)

# Get actual center coordinates in lat/lon
with rasterio.open(l8_224078_20200518) as dataset:
    bounds = dataset.bounds
    center_x, center_y = (bounds.left + bounds.right) / 2, (bounds.bottom + bounds.top) / 2
    transformer = Transformer.from_crs(dataset.crs, "EPSG:4326", always_xy=True)
    lon, lat = transformer.transform(center_x, center_y)

with rasterio.open(l8_224078_20200518) as dataset:
    # Transform the coordinates to map units
    transformer = Transformer.from_crs("EPSG:4326", dataset.crs, always_xy=True)
    x, y = transformer.transform(lon, lat)
    # Transform the map coordinates to data indices
    j, i = dataset.index(x, y)
    # Check if indices are within bounds
    if 0 <= j < dataset.height and 0 <= i < dataset.width:
        data = dataset.read()[:, j, i]
    else:
        print(f"Coordinates ({x}, {y}) are out of bounds")
        data = np.full(dataset.count, dataset.nodata or 0)
print(data.flatten())

(f_rs_extraction_point)=
## Extracting Data from Point Geometries 

When working with point geometries (like GPS locations or field sampling sites), you can extract raster values using spatial clipping operations. This approach handles coordinate system transformations automatically.

In [None]:
import rioxarray as rxr
import geopandas as gpd
from shapely.geometry import Point
import pandas as pd

# Create sample points
points = [Point(510000, 4010000), Point(520000, 4020000)]
l8_224078_20200518_points = gpd.GeoDataFrame(
    {'id': [1, 2], 'name': ['point1', 'point2']}, 
    geometry=points, 
    crs='EPSG:32633'
)

src = rxr.open_rasterio(l8_224078_20200518, chunks=True)
df = src.rio.clip_box(*l8_224078_20200518_points.total_bounds, crs=l8_224078_20200518_points.crs)
print(df)

## Handling Coordinate System Mismatches

When point geometries have different coordinate systems than the raster data, the extraction process automatically handles coordinate transformations. This flexibility allows you to work with data in different projections seamlessly.

In [None]:
import rioxarray as rxr
import geopandas as gpd
import rasterio
from shapely.geometry import Point
import pandas as pd

# Create sample points
points = [Point(510000, 4010000)]
l8_224078_20200518_points = gpd.GeoDataFrame(
    {'id': [1], 'name': ['point1']}, 
    geometry=points, 
    crs='EPSG:32633'
)

point_df = l8_224078_20200518_points.copy()
print(point_df.crs)
# Transform the CRS to WGS84 lat/lon
point_df = point_df.to_crs('epsg:4326')
print(point_df.crs)

# Extract values at point locations using rasterio sampling
results = []
with rasterio.open(l8_224078_20200518) as src:
    # Transform points back to raster CRS for sampling
    points_in_raster_crs = point_df.to_crs(src.crs)
    
    for idx, point in points_in_raster_crs.iterrows():
        geom = point.geometry
        if geom.geom_type == 'Point':
            row, col = src.index(geom.x, geom.y)
            if 0 <= row < src.height and 0 <= col < src.width:
                values = src.read()[:, row, col]
                result = {f'band_{i+1}': val for i, val in enumerate(values)}
                result.update({col: point[col] for col in point_df.columns if col != 'geometry'})
                results.append(result)

df = pd.DataFrame(results)
print(df)

## Working with Real-World Point Data

This example demonstrates extracting raster values using actual coordinate locations from real Landsat data. The process ensures points fall within the raster bounds and handles the coordinate transformation properly.

In [None]:
import geopandas as gpd
from shapely.geometry import Point
import rasterio

# Open raster
raster_path = "../../pygis/data/LC08_L1TP_224078_20200518_20200518_01_RT.TIF"
with rasterio.open(raster_path) as src:
    bounds = src.bounds
    crs = src.crs

# Create a point at the center of the raster
center_x = (bounds.left + bounds.right) / 2
center_y = (bounds.top + bounds.bottom) / 2
point = Point(center_x, center_y)

# Create GeoDataFrame
gdf = gpd.GeoDataFrame({'id': [1]}, geometry=[point], crs=crs)

print("Point within raster:", point)
print("CRS:", crs)

## Time Series Extraction from Multiple Images

For temporal analysis, you can extract pixel values from multiple raster images representing different time periods. The extracted data is formatted for time series analysis with appropriate temporal labels.

In [None]:
import rioxarray as rxr
import xarray as xr
import geopandas as gpd
from shapely.geometry import Point

# Path to your raster
l8_224078_20200518 = "../../pygis/data/LC08_L1TP_224078_20200518_20200518_01_RT.TIF"

# Create a valid point in the same CRS and within raster bounds
# We'll use a point near the center of the raster
point = Point(748000, -2800000)  # EPSG:32621
points_gdf = gpd.GeoDataFrame(
    {'id': [1], 'name': ['point1']}, 
    geometry=[point], 
    crs='EPSG:32621'
)

# Load raster twice to simulate time stacking
src1 = rxr.open_rasterio(l8_224078_20200518, chunks=True)
src2 = rxr.open_rasterio(l8_224078_20200518, chunks=True)

# Concatenate along a time dimension
src = xr.concat([src1, src2], dim='time')
src = src.assign_coords(time=['t1', 't2'])

# Clip using bounding box of the point
df = src.rio.clip_box(
    *points_gdf.total_bounds, 
    crs=points_gdf.crs,
    allow_one_dimensional_raster=True  # Optional but helps with point clip
)

print(df)

## Polygon-based Raster Extraction

Polygon-based extraction allows you to calculate statistics (mean, sum, etc.) for all pixels that fall within polygon boundaries. This is valuable for analyzing data within administrative boundaries, land parcels, or ecological zones.

In [None]:
import rioxarray as rxr
import geopandas as gpd
import rasterio
from shapely.geometry import Point
import pandas as pd

# Create sample points
points = [Point(748000, -2800000)]
l8_224078_20200518_points = gpd.GeoDataFrame(
    {'id': [1], 'name': ['point1']}, 
    geometry=points, 
    crs='EPSG:32633'
)

point_df = l8_224078_20200518_points.copy()
print(point_df.crs)
# Transform the CRS to WGS84 lat/lon
point_df = point_df.to_crs('epsg:4326')
print(point_df.crs)

# Extract values at point locations using rasterio sampling
results = []
with rasterio.open(l8_224078_20200518) as src:
    # Transform points back to raster CRS for sampling
    points_in_raster_crs = point_df.to_crs(src.crs)
    
    for idx, point in points_in_raster_crs.iterrows():
        geom = point.geometry
        if geom.geom_type == 'Point':
            row, col = src.index(geom.x, geom.y)
            if 0 <= row < src.height and 0 <= col < src.width:
                values = src.read()[:, row, col]
                result = {f'band_{i+1}': val for i, val in enumerate(values)}
                result.update({col: point[col] for col in point_df.columns if col != 'geometry'})
                results.append(result)

df = pd.DataFrame(results)
print(df)

### Calculating Zonal Statistics

Zonal statistics calculate summary metrics (mean, median, min, max) for raster pixels within polygon boundaries. This example demonstrates efficient polygon-based aggregation using rioxarray's clipping and xarray's aggregation functions.

In [None]:
from shapely.geometry import box
import rioxarray as rxr
import geopandas as gpd
import numpy as np

# Load raster
raster_path = "../../pygis/data/LC08_L1TP_224078_20200518_20200518_01_RT.TIF"
src = rxr.open_rasterio(raster_path, chunks=True)
xmin, ymin, xmax, ymax = src.rio.bounds()
raster_crs = src.rio.crs

# Calculate valid inner box (using 20% reduction from edges)
width = xmax - xmin
height = ymax - ymin
buffer_x = width * 0.20  # Reduce 20% from left/right
buffer_y = height * 0.20  # Reduce 20% from top/bottom

# Create valid inner polygon
inner_box = box(
    xmin + buffer_x,
    ymin + buffer_y,
    xmax - buffer_x,
    ymax - buffer_y
)
polygons = gpd.GeoDataFrame(
    {'id': [1]}, 
    geometry=[inner_box], 
    crs=raster_crs
)

# Assign band names
band_names = [f'band_{i+1}' for i in range(len(src.band))]
src = src.assign_coords(band=band_names)

# Clip and compute mean (more efficient method)
clipped = src.rio.clip(polygons.geometry, polygons.crs)

# Compute mean directly without converting to DataFrame
band_means = clipped.mean(dim=['x', 'y'])

# Convert to DataFrame with proper naming
df_mean = band_means.to_dataframe(name='mean_value').reset_index()
df_mean = df_mean[['band', 'mean_value']]

print(df_mean)