# K-Means Clustering for Raster Data

This notebook demonstrates how to perform unsupervised K-Means clustering on multi-band raster data using `scikit-learn` and `rasterio` in Python. K-Means clustering is useful for segmenting remote sensing imagery into distinct groups (e.g., land cover types) without labeled training data.

## Prerequisites
- Install required libraries: `rasterio`, `scikit-learn`, `numpy`, `matplotlib` (listed in `requirements.txt`).
- A multi-band GeoTIFF file (e.g., `sample.tif`). Replace the file path with your own raster file.

## Learning Objectives
- Apply K-Means clustering to raster data.
- Visualize the resulting clusters.
- Save the clustered output as a raster.

In [None]:
# Import required libraries
import rasterio
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

## Step 1: Load the Raster File

Load a multi-band raster file and prepare it for clustering.

In [None]:
# Define the path to the raster file
raster_path = 'sample.tif'

# Open the raster file
with rasterio.open(raster_path) as src:
    raster_data = src.read(masked=True)  # Shape: (bands, height, width)
    profile = src.profile

# Reshape for clustering: (height * width, bands)
height, width = raster_data.shape[1], raster_data.shape[2]
X = raster_data.transpose(1, 2, 0).reshape(-1, raster_data.shape[0])

# Remove masked (no-data) pixels
mask = np.any(raster_data.mask, axis=0).ravel()
X_valid = X[~mask]

# Print basic information
print(f'Raster shape: {raster_data.shape}')
print(f'Valid pixels for clustering: {X_valid.shape[0]}')

## Step 2: Apply K-Means Clustering

Perform K-Means clustering on the valid pixels, assuming a specified number of clusters (e.g., 5).

In [None]:
# Define number of clusters
n_clusters = 5

# Initialize and fit K-Means
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(X_valid)

# Predict clusters for all pixels (including masked)
clusters = np.full((height * width), -1, dtype=np.int32)  # -1 for masked pixels
clusters[~mask] = kmeans.labels_
clusters = clusters.reshape(height, width)

# Visualize clusters
plt.figure(figsize=(8, 8))
plt.imshow(clusters, cmap='tab10', vmin=0, vmax=n_clusters-1)
plt.colorbar(label='Cluster')
plt.title('K-Means Clustering Result')
plt.xlabel('Column')
plt.ylabel('Row')
plt.show()

## Step 3: Save Clustered Raster

Save the clustering result as a single-band GeoTIFF file.

In [None]:
# Update profile for single-band output
output_profile = profile.copy()
output_profile.update(count=1, dtype=rasterio.int32, nodata=-1)

# Save clustered raster
with rasterio.open('kmeans_clusters.tif', 'w', **output_profile) as dst:
    dst.write(clusters, 1)

print('Clustered raster saved to: kmeans_clusters.tif')

## Next Steps

- Replace `sample.tif` with your own multi-band raster file.
- Experiment with different numbers of clusters (`n_clusters`).
- Normalize input data before clustering (see `11_normalization_scaling.ipynb`).
- Proceed to the next notebook (`14_feature_selection.ipynb`) for feature selection.

## Notes
- Ensure the raster has multiple bands for meaningful clustering.
- Handle large datasets by sampling pixels or using MiniBatchKMeans.
- See `docs/installation.md` for troubleshooting library installation.