# Feature Selection for Remote Sensing Data

This notebook demonstrates how to perform feature selection on multi-band raster data using `scikit-learn` in Python. Feature selection identifies the most important bands or derived features for machine learning tasks, reducing dimensionality and improving model performance.

## Prerequisites
- Install required libraries: `rasterio`, `geopandas`, `scikit-learn`, `numpy`, `matplotlib` (listed in `requirements.txt`).
- A multi-band GeoTIFF file (e.g., `sample.tif`) and a shapefile with labeled data (e.g., `labels.shp`). Replace file paths with your own data.

## Learning Objectives
- Extract features from a raster using labeled data.
- Apply feature selection using Random Forest importance scores.
- Visualize and save selected features.

In [None]:
# Import required libraries
import rasterio
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from rasterio.features import geometry_mask

## Step 1: Load Raster and Labeled Data

Load the multi-band raster and shapefile with labeled data.

In [None]:
# Define file paths
raster_path = 'sample.tif'
shapefile_path = 'labels.shp'

# Load the shapefile
gdf = gpd.read_file(shapefile_path)

# Load the raster
with rasterio.open(raster_path) as src:
    raster_data = src.read()  # Shape: (bands, height, width)
    raster_crs = src.crs
    raster_transform = src.transform
    profile = src.profile

# Reproject shapefile to match raster CRS if needed
if gdf.crs != raster_crs:
    gdf = gdf.to_crs(raster_crs)

# Print basic information
print(f'Raster shape: {raster_data.shape}')
print(f'Number of classes: {gdf['class'].nunique()}')  # Assumes 'class' column

## Step 2: Extract Training Data

Extract pixel values from the raster at labeled geometry locations.

In [None]:
# Initialize lists for features and labels
X_train = []
y_train = []

# Extract pixel values for each geometry
for idx, row in gdf.iterrows():
    geom = row.geometry
    label = row['class']  # Assumes 'class' column
    mask = geometry_mask([geom], transform=raster_transform, out_shape=(raster_data.shape[1], raster_data.shape[2]), invert=True)
    for band in range(raster_data.shape[0]):
        band_data = raster_data[band][mask]
        if band == 0:
            features = band_data[:, np.newaxis]
        else:
            features = np.hstack((features, band_data[:, np.newaxis]))
    X_train.extend(features)
    y_train.extend([label] * len(band_data))

# Convert to arrays
X_train = np.array(X_train)
y_train = np.array(y_train)

# Print training data info
print(f'Training features shape: {X_train.shape}')
print(f'Training labels shape: {y_train.shape}')

## Step 3: Perform Feature Selection

Use Random Forest to compute feature importance scores and select the top features.

In [None]:
# Train Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importance scores
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]  # Sort in descending order

# Select top features (e.g., top 3 bands)
n_top_features = 3
top_indices = indices[:n_top_features]

# Print feature importance
print('Feature Importance Scores:')
for i, idx in enumerate(indices):
    print(f'Band {idx+1}: {importances[idx]:.4f}')

# Selected bands
print(f'Selected top {n_top_features} bands: {top_indices+1}')

## Step 4: Create Raster with Selected Features

Extract the top selected bands from the original raster.

In [None]:
# Create raster with selected bands
selected_data = raster_data[top_indices]

# Update profile for selected bands
selected_profile = profile.copy()
selected_profile.update(count=n_top_features)

# Save selected features raster
with rasterio.open('selected_features.tif', 'w', **selected_profile) as dst:
    dst.write(selected_data)

print(f'Selected features raster saved to: selected_features.tif')

## Step 5: Visualize Feature Importance

Plot the feature importance scores for all bands.

In [None]:
# Plot feature importance
plt.figure(figsize=(8, 6))
plt.bar(range(len(importances)), importances[indices], align='center')
plt.xticks(range(len(importances)), [f'Band {i+1}' for i in indices])
plt.title('Feature Importance Scores')
plt.xlabel('Band')
plt.ylabel('Importance')
plt.show()

## Next Steps

- Replace `sample.tif` and `labels.shp` with your own raster and labeled shapefile.
- Adjust the number of selected features (`n_top_features`).
- Use the selected features for classification (see `12_classification_rf_svm.ipynb`).
- Proceed to the next notebook (`15_unet_segmentation.ipynb`) for deep learning segmentation.

## Notes
- Ensure the shapefile has a 'class' column with labels.
- Feature selection reduces computational cost but may affect model accuracy.
- See `docs/installation.md` for troubleshooting library installation.