<a href="https://colab.research.google.com/github/Manvithtech07/Forest-Fire-Detection-And-Prediction/blob/main/notebooks/data_sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install rasterio

Collecting rasterio
  Downloading rasterio-1.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.1 kB)
Collecting affine (from rasterio)
  Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)
Collecting cligj>=0.5 (from rasterio)
  Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)
Collecting click-plugins (from rasterio)
  Downloading click_plugins-1.1.1.2-py2.py3-none-any.whl.metadata (6.5 kB)
Downloading rasterio-1.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.3/22.3 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Downloading click_plugins-1.1.1.2-py2.py3-none-any.whl (11 kB)
Installing collected packages: cligj, click-plugins, affine, rasterio
Successfully installed affine-2.4.0 click-plugins-1.1.1.2 cligj-0.7.2 rasterio-1.4.3


In [2]:
from google.colab import drive
drive.mount('/content/drive')

import rasterio
import numpy as np
import pandas as pd
import os
import random

# To ignore common warnings
import warnings
warnings.filterwarnings('ignore')

Mounted at /content/drive


In [3]:
GDRIVE_PROJECT_PATH = "/content/drive/My Drive/"
FEATURES_PATH = f"{GDRIVE_PROJECT_PATH}/data/features"
DATA_PATH = f"{GDRIVE_PROJECT_PATH}/data"

## 1. Load All Feature Rasters

Loading all 8 of our feature maps (`.tif` files) into NumPy arrays. We will store them in a dictionary so we can easily access them by name.

In [4]:
feature_filenames = [
    "ndvi_pre.tif",
    "ndvi_post.tif",
    "nbr_pre.tif",
    "nbr_post.tif",
    "elevation.tif",
    "slope.tif",
    "aspect.tif",
    "dNBR.tif" # target variable
]
features_data = {}

template_meta = None

try:
    for filename in feature_filenames:
        feature_name = filename.split('.')[0]

        file_path = f"{FEATURES_PATH}/{filename}"

        with rasterio.open(file_path) as src:
            features_data[feature_name] = src.read(1)

            if template_meta is None:
                template_meta = src.meta

        print(f" Loaded {filename} as '{feature_name}'")

    print("Shape of our data:", features_data['dNBR'].shape)

except Exception as e:
    print(f"ERROR: Could not load files. Check your 'feature_filenames' list. {e}")

 Loaded ndvi_pre.tif as 'ndvi_pre'
 Loaded ndvi_post.tif as 'ndvi_post'
 Loaded nbr_pre.tif as 'nbr_pre'
 Loaded nbr_post.tif as 'nbr_post'
 Loaded elevation.tif as 'elevation'
 Loaded slope.tif as 'slope'
 Loaded aspect.tif as 'aspect'
 Loaded dNBR.tif as 'dNBR'
Shape of our data: (2337, 3669)


## 2. Create Pixel Masks for Sampling

Now we use our `dNBR` data to find all the "Burned" and "Unburned" pixels. This is the core of our stratified sampling.

Based on USGS standards, we'll set our threshold for a "burned" area at **`dNBR > 0.1`**.

We will:
1.  Create a "Burned" list of all pixel coordinates `(row, col)` that meet this condition.
2.  Create an "Unburned" list of all pixels that *don't* meet this condition (but are still valid data).

In [6]:
BURN_THRESHOLD = 0.1
dnbr_data = features_data['dNBR']

valid_pixels_mask = ~np.isnan(dnbr_data)
print(f"Total valid (non-cloud) pixels: {np.sum(valid_pixels_mask)}")

burned_mask = (dnbr_data > BURN_THRESHOLD) & (valid_pixels_mask)
burned_pixels = np.argwhere(burned_mask)
print(f"Found {len(burned_pixels)} 'Burned' pixels.")

unburned_mask = (dnbr_data <= BURN_THRESHOLD) & (valid_pixels_mask)
unburned_pixels = np.argwhere(unburned_mask)
print(f"Found {len(unburned_pixels)} 'Unburned' pixels.")



Total valid (non-cloud) pixels: 4057872
Found 519879 'Burned' pixels.
Found 3537993 'Unburned' pixels.


## 3. Perform Stratified Sampling

Now we create our final, balanced dataset. We will take a sample of 5,000 "Burned" points and 5,000 "Unburned" points to create a balanced dataset of 10,000 total samples.

For each sample point, we will:
1.  Get its `(row, col)` coordinate.
2.  Extract the value from all 8 of our feature maps at that *exact* coordinate.
3.  Store these 8 values as one row in our dataset.
4.  Finally, we will save this dataset as a `.csv` file.

In [8]:
!pip install -q scikit-image
from skimage.transform import resize


master_shape = features_data['dNBR'].shape
print(f"Master shape (from dNBR) is: {master_shape}")

resized_features_data = {}

for feature_name, data_array in features_data.items():
    if data_array.shape != master_shape:
        print(f"  -> Resizing '{feature_name}' from {data_array.shape} to {master_shape}...")

        resized_array = resize(
            data_array,
            master_shape,
            order=1, # Use 'bilinear' interpolation
            preserve_range=True,
            anti_aliasing=True
        )
        resized_features_data[feature_name] = resized_array
    else:
        print(f"  -> Shape for '{feature_name}' is already correct.")
        resized_features_data[feature_name] = data_array

features_data = resized_features_data

Master shape (from dNBR) is: (2337, 3669)
  -> Shape for 'ndvi_pre' is already correct.
  -> Shape for 'ndvi_post' is already correct.
  -> Shape for 'nbr_pre' is already correct.
  -> Shape for 'nbr_post' is already correct.
  -> Resizing 'elevation' from (2215, 3486) to (2337, 3669)...
  -> Resizing 'slope' from (2215, 3486) to (2337, 3669)...
  -> Resizing 'aspect' from (2215, 3486) to (2337, 3669)...
  -> Shape for 'dNBR' is already correct.


In [13]:
N_SAMPLES_PER_CLASS = 5000

training_data_list = []

print(f"Sampling {N_SAMPLES_PER_CLASS} 'Burned' pixels")

n_burned = min(len(burned_pixels), N_SAMPLES_PER_CLASS)
sampled_burned_indices = random.sample(range(len(burned_pixels)), n_burned)

for i in sampled_burned_indices:
    row, col = burned_pixels[i]

    sample_data = {}
    for feature_name, data_array in features_data.items():
        sample_data[feature_name] = data_array[row, col]

    sample_data['Burned'] = 1

    training_data_list.append(sample_data)

print(f"Sampled {n_burned} 'Burned' pixels.")

print(f"Sampling {N_SAMPLES_PER_CLASS} 'Unburned' pixels")

n_unburned = min(len(unburned_pixels), N_SAMPLES_PER_CLASS)
sampled_unburned_indices = random.sample(range(len(unburned_pixels)), n_unburned)

for i in sampled_unburned_indices:
    row, col = unburned_pixels[i]

    sample_data = {}
    for feature_name, data_array in features_data.items():
        sample_data[feature_name] = data_array[row, col]

    sample_data['Burned'] = 0

    training_data_list.append(sample_data)

print(f"Sampled {n_unburned} 'Unburned' pixels.")

print("\nConverting data to pandas DataFrame")
df = pd.DataFrame(training_data_list)

df_clean = df.dropna()

print(f"Total rows before cleaning: {len(df)}")
print(f"Total rows after cleaning: {len(df_clean)}")

csv_output_path = f"{DATA_PATH}/training_data.csv"
df_clean.to_csv(csv_output_path, index=False)

Sampling 5000 'Burned' pixels
Sampled 5000 'Burned' pixels.
Sampling 5000 'Unburned' pixels
Sampled 5000 'Unburned' pixels.

Converting data to pandas DataFrame
Total rows before cleaning: 10000
Total rows after cleaning: 10000


In [14]:
df

Unnamed: 0,ndvi_pre,ndvi_post,nbr_pre,nbr_post,elevation,slope,aspect,dNBR,Burned
0,0.177976,0.096213,0.107546,-0.024990,1196.689297,89.999107,345.582886,0.132537,1
1,0.234447,0.154161,0.190272,0.073477,-32767.000000,-9999.000000,-9999.000000,0.116795,1
2,0.312417,0.142838,0.247186,0.040099,1533.746324,89.999039,55.282330,0.207087,1
3,0.231127,0.162649,0.182231,0.075043,851.715557,89.999123,223.241806,0.107188,1
4,0.378242,0.148193,0.325877,0.054974,1808.838535,89.999062,301.757935,0.270903,1
...,...,...,...,...,...,...,...,...,...
9995,0.207101,0.128117,0.129672,0.046324,370.869269,89.992195,207.635376,0.083348,0
9996,0.168335,0.191926,0.084934,0.107751,312.140027,89.981163,290.893555,-0.022817,0
9997,0.223156,0.183147,0.120947,0.065372,-32767.000000,-9999.000000,-9999.000000,0.055575,0
9998,0.169578,0.105547,0.108281,0.011692,1249.629774,89.998741,77.509628,0.096589,0
