Importing Neccessary Dependencies

In [10]:
import pandas as pd
import numpy as np

In [11]:
PREPROCESSED_DATA_PATH = "../../data/field-data/preprocessed.csv"
FINAL_PLOT_DATA_PATH = "../../data/field-data/final.csv"

In [12]:
df = pd.read_csv(PREPROCESSED_DATA_PATH)
# Shape of the dataframe
print(f"rows = {df.shape[0]}\ncolumns = {df.shape[1]}")
df.head()

rows = 2009
columns = 7


Unnamed: 0,id,lon,lat,AGB_tha,cluster_id,plot_id,subplot_id
0,10-75-3,80.413257,28.870558,454.729757,10,75,3
1,10-75-6,80.416333,28.870571,499.730683,10,75,6
2,10-84-4,80.405569,29.192646,367.626079,10,84,4
3,10-92-3,80.392803,29.484019,11.786484,10,92,3
4,10-92-4,80.395913,29.481326,77.825277,10,92,4


Data Preparation

*We are going to prepare our final (ground-truth) data, by aggregating it to the plot level. We have to do this because the satellite imagery (like Sentinel-2 or Landsat) has a spatial resolution ranging from 10m to 30m, attempting to extract data for tiny subplots often results in "mixed pixels," where the signal is blurred across multiple units.*

In [None]:
df["cluster_id"] = df["cluster_id"].astype(str).str.strip()
df["plot_id"] = df["plot_id"].astype(str).str.strip()
df["subplot_area_ha"] = 0.075  

plot_level_df = (
    df.groupby(["cluster_id", "plot_id"], dropna=False)
    .apply(
        lambda x: pd.Series(
            {
                "plot_lon": x["lon"].mean(),
                "plot_lat": x["lat"].mean(),
                "plot_agb_mean_t_ha": np.average(
                    x["AGB_tha"], weights=x["subplot_area_ha"]
                ),
                "plot_total_agb_t": np.sum(x["AGB_tha"] * x["subplot_area_ha"]),
                "subplot_count": len(x),
            }
        ),
        include_groups=False,
    )  # <-- add this line
    .reset_index()
)

plot_level_df["plotId"] = plot_level_df["cluster_id"] + "-" + plot_level_df["plot_id"]
plot_level_df.drop(columns=["cluster_id", "plot_id"], inplace=True)
plot_level_df.head()

Unnamed: 0,plot_lon,plot_lat,plot_agb_mean_t_ha,plot_total_agb_t,subplot_count,plotId
0,80.414795,28.870564,477.23022,71.584533,2.0,10-75
1,80.405569,29.192646,367.626079,27.571956,1.0,10-84
2,80.394871,29.483126,45.633926,10.267633,3.0,10-92
3,84.10442,28.228486,135.266012,40.579803,4.0,100-56
4,84.145391,27.650959,329.968995,148.486048,6.0,101-40


In [14]:
rows, _ = plot_level_df.shape
print('Number of plots : ', rows)

Number of plots :  524


### Why this is the "Golden" Dataset for Satellite Matching

- `Pixel Alignment:` A typical plot ($500m^2$ to $750m^2$) covers a large enough area to match Sentinel-2 (10m) or Landsat (30m) pixels without being highly sensitive to tiny GPS shifts.

- `Statistical Robustness:` By averaging the AGB across subplots, you reduce the influence of "outlier trees" (e.g., one massive tree in a single subplot) and get a more stable estimate of the forest stand's biomass.

- `Subplot Count Column:` You now have a subplot_count column. When you train your model, you can choose to give more weight to plots that had 4+ subplots, as their "ground truth" is more reliable than a plot with only 1 subplot.

Save the final plot level dadta

In [15]:
plot_level_df.to_csv(FINAL_PLOT_DATA_PATH, index=False)