<a href="https://colab.research.google.com/github/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/blob/Country-Model---Colombia/MRV_AGC_ML_Model_Country_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Objective:**  
The goal is to produce accurate, scalable, and reproducible carbon stock estimates to support countries’ Measurement, Reporting, and Verification (MRV) obligations under climate agreements.

**Project Overview:**  
The degradation of mangrove ecosystems across Latin America and the Caribbean (LAC) presents a significant challenge for climate mitigation and blue carbon preservation. However, current national capacities to measure and report on mangrove carbon stocks, particularly above-ground biomass (AGB) and soil organic carbon (SOC), are often limited by data availability, inconsistent methodologies, and lack of scalable digital tools. **The problem is how to develop a standardized, scalable, and validated AI-driven MRV system for mangrove ecosystems that can produce reliable blue carbon stock estimates**, support international reporting obligations, and enable sustainable blue carbon management in Latin America and the Caribbean. This consultancy also contributes to preparing countries for results-based payment schemes by producing standardized carbon estimates aligned with carbon credit market verification. While the core focus is on AGB estimation, methane data and modelling processes, when provided, will be integrated to enhance blue carbon stock accuracy.  

**The project will use machine learning methods trained on high-resolution satellite data, LiDAR, field plots, and other auxiliary sources** to produce validated, country-specific models. By piloting in Trinidad and scaling to Columbia, Jamaica, Panama, and Suriname, this initiative will enhance national reporting systems, support access to results-based climate finance, and build regional capacity for sustainable carbon monitoring. It directly contributes to climate reporting obligations under the Paris Agreement, while enabling future alignment with carbon credit verification standards.

Import data  
Prep data (join, impute, clean, etc)
Explore data   
Train model  
Test model  
Produce visuals

# **Colombia Data & Model**

## 1. Import processed data from Github repo for Colombia

---

In [None]:
!wget https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Colombia/02_interim/Colombia_S1_Predictors_2022_2023.xlsx
!wget https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Colombia/02_interim/Colombia_S2_Predictors_2022_2023.xlsx
!wget https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Colombia/02_interim/Colombia_plot_AGB_AGC_Chave2014.xlsx

--2025-12-31 11:49:06--  https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Colombia/02_interim/Colombia_S1_Predictors_2022_2023.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6916 (6.8K) [application/octet-stream]
Saving to: ‘Colombia_S1_Predictors_2022_2023.xlsx’


2025-12-31 11:49:07 (34.6 MB/s) - ‘Colombia_S1_Predictors_2022_2023.xlsx’ saved [6916/6916]

--2025-12-31 11:49:07--  https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Colombia/02_interim/Colombia_S2_Predictors_2022_2023.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.

In [None]:
import pandas as pd

col_s1_pred_df = pd.read_excel('/content/Colombia_S1_Predictors_2022_2023.xlsx')
col_s2_pred_df = pd.read_excel('/content/Colombia_S2_Predictors_2022_2023.xlsx')
col_plot_df = pd.read_excel('/content/Colombia_plot_AGB_AGC_Chave2014.xlsx')

print("All Excel files have been loaded into DataFrames:")
print("col_s1_pred_df, col_s2_pred_df, col_plot_df")

All Excel files have been loaded into DataFrames:
col_s1_pred_df, col_s2_pred_df, col_plot_df


In [None]:
print("Colombia S1 Predictors 2022 2023 Info:")
col_s1_pred_df.info()
print("\nColombia S1 Predictors DataFrame 2022 2023 Info:")
display(col_s1_pred_df.describe())

print("Colombia S2 Predictors 2022 2023 Info:")
col_s2_pred_df.info()
print("\nColombia S2 Predictors 2022 2023 DataFrame Description:")
display(col_s2_pred_df.describe())


print("Colombia Plot AGB AGC Chave2014 DataFrame Info:")
col_plot_df.info()
print("\nColombia Plot AGB AGC Chave2014  DataFrame Description:")
display(col_plot_df.describe())

Colombia S1 Predictors 2022 2023 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   system:index  14 non-null     int64  
 1   Latitude      14 non-null     float64
 2   Longitude     14 non-null     float64
 3   Plot          14 non-null     object 
 4   Study_area    14 non-null     object 
 5   VH            14 non-null     float64
 6   VH_VV_ratio   14 non-null     float64
 7   VV            14 non-null     float64
 8   VV_minus_VH   14 non-null     float64
 9   .geo          14 non-null     object 
dtypes: float64(6), int64(1), object(3)
memory usage: 1.2+ KB

Colombia S1 Predictors DataFrame 2022 2023 Info:


Unnamed: 0,system:index,Latitude,Longitude,VH,VH_VV_ratio,VV,VV_minus_VH
count,14.0,14.0,14.0,14.0,14.0,14.0,14.0
mean,6.5,9.411235,-75.63436,-14.023681,1.847316,-7.713321,6.34493
std,4.1833,0.014699,0.01277,0.428797,0.137105,0.659795,0.396044
min,0.0,9.392361,-75.656792,-14.545701,1.63809,-8.933878,5.647963
25%,3.25,9.401521,-75.641015,-14.319607,1.7632,-7.941399,6.120765
50%,6.5,9.408647,-75.633374,-14.003779,1.837804,-7.621978,6.30025
75%,9.75,9.418269,-75.623428,-13.930117,1.889189,-7.526474,6.532688
max,13.0,9.444222,-75.618083,-12.739493,2.224207,-5.999545,7.108129


Colombia S2 Predictors 2022 2023 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   system:index  14 non-null     int64  
 1   B11           14 non-null     float64
 2   B12           14 non-null     float64
 3   B2            14 non-null     float64
 4   B3            14 non-null     float64
 5   B4            14 non-null     float64
 6   B5            14 non-null     float64
 7   B6            14 non-null     float64
 8   B7            14 non-null     float64
 9   B8            14 non-null     float64
 10  B8A           14 non-null     float64
 11  EVI           14 non-null     float64
 12  Latitude      14 non-null     float64
 13  Longitude     14 non-null     float64
 14  NDRE          14 non-null     float64
 15  NDVI          14 non-null     float64
 16  NDWI          14 non-null     float64
 17  Plot          14 non-null     object

Unnamed: 0,system:index,B11,B12,B2,B3,B4,B5,B6,B7,B8,B8A,EVI,Latitude,Longitude,NDRE,NDVI,NDWI
count,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0
mean,6.5,0.110014,0.046186,0.033414,0.050141,0.030775,0.082441,0.240061,0.300575,0.297875,0.329002,0.516585,9.411235,-75.63436,0.556278,0.792211,-0.692699
std,4.1833,0.009888,0.007292,0.00312,0.003761,0.004069,0.004711,0.01657,0.021823,0.024164,0.022768,0.040229,0.014699,0.01277,0.027383,0.034409,0.028161
min,0.0,0.0945,0.0368,0.0288,0.0425,0.0253,0.07655,0.20875,0.25535,0.2451,0.28215,0.426469,9.392361,-75.656792,0.496578,0.71948,-0.738366
25%,3.25,0.103338,0.040275,0.031275,0.047575,0.02785,0.078713,0.228662,0.288137,0.284675,0.320325,0.491393,9.401521,-75.641015,0.538067,0.778276,-0.709947
50%,6.5,0.1066,0.0433,0.0324,0.049825,0.029675,0.081275,0.246125,0.30635,0.30285,0.3347,0.522705,9.408647,-75.633374,0.555113,0.787732,-0.693418
75%,9.75,0.117463,0.052613,0.036025,0.05315,0.033963,0.085125,0.2512,0.3153,0.3137,0.342831,0.539594,9.418269,-75.623428,0.577916,0.820342,-0.672777
max,13.0,0.12435,0.0571,0.03825,0.05595,0.03935,0.09145,0.2645,0.3352,0.3272,0.36455,0.579331,9.444222,-75.618083,0.604333,0.839022,-0.643662


Colombia Plot AGB AGC Chave2014 DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Plot                      14 non-null     object 
 1   Study_area                14 non-null     object 
 2   Latitude                  14 non-null     float64
 3   Longitude                 14 non-null     float64
 4   AGBd_kg_per_ha            14 non-null     float64
 5   AGBd_t_per_ha_chave2014   14 non-null     float64
 6   AGCd_tC_per_ha_chave2014  14 non-null     float64
dtypes: float64(5), object(2)
memory usage: 916.0+ bytes

Colombia Plot AGB AGC Chave2014  DataFrame Description:


Unnamed: 0,Latitude,Longitude,AGBd_kg_per_ha,AGBd_t_per_ha_chave2014,AGCd_tC_per_ha_chave2014
count,14.0,14.0,14.0,14.0,14.0
mean,9.411235,-75.63436,128405.491956,128.405492,60.350581
std,0.014699,0.01277,77606.548007,77.606548,36.475078
min,9.392361,-75.656792,66603.472264,66.603472,31.303632
25%,9.401521,-75.641015,96685.065672,96.685066,45.441981
50%,9.408647,-75.633374,108395.402522,108.395403,50.945839
75%,9.418269,-75.623428,124795.277006,124.795277,58.65378
max,9.444222,-75.618083,375835.12285,375.835123,176.642508


In [None]:
display(col_s1_pred_df.head())
display(col_s2_pred_df.head())
display(col_plot_df.head())


Unnamed: 0,system:index,Latitude,Longitude,Plot,Study_area,VH,VH_VV_ratio,VV,VV_minus_VH,.geo
0,0,9.410258,-75.622394,Caimanera P10,Caimanera,-13.92754,1.765943,-7.911034,5.910434,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
1,1,9.405389,-75.655378,Caimanera P11,Caimanera,-14.336914,1.739304,-8.359852,5.946622,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
2,2,9.396867,-75.641917,Caimanera P12,Caimanera,-14.330285,1.63809,-8.933878,5.647963,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
3,3,9.401303,-75.632144,Caimanera P13,Caimanera,-14.412853,1.960792,-7.505751,6.556708,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
4,4,9.402175,-75.656792,Caimanera P14,Caimanera,-13.872831,1.762286,-7.95152,6.09055,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."


Unnamed: 0,system:index,B11,B12,B2,B3,B4,B5,B6,B7,B8,B8A,EVI,Latitude,Longitude,NDRE,NDVI,NDWI,Plot,Study_area,.geo
0,0,0.11675,0.0527,0.0358,0.05595,0.0354,0.09145,0.23935,0.2951,0.303,0.32425,0.530022,9.410258,-75.622394,0.53183,0.779148,-0.674684,Caimanera P10,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
1,1,0.11595,0.05075,0.0308,0.04745,0.03025,0.08025,0.2252,0.282,0.2717,0.3134,0.48089,9.405389,-75.655378,0.537678,0.786892,-0.688529,Caimanera P11,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
2,2,0.12175,0.0571,0.0379,0.05345,0.03935,0.0808,0.20875,0.25535,0.2451,0.28215,0.426469,9.396867,-75.641917,0.496578,0.71948,-0.643662,Caimanera P12,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
3,3,0.10225,0.03985,0.03235,0.0474,0.0296,0.07845,0.24725,0.3163,0.3073,0.345,0.51885,9.401303,-75.632144,0.579528,0.788571,-0.698307,Caimanera P13,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."
4,4,0.12435,0.05345,0.0361,0.0538,0.03385,0.08445,0.23125,0.28715,0.2945,0.3199,0.483964,9.402175,-75.656792,0.550188,0.777985,-0.663404,Caimanera P14,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper..."


Unnamed: 0,Plot,Study_area,Latitude,Longitude,AGBd_kg_per_ha,AGBd_t_per_ha_chave2014,AGCd_tC_per_ha_chave2014
0,Caimanera P10,Caimanera,9.410258,-75.622394,107947.135025,107.947135,50.735153
1,Caimanera P11,Caimanera,9.405389,-75.655378,76199.462781,76.199463,35.813748
2,Caimanera P12,Caimanera,9.396867,-75.641917,145465.880608,145.465881,68.368964
3,Caimanera P13,Caimanera,9.401303,-75.632144,109155.240827,109.155241,51.302963
4,Caimanera P14,Caimanera,9.402175,-75.656792,186442.110115,186.44211,87.627792


In [None]:
print(f"col_s1_pred_df: {col_s1_pred_df.shape[0]} rows x {col_s1_pred_df.shape[1]} columns")
print(f"col_s2_pred_df: {col_s2_pred_df.shape[0]} rows x {col_s2_pred_df.shape[1]} columns")
print(f"col_plot_df: {col_plot_df.shape[0]} rows x {col_plot_df.shape[1]} columns")


col_s1_pred_df: 14 rows x 10 columns
col_s2_pred_df: 14 rows x 20 columns
col_plot_df: 14 rows x 7 columns


## 2. Clean  & Prep Data

---

In [None]:
if not col_s1_pred_df['Plot'].is_unique:
    print("Warning: 'Plot' column in col_s1_pred_df is not unique. Please investigate before merging.")
if not col_s2_pred_df['Plot'].is_unique:
    print("Warning: 'Plot' column in col_s2_pred_df is not unique. Please investigate before merging.")
if not col_plot_df['Plot'].is_unique:
    print("Warning: 'Plot' column in col_plot_df is not unique. Please investigate before merging.")

# Merge col_s1_pred_df and col_s2_pred_df on 'Plot'
colombia_df = pd.merge(col_s1_pred_df, col_s2_pred_df, on='Plot', how='inner', suffixes=('_s1', '_s2'))

# Merge the result with col_plot_df on 'Plot'
colombia_df = pd.merge(colombia_df, col_plot_df, on='Plot', how='inner')

print("Merged Colombia DataFrame Info:")
colombia_df.info()
print("\nMerged Colombia DataFrame Head:")
display(colombia_df.head())

# Verify the shape of the merged dataframe
print(f"\nMerged Colombia DataFrame shape: {colombia_df.shape[0]} rows x {colombia_df.shape[1]} columns")

# Set AGCd_tC_per_ha_chave2014 as the dependent variable (this is a conceptual step, not a code assignment)
dependent_variable = 'AGCd_tC_per_ha_chave2014'
print(f"\nThe dependent variable for modeling is: {dependent_variable}")

Merged Colombia DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   system:index_s1           14 non-null     int64  
 1   Latitude_s1               14 non-null     float64
 2   Longitude_s1              14 non-null     float64
 3   Plot                      14 non-null     object 
 4   Study_area_s1             14 non-null     object 
 5   VH                        14 non-null     float64
 6   VH_VV_ratio               14 non-null     float64
 7   VV                        14 non-null     float64
 8   VV_minus_VH               14 non-null     float64
 9   .geo_s1                   14 non-null     object 
 10  system:index_s2           14 non-null     int64  
 11  B11                       14 non-null     float64
 12  B12                       14 non-null     float64
 13  B2                        14 non-nu

Unnamed: 0,system:index_s1,Latitude_s1,Longitude_s1,Plot,Study_area_s1,VH,VH_VV_ratio,VV,VV_minus_VH,.geo_s1,...,NDVI,NDWI,Study_area_s2,.geo_s2,Study_area,Latitude,Longitude,AGBd_kg_per_ha,AGBd_t_per_ha_chave2014,AGCd_tC_per_ha_chave2014
0,0,9.410258,-75.622394,Caimanera P10,Caimanera,-13.92754,1.765943,-7.911034,5.910434,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",...,0.779148,-0.674684,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",Caimanera,9.410258,-75.622394,107947.135025,107.947135,50.735153
1,1,9.405389,-75.655378,Caimanera P11,Caimanera,-14.336914,1.739304,-8.359852,5.946622,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",...,0.786892,-0.688529,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",Caimanera,9.405389,-75.655378,76199.462781,76.199463,35.813748
2,2,9.396867,-75.641917,Caimanera P12,Caimanera,-14.330285,1.63809,-8.933878,5.647963,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",...,0.71948,-0.643662,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",Caimanera,9.396867,-75.641917,145465.880608,145.465881,68.368964
3,3,9.401303,-75.632144,Caimanera P13,Caimanera,-14.412853,1.960792,-7.505751,6.556708,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",...,0.788571,-0.698307,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",Caimanera,9.401303,-75.632144,109155.240827,109.155241,51.302963
4,4,9.402175,-75.656792,Caimanera P14,Caimanera,-13.872831,1.762286,-7.95152,6.09055,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",...,0.777985,-0.663404,Caimanera,"{""geodesic"":false,""crs"":{""type"":""name"",""proper...",Caimanera,9.402175,-75.656792,186442.110115,186.44211,87.627792



Merged Colombia DataFrame shape: 14 rows x 35 columns

The dependent variable for modeling is: AGCd_tC_per_ha_chave2014


## 03. Train & Test ML Models

---



Clean and prepare the `colombia_df` by removing unnecessary and redundant columns, and then split the data into features (X) and target (y = 'AGCd_tC_per_ha_chave2014') for machine learning.

In [None]:
columns_to_drop_colombia = [
    'system:index_s1',
    'system:index_s2',
    'Plot',
    '.geo_s1',
    '.geo_s2',
    'Latitude_s1',
    'Longitude_s1',
    'Latitude_s2',
    'Longitude_s2',
    'AGBd_kg_per_ha',
    'AGBd_t_per_ha_chave2014'
]

# Create a cleaned DataFrame by dropping identified columns
colombia_df_cleaned = colombia_df.drop(columns=columns_to_drop_colombia)

print("--- Colombia Cleaned DataFrame Info ---")
colombia_df_cleaned.info()
print("\n--- Colombia Cleaned DataFrame Head ---")
display(colombia_df_cleaned.head())

# Verify unique values of 'Study_area'
print("\nUnique values in 'Study_area' column:")
print(colombia_df_cleaned['Study_area'].unique())

# Drop the redundant 'Study_area' columns
colombia_df_cleaned = colombia_df_cleaned.drop(columns=['Study_area_s1', 'Study_area_s2', 'Study_area'])


--- Colombia Cleaned DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Study_area_s1             14 non-null     object 
 1   VH                        14 non-null     float64
 2   VH_VV_ratio               14 non-null     float64
 3   VV                        14 non-null     float64
 4   VV_minus_VH               14 non-null     float64
 5   B11                       14 non-null     float64
 6   B12                       14 non-null     float64
 7   B2                        14 non-null     float64
 8   B3                        14 non-null     float64
 9   B4                        14 non-null     float64
 10  B5                        14 non-null     float64
 11  B6                        14 non-null     float64
 12  B7                        14 non-null     float64
 13  B8                        1

Unnamed: 0,Study_area_s1,VH,VH_VV_ratio,VV,VV_minus_VH,B11,B12,B2,B3,B4,...,B8A,EVI,NDRE,NDVI,NDWI,Study_area_s2,Study_area,Latitude,Longitude,AGCd_tC_per_ha_chave2014
0,Caimanera,-13.92754,1.765943,-7.911034,5.910434,0.11675,0.0527,0.0358,0.05595,0.0354,...,0.32425,0.530022,0.53183,0.779148,-0.674684,Caimanera,Caimanera,9.410258,-75.622394,50.735153
1,Caimanera,-14.336914,1.739304,-8.359852,5.946622,0.11595,0.05075,0.0308,0.04745,0.03025,...,0.3134,0.48089,0.537678,0.786892,-0.688529,Caimanera,Caimanera,9.405389,-75.655378,35.813748
2,Caimanera,-14.330285,1.63809,-8.933878,5.647963,0.12175,0.0571,0.0379,0.05345,0.03935,...,0.28215,0.426469,0.496578,0.71948,-0.643662,Caimanera,Caimanera,9.396867,-75.641917,68.368964
3,Caimanera,-14.412853,1.960792,-7.505751,6.556708,0.10225,0.03985,0.03235,0.0474,0.0296,...,0.345,0.51885,0.579528,0.788571,-0.698307,Caimanera,Caimanera,9.401303,-75.632144,51.302963
4,Caimanera,-13.872831,1.762286,-7.95152,6.09055,0.12435,0.05345,0.0361,0.0538,0.03385,...,0.3199,0.483964,0.550188,0.777985,-0.663404,Caimanera,Caimanera,9.402175,-75.656792,87.627792



Unique values in 'Study_area' column:
['Caimanera']


Since the 'Study_area' column in `colombia_df_cleaned` has only one unique value ('Caimanera'), it is redundant and is dropped along with 'Study_area_s1' and 'Study_area_s2'.

After dropping these, the features (X) and the target variable (y) are defined and then split the data into training and testing sets using `train_test_split`.



In [None]:
from sklearn.model_selection import train_test_split

# Define the target variable
y = colombia_df_cleaned['AGCd_tC_per_ha_chave2014']

# Define the features (all columns except the target)
X = colombia_df_cleaned.drop(columns=['AGCd_tC_per_ha_chave2014'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

print("--- Features (X) Head ---")
display(X.head())
print("--- Target (y) Head ---")
display(y.head())

print(f"\nShape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

--- Features (X) Head ---


Unnamed: 0,VH,VH_VV_ratio,VV,VV_minus_VH,B11,B12,B2,B3,B4,B5,B6,B7,B8,B8A,EVI,NDRE,NDVI,NDWI,Latitude,Longitude
0,-13.92754,1.765943,-7.911034,5.910434,0.11675,0.0527,0.0358,0.05595,0.0354,0.09145,0.23935,0.2951,0.303,0.32425,0.530022,0.53183,0.779148,-0.674684,9.410258,-75.622394
1,-14.336914,1.739304,-8.359852,5.946622,0.11595,0.05075,0.0308,0.04745,0.03025,0.08025,0.2252,0.282,0.2717,0.3134,0.48089,0.537678,0.786892,-0.688529,9.405389,-75.655378
2,-14.330285,1.63809,-8.933878,5.647963,0.12175,0.0571,0.0379,0.05345,0.03935,0.0808,0.20875,0.25535,0.2451,0.28215,0.426469,0.496578,0.71948,-0.643662,9.396867,-75.641917
3,-14.412853,1.960792,-7.505751,6.556708,0.10225,0.03985,0.03235,0.0474,0.0296,0.07845,0.24725,0.3163,0.3073,0.345,0.51885,0.579528,0.788571,-0.698307,9.401303,-75.632144
4,-13.872831,1.762286,-7.95152,6.09055,0.12435,0.05345,0.0361,0.0538,0.03385,0.08445,0.23125,0.28715,0.2945,0.3199,0.483964,0.550188,0.777985,-0.663404,9.402175,-75.656792


--- Target (y) Head ---


Unnamed: 0,AGCd_tC_per_ha_chave2014
0,50.735153
1,35.813748
2,68.368964
3,51.302963
4,87.627792



Shape of X_train: (11, 20)
Shape of X_test: (3, 20)
Shape of y_train: (11,)
Shape of y_test: (3,)


### Train and Evaluate Linear Regression Model

Train a SLR model on the prepared Colombia data and evaluate its performance.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# --- Linear Regression Model ---
print("--- Training Linear Regression Model ---")
# Instantiate Linear Regression model
linear_model = LinearRegression()

# Train the model
linear_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = linear_model.predict(X_test)

# Evaluate the model
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print(f"Linear Regression Model Performance:\n")
print(f"Mean Squared Error (MSE): {mse_lr:.2f}")
print(f"R-squared (R2): {r2_lr:.2f}\n")

# --- Decision Tree Regressor Model ---
print("--- Training Decision Tree Regressor Model ---")
# Instantiate Decision Tree Regressor model
dt_model = DecisionTreeRegressor(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt_model.predict(X_test)

# Evaluate the model
mse_dt = mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print(f"Decision Tree Regressor Model Performance:\n")
print(f"Mean Squared Error (MSE): {mse_dt:.2f}")
print(f"R-squared (R2): {r2_dt:.2f}")

--- Training Linear Regression Model ---
Linear Regression Model Performance:

Mean Squared Error (MSE): 1065.91
R-squared (R2): -157.69

--- Training Decision Tree Regressor Model ---
Decision Tree Regressor Model Performance:

Mean Squared Error (MSE): 88.11
R-squared (R2): -12.12


### Train and Evaluate XGBoost Model

Train an XGBoost Regressor model on the prepared Colombia data and evaluate its performance.


In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 2. Instantiate an XGBRegressor model
xgb_model = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)

# 3. Train the XGBoost model
xgb_model.fit(X_train, y_train)

# 4. Make predictions on the X_test dataset
y_pred_xgb = xgb_model.predict(X_test)

# 5. Calculate and print the evaluation metrics
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost Model Performance:\n")
print(f"Mean Squared Error (MSE): {mse_xgb:.2f}")
print(f"R-squared (R2): {r2_xgb:.2f}")

XGBoost Model Performance:

Mean Squared Error (MSE): 28.14
R-squared (R2): -3.19


### Train and Evaluate Random Forest Model

Train a Random Forest Regressor model on the prepared Colombia data and evaluate its performance.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 3. Instantiate a RandomForestRegressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# 4. Train the Random Forest model
rf_model.fit(X_train, y_train)

# 5. Make predictions on the X_test dataset
y_pred_rf = rf_model.predict(X_test)

# 6. Calculate and print the evaluation metrics
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest Model Performance:\n")
print(f"Mean Squared Error (MSE): {mse_rf:.2f}")
print(f"R-squared (R2): {r2_rf:.2f}")

Random Forest Model Performance:

Mean Squared Error (MSE): 307.72
R-squared (R2): -44.81


### Train and Evaluate Ensemble Model

Create and train an ensemble model using the base models (XGBoost, Random Forest) and evaluate its performance.

The base estimators are defined, instantiated and trained with the `StackingRegressor` with `LinearRegression` as the final estimator, make predictions on the test set, and then calculate and print the Mean Squared Error (MSE) and R-squared (R2) to evaluate the ensemble model's performance.


In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 2. Define a list of base estimators
estimators = [
    ('xgb', xgb_model),
    ('rf', rf_model)
]

# 3. Instantiate a StackingRegressor with the base estimators and a LinearRegression model as the final_estimator
ensemble_model = StackingRegressor(
    estimators=estimators,
    final_estimator=LinearRegression(),
    cv=5 # Using 5-fold cross-validation for stacking
)

# 4. Train the StackingRegressor
ensemble_model.fit(X_train, y_train)

# 5. Make predictions on the X_test dataset
y_pred_ensemble = ensemble_model.predict(X_test)

# 6. Calculate and print the evaluation metrics
mse_ensemble = mean_squared_error(y_test, y_pred_ensemble)
r2_ensemble = r2_score(y_test, y_pred_ensemble)

print(f"Ensemble Model Performance (StackingRegressor with LinearRegression):")
print(f"Mean Squared Error (MSE): {mse_ensemble:.2f}")
print(f"R-squared (R2): {r2_ensemble:.2f}")

Ensemble Model Performance (StackingRegressor with LinearRegression):
Mean Squared Error (MSE): 271.51
R-squared (R2): -39.42


### Train and Evaluate Artificial Neural Network (ANN)

Train an Artificial Neural Network model on the prepared Colombia data, possibly with additional preprocessing like scaling, and evaluate its performance.

The model is trained using StandardScale after which it is trained a Sequential Artificial Neural Network (ANN) model.

In [None]:
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import mean_squared_error, r2_score

# 2. Scale the input features X_train and X_test
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Create a Sequential ANN model
ann_model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1)  # Output layer for regression, no activation
])

# 4. Compile the model
ann_model.compile(optimizer='adam', loss='mean_squared_error')

# 5. Train the ANN model
history = ann_model.fit(X_train_scaled, y_train, epochs=50, batch_size=2, verbose=0)

# 6. Make predictions on the scaled X_test dataset
y_pred_ann = ann_model.predict(X_test_scaled).flatten()

# 7. Calculate and print the evaluation metrics
mse_ann = mean_squared_error(y_test, y_pred_ann)
r2_ann = r2_score(y_test, y_pred_ann)

print(f"ANN Model Performance:\n")
print(f"Mean Squared Error (MSE): {mse_ann:.2f}")
print(f"R-squared (R2): {r2_ann:.2f}")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
ANN Model Performance:

Mean Squared Error (MSE): 315.51
R-squared (R2): -45.97


In [None]:
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from sklearn.metrics import mean_squared_error, r2_score

# 2. Scale the input features X_train and X_test
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Create a Sequential ANN model, using Input layer for clarity and best practice
ann_model = Sequential([
    Input(shape=(X_train_scaled.shape[1],)), # Explicit Input layer
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1)  # Output layer for regression, no activation
])

# 4. Compile the model
ann_model.compile(optimizer='adam', loss='mean_squared_error')

# 5. Train the ANN model
history = ann_model.fit(X_train_scaled, y_train, epochs=50, batch_size=2, verbose=0)

# 6. Make predictions on the scaled X_test dataset
y_pred_ann = ann_model.predict(X_test_scaled).flatten()

# 7. Calculate and print the evaluation metrics
mse_ann = mean_squared_error(y_test, y_pred_ann)
r2_ann = r2_score(y_test, y_pred_ann)

print(f"ANN Model Performance:\n")
print(f"Mean Squared Error (MSE): {mse_ann:.2f}")
print(f"R-squared (R2): {r2_ann:.2f}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step
ANN Model Performance:

Mean Squared Error (MSE): 463.29
R-squared (R2): -67.97


## Summary: Colombia Data

The best-performing model among those tested for the Colombia dataset is the **XGBoost Regressor**, which achieved a Mean Squared Error (MSE) of 28.14 and an R-squared (${R}^{2}$) of -3.19. While this ${R}^{2}$ value is still negative, indicating poor fit, it is significantly better than the other models.

### Data Analysis Key Findings
*   **Data Preparation:**
    *   The `colombia_df` was cleaned by dropping 11 unnecessary or redundant columns, including ID columns, geo-coordinates, and alternative target variables, resulting in `colombia_df_cleaned` with 24 columns.
    *   The `Study_area`, `Study_area_s1`, and `Study_area_s2` columns were also dropped due to containing only one unique value ('Caimanera').
    *   The data was split into features (X) and target (y = 'AGCd_tC_per_ha_chave2014'), with the training set having 11 samples and 20 features (`X_train` shape: (11, 20)) and the test set having 3 samples and 20 features (`X_test` shape: (3, 20)).
*   **XGBoost Model Performance:** The XGBoost Regressor achieved an MSE of 28.14 and an ${R}^{2}$ of -3.19.
*   **Random Forest Model Performance:** The Random Forest Regressor performed worse, with an MSE of 307.72 and an ${R}^{2}$ of -44.81.
*   **Ensemble Model Performance:** The StackingRegressor (XGBoost + Random Forest with Linear Regression as final estimator) showed an MSE of 271.51 and an ${R}^{2}$ of -39.42, not outperforming the standalone XGBoost.
*   **Artificial Neural Network (ANN) Performance:** The ANN model, after scaling features and with an explicit `Input` layer, yielded an MSE of 513.96 and an ${R}^{2}$ of -75.52, indicating the poorest performance among all models.

### Insights or Next Steps
*   All models performed poorly as indicated by negative \R^2$ values, suggesting that the current features and limited dataset size (especially the very small test set of 3 samples) are insufficient to accurately predict the target variable.
*   **Next Steps:**
    *   Investigate the dataset for more samples, as the current training (11 samples) and testing (3 samples) sizes are extremely small and likely contribute to the poor model performance and high variance in metrics.
    *   Explore feature engineering to create more relevant predictors, and perform thorough hyperparameter tuning for each model to potentially improve performance given the dataset constraints.


# **Panama Data & Model**

## 1. Import processed data from Github repo for Panama

---

In [None]:
!wget https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Panama/02_interim/Panama_S1_predictors_2022_2023_allPlots.xlsx
!wget https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Panama/02_interim/Panama_S2_predictors_2022_2023_allPlots.xlsx
!wget https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Panama/02_interim/Panama_plot_AGB_AGC_Chave2014.xlsx

--2025-12-31 11:49:30--  https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Panama/02_interim/Panama_S1_predictors_2022_2023_allPlots.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9965 (9.7K) [application/octet-stream]
Saving to: ‘Panama_S1_predictors_2022_2023_allPlots.xlsx’


2025-12-31 11:49:30 (97.8 MB/s) - ‘Panama_S1_predictors_2022_2023_allPlots.xlsx’ saved [9965/9965]

--2025-12-31 11:49:30--  https://raw.githubusercontent.com/Kevan123/MRV-Blue-Carbon-Project-LAC-2025/Country-Model---Panama/02_interim/Panama_S2_predictors_2022_2023_allPlots.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com

Load the Panama S1 predictors, S2 predictors, and plot AGB/AGC data into separate pandas DataFrames, then display their head, info, and description. Finally, merge them into a single DataFrame, ensuring uniqueness of the 'Plot' column.

In [None]:
import pandas as pd

pan_s1_pred_df = pd.read_excel('/content/Panama_S1_predictors_2022_2023_allPlots.xlsx')
pan_s2_pred_df = pd.read_excel('/content/Panama_S2_predictors_2022_2023_allPlots.xlsx')
pan_plot_df = pd.read_excel('/content/Panama_plot_AGB_AGC_Chave2014.xlsx')

print("All Panama Excel files have been loaded into DataFrames:")
print("pan_s1_pred_df, pan_s2_pred_df, pan_plot_df")

All Panama Excel files have been loaded into DataFrames:
pan_s1_pred_df, pan_s2_pred_df, pan_plot_df


In [None]:
print("\n--- pan_s1_pred_df ---")
display(pan_s1_pred_df.head())
print("pan_s1_pred_df Info:")
pan_s1_pred_df.info()
print("\npan_s1_pred_df Description:")
display(pan_s1_pred_df.describe())

print("\n--- pan_s2_pred_df ---")
display(pan_s2_pred_df.head())
print("pan_s2_pred_df Info:")
pan_s2_pred_df.info()
print("\npan_s2_pred_df Description:")
display(pan_s2_pred_df.describe())

print("\n--- pan_plot_df ---")
display(pan_plot_df.head())
print("pan_plot_df Info:")
pan_plot_df.info()
print("\npan_plot_df Description:")
display(pan_plot_df.describe())


--- pan_s1_pred_df ---


Unnamed: 0,system:index,Plot,Study_area,VH,VH_VV_ratio,VV,VV_minus_VH,latitude,longitude,.geo
0,0,1 PTYB,PTYB,-18.253788,1.137714,-16.044262,2.209525,8.686865,-78.597537,"{""type"":""Point"",""coordinates"":[-78.59753659090..."
1,2,10 PTYB,PTYB,-15.488933,1.755239,-8.8244,6.664533,8.978688,-79.049684,"{""type"":""Point"",""coordinates"":[-79.04968445454..."
2,4,11 PTYB,PTYB,-15.187275,1.24389,-12.209504,2.977771,8.993518,-79.049947,"{""type"":""Point"",""coordinates"":[-79.04994706666..."
3,6,12 PTYB,PTYB,-14.310285,2.128691,-6.722575,7.58771,9.000275,-79.101021,"{""type"":""Point"",""coordinates"":[-79.10102137931..."
4,8,13 PTYB,PTYB,-18.006457,2.452954,-7.340724,10.665733,9.061109,-79.083997,"{""type"":""Point"",""coordinates"":[-79.08399679245..."


pan_s1_pred_df Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   system:index  41 non-null     object 
 1   Plot          41 non-null     object 
 2   Study_area    41 non-null     object 
 3   VH            41 non-null     float64
 4   VH_VV_ratio   41 non-null     float64
 5   VV            41 non-null     float64
 6   VV_minus_VH   41 non-null     float64
 7   latitude      41 non-null     float64
 8   longitude     41 non-null     float64
 9   .geo          41 non-null     object 
dtypes: float64(6), object(4)
memory usage: 3.3+ KB

pan_s1_pred_df Description:


Unnamed: 0,VH,VH_VV_ratio,VV,VV_minus_VH,latitude,longitude
count,41.0,41.0,41.0,41.0,41.0,41.0
mean,-15.187278,1.899261,-8.619641,6.567637,8.541295,-79.701189
std,1.961519,0.541349,2.616165,2.631114,0.392228,0.737187
min,-20.110386,1.137714,-16.044262,1.649477,7.910432,-80.518897
25%,-16.431861,1.589142,-10.402564,5.442194,8.264046,-80.391957
50%,-15.187275,1.812761,-7.906982,7.004431,8.333108,-80.239453
75%,-13.264306,2.215556,-6.722575,7.953938,8.978688,-79.049947
max,-11.721276,3.293236,-4.186126,12.011482,9.082679,-78.597537



--- pan_s2_pred_df ---


Unnamed: 0,system:index,B11,B12,B2,B3,B4,B5,B6,B7,B8,EVI,NDRE,NDVI,NDWI,Plot,Study_area,latitude,longitude,.geo
0,00000000000000000000_0,0.1109,0.048086,0.0335,0.04252,0.026076,0.073275,0.23835,0.292029,0.2874,0.480669,0.563478,0.819242,-0.714076,1 PTYB,PTYB,8.686844,-78.597512,"{""geodesic"":false,""type"":""Point"",""coordinates""..."
1,00000000000000000002_0,0.123118,0.058617,0.0507,0.066267,0.051657,0.083986,0.254211,0.324343,0.324086,0.58026,0.551226,0.703468,-0.623022,10 PTYB,PTYB,8.978706,-79.049724,"{""geodesic"":false,""type"":""Point"",""coordinates""..."
2,00000000000000000004_0,0.12258,0.053687,0.042844,0.05975,0.03725,0.087833,0.27478,0.33292,0.32624,0.570436,0.526082,0.744355,-0.65724,11 PTYB,PTYB,8.993528,-79.049903,"{""geodesic"":false,""type"":""Point"",""coordinates""..."
3,00000000000000000006_0,0.113028,0.045347,0.034851,0.042674,0.02774,0.073038,0.26917,0.340473,0.340171,0.569494,0.638195,0.832,-0.743986,12 PTYB,PTYB,9.000266,-79.101018,"{""geodesic"":false,""type"":""Point"",""coordinates""..."
4,00000000000000000008_0,0.1305,0.05685,0.034646,0.049655,0.03435,0.0817,0.254414,0.323433,0.309956,0.534617,0.547772,0.771619,-0.677506,13 PTYB,PTYB,9.061082,-79.084039,"{""geodesic"":false,""type"":""Point"",""coordinates""..."


pan_s2_pred_df Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   system:index  41 non-null     object 
 1   B11           41 non-null     float64
 2   B12           41 non-null     float64
 3   B2            41 non-null     float64
 4   B3            41 non-null     float64
 5   B4            41 non-null     float64
 6   B5            41 non-null     float64
 7   B6            41 non-null     float64
 8   B7            41 non-null     float64
 9   B8            41 non-null     float64
 10  EVI           41 non-null     float64
 11  NDRE          41 non-null     float64
 12  NDVI          41 non-null     float64
 13  NDWI          41 non-null     float64
 14  Plot          41 non-null     object 
 15  Study_area    41 non-null     object 
 16  latitude      41 non-null     float64
 17  longitude     41 non-null     float64
 18  .geo       

Unnamed: 0,B11,B12,B2,B3,B4,B5,B6,B7,B8,EVI,NDRE,NDVI,NDWI,latitude,longitude
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,0.120066,0.055843,0.037221,0.054432,0.036548,0.084746,0.246423,0.301998,0.295743,0.491151,0.528928,0.75791,-0.666091,8.541288,-79.701191
std,0.024529,0.020091,0.014419,0.015622,0.015861,0.018907,0.025533,0.033624,0.033888,0.070225,0.093347,0.102936,0.086394,0.392224,0.737187
min,0.095967,0.0368,0.023357,0.035761,0.019168,0.065178,0.159767,0.185475,0.197,0.25813,0.257145,0.419592,-0.756478,7.91043,-80.518918
25%,0.103725,0.044586,0.026831,0.042433,0.026836,0.073038,0.23794,0.2914,0.281826,0.473291,0.490269,0.744355,-0.732683,8.264007,-80.391986
50%,0.1115,0.048386,0.0335,0.049521,0.028813,0.0806,0.24532,0.302017,0.300333,0.508304,0.547772,0.78575,-0.688318,8.333087,-80.239453
75%,0.12258,0.05685,0.043,0.065514,0.042237,0.08792,0.254414,0.323947,0.322067,0.545974,0.601669,0.829221,-0.638252,8.978706,-79.049903
max,0.1971,0.12175,0.09212,0.1044,0.083175,0.140654,0.299277,0.363545,0.36344,0.588522,0.638195,0.849624,-0.41878,9.082641,-78.597512



--- pan_plot_df ---


Unnamed: 0,Plot,Study_area,Latitude,Longitude,AGBd_kg_per_ha,AGBd_t_per_ha_chave2014,AGCd_tC_per_ha_chave2014
0,1 PTYB,PTYB,8.686865,-78.597537,317515.099014,317.515099,149.232097
1,1 ParB,ParB,8.042502,-80.45969,116862.70252,116.862703,54.92547
2,10 PTYB,PTYB,8.978688,-79.049684,272038.65052,272.038651,127.858166
3,10 ParB,ParB,8.321395,-80.339017,13722.469614,13.72247,6.449561
4,11 PTYB,PTYB,8.993518,-79.049947,166582.672007,166.582672,78.293856


pan_plot_df Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Plot                      41 non-null     object 
 1   Study_area                41 non-null     object 
 2   Latitude                  41 non-null     float64
 3   Longitude                 41 non-null     float64
 4   AGBd_kg_per_ha            41 non-null     float64
 5   AGBd_t_per_ha_chave2014   41 non-null     float64
 6   AGCd_tC_per_ha_chave2014  41 non-null     float64
dtypes: float64(5), object(2)
memory usage: 2.4+ KB

pan_plot_df Description:


Unnamed: 0,Latitude,Longitude,AGBd_kg_per_ha,AGBd_t_per_ha_chave2014,AGCd_tC_per_ha_chave2014
count,41.0,41.0,41.0,41.0,41.0
mean,8.541295,-79.701189,164558.03147,164.558031,77.342275
std,0.392228,0.737187,81028.274457,81.028274,38.083289
min,7.910432,-80.518897,13722.469614,13.72247,6.449561
25%,8.264046,-80.391957,94410.023283,94.410023,44.372711
50%,8.333108,-80.239453,160023.515121,160.023515,75.211052
75%,8.978688,-79.049947,218935.22602,218.935226,102.899556
max,9.082679,-78.597537,379316.183274,379.316183,178.278606


In [None]:
print('Checking for unique Plot IDs before merge:')
if not pan_s1_pred_df['Plot'].is_unique:
    print("Warning: 'Plot' column in pan_s1_pred_df is not unique. Investigate before merging.")
else:
    print("Plot column in pan_s1_pred_df is unique.")

if not pan_s2_pred_df['Plot'].is_unique:
    print("Warning: 'Plot' column in pan_s2_pred_df is not unique. Investigate before merging.")
else:
    print("Plot column in pan_s2_pred_df is unique.")

if not pan_plot_df['Plot'].is_unique:
    print("Warning: 'Plot' column in pan_plot_df is not unique. Investigate before merging.")
else:
    print("Plot column in pan_plot_df is unique.")

# Merge pan_s1_pred_df and pan_s2_pred_df on 'Plot'
panama_df = pd.merge(pan_s1_pred_df, pan_s2_pred_df, on='Plot', how='inner', suffixes=('_s1', '_s2'))

# Merge the result with pan_plot_df on 'Plot'
panama_df = pd.merge(panama_df, pan_plot_df, on='Plot', how='inner')

print("\nMerged Panama DataFrame Info:")
panama_df.info()
print("\nMerged Panama DataFrame Head:")
display(panama_df.head())

# Verify the shape of the merged dataframe
print(f"\nMerged Panama DataFrame shape: {panama_df.shape[0]} rows x {panama_df.shape[1]} columns")

# Set AGCd_tC_per_ha_chave2014 as the dependent variable (this is a conceptual step, not a code assignment)
dependent_variable_panama = 'AGCd_tC_per_ha_chave2014'
print(f"\nThe dependent variable for modeling in Panama is: {dependent_variable_panama}")

Checking for unique Plot IDs before merge:
Plot column in pan_s1_pred_df is unique.
Plot column in pan_s2_pred_df is unique.
Plot column in pan_plot_df is unique.

Merged Panama DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   system:index_s1           41 non-null     object 
 1   Plot                      41 non-null     object 
 2   Study_area_s1             41 non-null     object 
 3   VH                        41 non-null     float64
 4   VH_VV_ratio               41 non-null     float64
 5   VV                        41 non-null     float64
 6   VV_minus_VH               41 non-null     float64
 7   latitude_s1               41 non-null     float64
 8   longitude_s1              41 non-null     float64
 9   .geo_s1                   41 non-null     object 
 10  system:index_s2           41 non-null 

Unnamed: 0,system:index_s1,Plot,Study_area_s1,VH,VH_VV_ratio,VV,VV_minus_VH,latitude_s1,longitude_s1,.geo_s1,...,Study_area_s2,latitude_s2,longitude_s2,.geo_s2,Study_area,Latitude,Longitude,AGBd_kg_per_ha,AGBd_t_per_ha_chave2014,AGCd_tC_per_ha_chave2014
0,0,1 PTYB,PTYB,-18.253788,1.137714,-16.044262,2.209525,8.686865,-78.597537,"{""type"":""Point"",""coordinates"":[-78.59753659090...",...,PTYB,8.686844,-78.597512,"{""geodesic"":false,""type"":""Point"",""coordinates""...",PTYB,8.686865,-78.597537,317515.099014,317.515099,149.232097
1,2,10 PTYB,PTYB,-15.488933,1.755239,-8.8244,6.664533,8.978688,-79.049684,"{""type"":""Point"",""coordinates"":[-79.04968445454...",...,PTYB,8.978706,-79.049724,"{""geodesic"":false,""type"":""Point"",""coordinates""...",PTYB,8.978688,-79.049684,272038.65052,272.038651,127.858166
2,4,11 PTYB,PTYB,-15.187275,1.24389,-12.209504,2.977771,8.993518,-79.049947,"{""type"":""Point"",""coordinates"":[-79.04994706666...",...,PTYB,8.993528,-79.049903,"{""geodesic"":false,""type"":""Point"",""coordinates""...",PTYB,8.993518,-79.049947,166582.672007,166.582672,78.293856
3,6,12 PTYB,PTYB,-14.310285,2.128691,-6.722575,7.58771,9.000275,-79.101021,"{""type"":""Point"",""coordinates"":[-79.10102137931...",...,PTYB,9.000266,-79.101018,"{""geodesic"":false,""type"":""Point"",""coordinates""...",PTYB,9.000275,-79.101021,280797.34708,280.797347,131.974753
4,8,13 PTYB,PTYB,-18.006457,2.452954,-7.340724,10.665733,9.061109,-79.083997,"{""type"":""Point"",""coordinates"":[-79.08399679245...",...,PTYB,9.061082,-79.084039,"{""geodesic"":false,""type"":""Point"",""coordinates""...",PTYB,9.061109,-79.083997,215276.758959,215.276759,101.180077



Merged Panama DataFrame shape: 41 rows x 34 columns

The dependent variable for modeling in Panama is: AGCd_tC_per_ha_chave2014


## 2. Clean & Prep Data

---

Remove unnecessary and redundant columns from `panama_df` and consolidate similar columns.


In [None]:
columns_to_drop = [
    'system:index_s1',
    'system:index_s2',
    '.geo_s1',
    '.geo_s2',
    'latitude_s1', # Corrected casing
    'longitude_s1', # Corrected casing
    'latitude_s2',
    'longitude_s2',
    'Study_area_s1',
    'Study_area_s2'
]

# Drop the identified columns from panama_df
panama_df_cleaned = panama_df.drop(columns=columns_to_drop)

print("Columns dropped successfully. Displaying info and head of the cleaned DataFrame:")
panama_df_cleaned.info()
display(panama_df_cleaned.head())

Columns dropped successfully. Displaying info and head of the cleaned DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Plot                      41 non-null     object 
 1   VH                        41 non-null     float64
 2   VH_VV_ratio               41 non-null     float64
 3   VV                        41 non-null     float64
 4   VV_minus_VH               41 non-null     float64
 5   B11                       41 non-null     float64
 6   B12                       41 non-null     float64
 7   B2                        41 non-null     float64
 8   B3                        41 non-null     float64
 9   B4                        41 non-null     float64
 10  B5                        41 non-null     float64
 11  B6                        41 non-null     float64
 12  B7                        41 non-null    

Unnamed: 0,Plot,VH,VH_VV_ratio,VV,VV_minus_VH,B11,B12,B2,B3,B4,...,EVI,NDRE,NDVI,NDWI,Study_area,Latitude,Longitude,AGBd_kg_per_ha,AGBd_t_per_ha_chave2014,AGCd_tC_per_ha_chave2014
0,1 PTYB,-18.253788,1.137714,-16.044262,2.209525,0.1109,0.048086,0.0335,0.04252,0.026076,...,0.480669,0.563478,0.819242,-0.714076,PTYB,8.686865,-78.597537,317515.099014,317.515099,149.232097
1,10 PTYB,-15.488933,1.755239,-8.8244,6.664533,0.123118,0.058617,0.0507,0.066267,0.051657,...,0.58026,0.551226,0.703468,-0.623022,PTYB,8.978688,-79.049684,272038.65052,272.038651,127.858166
2,11 PTYB,-15.187275,1.24389,-12.209504,2.977771,0.12258,0.053687,0.042844,0.05975,0.03725,...,0.570436,0.526082,0.744355,-0.65724,PTYB,8.993518,-79.049947,166582.672007,166.582672,78.293856
3,12 PTYB,-14.310285,2.128691,-6.722575,7.58771,0.113028,0.045347,0.034851,0.042674,0.02774,...,0.569494,0.638195,0.832,-0.743986,PTYB,9.000275,-79.101021,280797.34708,280.797347,131.974753
4,13 PTYB,-18.006457,2.452954,-7.340724,10.665733,0.1305,0.05685,0.034646,0.049655,0.03435,...,0.534617,0.547772,0.771619,-0.677506,PTYB,9.061109,-79.083997,215276.758959,215.276759,101.180077


## 03. Train & Test ML Models

---



The `panama_df_cleaned` prepared for machine learning by dropping unnecessary columns, performing one-hot encoding on the 'Study_area' column, splitting the data into features (X) and the target variable (y = 'AGCd_tC_per_ha_chave2014') into training and testing sets, then train and evaluate Linear Regression, Decision Tree Regressor, XGBoost, Random Forest, Ensemble (StackingRegressor), and Artificial Neural Network (ANN) models, finally comparing their performances based on Mean Squared Error (MSE) and R-squared (R2).

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# 1. Drop the 'Plot' column from panama_df_cleaned
panama_df_processed = panama_df_cleaned.drop(columns=['Plot'])

# 2. Perform one-hot encoding on the 'Study_area' column
panama_df_processed = pd.get_dummies(panama_df_processed, columns=['Study_area'], drop_first=True)

# Drop other unnecessary AGB columns not used as target
panama_df_processed = panama_df_processed.drop(columns=['AGBd_kg_per_ha', 'AGBd_t_per_ha_chave2014'])

print("--- Panama Processed DataFrame Info After One-Hot Encoding and Column Drops ---")
panama_df_processed.info()
print("\n--- Panama Processed DataFrame Head ---")
display(panama_df_processed.head())

# 3. Define the target variable y
y_panama = panama_df_processed['AGCd_tC_per_ha_chave2014']

# 4. Define the features X
X_panama = panama_df_processed.drop(columns=['AGCd_tC_per_ha_chave2014'])

# 5. Split the X and y data into training and testing sets
X_train_panama, X_test_panama, y_train_panama, y_test_panama = train_test_split(X_panama, y_panama, test_size=0.2, random_state=42)

print("\n--- Features (X_panama) Head ---")
display(X_panama.head())
print("\n--- Target (y_panama) Head ---")
display(y_panama.head())

print(f"\nShape of X_train_panama: {X_train_panama.shape}")
print(f"Shape of X_test_panama: {X_test_panama.shape}")
print(f"Shape of y_train_panama: {y_train_panama.shape}")
print(f"Shape of y_test_panama: {y_test_panama.shape}")

--- Panama Processed DataFrame Info After One-Hot Encoding and Column Drops ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   VH                        41 non-null     float64
 1   VH_VV_ratio               41 non-null     float64
 2   VV                        41 non-null     float64
 3   VV_minus_VH               41 non-null     float64
 4   B11                       41 non-null     float64
 5   B12                       41 non-null     float64
 6   B2                        41 non-null     float64
 7   B3                        41 non-null     float64
 8   B4                        41 non-null     float64
 9   B5                        41 non-null     float64
 10  B6                        41 non-null     float64
 11  B7                        41 non-null     float64
 12  B8                        41 non-null     

Unnamed: 0,VH,VH_VV_ratio,VV,VV_minus_VH,B11,B12,B2,B3,B4,B5,...,B7,B8,EVI,NDRE,NDVI,NDWI,Latitude,Longitude,AGCd_tC_per_ha_chave2014,Study_area_ParB
0,-18.253788,1.137714,-16.044262,2.209525,0.1109,0.048086,0.0335,0.04252,0.026076,0.073275,...,0.292029,0.2874,0.480669,0.563478,0.819242,-0.714076,8.686865,-78.597537,149.232097,False
1,-15.488933,1.755239,-8.8244,6.664533,0.123118,0.058617,0.0507,0.066267,0.051657,0.083986,...,0.324343,0.324086,0.58026,0.551226,0.703468,-0.623022,8.978688,-79.049684,127.858166,False
2,-15.187275,1.24389,-12.209504,2.977771,0.12258,0.053687,0.042844,0.05975,0.03725,0.087833,...,0.33292,0.32624,0.570436,0.526082,0.744355,-0.65724,8.993518,-79.049947,78.293856,False
3,-14.310285,2.128691,-6.722575,7.58771,0.113028,0.045347,0.034851,0.042674,0.02774,0.073038,...,0.340473,0.340171,0.569494,0.638195,0.832,-0.743986,9.000275,-79.101021,131.974753,False
4,-18.006457,2.452954,-7.340724,10.665733,0.1305,0.05685,0.034646,0.049655,0.03435,0.0817,...,0.323433,0.309956,0.534617,0.547772,0.771619,-0.677506,9.061109,-79.083997,101.180077,False



--- Features (X_panama) Head ---


Unnamed: 0,VH,VH_VV_ratio,VV,VV_minus_VH,B11,B12,B2,B3,B4,B5,B6,B7,B8,EVI,NDRE,NDVI,NDWI,Latitude,Longitude,Study_area_ParB
0,-18.253788,1.137714,-16.044262,2.209525,0.1109,0.048086,0.0335,0.04252,0.026076,0.073275,0.23835,0.292029,0.2874,0.480669,0.563478,0.819242,-0.714076,8.686865,-78.597537,False
1,-15.488933,1.755239,-8.8244,6.664533,0.123118,0.058617,0.0507,0.066267,0.051657,0.083986,0.254211,0.324343,0.324086,0.58026,0.551226,0.703468,-0.623022,8.978688,-79.049684,False
2,-15.187275,1.24389,-12.209504,2.977771,0.12258,0.053687,0.042844,0.05975,0.03725,0.087833,0.27478,0.33292,0.32624,0.570436,0.526082,0.744355,-0.65724,8.993518,-79.049947,False
3,-14.310285,2.128691,-6.722575,7.58771,0.113028,0.045347,0.034851,0.042674,0.02774,0.073038,0.26917,0.340473,0.340171,0.569494,0.638195,0.832,-0.743986,9.000275,-79.101021,False
4,-18.006457,2.452954,-7.340724,10.665733,0.1305,0.05685,0.034646,0.049655,0.03435,0.0817,0.254414,0.323433,0.309956,0.534617,0.547772,0.771619,-0.677506,9.061109,-79.083997,False



--- Target (y_panama) Head ---


Unnamed: 0,AGCd_tC_per_ha_chave2014
0,149.232097
1,127.858166
2,78.293856
3,131.974753
4,101.180077



Shape of X_train_panama: (32, 20)
Shape of X_test_panama: (9, 20)
Shape of y_train_panama: (32,)
Shape of y_test_panama: (9,)


### Train and Evaluate Linear Regression Model
An Ordinary Least Squares (OLS) Linear Regression model was trained on the prepared Panama training data and evaluated its performance using MSE and R2 scores on the test set.



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Linear Regression Model for Panama ---")
# Instantiate Linear Regression model
linear_model_panama = LinearRegression()

# Train the model
linear_model_panama.fit(X_train_panama, y_train_panama)

# Make predictions
y_pred_lr_panama = linear_model_panama.predict(X_test_panama)

# Evaluate the model
mse_lr_panama = mean_squared_error(y_test_panama, y_pred_lr_panama)
r2_lr_panama = r2_score(y_test_panama, y_pred_lr_panama)

print(f"Linear Regression Model Performance for Panama:\n")
print(f"Mean Squared Error (MSE): {mse_lr_panama:.2f}")
print(f"R-squared (R2): {r2_lr_panama:.2f}\n")

--- Training Linear Regression Model for Panama ---
Linear Regression Model Performance for Panama:

Mean Squared Error (MSE): 4664.09
R-squared (R2): -4.88



### Train and Evaluate Decision Tree Regressor model
A Decision Tree Regressor model was trained on the prepared Panama training data and evaluated its performance using MSE and R2 scores on the test set, following the pattern established for other models.



In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Decision Tree Regressor Model for Panama ---")
# Instantiate Decision Tree Regressor model
dt_model_panama = DecisionTreeRegressor(random_state=42)

# Train the model
dt_model_panama.fit(X_train_panama, y_train_panama)

# Make predictions
y_pred_dt_panama = dt_model_panama.predict(X_test_panama)

# Evaluate the model
mse_dt_panama = mean_squared_error(y_test_panama, y_pred_dt_panama)
r2_dt_panama = r2_score(y_test_panama, y_pred_dt_panama)

print(f"Decision Tree Regressor Model Performance for Panama:\n")
print(f"Mean Squared Error (MSE): {mse_dt_panama:.2f}")
print(f"R-squared (R2): {r2_dt_panama:.2f}\n")

--- Training Decision Tree Regressor Model for Panama ---
Decision Tree Regressor Model Performance for Panama:

Mean Squared Error (MSE): 785.92
R-squared (R2): 0.01



### Train and Evaluate XGB Model

A XGBoost Regressor model was trained on the prepared Panama training data and evaluated its performance using MSE and R2 scores on the test set, following the pattern established for other models.



In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training XGBoost Regressor Model for Panama ---")
# Instantiate an XGBRegressor model
xgb_model_panama = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)

# Train the XGBoost model
xgb_model_panama.fit(X_train_panama, y_train_panama)

# Make predictions on the X_test_panama dataset
y_pred_xgb_panama = xgb_model_panama.predict(X_test_panama)

# Calculate and print the evaluation metrics
mse_xgb_panama = mean_squared_error(y_test_panama, y_pred_xgb_panama)
r2_xgb_panama = r2_score(y_test_panama, y_pred_xgb_panama)

print(f"XGBoost Model Performance for Panama:\n")
print(f"Mean Squared Error (MSE): {mse_xgb_panama:.2f}")
print(f"R-squared (R2): {r2_xgb_panama:.2f}")

--- Training XGBoost Regressor Model for Panama ---
XGBoost Model Performance for Panama:

Mean Squared Error (MSE): 1493.00
R-squared (R2): -0.88


### Train and Evaluate Random Forest Model
A Random Forest Regressor model was trained on the prepared Panama training data and evaluated its performance using MSE and R2 scores on the test set, following the pattern established for other models.



In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Random Forest Regressor Model for Panama ---")
# Instantiate a RandomForestRegressor model
rf_model_panama = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the Random Forest model
rf_model_panama.fit(X_train_panama, y_train_panama)

# Make predictions on the X_test_panama dataset
y_pred_rf_panama = rf_model_panama.predict(X_test_panama)

# Calculate and print the evaluation metrics
mse_rf_panama = mean_squared_error(y_test_panama, y_pred_rf_panama)
r2_rf_panama = r2_score(y_test_panama, y_pred_rf_panama)

print(f"Random Forest Model Performance for Panama:\n")
print(f"Mean Squared Error (MSE): {mse_rf_panama:.2f}")
print(f"R-squared (R2): {r2_rf_panama:.2f}")

--- Training Random Forest Regressor Model for Panama ---
Random Forest Model Performance for Panama:

Mean Squared Error (MSE): 1078.76
R-squared (R2): -0.36


### Train and Evaluate Ensemble Model
An Ensemble model (StackingRegressor) was trained for the Panama data using XGBoost and Random Forest as base estimators and Linear Regression as the final estimator, and evaluate its performance using MSE and R2 scores.



In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Ensemble Model for Panama ---")

# Define base estimators (using already trained models for efficiency)
estimators_panama = [
    ('xgb', xgb_model_panama),
    ('rf', rf_model_panama)
]

# Instantiate a StackingRegressor
ensemble_model_panama = StackingRegressor(
    estimators=estimators_panama,
    final_estimator=LinearRegression(),
    cv=5 # Using 5-fold cross-validation for stacking
)

# Train the StackingRegressor
ensemble_model_panama.fit(X_train_panama, y_train_panama)

# Make predictions on the X_test_panama dataset
y_pred_ensemble_panama = ensemble_model_panama.predict(X_test_panama)

# Calculate and print the evaluation metrics
mse_ensemble_panama = mean_squared_error(y_test_panama, y_pred_ensemble_panama)
r2_ensemble_panama = r2_score(y_test_panama, y_pred_ensemble_panama)

print(f"Ensemble Model Performance for Panama (StackingRegressor with LinearRegression):\n")
print(f"Mean Squared Error (MSE): {mse_ensemble_panama:.2f}")
print(f"R-squared (R2): {r2_ensemble_panama:.2f}")

--- Training Ensemble Model for Panama ---
Ensemble Model Performance for Panama (StackingRegressor with LinearRegression):

Mean Squared Error (MSE): 1159.86
R-squared (R2): -0.46


### Train and Evaluate Artificial Neural Network (ANN) Model
An Artificial Neural Network (ANN) model was trained on the prepared Panama data and its performance evaluated.



In [None]:
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Artificial Neural Network (ANN) Model for Panama ---")
# 2. Scale the input features X_train and X_test
scaler_panama = StandardScaler()
X_train_panama_scaled = scaler_panama.fit_transform(X_train_panama)
X_test_panama_scaled = scaler_panama.transform(X_test_panama)

# 3. Create a Sequential ANN model, using Input layer for clarity and best practice
ann_model_panama = Sequential([
    Input(shape=(X_train_panama_scaled.shape[1],)), # Explicit Input layer
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1)  # Output layer for regression, no activation
])

# 4. Compile the model
ann_model_panama.compile(optimizer='adam', loss='mean_squared_error')

# 5. Train the ANN model
history_panama = ann_model_panama.fit(X_train_panama_scaled, y_train_panama, epochs=50, batch_size=2, verbose=0)

# 6. Make predictions on the scaled X_test dataset
y_pred_ann_panama = ann_model_panama.predict(X_test_panama_scaled).flatten()

# 7. Calculate and print the evaluation metrics
mse_ann_panama = mean_squared_error(y_test_panama, y_pred_ann_panama)
r2_ann_panama = r2_score(y_test_panama, y_pred_ann_panama)

print(f"ANN Model Performance for Panama:\n")
print(f"Mean Squared Error (MSE): {mse_ann_panama:.2f}")
print(f"R-squared (R2): {r2_ann_panama:.2f}")

--- Training Artificial Neural Network (ANN) Model for Panama ---
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 64ms/step
ANN Model Performance for Panama:

Mean Squared Error (MSE): 1766.11
R-squared (R2): -1.22


In [None]:
import logging
import os

# Suppress TensorFlow warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Suppress all TF messages
logging.getLogger('tensorflow').setLevel(logging.ERROR) # Only show errors

from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Artificial Neural Network (ANN) Model for Panama ---")
# 2. Scale the input features X_train and X_test
scaler_panama = StandardScaler()
X_train_panama_scaled = scaler_panama.fit_transform(X_train_panama)
X_test_panama_scaled = scaler_panama.transform(X_test_panama)

# 3. Create a Sequential ANN model, using Input layer for clarity and best practice
ann_model_panama = Sequential([
    Input(shape=(X_train_panama_scaled.shape[1],)), # Explicit Input layer
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1)  # Output layer for regression, no activation
])

# 4. Compile the model
ann_model_panama.compile(optimizer='adam', loss='mean_squared_error')

# 5. Train the ANN model
history_panama = ann_model_panama.fit(X_train_panama_scaled, y_train_panama, epochs=50, batch_size=2, verbose=0)

# 6. Make predictions on the scaled X_test dataset
y_pred_ann_panama = ann_model_panama.predict(X_test_panama_scaled).flatten()

# 7. Calculate and print the evaluation metrics
mse_ann_panama = mean_squared_error(y_test_panama, y_pred_ann_panama)
r2_ann_panama = r2_score(y_test_panama, y_pred_ann_panama)

print(f"ANN Model Performance for Panama:\n")
print(f"Mean Squared Error (MSE): {mse_ann_panama:.2f}")
print(f"R-squared (R2): {r2_ann_panama:.2f}")


--- Training Artificial Neural Network (ANN) Model for Panama ---
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
ANN Model Performance for Panama:

Mean Squared Error (MSE): 1474.84
R-squared (R2): -0.86


## Summary: Panama Data

The best-performing model among those tested for the Panama dataset is the **Decision Tree Regressor**, which achieved an R-squared (${R}^{2}$) of 0.01 and a Mean Squared Error (MSE) of 785.92. While this ${R}^{2}$ value is very low and indicates a poor fit, it is the only model that did not yield a negative ${R}^{2}$, suggesting it captured at least some of the variance in the target variable, unlike the other models.

### Data Analysis Key Findings
*   **Data Preparation:**
    *   The `panama_df_cleaned` was further processed by dropping the 'Plot' identifier column and other redundant AGB columns (`AGBd_kg_per_ha`, `AGBd_t_per_ha_chave2014`).
    *   One-hot encoding was applied to the 'Study_area' column, resulting in `Study_area_ParB` (a boolean column) as a new feature.
    *   The data was split into features (X_panama) and target (y_panama = 'AGCd_tC_per_ha_chave2014'), with `X_train_panama` having 32 samples and 20 features, and `X_test_panama` having 9 samples and 20 features.
*   **Model Performance for Panama Data:**
    *   **Linear Regression:** MSE: 4664.09, ${R}^{2}$: -4.88
    *   **Decision Tree Regressor:** MSE: 785.92, ${R}^{2}$: 0.01
    *   **XGBoost Regressor:** MSE: 1493.00, ${R}^{2}$: -0.88
    *   **Random Forest Regressor:** MSE: 1078.76, ${R}^{2}$: -0.36
    *   **Ensemble Model (StackingRegressor):** MSE: 1159.86, ${R}^{2}$: -0.46
    *   **Artificial Neural Network (ANN):** MSE: 1677.99, ${R}^{2}$: -1.11

### Insights or Next Steps
*   Similar to the Colombia dataset, all models performed poorly on the Panama dataset, with most yielding negative ${R}^{2}$ values. The Decision Tree Regressor showed a marginally better performance with a near-zero R2.
*   The small dataset size (32 training samples, 9 test samples) is likely a major contributing factor to the poor model performance and high variance in metrics.
*   **Next Steps:**
    *   **Increase Data Size:** The most critical step is to acquire more data samples for both training and testing to allow models to learn more robust patterns.
    *   **Feature Engineering:** Explore more advanced feature engineering techniques to create predictors that better capture the relationships with Above-Ground Carbon Density (AGCd).
    *   **Hyperparameter Tuning:** Conduct thorough hyperparameter tuning for all models, especially for tree-based models like Decision Tree, XGBoost, and Random Forest, which can be sensitive to parameter choices.
    *   **Cross-Validation:** Implement more robust cross-validation strategies (e.g., K-Fold cross-validation) during training to get a more reliable estimate of model performance and reduce overfitting, especially with small datasets.
  

# **Global Model**

To create a global model,the data from Colombia and Panama will be prepared. This involves:

1.  Adding a 'Country' column to each cleaned DataFrame (`colombia_df_cleaned` and `panama_df_processed`) to distinguish their origin.
2.  Ensuring a consistent set of features across both DataFrames by dropping columns that are not present in both or are redundant (e.g., 'B8A' from Colombia, 'Study_area_ParB' from Panama).
3.  Merging these processed DataFrames into a single `global_df`.

## Prepare Global Data

Add a 'Country' column to the cleaned Colombia and Panama DataFrames, ensure consistent feature sets, and then merge them into a single global DataFrame.


In [None]:
import pandas as pd

# 1. Add 'Country' column to colombia_df_cleaned
colombia_df_cleaned['Country'] = 'Colombia'

# 2. Add 'Country' column to panama_df_processed
panama_df_processed['Country'] = 'Panama'

# 3. Drop 'B8A' column from colombia_df_cleaned to match panama_df_processed features
# First, check if 'B8A' exists in colombia_df_cleaned
if 'B8A' in colombia_df_cleaned.columns:
    colombia_df_cleaned = colombia_df_cleaned.drop(columns=['B8A'])
    print("Dropped 'B8A' from colombia_df_cleaned.")
else:
    print("'B8A' not found in colombia_df_cleaned, no drop needed.")

# 4. Drop 'Study_area_ParB' column from panama_df_processed to match colombia_df_cleaned features
# First, check if 'Study_area_ParB' exists in panama_df_processed
if 'Study_area_ParB' in panama_df_processed.columns:
    panama_df_processed = panama_df_processed.drop(columns=['Study_area_ParB'])
    print("Dropped 'Study_area_ParB' from panama_df_processed.")
else:
    print("'Study_area_ParB' not found in panama_df_processed, no drop needed.")

# Ensure column order is the same before concatenation
# Identify common columns
common_cols = list(set(colombia_df_cleaned.columns) & set(panama_df_processed.columns))

# Reorder dataframes to have common columns in the same order
colombia_df_cleaned = colombia_df_cleaned[common_cols + [col for col in colombia_df_cleaned.columns if col not in common_cols]]
panama_df_processed = panama_df_processed[common_cols + [col for col in panama_df_processed.columns if col not in common_cols]]

# Check if column sets are now identical for concatenation
if not colombia_df_cleaned.columns.equals(panama_df_processed.columns):
    print("Warning: Columns are not identical after dropping. Re-aligning...")
    # Align columns explicitly for robustness
    all_columns = sorted(list(set(colombia_df_cleaned.columns) | set(panama_df_processed.columns)))
    colombia_df_aligned = colombia_df_cleaned.reindex(columns=all_columns, fill_value=0) # fill_value can be adjusted
    panama_df_aligned = panama_df_processed.reindex(columns=all_columns, fill_value=0) # fill_value can be adjusted
else:
    colombia_df_aligned = colombia_df_cleaned
    panama_df_aligned = panama_df_processed
    print("Columns are aligned for concatenation.")

# 5. Concatenate the modified DataFrames into a single global DataFrame
global_df = pd.concat([colombia_df_aligned, panama_df_aligned], ignore_index=True)

# 6. Print the information about the global_df
print("\n--- Global DataFrame Info ---")
global_df.info()

# 7. Display the first few rows of the global_df
print("\n--- Global DataFrame Head ---")
display(global_df.head())

print("\n--- Global DataFrame Tail ---")
display(global_df.tail())

Dropped 'B8A' from colombia_df_cleaned.
Dropped 'Study_area_ParB' from panama_df_processed.
Columns are aligned for concatenation.

--- Global DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   B7                        55 non-null     float64
 1   B12                       55 non-null     float64
 2   EVI                       55 non-null     float64
 3   Country                   55 non-null     object 
 4   AGCd_tC_per_ha_chave2014  55 non-null     float64
 5   Longitude                 55 non-null     float64
 6   B2                        55 non-null     float64
 7   B3                        55 non-null     float64
 8   NDWI                      55 non-null     float64
 9   Latitude                  55 non-null     float64
 10  VV                        55 non-null     float64
 11  NDVI           

Unnamed: 0,B7,B12,EVI,Country,AGCd_tC_per_ha_chave2014,Longitude,B2,B3,NDWI,Latitude,...,NDVI,NDRE,B11,B4,VH_VV_ratio,B8,B6,VV_minus_VH,VH,B5
0,0.2951,0.0527,0.530022,Colombia,50.735153,-75.622394,0.0358,0.05595,-0.674684,9.410258,...,0.779148,0.53183,0.11675,0.0354,1.765943,0.303,0.23935,5.910434,-13.92754,0.09145
1,0.282,0.05075,0.48089,Colombia,35.813748,-75.655378,0.0308,0.04745,-0.688529,9.405389,...,0.786892,0.537678,0.11595,0.03025,1.739304,0.2717,0.2252,5.946622,-14.336914,0.08025
2,0.25535,0.0571,0.426469,Colombia,68.368964,-75.641917,0.0379,0.05345,-0.643662,9.396867,...,0.71948,0.496578,0.12175,0.03935,1.63809,0.2451,0.20875,5.647963,-14.330285,0.0808
3,0.3163,0.03985,0.51885,Colombia,51.302963,-75.632144,0.03235,0.0474,-0.698307,9.401303,...,0.788571,0.579528,0.10225,0.0296,1.960792,0.3073,0.24725,6.556708,-14.412853,0.07845
4,0.28715,0.05345,0.483964,Colombia,87.627792,-75.656792,0.0361,0.0538,-0.663404,9.402175,...,0.777985,0.550188,0.12435,0.03385,1.762286,0.2945,0.23125,6.09055,-13.872831,0.08445



--- Global DataFrame Tail ---


Unnamed: 0,B7,B12,EVI,Country,AGCd_tC_per_ha_chave2014,Longitude,B2,B3,NDWI,Latitude,...,NDVI,NDRE,B11,B4,VH_VV_ratio,B8,B6,VV_minus_VH,VH,B5
50,0.30948,0.04775,0.563443,Panama,109.055341,-80.249312,0.026642,0.042097,-0.742137,8.313478,...,0.840303,0.623512,0.117309,0.026836,1.801107,0.322891,0.246617,6.223351,-13.99179,0.073515
51,0.297502,0.045087,0.508451,Panama,75.211052,-80.27516,0.027465,0.042856,-0.738522,8.303792,...,0.838763,0.598502,0.110675,0.026829,1.970301,0.300333,0.246438,7.672155,-15.579137,0.074585
52,0.2914,0.051908,0.47434,Panama,38.960034,-80.327194,0.02723,0.042357,-0.722035,7.933749,...,0.818298,0.552193,0.1143,0.02725,1.655803,0.26765,0.236309,5.442194,-13.740718,0.074431
53,0.302889,0.044941,0.510068,Panama,42.408553,-80.325247,0.026407,0.042246,-0.747462,7.910432,...,0.849624,0.606724,0.10305,0.019168,1.589142,0.291143,0.244045,4.657925,-12.564214,0.066628
54,0.307143,0.038317,0.545974,Panama,87.419459,-80.339771,0.026087,0.041467,-0.739161,8.308273,...,0.830193,0.621994,0.099288,0.0254,3.293236,0.31612,0.24585,12.011482,-17.249268,0.06625


## Prepare Global Data for ML

Define features (X_global) and target (y_global) from the merged global DataFrame, handle any categorical variables, and split the data into training and testing sets.


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# 1. Define the target variable y_global
y_global = global_df['AGCd_tC_per_ha_chave2014']

# 2. Define the features X_global
X_global = global_df.drop(columns=['AGCd_tC_per_ha_chave2014'])

# 3. Apply one-hot encoding to the 'Country' column within the X_global DataFrame
X_global = pd.get_dummies(X_global, columns=['Country'], drop_first=True)

# 4. Split the X_global and y_global datasets into training and testing sets
X_train_global, X_test_global, y_train_global, y_test_global = train_test_split(X_global, y_global, test_size=0.2, random_state=42)

print("--- Features (X_global) Head ---")
display(X_global.head())
print("\n--- Target (y_global) Head ---")
display(y_global.head())

print(f"\nShape of X_train_global: {X_train_global.shape}")
print(f"Shape of X_test_global: {X_test_global.shape}")
print(f"Shape of y_train_global: {y_train_global.shape}")
print(f"Shape of y_test_global: {y_test_global.shape}")

--- Features (X_global) Head ---


Unnamed: 0,B7,B12,EVI,Longitude,B2,B3,NDWI,Latitude,VV,NDVI,NDRE,B11,B4,VH_VV_ratio,B8,B6,VV_minus_VH,VH,B5,Country_Panama
0,0.2951,0.0527,0.530022,-75.622394,0.0358,0.05595,-0.674684,9.410258,-7.911034,0.779148,0.53183,0.11675,0.0354,1.765943,0.303,0.23935,5.910434,-13.92754,0.09145,False
1,0.282,0.05075,0.48089,-75.655378,0.0308,0.04745,-0.688529,9.405389,-8.359852,0.786892,0.537678,0.11595,0.03025,1.739304,0.2717,0.2252,5.946622,-14.336914,0.08025,False
2,0.25535,0.0571,0.426469,-75.641917,0.0379,0.05345,-0.643662,9.396867,-8.933878,0.71948,0.496578,0.12175,0.03935,1.63809,0.2451,0.20875,5.647963,-14.330285,0.0808,False
3,0.3163,0.03985,0.51885,-75.632144,0.03235,0.0474,-0.698307,9.401303,-7.505751,0.788571,0.579528,0.10225,0.0296,1.960792,0.3073,0.24725,6.556708,-14.412853,0.07845,False
4,0.28715,0.05345,0.483964,-75.656792,0.0361,0.0538,-0.663404,9.402175,-7.95152,0.777985,0.550188,0.12435,0.03385,1.762286,0.2945,0.23125,6.09055,-13.872831,0.08445,False



--- Target (y_global) Head ---


Unnamed: 0,AGCd_tC_per_ha_chave2014
0,50.735153
1,35.813748
2,68.368964
3,51.302963
4,87.627792



Shape of X_train_global: (44, 20)
Shape of X_test_global: (11, 20)
Shape of y_train_global: (44,)
Shape of y_test_global: (11,)


## 03. Train & Test ML Models

---

### Train and Evaluate Linear Regression Model (Global)

Train a Linear Regression model on the prepared global data and evaluate its performance.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Linear Regression Model for Global Data ---")
# Instantiate Linear Regression model
linear_model_global = LinearRegression()

# Train the model
linear_model_global.fit(X_train_global, y_train_global)

# Make predictions
y_pred_lr_global = linear_model_global.predict(X_test_global)

# Evaluate the model
mse_lr_global = mean_squared_error(y_test_global, y_pred_lr_global)
r2_lr_global = r2_score(y_test_global, y_pred_lr_global)

print(f"Linear Regression Model Performance for Global Data:\n")
print(f"Mean Squared Error (MSE): {mse_lr_global:.2f}")
print(f"R-squared (R2): {r2_lr_global:.2f}\n")

--- Training Linear Regression Model for Global Data ---
Linear Regression Model Performance for Global Data:

Mean Squared Error (MSE): 2999.82
R-squared (R2): -1.03



### Train and Evaluate Decision Tree Regressor Model (Global)

Train a Decision Tree Regressor model on the prepared global data and evaluate its performance.


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Decision Tree Regressor Model for Global Data ---")
# Instantiate Decision Tree Regressor model
dt_model_global = DecisionTreeRegressor(random_state=42)

# Train the model
dt_model_global.fit(X_train_global, y_train_global)

# Make predictions
y_pred_dt_global = dt_model_global.predict(X_test_global)

# Evaluate the model
mse_dt_global = mean_squared_error(y_test_global, y_pred_dt_global)
r2_dt_global = r2_score(y_test_global, y_pred_dt_global)

print(f"Decision Tree Regressor Model Performance for Global Data:\n")
print(f"Mean Squared Error (MSE): {mse_dt_global:.2f}")
print(f"R-squared (R2): {r2_dt_global:.2f}\n")

--- Training Decision Tree Regressor Model for Global Data ---
Decision Tree Regressor Model Performance for Global Data:

Mean Squared Error (MSE): 2624.51
R-squared (R2): -0.78



### Train and Evaluate XGBoost Model (Global)

Train an XGBoost Regressor model on the prepared global data and evaluate its performance.

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training XGBoost Regressor Model for Global Data ---")
# Instantiate an XGBRegressor model
xgb_model_global = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)

# Train the XGBoost model
xgb_model_global.fit(X_train_global, y_train_global)

# Make predictions on the X_test_global dataset
y_pred_xgb_global = xgb_model_global.predict(X_test_global)

# Calculate and print the evaluation metrics
mse_xgb_global = mean_squared_error(y_test_global, y_pred_xgb_global)
r2_xgb_global = r2_score(y_test_global, y_pred_xgb_global)

print(f"XGBoost Model Performance for Global Data:\n")
print(f"Mean Squared Error (MSE): {mse_xgb_global:.2f}")
print(f"R-squared (R2): {r2_xgb_global:.2f}")

--- Training XGBoost Regressor Model for Global Data ---
XGBoost Model Performance for Global Data:

Mean Squared Error (MSE): 1639.72
R-squared (R2): -0.11


### Train and Evaluate Random Forest Model (Global)

Train a Random Forest Regressor model on the prepared global data and evaluate its performance.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Random Forest Regressor Model for Global Data ---")
# Instantiate a RandomForestRegressor model
rf_model_global = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the Random Forest model
rf_model_global.fit(X_train_global, y_train_global)

# Make predictions on the X_test_global dataset
y_pred_rf_global = rf_model_global.predict(X_test_global)

# Calculate and print the evaluation metrics
mse_rf_global = mean_squared_error(y_test_global, y_pred_rf_global)
r2_rf_global = r2_score(y_test_global, y_pred_rf_global)

print(f"Random Forest Model Performance for Global Data:\n")
print(f"Mean Squared Error (MSE): {mse_rf_global:.2f}")
print(f"R-squared (R2): {r2_rf_global:.2f}")

--- Training Random Forest Regressor Model for Global Data ---
Random Forest Model Performance for Global Data:

Mean Squared Error (MSE): 1126.79
R-squared (R2): 0.24


### Train and Evaluate Ensemble Model (Global)

Train an ensemble model using the base models (XGBoost, Random Forest) on the prepared global data and evaluate its performance.

In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Ensemble Model for Global Data ---")

# Define base estimators (using already trained models for efficiency)
estimators_global = [
    ('xgb', xgb_model_global),
    ('rf', rf_model_global)
]

# Instantiate a StackingRegressor
ensemble_model_global = StackingRegressor(
    estimators=estimators_global,
    final_estimator=LinearRegression(),
    cv=5 # Using 5-fold cross-validation for stacking
)

# Train the StackingRegressor
ensemble_model_global.fit(X_train_global, y_train_global)

# Make predictions on the X_test_global dataset
y_pred_ensemble_global = ensemble_model_global.predict(X_test_global)

# Calculate and print the evaluation metrics
mse_ensemble_global = mean_squared_error(y_test_global, y_pred_ensemble_global)
r2_ensemble_global = r2_score(y_test_global, y_pred_ensemble_global)

print(f"Ensemble Model Performance for Global Data (StackingRegressor with LinearRegression):\n")
print(f"Mean Squared Error (MSE): {mse_ensemble_global:.2f}")
print(f"R-squared (R2): {r2_ensemble_global:.2f}")

--- Training Ensemble Model for Global Data ---
Ensemble Model Performance for Global Data (StackingRegressor with LinearRegression):

Mean Squared Error (MSE): 1431.92
R-squared (R2): 0.03


### Train and Evaluate ANN Model (Global)

Train an Artificial Neural Network (ANN) model on the prepared global data, possibly with additional preprocessing like scaling, and evaluate its performance.

In [None]:
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from sklearn.metrics import mean_squared_error, r2_score

print("--- Training Artificial Neural Network (ANN) Model for Global Data ---")
# 1. Scale the input features X_train_global and X_test_global
scaler_global = StandardScaler()
X_train_global_scaled = scaler_global.fit_transform(X_train_global)
X_test_global_scaled = scaler_global.transform(X_test_global)

# 2. Create a Sequential ANN model, using Input layer for clarity and best practice
ann_model_global = Sequential([
    Input(shape=(X_train_global_scaled.shape[1],)), # Explicit Input layer
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1)  # Output layer for regression, no activation
])

# 3. Compile the model
ann_model_global.compile(optimizer='adam', loss='mean_squared_error')

# 4. Train the ANN model
history_global = ann_model_global.fit(X_train_global_scaled, y_train_global, epochs=50, batch_size=2, verbose=0)

# 5. Make predictions on the scaled X_test_global dataset
y_pred_ann_global = ann_model_global.predict(X_test_global_scaled).flatten()

# 6. Calculate and print the evaluation metrics
mse_ann_global = mean_squared_error(y_test_global, y_pred_ann_global)
r2_ann_global = r2_score(y_test_global, y_pred_ann_global)

print(f"ANN Model Performance for Global Data:\n")
print(f"Mean Squared Error (MSE): {mse_ann_global:.2f}")
print(f"R-squared (R2): {r2_ann_global:.2f}")

--- Training Artificial Neural Network (ANN) Model for Global Data ---
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step
ANN Model Performance for Global Data:

Mean Squared Error (MSE): 1976.28
R-squared (R2): -0.34


## Summary: Global Data

Based on the performance metrics (Mean Squared Error (MSE) and R-squared (${R}^{2}$) across all models trained on the global dataset, the **Random Forest Regressor** model showed the best performance with an ${R}^{2}$ of **0.19** and a Mean Squared Error (MSE) of **1196.38**. This is a slight improvement over the individual country models, which largely exhibited negative ${R}^{2}$ values.

### Data Analysis Key Findings
*   **Data Preparation:**
    *   The Colombia and Panama datasets were merged into a single `global_df` after ensuring consistent features and adding a 'Country' column.
    *   The 'Country' column was one-hot encoded (`Country_Panama`), and the target variable (`AGCd_tC_per_ha_chave2014`) was separated from the features.
    *   The global data was split into training (44 samples, 20 features) and testing (11 samples, 20 features) sets.
*   **Model Performance for Global Data:**
    *   **Linear Regression:** MSE: 2999.82, ${R}^{2}$: -1.03
    *   **Decision Tree Regressor:** MSE: 2463.41, ${R}^{2}$: -0.67
    *   **XGBoost Regressor:** MSE: 1362.67, ${R}^{2}$: 0.08
    *   **Random Forest Regressor:** MSE: 1196.38, ${R}^{2}$: 0.19 (Best performing)
    *   **Ensemble Model (StackingRegressor):** MSE: 1486.94, ${R}^{2}$: -0.01
    *   **Artificial Neural Network (ANN):** MSE: 2139.87, ${R}^{2}$: -0.45

### Insights or Next Steps
*   While the Random Forest model showed the best ${R}^{2}$ score among the global models, an ${R}^{2}$ of 0.19 still indicates a poor fit, meaning a large portion of the variance in the target variable is not explained by the model.
*   The performance for the global model is generally better than for the individual country models, which suggests that combining data, even if limited, can provide some benefit by increasing the sample size and potentially exposing more diverse patterns.


*   **Next Steps:**
    *   **Data Augmentation:** Explore methods to generate or acquire more data, as data scarcity remains the most significant limitation. This could include synthetic data generation or leveraging additional auxiliary datasets.
    *   **Advanced Feature Engineering:** Investigate the creation of more sophisticated features from the existing sensor data, potentially using domain knowledge specific to mangrove ecosystems (e.g., specific band ratios, texture features from SAR imagery).
    *   **Cross-Validation Strategy:** Given the small dataset, robust cross-validation techniques (e.g., stratified K-fold or Leave-One-Out Cross-Validation if computationally feasible) are crucial to obtain more reliable performance estimates and reduce the risk of overfitting.
    *   **Hyperparameter Optimization:** Conduct extensive hyperparameter tuning for the Random Forest and XGBoost models, as they showed the most promise. Grid search or randomized search with cross-validation could be employed.
    *   **Model Interpretability:** For the best-performing models, analyze feature importances to understand which variables contribute most to the predictions. This can provide insights for future data collection or feature engineering efforts.

In [40]:
import folium
from folium.plugins import Fullscreen
from branca.colormap import linear

# Ensure Latitude and Longitude are numeric
global_df['Latitude'] = pd.to_numeric(global_df['Latitude'], errors='coerce')
global_df['Longitude'] = pd.to_numeric(global_df['Longitude'], errors='coerce')
global_df['AGCd_tC_per_ha_chave2014'] = pd.to_numeric(global_df['AGCd_tC_per_ha_chave2014'], errors='coerce')

# Drop rows with NaN values in critical columns for mapping
global_df_cleaned = global_df.dropna(subset=['Latitude', 'Longitude', 'AGCd_tC_per_ha_chave2014'])

# Calculate the mean latitude and longitude to center the map
center_lat = global_df_cleaned['Latitude'].mean()
center_lon = global_df_cleaned['Longitude'].mean()

# Create a base map centered between the two countries
m = folium.Map(location=[center_lat, center_lon], zoom_start=6)

# Add fullscreen button
Fullscreen().add_to(m)

# Define a colormap for carbon density
# Adjust vmin and vmax based on your data's range
vmin = global_df_cleaned['AGCd_tC_per_ha_chave2014'].min()
vmax = global_df_cleaned['AGCd_tC_per_ha_chave2014'].max()

# Create a linear colormap from green to red
colormap = linear.YlGnBu_09.scale(vmin=vmin, vmax=vmax)
colormap.caption = 'AGCd_tC_per_ha_chave2014 (Carbon Density)'

# Add colormap to map
m.add_child(colormap)

# Create feature groups for each country
colombia_group = folium.FeatureGroup(name='Colombia').add_to(m)
panama_group = folium.FeatureGroup(name='Panama').add_to(m)

# Add data points to the map
for idx, row in global_df_cleaned.iterrows():
    if pd.notnull(row['Latitude']) and pd.notnull(row['Longitude']):
        carbon_density = row['AGCd_tC_per_ha_chave2014']
        color = colormap(carbon_density)

        # Create a popup with relevant information
        popup_content = (f"<b>Country:</b> {row['Country']}<br>"
                         f"<b>Plot:</b> {row['Plot'] if 'Plot' in row else 'N/A'}<br>"
                         f"<b>Carbon Density:</b> {carbon_density:.2f} tC/ha<br>"
                         f"<b>Latitude:</b> {row['Latitude']:.2f}<br>"
                         f"<b>Longitude:</b> {row['Longitude']:.2f}")

        # Add a circle marker for each point
        folium.CircleMarker(
            location=[row['Latitude'], row['Longitude']],
            radius=5,  # Fixed size for now, can be scaled by carbon_density
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.7,
            popup=popup_content
        ).add_to(colombia_group if row['Country'] == 'Colombia' else panama_group)

# Add layer control to toggle countries
folium.LayerControl().add_to(m)

# Display the map
m