# **APPLYING OUR MODEL TO A NEW DATASET DEFINITION**

---

We want to explore the behaviour of our model when facing a new dataset definition.

THis time, we will use a new year, `2019`, to test the predictive power of our models.

The aim of this notebook is to **show if our models retain their predictive capacity when tested on a different year** while also **showcasing the use of our serialized models**.



This notebook, unlike the others, will be short.


<br>

<br>



## **VERIFYING DATASET INTEGRITY**

In [1]:
import os
import rasterio
from collections import defaultdict
from tqdm import tqdm

def list_tif_files(folder_path):
    return [f for f in os.listdir(folder_path) if f.endswith('.tif')]

def explore_folder(folder_path):
    config_count = defaultdict(int)
    
    tif_files = list_tif_files(folder_path)
    for tif_file in tqdm(tif_files, desc=f"Processing {folder_path}"): 
        file_path = os.path.join(folder_path, tif_file)
        with rasterio.open(file_path) as src:
            config = (
                src.width,
                src.height,
                tuple(src.dtypes),
                str(src.crs),
                src.count
            )
            config_count[config] += 1
    
    return config_count

def explore_tif_folders(folder_paths):
    for folder_path in folder_paths:
        absolute_folder_path = os.path.abspath(folder_path)
        print(f"\n\n\nExploring folder: {absolute_folder_path}")

        config_count = explore_folder(folder_path)
        print("=======================================")
        print("\nUnique Configurations:")
        for i, (config, count) in enumerate(config_count.items(), 1):
            width, height, dtypes, crs, bands = config
            print(f"\nConfiguration #{i}:")
            print(f"  Width: {width}, Height: {height}")
            print(f"  Data Types: {dtypes}")
            print(f"  CRS: {crs}")
            print(f"  Number of Bands: {bands}")
            print(f"  Count of Images with this Configuration: {count}")
        print("=======================================\n")


# Initial folder paths, relative to the current script location

tif_dataset_paths = [
    'data/sentinel2rgbmedian2020.py',
    'data/treecover2020.py'
]
explore_tif_folders(tif_dataset_paths)





Exploring folder: /home/felipera/projects/udeai_forest/data/sentinel2rgbmedian2020.py


Processing data/sentinel2rgbmedian2020.py: 100%|██████████| 64106/64106 [01:22<00:00, 773.78it/s]



Unique Configurations:

Configuration #1:
  Width: 100, Height: 100
  Data Types: ('uint8', 'uint8', 'uint8')
  CRS: EPSG:4326
  Number of Bands: 3
  Count of Images with this Configuration: 64106




Exploring folder: /home/felipera/projects/udeai_forest/data/treecover2020.py


Processing data/treecover2020.py: 100%|██████████| 64106/64106 [01:21<00:00, 783.42it/s]


Unique Configurations:

Configuration #1:
  Width: 100, Height: 100
  Data Types: ('uint8', 'uint8', 'uint8')
  CRS: EPSG:4326
  Number of Bands: 3
  Count of Images with this Configuration: 64106






Since the configurations match, we know that the data was downloaded without errors.

<br>

<BR>

---

## **SERIALIZING DATASETS TO PREPARE FOR PREDICTIONS**

In [2]:
# Importing required libraries
import os
import json
import torch
import rasterio
import numpy as np
from tqdm import tqdm

# Function to read and process a single .tif file for 'y'
def read_and_process_tif_file_for_y(filepath):
    with rasterio.open(filepath) as src:
        # Read the bands 1 and 3, skipping the empty band 2
        band1, band3 = src.read(1), src.read(3)
        
        # Apply the transformation to scale the values between 0 and 100
        band1 = (band1 / 255.0) * 100
        band3 = (band3 / 255.0) * 100
        
        # Calculate Percent_Vegetation_Coverage and clip it to be between 0 and 100
        percent_vegetation_coverage = np.clip(band1 + band3, 0, 100)
        
        # Calculate a single Percent_Vegetation_Coverage value for the entire image (e.g., mean)
        single_value = np.mean(percent_vegetation_coverage)
        
        return single_value, src.meta


# Function to read a single .tif file and return as numpy array
def read_tif_file(filepath):
    with rasterio.open(filepath) as src:
        return np.array(src.read()), src.meta

# Function to save metadata to a JSON file
def save_metadata(metadata_dict, save_path):
    for meta in metadata_dict:
        if 'sentinel_meta' in meta and 'crs' in meta['sentinel_meta']:
            meta['sentinel_meta']['crs'] = str(meta['sentinel_meta']['crs'])
        if 'treecover_meta' in meta and 'crs' in meta['treecover_meta']:
            meta['treecover_meta']['crs'] = str(meta['treecover_meta']['crs'])
    with open(save_path, 'w') as f:
        json.dump(metadata_dict, f)


# Initialize empty lists to store images and metadata
X_images = []
y_images = []
metadata_list = []


### ---------------------------- ###

# Directory paths
sentinel_dir = 'data/sentinel2rgbmedian2020.py'  # Replace with your actual directory
treecover_dir = 'data/treecover2020.py'  # Replace with your actual directory

X_test_tensor_name = 'X_test_tensor_visuals.pth'
y_test_tensor_name = 'y_test_tensor_visuals.pth'

### ---------------------------- ###


# File names are assumed to be the same in both directories
filenames = os.listdir(sentinel_dir)

# Loop through each file and read the image and metadata
for filename in tqdm(filenames):
    # Read SENTINEL-2 image
    sentinel_path = os.path.join(sentinel_dir, filename)
    sentinel_img, sentinel_meta = read_tif_file(sentinel_path)
    
    # Read and process Tree Cover image
    treecover_path = os.path.join(treecover_dir, filename)
    percent_vegetation_coverage, treecover_meta = read_and_process_tif_file_for_y(treecover_path)
    
    # Append to lists
    X_images.append(sentinel_img)
    y_images.append(percent_vegetation_coverage)
    metadata_list.append({
        'filename': filename,
        'sentinel_meta': sentinel_meta,
        'treecover_meta': treecover_meta
    })

# Convert lists to PyTorch tensors
X_tensor = torch.tensor(np.stack(X_images, axis=0))
y_tensor = torch.tensor(np.stack(y_images, axis=0), dtype=torch.float32)

# Serialize tensors and save to disk
torch.save(X_tensor, X_test_tensor_name)
torch.save(y_tensor, y_test_tensor_name)

# Save metadata to JSON file
save_metadata(metadata_list, 'metadata_test.json')

print("Data preparation and serialization complete.")


100%|██████████| 64106/64106 [03:21<00:00, 318.38it/s]


Data preparation and serialization complete.


Our data was succesfully loaded into the tensors `'X_test_tensor.pth'` and `'y_test_tensor.pth'`


<br>

<br>

<br>

---

<br>

<br>

## **LOADING OUR SERIALIZED MODELS AND THE DATASET**

<br>

To load our models, first we must make sure that all the required libraries, function and dependencies are available in our environment:

In [2]:
X_test_tensor_name = 'X_test_tensor_visuals.pth'
y_test_tensor_name = 'y_test_tensor_visuals.pth'

In [7]:
import torch
import joblib
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error, mean_absolute_error
import os
import json
import time
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV


# custom definitions
from src.classes.VGGUdeaSpectral import VGGUdeaSpectral

from src.classes.MultipleRegressionModel import MultipleRegressionModel


def rmse_score(net, X, y):
    y_pred = net.predict(X)
    rmse = (mean_squared_error(y_true=y, y_pred=y_pred)) ** 0.5
    return -rmse  # Skorch tries to maximize the score, so negate the RMSE

In [9]:
# Load the tensors
X_test_tensor = torch.load(X_test_tensor_name)
y_test_tensor = torch.load(y_test_tensor_name).view(-1,1)

# Print the shapes and data types of the tensors
print("Shape of X_tensor:", X_test_tensor.shape)
print("Data type of X_tensor:", X_test_tensor.dtype)
print("Number of elements in X_tensor:", torch.numel(X_test_tensor))

print("Shape of y_tensor:", y_test_tensor.shape)
print("Data type of y_tensor:", y_test_tensor.dtype)
print("Number of elements in y_tensor:", torch.numel(y_test_tensor))

# Calculate the size in bytes
print("Size of X_tensor in bytes:", X_test_tensor.element_size() * X_test_tensor.nelement())
print("Size of y_tensor in bytes:", y_test_tensor.element_size() * y_test_tensor.nelement())

Shape of X_tensor: torch.Size([64106, 3, 100, 100])
Data type of X_tensor: torch.uint8
Number of elements in X_tensor: 1923180000
Shape of y_tensor: torch.Size([64106, 1])
Data type of y_tensor: torch.float32
Number of elements in y_tensor: 64106
Size of X_tensor in bytes: 1923180000
Size of y_tensor in bytes: 256424


In [4]:
## LOADING MODEL VGGUdeaSpectral
import joblib
from src.classes.VGGUdeaSpectral import VGGUdeaSpectral


vgg_model = joblib.load("src/trained_models/VGGUdeaSpectral1/VGGUdeaSpectral_model1.joblib")

vgg_model

In [None]:
## LOADING MODEL VGGUdeaSpectral

from src.classes.VGGUdeaSpectral import VGGUdeaSpectral


vgg_model = joblib.load("src/trained_models/VGGUdeaSpectral0/VGGUdeaSpectral_model0.joblib")

vgg_model

In [8]:
## LOADING MODEL MultipleRegression

from src.classes.VGGUdeaSpectral import VGGUdeaSpectral


multiple_regression_model = joblib.load("src/trained_models/MultipleRegression0/MultipleRegression_model0.joblib")

multiple_regression_model

In [10]:
# Convert X to float32
X_test_tensor = X_test_tensor.to(dtype=torch.float32)

predictions = vgg_model.predict(X_test_tensor)


# If  y is a PyTorch tensor, you need to convert it to a NumPy array
y_np = y_test_tensor.cpu().numpy() if isinstance(y_test_tensor, torch.Tensor) else y_test_tensor

# Calculate MAE and RMSE using the true labels y
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_np, predictions)
rmse = np.sqrt(mean_squared_error(y_np, predictions))

print(f'Average MAE: {mae}')
print(f'Average RMSE: {rmse}')

Average MAE: 5.838313579559326
Average RMSE: 7.768616676330566


In [14]:
print("Predictions Info: ")
print("DAtatype:    ", type(predictions))
print("Shape:       ", predictions.shape)

print("Unique Values as percentage of Vegetativa Coverage per 1km^2 tile: \n")

print(np.unique(predictions))



Predictions Info: 
DAtatype:     <class 'numpy.ndarray'>
Shape:        (64106, 1)
Unique Values as percentage of Vegetativa Coverage per 1km^2 tile: 

[46.53532  46.53537  46.535534 ... 94.0871   94.39075  94.58438 ]


In [14]:
# Convert X to float32
X_test_tensor = X_test_tensor.to(dtype=torch.float32)

# Flatten the images
X_test = X_test_tensor.reshape(X_test_tensor.shape[0], -1)

predictions = multiple_regression_model.predict(X_test)

# If  y is a PyTorch tensor, you need to convert it to a NumPy array
y_np = y_test_tensor.cpu().numpy() if isinstance(y_test_tensor, torch.Tensor) else y_test_tensor

# Calculate MAE and RMSE using the true labels y
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_np, predictions)
rmse = np.sqrt(mean_squared_error(y_np, predictions))

print(f'Average MAE: {mae}')
print(f'Average RMSE: {rmse}')

Average MAE: 4.6125633849907945
Average RMSE: 9.391506176502144


<br>
<br>

---

## **CONCLUSIONS**

<br>
<br>

### **FOR THE VGGUdeaSpectral Model**


<br>

<br>

```javascript
Validation Set (2019):

MAE: 3.9198
RMSE: 6.394


Test Set (2020):

MAE: 5.8383
RMSE: 7.7686
```

<br>

We observe that the model's performance deteriorated by approximately 48.94% in Mean Absolute Error (MAE) and 21.50% in Root Mean Squared Error (RMSE) when tested on the 2020 images compared to the validation performance on the 2019 data. This substantial deterioration might suggest:

Temporal Overfitting: The model may be significantly overfitting to the specific conditions of 2019, including weather patterns, seasonal vegetation changes, or specific events (like fires or deforestation) that uniquely characterized that year.

Shift in Data Distribution: The characteristics of the images from 2020 could differ markedly from those of 2019, indicating a significant distribution shift. This discrepancy could be due to various factors, such as changes in satellite imaging conditions, alterations in land use, or natural vegetation cycles.

Model Robustness: The pronounced performance drop highlights potential issues with the model's robustness, suggesting it may struggle with the year-to-year natural variability in the data. This indicates a need for incorporating a more diverse dataset spanning multiple years during training to enhance temporal generalization.

<br>
<br>

### **FOR THE MultipleRegressor Model**

<br>

<br>

```javascript
Validation Set (2020):

MAE: 4.659
RMSE: 9.1532


Test Set (2019):

MAE: 5.8381
RMSE: 9.3915
```

<br>

The performance of the MultipleRegressor model on the 2019 dataset compared to its validation on the 2020 dataset shows:

- A slight improvement in Mean Absolute Error (MAE) by approximately 1.00%.
- A deterioration in Root Mean Squared Error (RMSE) by approximately 2.60%.

This suggest that:

1. The MAE improvement suggests that on average, the model's predictions were marginally closer to the actual values in 2019 compared to 2020, despite being trained on 2020 data.

2. The increase in RMSE indicates that there were likely a few larger errors in the 2019 predictions, as RMSE is more sensitive to larger errors due to the squaring of errors.

3. Overall, the model shows a relatively stable performance across the two years, with a small deterioration in RMSE, which is to be expected due to variations in data year-to-year.​