Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute_rpss is not working on generated lightning data #5

Open
harry9713 opened this issue Apr 4, 2024 · 10 comments
Open

compute_rpss is not working on generated lightning data #5

harry9713 opened this issue Apr 4, 2024 · 10 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers

Comments

@harry9713
Copy link
Collaborator

harry9713 commented Apr 4, 2024

I tried with examples given, the function to calculate rpss is giving errors about the absence of dim_0.

version: 1.5.0
python: 3.10.2

code:

import nwpeval as nw

obs, model = lightning_data()
metrics_obj = nw.NWP_Stats(obs.lightning_density, model.lightning_density)
thresholds = {
    'SEDS': 0.6,
    'SEDI': 0.7,
    'RPSS': 0.8
}

# Compute probabilistic metrics with custom thresholds
metrics = ['SEDS', 'SEDI', 'RPSS']
metric_values = metrics_obj.compute_metrics(metrics, thresholds=thresholds)

Error:

[682](file:///C:/Users/kizhu001/AppData/Local/miniforge3/envs/nbase-windows/lib/site-packages/xarray/namedarray/core.py:682) except ValueError:
--> [683](file:///C:/Users/kizhu001/AppData/Local/miniforge3/envs/nbase-windows/lib/site-packages/xarray/namedarray/core.py:683)     raise ValueError(f"{dim!r} not found in array dimensions {self.dims!r}")

ValueError: 'dim_0' not found in array dimensions ('time', 'lat', 'lon')```

When I tracked the error, it came from the compute_rpss function which expects dim_0 as a native dimension in input. I am not sure if this will be the case in every dataset that we provide.

Possible solution
Standardise the output and input structure of data by enforcing attributes and checking routines.

Please let me know what you think.

@Debasish-Mahapatra
Copy link
Owner

Is this an issue only for the compute_rpss method, or there are other methods that have the same issue?
At this point I think, it is a bug from the previous versions where dim_0 was hardcoded into the method.

At this moment the fix can be just changing this hardcoded value?

Now a standardizing the IO by enforcing attributes and checking routines will be a good upgrade to the entire code base, I do not think it will be much of a hassle to input this.

That being said, I think this can be a fix in the next version perhaps? Do you agree?

@Debasish-Mahapatra
Copy link
Owner

Debasish-Mahapatra commented Apr 4, 2024

Screenshot 2024-04-04 at 10 36 21 AM

Fixed (temporary) -- Changing the code for compute_rpss method to below fixes the issue.

def compute_rpss(self, threshold, dim=None):
        """
        Compute the Ranked Probability Skill Score (RPSS) for a given threshold.
        
        Args:
            threshold (float): The threshold value for binary classification.
            dim (str, list, or None): The dimension(s) along which to compute the RPSS.
                                      If None, compute the RPSS over the entire data.
        
        Returns:
            xarray.DataArray: The computed RPSS values.
        """
        # Convert data to binary based on the threshold
        obs_binary = (self.obs_data >= threshold).astype(int)
        model_binary = (self.model_data >= threshold).astype(int)
        
        # Calculate the RPS for the model data
        rps_model = ((model_binary.cumsum(dim) - obs_binary.cumsum(dim)) ** 2).mean(dim=dim)
        
        # Calculate the RPS for the climatology (base rate)
        base_rate = obs_binary.mean(dim=dim)
        rps_climo = ((xr.full_like(model_binary, base_rate).cumsum(dim) - obs_binary.cumsum(dim)) ** 2).mean(dim=dim)
        
        # Calculate the RPSS
        rpss = 1 - rps_model / rps_climo
        
        return rpss

@Debasish-Mahapatra
Copy link
Owner

Did a major fix to comute_rpss to work for both scalar and non-scalar values.

    def compute_rpss(self, threshold, dim=None):
        """
        Compute the Ranked Probability Skill Score (RPSS) for a given threshold.
    
        Args:
            threshold (float): The threshold value for binary classification.
            dim (str, list, or None): The dimension(s) along which to compute the RPSS.
                                  If None, compute the RPSS over the entire data.
    
        Returns:
            xarray.DataArray: The computed RPSS values.
        """
        # Convert data to binary based on the threshold
        obs_binary = (self.obs_data >= threshold).astype(int)
        model_binary = (self.model_data >= threshold).astype(int)
    
        # Calculate the RPS for the model data
        rps_model = ((model_binary.cumsum(dim) - obs_binary.cumsum(dim)) ** 2).mean(dim=dim)
    
        # Calculate the RPS for the climatology (base rate)
        base_rate = obs_binary.mean(dim=dim)
        rps_climo = ((xr.full_like(model_binary, 0).cumsum(dim) - obs_binary.cumsum(dim)) ** 2).mean(dim=dim)
        rps_climo = rps_climo + base_rate * (1 - base_rate)
    
        # Calculate the RPSS
        rpss = 1 - rps_model / rps_climo
    
        return rpss

Indepth Explanation for what this means (for new users)

The updated compute_rpss method will work correctly for both scalar and non-scalar base_rate values.

In the context of xarray and dimensions/coordinates in a dataset, a scalar value refers to a single value that does not depend on any dimensions. It is a 0-dimensional value. On the other hand, a non-scalar value is an array or a DataArray that depends on one or more dimensions and has corresponding coordinates.

Let's consider an example to illustrate the difference:

Suppose we have a dataset with dimensions "time", "lat", and "lon". The dataset contains a variable "temperature" with corresponding coordinates for each dimension.

  • Scalar value: If we calculate the mean temperature over all dimensions using temperature.mean(), the resulting value will be a scalar. It will be a single value that does not depend on any dimensions.

  • Non-scalar value: If we calculate the mean temperature over a specific dimension, such as temperature.mean(dim="time"), the resulting value will be a non-scalar DataArray. It will have dimensions "lat" and "lon" and corresponding coordinates, but it will not depend on the "time" dimension anymore.

In the updated compute_rpss method, the line base_rate = obs_binary.mean(dim=dim) calculates the mean of obs_binary over the specified dimensions dim. If dim is None, it will calculate the mean over all dimensions, resulting in a scalar value. If dim is a specific dimension or a list of dimensions, it will calculate the mean over those dimensions, resulting in a non-scalar DataArray.

The subsequent lines of code in the compute_rpss method handle both cases correctly:

rps_climo = ((xr.full_like(model_binary, 0).cumsum(dim) - obs_binary.cumsum(dim)) ** 2).mean(dim=dim)
rps_climo = rps_climo + base_rate * (1 - base_rate)

If base_rate is a scalar value, it will be broadcasted to match the shape of rps_climo, and the calculation will be performed element-wise. If base_rate is a non-scalar DataArray, it will be aligned with rps_climo based on the common dimensions, and the calculation will be performed element-wise.

Now, whether this will work with data of different coordinates??? The updated compute_rpss method should work correctly as long as the dimensions and coordinates of obs_binary and model_binary are compatible. The method relies on xarray's broadcasting and alignment rules to handle data with different coordinates.

However, it's important to note that if the coordinates of obs_binary and model_binary are completely different or incompatible, you may encounter issues with dimension alignment or broadcasting. In such cases, you would need to ensure that the coordinates are properly aligned or resampled before applying the compute_rpss method.

In summary, the updated compute_rpss method should work correctly for both scalar and non-scalar base_rate values, and it should handle data with different coordinates as long as the dimensions and coordinates are compatible between obs_binary and model_binary.

@Debasish-Mahapatra Debasish-Mahapatra added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers labels Apr 4, 2024
@Debasish-Mahapatra Debasish-Mahapatra self-assigned this Apr 4, 2024
@Debasish-Mahapatra
Copy link
Owner

@harry9713 I have created a new branch. Can you test this and let me know if the bug is gone and then we can close this with the "IO" changes as a major enhancement in the major update or so?

@harry9713
Copy link
Collaborator Author

harry9713 commented Apr 4, 2024

@Debasish-Mahapatra it does fixes the runtime error for now. So we can merge this. I will check the rest of the methods as well. However, I did not get a realistic result as the sample data give nan values everytime. We will have to release the new version since this is a major fix.

@Debasish-Mahapatra
Copy link
Owner

Debasish-Mahapatra commented Apr 4, 2024

@harry9713 I will merge this with the main. Were you testing the code with the added Lightning data? It might be giving NaN values because of the threshold; did you check that? It might also be a good Idea to check with some "real case data", before releasing the next version, and see if we are getting some realistic results.

I am still keeping this issue open; we can close it depending how the tests with ERA data go.

@harry9713
Copy link
Collaborator Author

I have checked with different thresholds but it resulted in the same. Would be nice if we have real cases to test with.

@Debasish-Mahapatra
Copy link
Owner

In the new branch, have added some plots in nwpeval/examples... check the following -> metrics_time_avg.png and metrics_time_avg.png.. These are from the generated "test" data of lightning.

Era_rain.nc.zip

can you take a look at this?

import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import nwpeval as nw

# Load observation and model data
obs_data = xr.open_dataset("Era_rain.nc")
model_data = xr.open_dataset("Era_rain.nc")

# Extract precipitation variables
obs_tp = obs_data["tp"] * 10000  # Convert from m to mm
model_tp = model_data["tp"] * 1000  # Convert from m to mm

# Create an instance of the NWP_Stats class
metrics_obj = nw.NWP_Stats(obs_tp, model_tp)

#####################

# Define the thresholds for metric calculations
thresholds = {
    'RPSS': 1,
}

# Calculate time-averaged metrics
metrics_time_avg = {}
for metric, threshold in thresholds.items():
    metrics_time_avg[metric] = metrics_obj.compute_metrics([metric], thresholds={metric: threshold}, dim="time")[metric]

# Calculate area-average diurnal cycle metrics
metrics_diurnal = {}
for metric, threshold in thresholds.items():
    metrics_diurnal[metric] = metrics_obj.compute_metrics([metric], thresholds={metric: threshold}, dim=["latitude", "longitude"])[metric].groupby("time.hour").mean()

# Plot time-averaged metrics
for metric in thresholds.keys():
    plt.figure(figsize=(9, 6))
    metrics_time_avg[metric].plot(cmap="coolwarm", vmin=-1, vmax=1)
    plt.title(f"Time-Averaged {metric}")
    plt.xlabel("Longitude")
    plt.ylabel("Latitude")
    plt.tight_layout()
    plt.savefig(f"metrics_time_avg_{metric}.png")
    plt.show()

# Plot area-average diurnal cycle metrics
for metric in thresholds.keys():
    plt.figure(figsize=(9, 6))
    metrics_diurnal[metric].plot()
    plt.title(f"Area-Average Diurnal Cycle {metric}")
    plt.xlabel("Hour")
    plt.ylabel(metric)
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(f"metrics_diurnal_{metric}.png")
    plt.show()

metrics_diurnal_RPSS
metrics_time_avg_RPSS

I have done this with the attached Era_rain.nc data.

@harry9713
Copy link
Collaborator Author

Defenitely. I will check with the provided data later today.

I think the conversion from m to mm is wrong for observations since it should be 1000 instead of 10000. Please check if it is the same from your side as well. Other than that, it seems working.

@Debasish-Mahapatra
Copy link
Owner

Debasish-Mahapatra commented Apr 4, 2024

Opps..! My bad, I should have said about this earlier, As I was too lazy to download two data sets and then match the grids and go through all the pre-processing steps, I downloaded just one data from Era5 and multiplied these "factors" (10000 and 1000) so that they have a variability when you calculate something. 🤣🤣🤣 And then I forgot to change the comments.

Not A Bug! It was Intentional...😂

# Extract precipitation variables
obs_tp = obs_data["tp"] * 10000  # offset to create data variability 
model_tp = model_data["tp"] * 1000  # offset to create data variability 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants