# Statistical Analysis of Extreme Precipitation Indices (EPI)

This notebook was designed to carry out the statistical analysis of the output data obtained from `Computing_Indices.ipynb`. These file are the computed daily precipitation extreme indices

---

 - Author:          
                    Luis F Patino Velasquez - MA
 - Date:            
                    Jun 2020
 - Version:         
                    1.0
 - Notes:            
                    Files used in this notebook are outputs of the Computing_Indices.ipynb notebook
 - Jupyter version: 
                    jupyter core     : 4.7.1
                    jupyter-notebook : 6.4.0
                    qtconsole        : 5.1.1
                    ipython          : 7.25.0
                    ipykernel        : 6.0.3
                    jupyter client   : 6.1.12
                    jupyter lab      : 3.0.16
                    nbconvert        : 6.1.0
                    ipywidgets       : 7.6.3
                    nbformat         : 5.1.3
                    traitlets        : 5.0.5
 - Python version:  
                    3.8.5 

---

## Main considerations

* Data coming from HadUK-Grid and GPM-IMERG have been regridded using a conservative interpolation in NCO. An example of the code used is: `cdo -remapcon,gpm_imerg_xclimSeason_QSDEC_prcp_2001-2019.nc haduk_metoffice_xclimSeason_QSDEC_prcp_2001-2019 haduk_metoffice_xclimSeason_QSDEC_prcp_RegridToIMERG_2001-2019`

## Setting Python Modules

In [None]:
# Imports for xclim and xarray
import xclim as xc
import pandas as pd
import numpy as np
import xarray as xr

# File handling libraries
import time
import tempfile
from pathlib import Path

# other python packages
import functools
import warnings
from itertools import groupby

# Geospatial libraries
import geopandas
import rioxarray
from shapely.geometry import mapping

# import plotting stuff
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.mlab as mlab
import matplotlib.colors as mcolors
# set colours
# plt.style.use('default')
plt.style.use("~/.local/lib/python3.8/site-packages/matplotlib/mpl-data/stylelib/lfpv.mplstyle")

%matplotlib inline
# Set some plotting defaults
plt.rcParams['figure.figsize'] = (15, 11)
plt.rcParams['figure.dpi'] = 50

# Mapping libraries
from mpl_toolkits.basemap import Basemap

sep = '-----------\n-----------'
print(sep)

def savingFile(file_name, data_xarray):
    # Check if file already exist
    check = Path(Path('/mnt/d/MRes_dataset/active_data/103_stats') / file_name)
    if check.is_file() is False:
        # Saving file with annual precipitations
        xclim_indices = Path(Path('/mnt/d/MRes_dataset/active_data/103_stats') / file_name)
        print ('saving to ', xclim_indices)
        data_xarray.rio.set_spatial_dims(x_dim='lon', y_dim='lat', inplace=True)
        data_xarray.rio.write_crs(4326, inplace=True)
        data_xarray.to_netcdf(path=xclim_indices)
        print ('finished saving')
    else:
        print ('{} already exist - try using a different file name'.format(file_name))

# Reading the Data

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/active_data/102_prcp')
# Check saved files
!ls {fldr_src}

<div class="alert alert-block alert-warning">
 <p style="color:black"> <b>R95pTOT and R99pTOT are run separetely. Make sure to look at column values in the script as well as the data set chosen in the step below</b></p>
</div>

## Working with data excluding r99ptot

In [None]:
# Reading the seasonal data
ERA_dataset_season = xr.open_dataset(Path(fldr_src / 'era5_copernicus_xclimSeason_QSDEC_prcp_RegridedToIMERG_2001-2019.nc'))
GPM_dataset_season = xr.open_dataset(Path(fldr_src / 'gpm_imerg_xclimSeason_QSDEC_prcp_2001-2019.nc'))
HAD_dataset_season = xr.open_dataset(Path(fldr_src / 'hadukWGS84Attr_metoffice_xclimSeason_QSDEC_prcp_RegridedToIMERG_2001-2019.nc'))

print(sep)
print('Dataset setup')
print(sep)

HAD_dataset_season

# Checking and Analysing season data

## Season Analysis

### Grouping data by season

In [None]:
# Grouping by season keeping time series
# This is needed to carry out analysis by choosing the time dimension
# It will analysis the same season for all years

# ERA5
ERA_dataset_DJF = ERA_dataset_season.sel(time=ERA_dataset_season.time.dt.season=="DJF")
ERA_dataset_MAM = ERA_dataset_season.sel(time=ERA_dataset_season.time.dt.season=="MAM")
ERA_dataset_JJA = ERA_dataset_season.sel(time=ERA_dataset_season.time.dt.season=="JJA")
ERA_dataset_SON = ERA_dataset_season.sel(time=ERA_dataset_season.time.dt.season=="SON")

# IMERG
GPM_dataset_DJF = GPM_dataset_season.sel(time=GPM_dataset_season.time.dt.season=="DJF")
GPM_dataset_MAM = GPM_dataset_season.sel(time=GPM_dataset_season.time.dt.season=="MAM")
GPM_dataset_JJA = GPM_dataset_season.sel(time=GPM_dataset_season.time.dt.season=="JJA")
GPM_dataset_SON = GPM_dataset_season.sel(time=GPM_dataset_season.time.dt.season=="SON")

# HADUK
HAD_dataset_DJF = HAD_dataset_season.sel(time=HAD_dataset_season.time.dt.season=="DJF")
HAD_dataset_MAM = HAD_dataset_season.sel(time=HAD_dataset_season.time.dt.season=="MAM")
HAD_dataset_JJA = HAD_dataset_season.sel(time=HAD_dataset_season.time.dt.season=="JJA")
HAD_dataset_SON = HAD_dataset_season.sel(time=HAD_dataset_season.time.dt.season=="SON")

print(sep)
print('All data have been grouped by seasons')
print(sep)

### Statistic comparison

* GPM-IMERG and ERA5 data will be compared with values from HadUK-Grid
* The analysis will be carried out by seasons for all years e.g. Pearson correlation coefficient for DJF season in 2001,2002,2003....2019

In [None]:
# Stats packages
import scipy.stats as stats
from scipy.stats import mannwhitneyu
from scipy.stats import variation

<a id='func_stats'></a>
#### Functions to be used for data processing and statistical analysis**

In [None]:
def season_dataframe(HADxarray_season, GPMxarray_season, ERAxarray_season):
    """
    Changes any data in timedelta64 format to float for a pandas dataframe
    :param HADxarray_season: xarray with time variable group by season
    :param GPMxarray_season: xarray with time variable group by season
    :param ERAxarray_season: xarray with time variable group by season
    :return: tupple - pandas dataframes
    """
    ##############
    # ERA5
    ##############
    df_era_season = ERAxarray_season.to_dataframe().reset_index()
    # Change all timedelta64 (x days) for only the day
    df_era_season['r10mm'] = np.where(np.isnat(df_era_season['r10mm']),np.nan, (df_era_season['r10mm'] / np.timedelta64(1, 'D'))).astype(float)
    df_era_season['r20mm'] = np.where(np.isnat(df_era_season['r20mm']),np.nan, (df_era_season['r20mm'] / np.timedelta64(1, 'D'))).astype(float)
    df_era_season['cdd'] = np.where(np.isnat(df_era_season['cdd']),np.nan, (df_era_season['cdd'] / np.timedelta64(1, 'D'))).astype(float)
    df_era_season['cwd'] = np.where(np.isnat(df_era_season['cwd']),np.nan, (df_era_season['cwd'] / np.timedelta64(1, 'D'))).astype(float)
    ##############
    # GPM-IMERG
    ##############
    df_gpm_season = GPMxarray_season.to_dataframe().reset_index()
    # Change all timedelta64 (x days) for only the day
    df_gpm_season['r10mm'] = np.where(np.isnat(df_gpm_season['r10mm']),np.nan, (df_gpm_season['r10mm'] / np.timedelta64(1, 'D'))).astype(float)
    df_gpm_season['r20mm'] = np.where(np.isnat(df_gpm_season['r20mm']),np.nan, (df_gpm_season['r20mm'] / np.timedelta64(1, 'D'))).astype(float)
    df_gpm_season['cdd'] = np.where(np.isnat(df_gpm_season['cdd']),np.nan, (df_gpm_season['cdd'] / np.timedelta64(1, 'D'))).astype(float)
    df_gpm_season['cwd'] = np.where(np.isnat(df_gpm_season['cwd']),np.nan, (df_gpm_season['cwd'] / np.timedelta64(1, 'D'))).astype(float)
    
    ##############
    # HADUK-GRID
    ##############
    df_had_season = HADxarray_season.to_dataframe().reset_index()
    # Change all timedelta64 (x days) for only the day
    # If value is nan then leave it as nan otherwise change to date number as a float
    df_had_season['r10mm'] = np.where(np.isnat(df_had_season['r10mm']),np.nan, (df_had_season['r10mm'] / np.timedelta64(1, 'D')))
    df_had_season['r20mm'] = np.where(np.isnat(df_had_season['r20mm']),np.nan, (df_had_season['r20mm'] / np.timedelta64(1, 'D')))
    df_had_season['cdd'] = np.where(np.isnat(df_had_season['cdd']),np.nan, (df_had_season['cdd'] / np.timedelta64(1, 'D')))
    df_had_season['cwd'] = np.where(np.isnat(df_had_season['cwd']),np.nan, (df_had_season['cwd'] / np.timedelta64(1, 'D')))
    
    return(df_era_season, df_gpm_season, df_had_season)

def setting_dataframes(OBS_dataframe, OTH_dataframe):
    """
    Creates dataframe with data for the UK only
    :OBS_dataframe: pandas dataframe
    :df_had_season: pandas dataframe
    :return: tupple - pandas dataframes
    """
    col_lst = ['r10mm_x', 'r20mm_x', 'cdd_x', 'cwd_x', 'sdii_x', 'rx1day_x', 'rx5day_x', 'prcptot_x', 'r95ptot_x']
    col_lst_ord = ['lat', 'lon', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']

    # Add date as string to help with inner join
    df_OBS = OBS_dataframe.copy()
    df_OBS['tiempo'] = OBS_dataframe['time'].apply(lambda x: x.strftime('%Y%m%d'))
    df_OTH = OTH_dataframe.copy()
    df_OTH['tiempo'] = OTH_dataframe['time'].apply(lambda x: x.strftime('%Y%m%d'))

    # Inner join of dataframe - make sure we are looking at the same set of coordinates
    new_df = pd.merge(df_OBS, df_OTH, how='inner', on=['lat','lon','tiempo'])

    # Select rows without NaN values - Only looking at the indices of the obs dataset (HadUK-Grid)
    # Needed for the statistic analysis. It only remove row if all NaN values
    selected_rows = new_df.dropna(subset=col_lst, how='all', axis=0).reset_index()


    # Split the joined dataframe into OBS (HAD) and OTH (GPM or ERA)
    df_OBS_split = selected_rows.filter(regex='_x')
    df_OTH_split = selected_rows.filter(regex='_y')

    # create copy of dataframe to avoid the slice error
    df_OBS_final = df_OBS_split.copy()
    df_OTH_final = df_OTH_split.copy()

    # add lat and lon to split dataframes
    coord_lst = ['lat', 'lon']
    for coord in coord_lst:
        df_OBS_final[coord] = selected_rows[coord]
        df_OTH_final[coord] = selected_rows[coord]

    # rename and reorder columns
    df_OBS_final.columns = df_OBS_final.columns.str.replace('_x','')
    df_OTH_final.columns = df_OTH_final.columns.str.replace('_y','')
    # re-order
    df_OBS_final = df_OBS_final.reindex(columns=col_lst_ord)
    df_OTH_final = df_OTH_final.reindex(columns=col_lst_ord)

    df_OBS_grouped = df_OBS_final.groupby(['lat', 'lon']).agg(lambda x: list(x)).reset_index()
    df_OTH_grouped = df_OTH_final.groupby(['lat', 'lon']).agg(lambda x: list(x)).reset_index()
    
    return(df_OBS_grouped,df_OTH_grouped)

def stats_indices_v2(dataframe_obs_season, dataframe_oth_season, stat_method):
    """
    Returns xarray containing the output of statistical analysis
    :dataframe_obs_season: pandas dataframe
    :dataframe_oth_season: pandas dataframe
    :stat_method: string of statistical test
    :return: xarray
    """
    col_lst = ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
    
    # Correct the coordinaes missmatch
    matchCoord_dfs = setting_dataframes(dataframe_obs_season, dataframe_oth_season)
    
    # Set the dataset taking output from function
    groupsOBS = matchCoord_dfs[0]   
    groupsOTH = matchCoord_dfs[1]
    
    ############################################################
    # Stats are based on OTHER compared to OBSERVATION
    # OBSERVATION = HadUK-Grid / OTHER = GPM-IMERG OR ERA5
    ############################################################
    # Set values ready for iteration inside dataframe
    df_OBS_rows = groupsOBS.shape[0]
    df_OBS_cols = groupsOBS.shape[1]

    # Create df to store results from pearsons
    df_stats = pd.DataFrame(columns=col_lst)

    # Loop through dataframes getting list of values ready for statistic evaluation
    for i in range(0,df_OBS_rows):
        temp_lst = []
        for j in range(2,df_OBS_cols): # indices start from column 3 in the dataframe        
            # set list of index to be removed
            OBS_toRemove_idx = []
            OTH_toRemove_idx = []
            # check if nan value exist and get index
            obs_data = np.array(groupsOBS.iat[i,j])
            oth_data = np.array(groupsOTH.iat[i,j])
            OBS_toRemove = np.argwhere(np.isnan(obs_data))
            OTH_toRemove = np.argwhere(np.isnan(oth_data))
            # Calculate stats
            # check if more than half of the list are NaN values inside the list
            if (len(OBS_toRemove) < 5) or (len(OTH_toRemove) < 5): # add NaN if over half of the data is NaN values
                # replace NaN values with the average of the list ignoring NaN values
                obs_data[np.isnan(obs_data)] = np.nanmean(obs_data)
                oth_data[np.isnan(oth_data)] = np.nanmean(oth_data)
        
                # calculate the requiered stats
                if stat_method == 'pearson':
                    pearson = stats.pearsonr(obs_data, oth_data)
                    temp_lst.append(pearson[0])
                if stat_method == 'spearman':
                    spearm = stats.spearmanr(obs_data, oth_data)
                    temp_lst.append(spearm[0])
                if stat_method == 'wm':
                    wm = stats.mannwhitneyu(obs_data, oth_data)
                    temp_lst.append(wm[1])
                if stat_method == 'levene':
                    levene = stats.levene(obs_data, oth_data, center='median')
                    temp_lst.append(levene[1])
            else:
                # As NaN values are too many then correlation cannot be calculated - NaN assigned
                temp_lst.append(np.nan)
        # add data to final dataframe
        df_stats.loc[len(df_stats)] = temp_lst

    # Add coordinate, time and percentiles variables back ready for xarray
    df_stats['lat'] = groupsOBS['lat']
    df_stats['lon'] = groupsOBS['lon']

    # Reorder the columns and reset index
    df_statsFinal = df_stats.reindex(columns= ['lat', 'lon', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot'])
    df_statsFinal = df_statsFinal.reset_index()
    df_statsFinal = df_statsFinal.drop(['index'], axis = 1)

    # Return pearson dataframe to an xarray object
    final_output = df_statsFinal.set_index(['lat', 'lon']).to_xarray()
    
    return (final_output)

def reference_dataframes(OBS_dataframe, OTH_dataframe):
    """
    Creates dataframe with data for the UK only
    :OBS_dataframe: pandas dataframe
    :df_had_season: pandas dataframe
    :return: pandas dataframes
    """
    col_lst = ['r10mm_x', 'r20mm_x', 'cdd_x', 'cwd_x', 'sdii_x', 'rx1day_x', 'rx5day_x', 'prcptot_x', 'r95ptot_x']
    col_lst_ord = ['lat', 'lon', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']

    # Add date as string to help with inner join df_HAD_DJF,df_HAD_DJF
    df_OBS = OBS_dataframe.copy()
    df_OBS['tiempo'] = OBS_dataframe['time'].apply(lambda x: x.strftime('%Y%m%d'))
    df_OTH = OTH_dataframe.copy()
    df_OTH['tiempo'] = OTH_dataframe['time'].apply(lambda x: x.strftime('%Y%m%d'))

    # Inner join of dataframe - make sure we are looking at the same set of coordinates
    new_df = pd.merge(df_OBS, df_OTH, how='inner', on=['lat','lon','tiempo'])

    # Select rows without NaN values for obs dataset (HadUK-Grid) - This make sure we are looking at UK extent
    selected_rows = new_df.dropna(subset=col_lst, how='all', axis=0).reset_index()

    # Split the joined dataframe into OBS (HAD) and OTH (GPM or ERA)
    df_OTH_split = selected_rows.filter(regex='_y')

    # create copy of dataframe to avoid the slice error
    df_OTH_final = df_OTH_split.copy()

    # add lat and lon to split dataframes
    coord_lst = ['lat', 'lon']
    for coord in coord_lst:
        df_OTH_final[coord] = selected_rows[coord]

    # rename and reorder columns
    df_OTH_final.columns = df_OTH_final.columns.str.replace('_y','')
    # re-order
    df_OTH_final = df_OTH_final.reindex(columns=col_lst_ord)

    df_OTH_grouped = df_OTH_final.groupby(['lat', 'lon']).agg(lambda x: list(x)).reset_index()

    return(df_OTH_grouped)

def indices_avg(OBS_dataframe, OTH_dataframe):
    """
    Returns xarray containing the output of the average analysis
    :OBS_dataframe: pandas dataframe
    :OTH_dataframe: pandas dataframe
    :return: xarray
    """
    col_lst = ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
    # Correct the coordinaes missmatch
    matchCoord_dfs = reference_dataframes(OBS_dataframe, OTH_dataframe)
    
    # Set the dataset taking output from function
    groupsOTH = matchCoord_dfs

    # Set values ready for iteration inside dataframe
    df_OTH_rows = groupsOTH.shape[0]
    df_OTH_cols = groupsOTH.shape[1]

    # Create df to store results from pearsons
    df_stats = pd.DataFrame(columns= col_lst)

    # Loop through dataframes getting list of values ready for statistic evaluation
    for i in range(0,df_OTH_rows):
        temp_lst = []
        for j in range(2,df_OTH_cols):  # indices start from column 3 in the dataframe  
            # set list of index to be removed
            OTH_toRemove_idx = []
            # check if nan value exist and get index
            OTH_toRemove = np.argwhere(np.isnan(groupsOTH.iat[i,j]))

            # Calculate stats
            # check if all NaN values inside the list
            if len(OTH_toRemove) > 10: # add NaN if over half of the data is NaN values
                temp_lst.append(np.nan)
            else:
                # create dataset for stats analysis
                OTH_data = groupsOTH.iat[i,j]
                temp_lst.append(np.nanmean(OTH_data))
        # add data to final dataframe
        df_stats.loc[len(df_stats)] = temp_lst
    # Add coordinate, time and percentiles variables back ready for xarray
    df_stats['lat'] = groupsOTH['lat']
    df_stats['lon'] = groupsOTH['lon']

    # Reorder the columns and reset index
    df_statsFinal = df_stats.reindex(columns= ['lat', 'lon', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot'])
    df_statsFinal = df_statsFinal.reset_index()
    df_statsFinal = df_statsFinal.drop(['index'], axis = 1)

    # Return pearson dataframe to an xarray object
    final_output = df_statsFinal.set_index(['lat', 'lon']).to_xarray()

    return (final_output)

print(sep)
print('Functions set')
print(sep)

* **Change xarray data to daframe to carryout statistic analysis using pandas as ref**

<div class="alert alert-block alert-warning">
 <p style="color:black"> <b>NOTE: GPM-IMERG AND ERA5 need to be compared against is HADUK-Grid</b></p>
</div>

In [None]:
# Create the season dataset containing HADUK-Grid, ERA5 and GPM-IMERG data 
#  the order of the function return is (df_era_season, df_gpm_season, df_had_season)
df_DJF_season = season_dataframe(HAD_dataset_DJF, GPM_dataset_DJF, ERA_dataset_DJF)
df_MAM_season = season_dataframe(HAD_dataset_MAM, GPM_dataset_MAM, ERA_dataset_MAM)
df_JJA_season = season_dataframe(HAD_dataset_JJA, GPM_dataset_JJA, ERA_dataset_JJA)
df_SON_season = season_dataframe(HAD_dataset_SON, GPM_dataset_SON, ERA_dataset_SON)

# DJF Season ONLY dataframe
df_ERA_DJF = df_DJF_season[0]
df_GPM_DJF = df_DJF_season[1]
df_HAD_DJF = df_DJF_season[2]

# MAM Season ONLY dataframe
df_ERA_MAM = df_MAM_season[0]
df_GPM_MAM = df_MAM_season[1]
df_HAD_MAM = df_MAM_season[2]

# JJA Season ONLY dataframe
df_ERA_JJA = df_JJA_season[0]
df_GPM_JJA = df_JJA_season[1]
df_HAD_JJA = df_JJA_season[2]

# SON Season ONLY dataframe
df_ERA_SON = df_SON_season[0]
df_GPM_SON = df_SON_season[1]
df_HAD_SON = df_SON_season[2]

print(sep)
print('All season datasets have been set')
print(sep)

<div class="alert alert-block alert-warning">
 <p style="color:black"> <b>Below is a quick check for NaN values. We are checking that the dataframes produced from the computed indices nc files are not storing only NaN values</b><br><i>This only checks if a whole row has NaN values. Further down the process more NaN value check need to be made</i></p>
</div>

In [None]:
# Checking if there is a dataframe with all values as NaN
vars = ["df_ERA_DJF", "df_GPM_DJF", "df_HAD_DJF",
        "df_ERA_MAM", "df_GPM_MAM", "df_HAD_MAM",
        "df_ERA_JJA", "df_GPM_JJA", "df_HAD_JJA",
        "df_ERA_SON", "df_GPM_SON", "df_HAD_SON"]

i = 0
dfNames_withOnlyNan = []
# This loop uses the name in the vars list and calls the local variable
for name in vars:
    selected_rows = locals()[name][~locals()[name].isnull().any(axis=1)]
    if selected_rows.shape == 0:
        dfNames_withOnlyNan.append(name)
if i == 0:
    print(sep)
    print('All dataframes have numeric values that can be computed')
    print(sep)
else:
    print(sep)
    print('The following dataframes are entiry build with NaN values: '.format(dfNames_withOnlyNan))
    print(sep)

#### Pearson correlation coefficient (r)
The function can be seen [here](#func_stats)

In [None]:
# Calculation (r) for each season
print(sep)
print('Calculating Pearson correlation...')
print(sep)

# DJF Season ONLY dataframe
print('Doing DJF...')
pearson_HAD_GPM_DJF = stats_indices_v2(df_HAD_DJF, df_GPM_DJF, 'pearson')
pearson_HAD_ERA_DJF = stats_indices_v2(df_HAD_DJF, df_ERA_DJF, 'pearson')
print('Doing MAM...')
# MAM Season ONLY dataframe
pearson_HAD_GPM_MAM = stats_indices_v2(df_HAD_MAM, df_GPM_MAM, 'pearson')
pearson_HAD_ERA_MAM = stats_indices_v2(df_HAD_MAM, df_ERA_MAM, 'pearson')
print('Doing JJA...')
# JJA Season ONLY dataframe
pearson_HAD_GPM_JJA = stats_indices_v2(df_HAD_JJA, df_GPM_JJA, 'pearson')
pearson_HAD_ERA_JJA = stats_indices_v2(df_HAD_JJA, df_ERA_JJA, 'pearson')
print('Doing SON...')
# SON Season ONLY dataframe
pearson_HAD_GPM_SON = stats_indices_v2(df_HAD_SON, df_GPM_SON, 'pearson')
pearson_HAD_ERA_SON = stats_indices_v2(df_HAD_SON, df_ERA_SON, 'pearson')

print('Saving outputs of correlation as .nc files')
savingFile('pearson_HAD_GPM_DJF.nc', pearson_HAD_GPM_DJF)
savingFile('pearson_HAD_ERA_DJF.nc', pearson_HAD_ERA_DJF)
savingFile('pearson_HAD_GPM_MAM.nc', pearson_HAD_GPM_MAM)
savingFile('pearson_HAD_ERA_MAM.nc', pearson_HAD_ERA_MAM)
savingFile('pearson_HAD_GPM_JJA.nc', pearson_HAD_GPM_JJA)
savingFile('pearson_HAD_ERA_JJA.nc', pearson_HAD_ERA_JJA)
savingFile('pearson_HAD_GPM_SON.nc', pearson_HAD_GPM_SON)
savingFile('pearson_HAD_ERA_SON.nc', pearson_HAD_ERA_SON)


print(sep)
print('Finished calculating Pearson correlations...')
print(sep)

#### Spearman correlation coefficient (r)
The function can be seen [here](#func_stats)

In [None]:
# Calculation (r) for each season
print(sep)
print('Calculating Spearman correlation...')
print(sep)
print('Doing DJF...')
# DJF Season ONLY dataframe
spearman_HAD_GPM_DJF = stats_indices_v2(df_HAD_DJF, df_GPM_DJF, 'spearman')
spearman_HAD_ERA_DJF = stats_indices_v2(df_HAD_DJF, df_ERA_DJF, 'spearman')
print('Doing MAM...')
# MAM Season ONLY dataframe
spearman_HAD_GPM_MAM = stats_indices_v2(df_HAD_MAM, df_GPM_MAM, 'spearman')
spearman_HAD_ERA_MAM = stats_indices_v2(df_HAD_MAM, df_ERA_MAM, 'spearman')
print('Doing JJA...')
# JJA Season ONLY dataframe
spearman_HAD_GPM_JJA = stats_indices_v2(df_HAD_JJA, df_GPM_JJA, 'spearman')
spearman_HAD_ERA_JJA = stats_indices_v2(df_HAD_JJA, df_ERA_JJA, 'spearman')
print('Doing SON...')
# SON Season ONLY dataframe
spearman_HAD_GPM_SON = stats_indices_v2(df_HAD_SON, df_GPM_SON, 'spearman')
spearman_HAD_ERA_SON = stats_indices_v2(df_HAD_SON, df_ERA_SON, 'spearman')

print('Saving outputs of spearman correlation as .nc files')
savingFile('spearman_HAD_GPM_DJF.nc', spearman_HAD_GPM_DJF)
savingFile('spearman_HAD_ERA_DJF.nc', spearman_HAD_ERA_DJF)
savingFile('spearman_HAD_GPM_MAM.nc', spearman_HAD_GPM_MAM)
savingFile('spearman_HAD_ERA_MAM.nc', spearman_HAD_ERA_MAM)
savingFile('spearman_HAD_GPM_JJA.nc', spearman_HAD_GPM_JJA)
savingFile('spearman_HAD_ERA_JJA.nc', spearman_HAD_ERA_JJA)
savingFile('spearman_HAD_GPM_SON.nc', spearman_HAD_GPM_SON)
savingFile('spearman_HAD_ERA_SON.nc', spearman_HAD_ERA_SON)

print(sep)
print('Finished calculating Spearman correlations...')
print(sep)

#### Mann-Whitney U test
The function can be seen [here](#func_stats)

In [None]:
# Calculation (p) for each season
print(sep)
print('Calculating Mann-Whitney...')
print(sep)
print('Doing DJF...')
# DJF Season ONLY dataframe
MW_HAD_GPM_DJF = stats_indices_v2(df_HAD_DJF, df_GPM_DJF, 'wm')
MW_HAD_ERA_DJF = stats_indices_v2(df_HAD_DJF, df_ERA_DJF, 'wm')
print('Doing MAM...')
# MAM Season ONLY dataframe
MW_HAD_GPM_MAM = stats_indices_v2(df_HAD_MAM, df_GPM_MAM, 'wm')
MW_HAD_ERA_MAM = stats_indices_v2(df_HAD_MAM, df_ERA_MAM, 'wm')
print('Doing JJA...')
# JJA Season ONLY dataframe
MW_HAD_GPM_JJA = stats_indices_v2(df_HAD_JJA, df_GPM_JJA, 'wm')
MW_HAD_ERA_JJA = stats_indices_v2(df_HAD_JJA, df_ERA_JJA, 'wm')
print('Doing SON...')
# SON Season ONLY dataframe
MW_HAD_GPM_SON = stats_indices_v2(df_HAD_SON, df_GPM_SON, 'wm')
MW_HAD_ERA_SON = stats_indices_v2(df_HAD_SON, df_ERA_SON, 'wm')

print('Saving outputs of Mann-Whitney  as .nc files')
savingFile('MW_HAD_GPM_DJF.nc', MW_HAD_GPM_DJF)
savingFile('MW_HAD_ERA_DJF.nc', MW_HAD_ERA_DJF)
savingFile('MW_HAD_GPM_MAM.nc', MW_HAD_GPM_MAM)
savingFile('MW_HAD_ERA_MAM.nc', MW_HAD_ERA_MAM)
savingFile('MW_HAD_GPM_JJA.nc', MW_HAD_GPM_JJA)
savingFile('MW_HAD_ERA_JJA.nc', MW_HAD_ERA_JJA)
savingFile('MW_HAD_GPM_SON.nc', MW_HAD_GPM_SON)
savingFile('MW_HAD_ERA_SON.nc', MW_HAD_ERA_SON)

print(sep)
print('Finished calculating Mann-Whitney...')
print(sep)

#### Levene test based on the median

In [None]:
# Calculation (p) for each season
print(sep)
print('Calculating Levene...')
print(sep)
print('Doing DJF...')
# DJF Season ONLY dataframe
levene_HAD_GPM_DJF = stats_indices_v2(df_HAD_DJF, df_GPM_DJF, 'levene')
levene_HAD_ERA_DJF = stats_indices_v2(df_HAD_DJF, df_ERA_DJF, 'levene')
print('Doing MAM...')
# MAM Season ONLY dataframe
levene_HAD_GPM_MAM = stats_indices_v2(df_HAD_MAM, df_GPM_MAM, 'levene')
levene_HAD_ERA_MAM = stats_indices_v2(df_HAD_MAM, df_ERA_MAM, 'levene')
print('Doing JJA...')
# JJA Season ONLY dataframe
levene_HAD_GPM_JJA = stats_indices_v2(df_HAD_JJA, df_GPM_JJA, 'levene')
levene_HAD_ERA_JJA = stats_indices_v2(df_HAD_JJA, df_ERA_JJA, 'levene')
print('Doing SON...')
# SON Season ONLY dataframe
levene_HAD_GPM_SON = stats_indices_v2(df_HAD_SON, df_GPM_SON, 'levene')
levene_HAD_ERA_SON = stats_indices_v2(df_HAD_SON, df_ERA_SON, 'levene')

print('Saving outputs of Levene as .nc files')
savingFile('levene_HAD_GPM_DJF.nc', levene_HAD_GPM_DJF)
savingFile('levene_HAD_ERA_DJF.nc', levene_HAD_ERA_DJF)
savingFile('levene_HAD_GPM_MAM.nc', levene_HAD_GPM_MAM)
savingFile('levene_HAD_ERA_MAM.nc', levene_HAD_ERA_MAM)
savingFile('levene_HAD_GPM_JJA.nc', levene_HAD_GPM_JJA)
savingFile('levene_HAD_ERA_JJA.nc', levene_HAD_ERA_JJA)
savingFile('levene_HAD_GPM_SON.nc', levene_HAD_GPM_SON)
savingFile('levene_HAD_ERA_SON.nc', levene_HAD_ERA_SON)

print(sep)
print('Finished calculating Levene...')
print(sep)

#### Precipitation indices average for 2001 to 2019

In [None]:
# Calculation indices average for the entire time period
print(sep)
print('Calculating indices average...')
print(sep)
print('Doing DJF...')
# DJF Season ONLY dataframe indices_avg(OBS_dataframe, OTH_dataframe)
IndAvg_HAD_DJF = indices_avg(df_HAD_DJF,df_HAD_DJF)
IndAvg_GPM_DJF = indices_avg(df_HAD_DJF,df_GPM_DJF) 
IndAvg_ERA_DJF = indices_avg(df_HAD_DJF,df_ERA_DJF) 
print('Doing MAM...')
# MAM Season ONLY dataframe
IndAvg_HAD_MAM = indices_avg(df_HAD_MAM,df_HAD_MAM)
IndAvg_GPM_MAM = indices_avg(df_HAD_MAM,df_GPM_MAM) 
IndAvg_ERA_MAM = indices_avg(df_HAD_MAM,df_ERA_MAM) 
print('Doing JJA...')
# JJA Season ONLY dataframe
IndAvg_HAD_JJA = indices_avg(df_HAD_JJA,df_HAD_JJA)
IndAvg_GPM_JJA = indices_avg(df_HAD_JJA,df_GPM_JJA) 
IndAvg_ERA_JJA = indices_avg(df_HAD_JJA,df_ERA_JJA) 
print('Doing SON...')
# SON Season ONLY dataframe
IndAvg_HAD_SON = indices_avg(df_HAD_SON,df_HAD_SON)
IndAvg_GPM_SON = indices_avg(df_HAD_SON,df_GPM_SON) 
IndAvg_ERA_SON = indices_avg(df_HAD_SON,df_ERA_SON) 

print('Saving outputs of spearman correlation as .nc files')
savingFile('IndAvg_HAD_DJF.nc', IndAvg_HAD_DJF)
savingFile('IndAvg_GPM_DJF.nc', IndAvg_GPM_DJF)
savingFile('IndAvg_ERA_DJF.nc', IndAvg_ERA_DJF)
savingFile('IndAvg_HAD_MAM.nc', IndAvg_HAD_MAM)
savingFile('IndAvg_GPM_MAM.nc', IndAvg_GPM_MAM)
savingFile('IndAvg_ERA_MAM.nc', IndAvg_ERA_MAM)
savingFile('IndAvg_HAD_JJA.nc', IndAvg_HAD_JJA)
savingFile('IndAvg_GPM_JJA.nc', IndAvg_GPM_JJA)
savingFile('IndAvg_ERA_JJA.nc', IndAvg_ERA_JJA)
savingFile('IndAvg_HAD_SON.nc', IndAvg_HAD_SON)
savingFile('IndAvg_GPM_SON.nc', IndAvg_GPM_SON)
savingFile('IndAvg_ERA_SON.nc', IndAvg_ERA_SON)


print(sep)
print('Finished calculating indices average...')
print(sep)

# Creating graphs and Maps

* Colour functions Source
        ----------
        https://towardsdatascience.com/beautiful-custom-colormaps-with-matplotlib-5bab3d1f0e72

In [None]:
def hex_to_rgb(value):
    '''
    Converts hex to rgb colours
    value: string of 6 characters representing a hex colour.
    Returns: list length 3 of RGB values'''
    value = value.strip("#") # removes hash symbol if present
    lv = len(value)
    return tuple(int(value[i:i + lv // 3], 16) for i in range(0, lv, lv // 3))


def rgb_to_dec(value):
    '''
    Converts rgb to decimal colours (i.e. divides each value by 256)
    value: list (length 3) of RGB values
    Returns: list (length 3) of decimal values'''
    return [v/256 for v in value]

def get_continuous_cmap(hex_list, float_list=None):
    ''' creates and returns a color map that can be used in heat map figures.
        If float_list is not provided, colour map graduates linearly between each color in hex_list.
        If float_list is provided, each color in hex_list is mapped to the respective location in float_list. 
        
        Source
        ----------
        https://towardsdatascience.com/beautiful-custom-colormaps-with-matplotlib-5bab3d1f0e72
        
        Parameters
        ----------
        hex_list: list of hex code strings
        float_list: list of floats between 0 and 1, same length as hex_list. Must start with 0 and end with 1.
        
        Returns
        ----------
        colour map'''
    
    rgb_list = [rgb_to_dec(hex_to_rgb(i)) for i in hex_list]
    if float_list:
        pass
    else:
        float_list = list(np.linspace(0,1,len(rgb_list)))
        
    cdict = dict()
    for num, col in enumerate(['red', 'green', 'blue']):
        col_list = [[float_list[i], rgb_list[i][num], rgb_list[i][num]] for i in range(len(float_list))]
        cdict[col] = col_list
    cmp = mcolors.LinearSegmentedColormap('my_cmp', segmentdata=cdict, N=256)
    return cmp

<a id='func_plots'></a>
## Functions to be used for data plotting**

In [None]:
def setting_map(prcp_index, season_dataset, graph_title, season, legend_text, row_num, col_num, vmin, vmax):
     """
    Returns mapplot lib figure
    :prcp_index: string
    :season_dataset: string
    :graph_title: string
    :season: string
    :legend_text: string
    :row_num: integer
    :col_num: integer
    :vmin: integer
    :vmax: integer
    :return: mapplotlib figure
    """
    mm = Basemap(resolution='i',projection='merc',ellps='WGS84',llcrnrlat=49,urcrnrlat=61,llcrnrlon=-9,urcrnrlon=3,lat_ts=20, ax=axs[row_num,col_num])
    lons = season_dataset.variables['lon'][:]
    lats = season_dataset.variables['lat'][:]
    ext_ind = season_dataset.variables[prcp_index][:]
    lon, lat = np.meshgrid(lons, lats)
    xi, yi = mm(lon, lat)
    hex_list = ['#f6eff7', '#c0e0d4', '#8bcdc3', '#4cb9c3', '#3d9ebe', '#3883b6', '#3369ac', '#2d4ea0', '#253494']

    cs = mm.pcolor(xi,yi,np.squeeze(ext_ind ),shading='auto', vmin=vmin, vmax=vmax, cmap=get_continuous_cmap(hex_list))
    fig.colorbar(cs, ax=axs[row_num,col_num], shrink=0.8, pad=0.05, label=legend_text, orientation = 'horizontal')

    # add shp file as coastline
    mm.readshapefile('/mnt/d/MRes_dataset/active_data/101_admin/uk_admin_boundary_py_nasa_pp_countryOutlineFromGiovanni', 'uk_boundary')

    # draw parallels and meridians.
    # Mercator
    mm.drawparallels(np.arange(-40,61.,2.),labels=[True, False, False, True])
    mm.drawmeridians(np.arange(-20.,21.,2.),labels=[True, False, False, True])
    
    # set title
    # setting legend in bar
    if prcp_index in ['r10mm', 'r20mm','r95p']:
        title_text = prcp_index.capitalize()
    if prcp_index in ['cdd', 'cwd', 'sdii']:
        title_text = prcp_index.upper()
    if prcp_index in ['rx1day', 'rx5day']:
        title_text = prcp_index[0:2].upper() + prcp_index[2:]
    if prcp_index == 'prcptot':
        title_text = prcp_index[0:4].upper() + prcp_index[4:]
        
    graph_title = graph_title + '\n' + season + ' - ' + title_text
    axs[row_num,col_num].set_title(graph_title)
    
    return(mm)

def season_mean_UK(season_dataset,lst_seasonNames):
    """
    Returns pandas dataframe contatingn season mean average
    :season_dataset: xarray
    :lst_seasonNames: list of strings
    :return: pandas dataframe
    """
    
    col_lst = ['r10mm_x', 'r20mm_x', 'cdd_x', 'cwd_x', 'sdii_x', 'rx1day_x', 'rx5day_x', 'prcptot_x', 'r99ptot_x']
    col_lst_ord = ['season', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r99ptot']

    # Add date as string to help with inner join
    arr_averageSeasonOBS = HAD_dataset_season.groupby('time.season').mean(dim='time')
    df_averageSeasonOBS = arr_averageSeasonOBS.to_dataframe().reset_index()

    arr_averageSeasonOTH = season_dataset.groupby('time.season').mean(dim='time')
    df_averageSeasonOTH = arr_averageSeasonOTH.to_dataframe().reset_index()

    # Change all timedelta64 (x days) for only the day
    indices_lst = ['r10mm', 'r20mm', 'cdd', 'cwd']
    for prc_ind in indices_lst:
        df_averageSeasonOBS[prc_ind] = np.where(np.isnan(df_averageSeasonOBS[prc_ind] ),np.nan, (df_averageSeasonOBS[prc_ind]  / np.timedelta64(1, 'D')))
        df_averageSeasonOTH[prc_ind]  = np.where(np.isnan(df_averageSeasonOTH[prc_ind] ),np.nan, (df_averageSeasonOTH[prc_ind]  / np.timedelta64(1, 'D')))

    # Select rows without NaN values for obs dataset (HadUK-Grid) - This make sure we are looking at UK extent
    new_df = pd.merge(df_averageSeasonOBS, df_averageSeasonOTH, how='inner', on=['lat','lon','season'])

    # Select rows without NaN values - Only looking at the indices of the obs dataset (HadUK-Grid)
    # Needed for the statistic analysis. It only remove row if all NaN values
    selected_rows = new_df.dropna(subset=col_lst, how='all', axis=0).reset_index()

    # Split the joined dataframe into OBS (HAD) and OTH (GPM or ERA)
    df_OTH_split = selected_rows.filter(regex='_y')

    # create copy of dataframe to avoid the slice error
    df_OTH_final = df_OTH_split.copy()

    # add lat and lon to split 
    df_OTH_final['season'] = selected_rows['season']

    # rename and reorder columns
    df_OTH_final.columns = df_OTH_final.columns.str.replace('_y','')
    # re-order
    df_OTH_final = df_OTH_final.reindex(columns=col_lst_ord)

    # Get average by season for all indices
    df_averageSeason_final = df_OTH_final.groupby('season').agg('mean')
    return(df_averageSeason_final.reindex(lst_seasonNames).reset_index())

def year_mean_UK(season_dataset, dataset_name):
    """
    Returns pandas dataframe contatingn year mean average
    :season_dataset: xarray
    :lst_seasonNames: list of strings
    :return: pandas dataframe
    """
    col_lst = ['r10mm_x', 'r20mm_x', 'cdd_x', 'cwd_x', 'sdii_x', 'rx1day_x', 'rx5day_x', 'prcptot_x', 'r99ptot_x']
    col_lst_ord = ['year', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r99ptot']
    yrs_lst = [*range(2001,2020,1)]

    # Add date as string to help with inner join
    if dataset_name != 'HAD':
        arr_averageSeasonOTH = season_dataset.groupby('time.year').sum('time')
        df_averageSeasonOTH = arr_averageSeasonOTH.to_dataframe().reset_index()

        arr_averageSeasonOBS = HAD_dataset_season.groupby('time.year').sum('time')
        df_averageSeasonOBS = arr_averageSeasonOBS.to_dataframe().reset_index()

        # Change all timedelta64 (x days) for only the day
        indices_lst = ['r10mm', 'r20mm', 'cdd', 'cwd']
        for prc_ind in indices_lst:
            df_averageSeasonOBS[prc_ind] = np.where(np.isnan(df_averageSeasonOBS[prc_ind] ),np.nan, (df_averageSeasonOBS[prc_ind]  / np.timedelta64(1, 'D')))
            df_averageSeasonOTH[prc_ind]  = np.where(np.isnan(df_averageSeasonOTH[prc_ind] ),np.nan, (df_averageSeasonOTH[prc_ind]  / np.timedelta64(1, 'D')))
            
        # Select rows without NaN values for obs dataset (HadUK-Grid) - This make sure we are looking at UK extent
        new_df = pd.merge(df_averageSeasonOBS, df_averageSeasonOTH, how='inner', on=['lat','lon','year'])
        new_df
        # Select rows without NaN values - Only looking at the indices of the obs dataset (HadUK-Grid)
        # Needed for the statistic analysis. It only remove row if all NaN values
        selected_rows = new_df.dropna(subset=col_lst, how='all', axis=0).reset_index()

        # Split the joined dataframe into OBS (HAD) and OTH (GPM or ERA)
        df_OTH_split = selected_rows.filter(regex='_y')
        df_OTH_split

        # create copy of dataframe to avoid the slice error
        df_OTH_final = df_OTH_split.copy()

        # add year to split 
        df_OTH_final['year'] = selected_rows['year']

         # rename and reorder columns
        df_OTH_final.columns = df_OTH_final.columns.str.replace('_y','')
        # re-order
        df_OTH_final = df_OTH_final.reindex(columns=col_lst_ord)
        # Get average by year for all indices
        df_averageYear = df_OTH_final.groupby('year').agg('mean')
#         df_averageYear = df_averageYear.replace(np.nan, '', regex=True)
        df_averageYear = df_averageYear.iloc[1:]

    else:
        # Remove zero as this affects the average
        arr_averageSeasonOBS = season_dataset.groupby('time.year').sum('time')
        df_averageSeasonOBS = arr_averageSeasonOBS.to_dataframe().reset_index()
        # Change all timedelta64 (x days) for only the day
        indices_lst = ['r10mm', 'r20mm', 'cdd', 'cwd']
        for prc_ind in indices_lst:
            df_averageSeasonOBS[prc_ind] = np.where(np.isnan(df_averageSeasonOBS[prc_ind] ),np.nan, (df_averageSeasonOBS[prc_ind]  / np.timedelta64(1, 'D')))
        
        # For HADGrid-UK replace zero for NaN to avoid using zero in the mean value
        df_averageSeasonOBS = df_averageSeasonOBS.replace(0, np.NaN)

        # Drop unnecesary columns
        df_averageSeasonOBS.drop(['lat', 'lon', 'percentiles'], axis=1, inplace=True)
        # Get average by year for all indices
        df_averageYear = df_averageSeasonOBS.groupby('year').agg('mean')
#         df_averageYear = df_averageYear.replace(np.nan, '', regex=True)
        df_averageYear = df_averageYear.iloc[1:]

    return(df_averageYear.reindex(yrs_lst).reset_index().round(2))
 
def indices_plot(ERAdataframe_seasonAverage, GPMdataframe_seasonAverage, HADdataframe_seasonAverage, prcp_index):
    """
    Returns tupple with list of values
    :ERAdataframe_seasonAverage: pandas dataframe
    :GPMdataframe_seasonAverage: pandas dataframe
    :HADdataframe_seasonAverage: pandas dataframe
    :prcp_index: string
    :return: pandas dataframe
    """
    #setting the data
    era_index = ERAdataframe_seasonAverage[prcp_index].tolist()
    gpm_index = GPMdataframe_seasonAverage[prcp_index].tolist()
    had_index = HADdataframe_seasonAverage[prcp_index].tolist()
    
    # This deals with NaN values in 2019
    if prcp_index in ['r10mm', 'r20mm', 'cdd', 'cwd']:
        era_index = era_index[:-1]
        gpm_index = gpm_index[:-1]
        had_index = had_index[:-1]
        
    return(era_index, gpm_index, had_index)

def plot_setup(subplot_ref,lst_yrs, y_label, x_label):
    """
    Returns maplotlib figure
    :subplot_ref: list of integers
    :lst_yrs: list of integers
    :y_label: string
    :x_label: string
    :return: mapplotlib figure
    """
    # Set the tick positions
    subplot_ref.set_xticks(lst_yrs)
    # Set the tick labels
    subplot_ref.set_xticklabels(lst_yrs)
    subplot_ref.xaxis.set_tick_params(labelsize='x-large')
    subplot_ref.yaxis.set_tick_params(labelsize='x-large')
    # Set title and axis
    subplot_ref.grid()
    subplot_ref.set_ylabel(y_label, fontdict={'fontsize': 18, 'fontweight': 'normal'})
    subplot_ref.set_xlabel(x_label, fontdict={'fontsize': 18, 'fontweight': 'normal'})
    # Set text
    subplot_ref.text(0.1, 0.95, label, horizontalalignment='center', verticalalignment="top",\
                  transform=subplot_ref.transAxes, fontsize='x-large', fontweight='bold',\
                  bbox=dict(facecolor='none', edgecolor='black', boxstyle='round'))
    # Set legend
    subplot_ref.legend(bbox_to_anchor=(0, 1, 1, 0), loc='lower center', fontsize='x-large', ncol=3)
    
def saving_image(subplot_ref, fldr_plot, file_name):
    """
    Save image output in folder
    :subplot_ref: list of integers
    :fldr_plot: pathlib folder path
    :file_name: string
    """
    extent = subplot_ref.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
    fig.savefig((Path(fldr_plot / file_name)), bbox_inches=extent)
#     # Pad the saved area by 10% in the x-direction and 20% in the y-direction, 
    fig.savefig((Path(fldr_plot / file_name)), bbox_inches=extent.expanded(plot_dmn[0], plot_dmn[1]), dpi=150)
    
print(sep)
print('Mapping functions set')
print(sep)

## Creating maps of statistical outputs

This process creates the maps for each statistical result. Only one type of input can be entered here i.e. the results of Pearson's coefficient for HADgridUK to GPM-IMERG for all the seasons

In [None]:
indices_lst = ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
seasons_lst = ['DJF', 'MAM', 'JJA', 'SON']
# title and legend - needs to match the dataset plotted
title = 'Levene (p) value\n HADUK-Grid - ERA5'
legend = 'Levene (p)'
fileName = 'levene_HADUK-ERA_'
fldr_images = Path('/mnt/c/Users/C0060017/Documents/Taught_Material/MRes_Dissertation/Dissertation/Images/levene_ERA')


# # Restricting view to the UK - ONLY DO THIS FOR ERA AND GPM-IMERG
indices_statsUK_DJF = levene_HAD_ERA_DJF
indices_statsUK_MAM = levene_HAD_ERA_MAM
indices_statsUK_JJA = levene_HAD_ERA_JJA
indices_statsUK_SON = levene_HAD_ERA_SON

# Gettting Vmin and Vmax for graphical output - this is the colour bar
# Using the xarray to find vmin and vmax for the map display
# Each max and min is select by index rather than season

vars = ['indices_statsUK_DJF','indices_statsUK_MAM',
        'indices_statsUK_JJA','indices_statsUK_SON']

df_cbar = pd.DataFrame(columns=['DJF_min','DJF_max','MAM_min','MAM_max',
                               'JJA_min','JJA_max','SON_min','SON_max'])
# This loop uses the name in the vars list and calls the local variable
for prcp_ind in indices_lst:
    temp_lst=[]
    for name in vars:
        min_values = locals()[name][prcp_ind].min().values
        max_values = locals()[name][prcp_ind].max().values
        temp_lst.append(min_values)
        temp_lst.append(max_values)
    # add values to dataframe
    df_cbar.loc[len(df_cbar)] = temp_lst  

# Create final dataframe
cbarFinal = pd.DataFrame(columns=['prcp_ind_cbar','vmin','vmax'])
# Add data to final dataframe
cbarFinal['prcp_ind_cbar'] = indices_lst
cbarFinal['vmin'] = df_cbar.min(axis=1)
cbarFinal['vmax'] = df_cbar.max(axis=1)
# End of Vmin and Vmax for graphical output - this is the colour bar


# Start Plotting
# set number of rows for subplots
n_rows = len(indices_lst)

# start subplots
from mpl_toolkits.axes_grid1 import make_axes_locatable
fig, axs = plt.subplots(n_rows, 4,figsize=(30,100))

# Create the plots
i = 0
while i < len(indices_lst):
    j = 0
    for season in seasons_lst:
        if season == 'DJF':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_statsUK_DJF, title, 'DJF', legend, i, j, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j], fldr_images, file_name,[0.85,1.38])
        if season == 'MAM':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_statsUK_MAM, title, 'MAM', legend, i, j+1, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+1], fldr_images, file_name,[0.85,1.38])
        if season == 'JJA':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_statsUK_JJA, title, 'JJA', legend, i, j+2, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+2], fldr_images, file_name,[0.85,1.38])
            
        if season == 'SON':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_statsUK_SON, title, 'SON', legend, i, j+3, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+3], fldr_images, file_name,[0.85,1.38])
    i +=1
plt.show()


# Make sure it show a nice layout avoiding overlapping
plt.tight_layout()

## Creating maps of indices average

In [None]:
indices_lst = ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
seasons_lst = ['DJF', 'MAM', 'JJA', 'SON']
# title and legend - needs to match the dataset plotted
title = 'Mean value ERA5'
fileName = 'IndicesMeanAvg_ERA_'
DATASET = 'ERA'
fldr_images = Path('/mnt/c/Users/C0060017/Documents/Taught_Material/MRes_Dissertation/Dissertation/Images/indicesMeanAvg_ERA')
    
if DATASET == 'ERA':
    indices_avgUK_DJF = IndAvg_ERA_DJF
    indices_avgUK_MAM = IndAvg_ERA_MAM
    indices_avgUK_JJA = IndAvg_ERA_JJA
    indices_avgUK_SON = IndAvg_ERA_SON
if DATASET == 'GPM':
    indices_avgUK_DJF = IndAvg_GPM_DJF
    indices_avgUK_MAM = IndAvg_GPM_MAM
    indices_avgUK_JJA = IndAvg_GPM_JJA
    indices_avgUK_SON = IndAvg_GPM_SON
if DATASET == 'HAD':  
    indices_avgUK_DJF = IndAvg_HAD_DJF
    indices_avgUK_MAM = IndAvg_HAD_MAM
    indices_avgUK_JJA = IndAvg_HAD_JJA
    indices_avgUK_SON = IndAvg_HAD_SON

# Gettting Vmin and Vmax for graphical output - this is the colour bar
# Using the xarray to find vmin and vmax for the map display
# Each max and min is select by index rather than season

vars = ['indices_avgUK_DJF','indices_avgUK_MAM',
        'indices_avgUK_JJA','indices_avgUK_SON']

df_cbar = pd.DataFrame(columns=['DJF_min','DJF_max','MAM_min','MAM_max',
                               'JJA_min','JJA_max','SON_min','SON_max'])
# This loop uses the name in the vars list and calls the local variable
for prcp_ind in indices_lst:
    temp_lst=[]
    for name in vars:
        min_values = locals()[name][prcp_ind].min().values
        max_values = locals()[name][prcp_ind].max().values
        temp_lst.append(min_values)
        temp_lst.append(max_values)
    # add values to dataframe
    df_cbar.loc[len(df_cbar)] = temp_lst  

# Create final dataframe
cbarFinal = pd.DataFrame(columns=['prcp_ind_cbar','vmin','vmax'])
# Add data to final dataframe
cbarFinal['prcp_ind_cbar'] = indices_lst
cbarFinal['vmin'] = df_cbar.min(axis=1)
cbarFinal['vmax'] = df_cbar.max(axis=1)
# End of Vmin and Vmax for graphical output - this is the colour bar


# set number of rows for subplots
n_rows = len(indices_lst)

# start subplots
from mpl_toolkits.axes_grid1 import make_axes_locatable
fig, axs = plt.subplots(n_rows, 4, figsize=(30,100))

# Create the plots
i = 0
while i < len(indices_lst):
    j = 0
    # setting legend in bar
    if indices_lst[i] in ['r10mm', 'r20mm']:
        legend_text = indices_lst[i].capitalize() + ' (days)' 
    if indices_lst[i] in ['cdd', 'cwd']:
        legend_text = indices_lst[i].upper() + ' (days)'
    if indices_lst[i] in ['rx1day', 'rx5day']:
        legend_text = indices_lst[i][0:2].upper() + indices_lst[i][2:] + ' (mm)'
    if indices_lst[i] == 'prcptot':
        legend_text = indices_lst[i].upper() + ' (mm)'
    if indices_lst[i] in ['r95ptot', 'r99ptot']:
        legend_text = indices_lst[i][:4].capitalize() + indices_lst[i][4:].upper()  + ' (%)'
    if indices_lst[i] == 'sdii':
        legend_text = indices_lst[i].upper() + ' (mm)'
  
    #Create plots 
    for season in seasons_lst:
        if season == 'DJF':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_avgUK_DJF, title, season, legend_text, i, j, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j], fldr_images, file_name,[0.85,1.38])
        if season == 'MAM':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_avgUK_MAM, title, 'MAM', legend_text, i, j+1, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+1], fldr_images, file_name,[0.85,1.38])
            
        if season == 'JJA':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_avgUK_JJA, title, 'JJA', legend_text, i, j+2, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+2], fldr_images, file_name,[0.85,1.38])
            
        if season == 'SON':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, indices_avgUK_SON, title, 'SON', legend_text, i, j+3, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+3], fldr_images, file_name,[0.85,1.38])
    i +=1
plt.show()


# Make sure it show a nice layout avoiding overlapping
plt.tight_layout()

## Creating maps of difference in indices average

In [None]:
indices_lst = ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
seasons_lst = ['DJF', 'MAM', 'JJA', 'SON']
# title and legend - needs to match the dataset plotted
title = 'Difference Index mean value\n HadUK-Grid - IMERG'
fileName = 'IndicesMeanAvgDifference_GPM_'
DATASET = 'GPM'
fldr_images = Path('/mnt/c/Users/C0060017/Documents/Taught_Material/MRes_Dissertation/Dissertation/Images/IndicesDiff_GPM')
    
if DATASET == 'ERA':
    diff_indAvgUK_DJF = IndAvg_HAD_DJF - IndAvg_ERA_DJF
    diff_indAvgUK_MAM = IndAvg_HAD_MAM - IndAvg_ERA_MAM
    diff_indAvgUK_JJA = IndAvg_HAD_JJA - IndAvg_ERA_JJA
    diff_indAvgUK_SON = IndAvg_HAD_SON - IndAvg_ERA_SON
if DATASET == 'GPM':
    diff_indAvgUK_DJF = IndAvg_HAD_DJF - IndAvg_GPM_DJF
    diff_indAvgUK_MAM = IndAvg_HAD_MAM - IndAvg_GPM_MAM
    diff_indAvgUK_JJA = IndAvg_HAD_JJA - IndAvg_GPM_JJA
    diff_indAvgUK_SON = IndAvg_HAD_SON - IndAvg_GPM_SON
    
# Gettting Vmin and Vmax for graphical output - this is the colour bar
# Using the xarray to find vmin and vmax for the map display
# Each max and min is select by index rather than season

vars = ['diff_indAvgUK_DJF','diff_indAvgUK_MAM',
        'diff_indAvgUK_JJA','diff_indAvgUK_SON']

df_cbar = pd.DataFrame(columns=['DJF_min','DJF_max','MAM_min','MAM_max',
                               'JJA_min','JJA_max','SON_min','SON_max'])
# This loop uses the name in the vars list and calls the local variable
for prcp_ind in indices_lst:
    temp_lst=[]
    for name in vars:
        min_values = locals()[name][prcp_ind].min().values
        max_values = locals()[name][prcp_ind].max().values
        temp_lst.append(min_values)
        temp_lst.append(max_values)
    # add values to dataframe
    df_cbar.loc[len(df_cbar)] = temp_lst  

# Create final dataframe
cbarFinal = pd.DataFrame(columns=['prcp_ind_cbar','vmin','vmax'])
# Add data to final dataframe
cbarFinal['prcp_ind_cbar'] = indices_lst
cbarFinal['vmin'] = df_cbar.min(axis=1)
cbarFinal['vmax'] = df_cbar.max(axis=1)
# End of Vmin and Vmax for graphical output - this is the colour bar

# set number of rows for subplots
n_rows = len(indices_lst)

# start subplots
from mpl_toolkits.axes_grid1 import make_axes_locatable
fig, axs = plt.subplots(n_rows, 4, figsize=(30,100))

# Create the plots
i = 0
while i < len(indices_lst):
    j = 0
    # setting legend in bar
    if indices_lst[i] in ['r10mm', 'r20mm']:
        legend_text = indices_lst[i].capitalize() + ' (days)' 
    if indices_lst[i] in ['cdd', 'cwd']:
        legend_text = indices_lst[i].upper() + ' (days)'
    if indices_lst[i] in ['rx1day', 'rx5day']:
        legend_text = indices_lst[i][0:2].upper() + indices_lst[i][2:] + ' (mm)'
    if indices_lst[i] == 'prcptot':
        legend_text = indices_lst[i].upper() + ' (mm)'
    if indices_lst[i] in ['r95ptot', 'r99ptot']:
        legend_text = indices_lst[i][:4].capitalize() + indices_lst[i][4:].upper() + ' (%)'
    if indices_lst[i] == 'sdii':
        legend_text = indices_lst[i].upper() + ' (mm)'
 
    #Create plots 
    for season in seasons_lst:
        if season == 'DJF':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, diff_indAvgUK_DJF, title, season, legend_text, i, j, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j], fldr_images, file_name,[0.85,1.38])
        if season == 'MAM':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, diff_indAvgUK_MAM, title, 'MAM', legend_text, i, j+1, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+1], fldr_images, file_name,[0.85,1.38])
            
        if season == 'JJA':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, diff_indAvgUK_JJA, title, 'JJA', legend_text, i, j+2, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+2], fldr_images, file_name,[0.85,1.38])
            
        if season == 'SON':
            prcp_ind = indices_lst[i]
            vmin = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmin'].values[0]
            vmax = cbarFinal.loc[cbarFinal.prcp_ind_cbar == prcp_ind,'vmax'].values[0]
            setting_map(prcp_ind, diff_indAvgUK_SON, title, 'SON', legend_text, i, j+3, vmin, vmax)
            # saving the subplot
            file_name = fileName + season + '_' + prcp_ind + '.png'
            saving_image(axs[i, j+3], fldr_images, file_name,[0.85,1.38])
    i +=1
plt.show()

## Creating average time series (season) for each index for each dataset

In [None]:
seasons_lst = ['DJF', 'MAM', 'JJA', 'SON']

ERA_averageSeason = season_mean_UK(ERA_dataset_season, seasons_lst)
GPM_averageSeason = season_mean_UK(GPM_dataset_season, seasons_lst)
HAD_averageSeason = season_mean_UK(HAD_dataset_season, seasons_lst)

ERA_averageYear_ind = year_mean_UK(ERA_dataset_season,'ERA')
GPM_averageYear_ind = year_mean_UK(GPM_dataset_season,'GPM')
HAD_averageYear_ind = year_mean_UK(HAD_dataset_season,'HAD')

# ERA_averageSeason.reindex(seasons_lst)
ERA_averageYear_ind

* **Creating plots**

In [None]:
indices_lst = ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
lst_yrs = [*range(2001,2020,1)]
# seasons_lst = ['DJF', 'MAM', 'JJA', 'SON']
# x = [1,2,3,4]
fldr_images = Path('/mnt/d/MRes_dataset/Images/avgIndicesSeasonPlots')

# start subplots
n_rows = len(indices_lst)
from mpl_toolkits.axes_grid1 import make_axes_locatable
fig, axs = plt.subplots(5, 2, figsize=(50,50))

# Create the plots
i = 0
k = 0
while i < len(indices_lst): 
    # setting legend in bar
    if indices_lst[i] in ['r10mm', 'r20mm']:
        lbl_yaxis = indices_lst[i].capitalize() + ' (days)' 
    if indices_lst[i] in ['cdd', 'cwd']:
        lbl_yaxis = indices_lst[i].upper() + ' (days)'
    if indices_lst[i] in ['rx1day', 'rx5day']:
        lbl_yaxis = indices_lst[i][0:2].upper() + indices_lst[i][2:] + ' (mm)'
    if indices_lst[i] == 'prcptot':
        lbl_yaxis = indices_lst[i][0:4].upper() + indices_lst[i][4:] + ' (mm)'
    if indices_lst[i] == 'r95p':
        lbl_yaxis = indices_lst[i].capitalize() + ' (mm)'
    if indices_lst[i] == 'sdii':
        lbl_yaxis = indices_lst[i].upper() + ' (mm)'
    
    if indices_lst[i] in ['r10mm', 'r20mm', 'cdd']:
        lst_yrs_modified = [*range(2001,2019,1)]
        data_graph = indices_plot(ERA_averageYear_ind, GPM_averageYear_ind, HAD_averageYear_ind, indices_lst[i])
        axs[i,0].plot(lst_yrs_modified, data_graph[0], label = 'ERA5', marker='D')
        axs[i,0].plot(lst_yrs_modified, data_graph[1], label = 'GPM-IMERG', marker='v')
        axs[i,0].plot(lst_yrs_modified, data_graph[2], label = 'HadUK-Grid', marker='o')
        # creating plot
        plot_setup(axs[i,0],lst_yrs_modified, lbl_yaxis, 'Seasons')
        # saving subplot
        file_name = 'average_seasonSeries_allDatasets_UK_' + indices_lst[i] + '.png'
        saving_image(axs[i,0], fldr_images, file_name)
    elif indices_lst[i] in ['cwd', 'sdii']:
        data_graph = indices_plot(ERA_averageYear_ind, GPM_averageYear_ind, HAD_averageYear_ind, indices_lst[i])
        axs[i,0].plot(lst_yrs, data_graph[0], label = 'ERA5', marker='D')
        axs[i,0].plot(lst_yrs, data_graph[1], label = 'GPM-IMERG', marker='v')
        axs[i,0].plot(lst_yrs, data_graph[2], label = 'HadUK-Grid', marker='o')
        # creating plot
        plot_setup(axs[i,0],lst_yrs, lbl_yaxis, 'Seasons')
        # saving subplot
        file_name = 'average_seasonSeries_allDatasets_UK_' + indices_lst[i] + '.png'
        saving_image(axs[i,0], fldr_images, file_name)

    else:
        data_graph = indices_plot(ERA_averageYear_ind, GPM_averageYear_ind, HAD_averageYear_ind, indices_lst[i])
        axs[k,1].plot(lst_yrs, data_graph[0], label = 'ERA5', marker='D')
        axs[k,1].plot(lst_yrs, data_graph[1], label = 'GPM-IMERG', marker='v')
        axs[k,1].plot(lst_yrs, data_graph[2], label = 'HadUK-Grid', marker='o')
        # creating plot
        plot_setup(axs[k,1],lst_yrs, lbl_yaxis, 'Seasons')
        if k <= 3:
            pass
            # saving subplot
            file_name = 'average_seasonSeries_allDatasets_UK_' + indices_lst[i] + '.png'
            saving_image(axs[k,1], fldr_images, file_name)
        
        k += 1

    i += 1

plt.show()
# Make sure it show a nice layout avoiding overlapping
plt.tight_layout()

## Percentages of Grid cells
Creating table with insignificant different mean (MW) and variance (Levene) and average correlation coefficient at significant level of 5%, 2001-2019

<div class="alert alert-block alert-warning">
 <p style="color:black"> <b>NOTE: For the loop creating the Getting the percentage of values where H0 can be accepted for MW and Levene, the datasets need to be changed between IMERG and ERA. This step must be run individually for each of the products i.e. you run MW_HAD_ERA_DJF, MW_HAD_ERA_MAM, MW_HAD_ERA_JJA, MW_HAD_ERA_SON; when it is done you then change to MW_HAD_GPM_DJF, MW_HAD_GPM_MAM, MW_HAD_GPM_JJA, MW_HAD_GPM_SON</b></p>
</div>

In [None]:
col_lst = ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
seasons_lst = ['DJF', 'MAM', 'JJA', 'SON']

# Create empty df to store values
perc_col = ['season', 'stat_test', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
df_perc = pd.DataFrame(columns=perc_col)


def perc_calc(df,season_name,test_stat):
    lst_prc = []
    # append season to list
    lst_prc.append(season)
    lst_prc.append(test_stat)
    # Calculate percentage of values where H0 can be accepted
    if test_stat in ['MW', 'Levene']:
        for prc_ind in col_lst:
            df_pvalues_H0 = pd.DataFrame()
            # Create mask with the H0 condition
            mask = (df[prc_ind] > 0.05)
            df_pvalues_H0 = df[mask]
            # Get percentage of values where H0 can be accepted
            perc_H0 = ((df_pvalues_H0.shape[0] / df.shape[0]) * 100)
            lst_prc.append(perc_H0)

        # add data to final dataframe
        df_perc.loc[len(df_perc)] = lst_prc
    if test_stat in ['Pearson', 'Spearman']:
        for prc_ind in col_lst:
            lst_prc.append(df[prc_ind].mean())

        # add data to final dataframe
        df_perc.loc[len(df_perc)] = lst_prc
    
    return(df_perc)
    

# Getting the percentage of values where H0 can be accepted for MW and Levene
for season in seasons_lst:
    if season == 'DJF':
        df_MW = MW_HAD_ERA_DJF.to_dataframe().reset_index()
        df_selectERA_MW = df_MW.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_LV = levene_HAD_ERA_DJF.to_dataframe().reset_index()
        df_selectERA_LV = df_LV.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_PR = pearson_HAD_ERA_DJF.to_dataframe().reset_index()
        df_selectERA_PR = df_PR.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_SP = spearman_HAD_ERA_DJF.to_dataframe().reset_index()
        df_selectERA_SP = df_SP.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        perc_calc(df_selectERA_MW,season,'MW')
        perc_calc(df_selectERA_LV,season,'Levene')
        perc_calc(df_selectERA_PR,season,'Pearson')
        perc_calc(df_selectERA_SP,season,'Spearman')
        
    if season == 'MAM':
        df_MW = MW_HAD_ERA_MAM.to_dataframe().reset_index()
        df_selectERA_MW = df_MW.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_LV = levene_HAD_ERA_MAM.to_dataframe().reset_index()
        df_selectERA_LV = df_LV.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_PR = pearson_HAD_ERA_MAM.to_dataframe().reset_index()
        df_selectERA_PR = df_PR.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_SP = spearman_HAD_ERA_MAM.to_dataframe().reset_index()
        df_selectERA_SP = df_SP.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        perc_calc(df_selectERA_MW,season,'MW')
        perc_calc(df_selectERA_LV,season,'Levene')
        perc_calc(df_selectERA_PR,season,'Pearson')
        perc_calc(df_selectERA_SP,season,'Spearman')
        
    if season == 'JJA':
        df_MW = MW_HAD_ERA_JJA.to_dataframe().reset_index()
        df_selectERA_MW = df_MW.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_LV = levene_HAD_ERA_JJA.to_dataframe().reset_index()
        df_selectERA_LV = df_LV.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_PR = pearson_HAD_ERA_JJA.to_dataframe().reset_index()
        df_selectERA_PR = df_PR.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_SP = spearman_HAD_ERA_JJA.to_dataframe().reset_index()
        df_selectERA_SP = df_SP.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        perc_calc(df_selectERA_MW,season,'MW')
        perc_calc(df_selectERA_LV,season,'Levene')
        perc_calc(df_selectERA_PR,season,'Pearson')
        perc_calc(df_selectERA_SP,season,'Spearman')
        
    if season == 'SON':
        df_MW = MW_HAD_ERA_SON.to_dataframe().reset_index()
        df_selectERA_MW = df_MW.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_LV = levene_HAD_ERA_SON.to_dataframe().reset_index()
        df_selectERA_LV = df_LV.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_PR = pearson_HAD_ERA_SON.to_dataframe().reset_index()
        df_selectERA_PR = df_PR.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        df_SP = spearman_HAD_ERA_SON.to_dataframe().reset_index()
        df_selectERA_SP = df_SP.dropna(subset=col_lst, how='all', axis=0).reset_index()
        
        perc_calc(df_selectERA_MW,season,'MW')
        perc_calc(df_selectERA_LV,season,'Levene')
        perc_calc(df_selectERA_PR,season,'Pearson')
        perc_calc(df_selectERA_SP,season,'Spearman')
        
final_df = df_perc.round(2)
final_df

## Average indices difference value 
This stage takes the data from section 2.1.2.6. Precipitation indices average for 2001 to 2019 and it is related to section 3.4 Difference Maps

In [None]:
#Setting the datasets
diff_indAvgUK_DJF_ERA = IndAvg_HAD_DJF - IndAvg_ERA_DJF
diff_indAvgUK_MAM_ERA = IndAvg_HAD_MAM - IndAvg_ERA_MAM
diff_indAvgUK_JJA_ERA = IndAvg_HAD_JJA - IndAvg_ERA_JJA
diff_indAvgUK_SON_ERA = IndAvg_HAD_SON - IndAvg_ERA_SON

diff_indAvgUK_DJF_GPM = IndAvg_HAD_DJF - IndAvg_GPM_DJF
diff_indAvgUK_MAM_GPM = IndAvg_HAD_MAM - IndAvg_GPM_MAM
diff_indAvgUK_JJA_GPM = IndAvg_HAD_JJA - IndAvg_GPM_JJA
diff_indAvgUK_SON_GPM = IndAvg_HAD_SON - IndAvg_GPM_SON

# Create empty df to store values
diff_col = ['season', 'dataset', 'r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot']
df_diff = pd.DataFrame(columns=diff_col)


# Checking if there is a dataframe with all values as NaN
vars = ['diff_indAvgUK_DJF_ERA','diff_indAvgUK_MAM_ERA',
        'diff_indAvgUK_JJA_ERA','diff_indAvgUK_SON_ERA',
        'diff_indAvgUK_DJF_GPM','diff_indAvgUK_MAM_GPM',
        'diff_indAvgUK_JJA_GPM','diff_indAvgUK_SON_GPM']

# This loop uses the name in the vars list and calls the local variable
for name in vars:
    # Set dataframe
    df = locals()[name].to_dataframe().reset_index()
    df_select = df.dropna(subset=col_lst, how='all', axis=0).reset_index()
    lst_prc = []
    
    # append season and dataset to list
    lst_prc.append(name.split('_')[2])
    lst_prc.append('HadUk-Grid - ' + name.split('_')[3])
    
    # Calculate percentage of values where H0 can be accepted
    for prc_ind in col_lst:
        lst_prc.append(df_select[prc_ind].mean())

    # add data to final dataframe
    df_diff.loc[len(df_diff)] = lst_prc 

final_diff = df_diff.round(2)