### Simple one-way ANOVA test to analyse the relationship between the water level variability and peat type

**Motivation for analysis**
- Visually, there seems to be a strong correspondence between water level variation and peat vegetation type, with higher variation over palm swamp regions. 

**Inputs**
- land type map 
- standard deviation of the final derived water level timeseries

**Method**
- Calculate water level standard deviation
- Calculate one-way ANOVA (f-test) between the peat swamp type (hardwood/palm) and the water level variation



In [1]:
from scipy import stats
from scipy.stats import f_oneway
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import pandas as pd
import xarray as xr
import numpy as np
xr.set_options(cmap_sequential='viridis')

<xarray.core.options.set_options at 0x7fb651621a30>

In [2]:
ALOS_OUT = 'filepath'

In [3]:
# Landtype map
lt_map = xr.open_dataset(ALOS_OUT + 'landtype_100m.nc')
lt_map = lt_map.where(lt_map.isin([4,5]))
# water level daily timeseries
wl_final = xr.open_dataset(ALOS_OUT + 'WL_daily_final.nc')

# mean, min, max and stdev
WL_mean = xr.open_dataset(ALOS_OUT + 'WL_mean.nc')
WL_min = xr.open_dataset(ALOS_OUT + 'WL_min.nc')
WL_max = xr.open_dataset(ALOS_OUT + 'WL_max.nc')
WL_stdev = xr.open_dataset(ALOS_OUT + 'WL_stdev.nc')

In [6]:
# One way anova between the peat swamp type and the water level standard deviation
def calculate_anova(lt_map,wl_stat, stat_name):
    d1 = lt_map['type'].values
    d2 = wl_stat['water_level'].values

    # removing nan values
    d1 = d1[~np.isnan(d1)]
    d2 = d2[~np.isnan(d2)]
    f_stat, pvalue = f_oneway(d1,d2)

    print ('The one-way ANOVA outputs for ' + stat_name + ' are: ')
    print ('f stat: ', str(f_stat))
    print ('p value: ', str(pvalue))
    
    return (d2)



In [7]:
d2 = calculate_anova(lt_map, WL_stdev, 'the standard deviation of the water level')
print (len(d2))

The one-way ANOVA outputs for the standard deviation of the water level are: 
f stat:  52284.51333912362
p value:  0.0
8646611


In [8]:
d2

array([4.76587954, 7.95832095, 5.12864247, ..., 3.76799344, 4.95910528,
       6.9644831 ])

In [9]:
calculate_anova(lt_map, WL_mean, 'the mean water level')

The one-way ANOVA outputs for the mean water level are: 
f stat:  91620.45040894908
p value:  0.0


array([ 0.78489061,  5.68351625, -0.26425023, ...,  3.58119379,
       -0.73116236,  5.39631525])

In [10]:
calculate_anova(lt_map, WL_min, 'the min water level')

The one-way ANOVA outputs for the min water level are: 
f stat:  51443043.22214476
p value:  0.0


array([ -7.83417105,  -7.00494687, -10.3283852 , ...,  -4.37110852,
       -13.27824619,  -8.0539838 ])

In [11]:
calculate_anova(lt_map, WL_max, 'the max water level')

The one-way ANOVA outputs for the max water level are: 
f stat:  21131989.584988564
p value:  0.0


array([19.30696636, 30.94396974, 10.46517399, ..., 15.17239394,
       12.54865373, 29.65567032])

#### testing the eta statistic 
- eta calculates the effect size of the one way anova test (or eta squared, like R squared) - it measures the proportion of the total variance explained by the independent variable
- eta squared is sensitive to sample size
- larger the sample size, the larger the effect size
- important to mention the effect size and sample size together

In [None]:
%%time

# merging the land type and stdev arrays
ds_merged = xr.merge([lt_map['type'],WL_stdev['water_level']])
df = ds_merged.to_dataframe()


In [52]:
aov_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(type),3828724.0,1.0,898284.884846,0.0
Residual,36854100.0,8646609.0,,


In [12]:
%%time

# merging the land type and stdev arrays
ds_merged = xr.merge([lt_map['type'],WL_stdev['water_level']])
df = ds_merged.to_dataframe()

model = ols('water_level ~ C(type)', data = df).fit()

# anova test
aov_table = anova_lm(model, typ=2)

# effect size
# sum of squares between the groups
SSbetween = aov_table.loc['C(type)','sum_sq']
df_between = aov_table.loc['C(type)','df']

# sum of squares within the groups
SSwithin = aov_table.loc['Residual','sum_sq']
df_within = aov_table.loc['Residual','df']

eta_squared = SSbetween / (SSbetween + SSwithin)
eta_squared2 = 3828724/(3828724+36854100)
print (aov_table)
print ('Eta squared: ' + str(eta_squared))
print ('Eta squared partial: ' + str(eta_squared2))

                sum_sq         df              F  PR(>F)
C(type)   3.828724e+06        1.0  898284.884846     0.0
Residual  3.685410e+07  8646609.0            NaN     NaN
Eta squared: 0.09411156328014522
Eta squared partial: 0.09411155921722641
CPU times: user 1min 8s, sys: 14.7 s, total: 1min 23s
Wall time: 1min 4s


**About 9.4% of the variance in peat swamp type is due to the water level variance (pvalue<<0.05)**

In [66]:
%%time

# merging the land type and stdev arrays
ds_merged = xr.merge([lt_map['type'],WL_min['water_level']])
df = ds_merged.to_dataframe()

model = ols('water_level ~ C(type)', data = df).fit()

# anova test
aov_table = anova_lm(model, typ=2)

# effect size
# sum of squares between the groups
SSbetween = aov_table.loc['C(type)','sum_sq']
df_between = aov_table.loc['C(type)','df']

# sum of squares within the groups
SSwithin = aov_table.loc['Residual','sum_sq']
df_within = aov_table.loc['Residual','df']

eta_squared = SSbetween / (SSbetween + SSwithin)

print (aov_table)
print ('\nEta squared: ' + str(eta_squared))

                sum_sq         df            F  PR(>F)
C(type)   6.122928e+04        1.0  2687.934014     0.0
Residual  1.969638e+08  8646609.0          NaN     NaN

Eta squared: 0.00031076907578721257
CPU times: user 1min 7s, sys: 20.6 s, total: 1min 28s
Wall time: 1min 3s


**None of the variance in peat swamp type is related to the minimum water level**

In [91]:
%%time

# merging the land type and stdev arrays
ds_merged = xr.merge([lt_map['type'],WL_max['water_level']])
df = ds_merged.to_dataframe()

model = ols('water_level ~ C(type)', data = df).fit()

# anova test
aov_table = anova_lm(model, typ=2)

# effect size
# sum of squares between the groups
SSbetween = aov_table.loc['C(type)','sum_sq']
df_between = aov_table.loc['C(type)','df']

# sum of squares within the groups
SSwithin = aov_table.loc['Residual','sum_sq']
df_within = aov_table.loc['Residual','df']

eta_squared = SSbetween / (SSbetween + SSwithin)


print (aov_table)
print ('\nEta squared: ' + str(eta_squared))

                sum_sq         df             F  PR(>F)
C(type)   8.134953e+07        1.0  1.230110e+06     0.0
Residual  5.718170e+08  8646609.0           NaN     NaN

Eta squared: 0.12454639155426386
CPU times: user 1min 5s, sys: 16.6 s, total: 1min 22s
Wall time: 1min 1s


**About 12.5% of the variance in peat swamp type is due to the max water level (pvalue<<0.05)**

In [64]:
%%time

# merging the land type and stdev arrays
ds_merged = xr.merge([lt_map['type'],WL_mean['water_level']])
df = ds_merged.to_dataframe()

model = ols('water_level ~ C(type)', data = df).fit()

# anova test
aov_table = anova_lm(model, typ=2)

# effect size
# sum of squares between the groups
SSbetween = aov_table.loc['C(type)','sum_sq']
df_between = aov_table.loc['C(type)','df']

# sum of squares within the groups
SSwithin = aov_table.loc['Residual','sum_sq']
df_within = aov_table.loc['Residual','df']

eta_squared = SSbetween / (SSbetween + SSwithin)
print (aov_table)
print ('\nEta squared: ' + str(eta_squared))

                sum_sq         df              F  PR(>F)
C(type)   1.554615e+07        1.0  898409.935041     0.0
Residual  1.496215e+08  8646609.0            NaN     NaN

Eta squared: 0.09412343141012111
CPU times: user 1min 6s, sys: 20.6 s, total: 1min 27s
Wall time: 1min 2s


**About 9.4% of the variance in peat swamp type is due to the mean water level (pvalue<<0.05)**

#### Mean value of max, mean and stdev for each peat swamp type

In [89]:
def calculate_mean_values(ds, variable_name):
    print ('Palm swamp ' + variable_name + str(np.round(ds.where(lt_map['type']==4)['water_level'].mean().values,2)) + ' cm')
    print ('Hardwood swamp ' + variable_name + str(np.round(ds.where(lt_map['type']==5)['water_level'].mean().values,2)) + ' cm')
                                                          

In [90]:
calculate_mean_values(WL_stdev,'standard deviation: ')
calculate_mean_values(WL_min,'min: ')
calculate_mean_values(WL_max,'max: ')
calculate_mean_values(WL_mean,'mean: ')

Palm swamp standard deviation: 5.86 cm
Hardwood swamp standard deviation: 4.41 cm
Palm swamp min: -5.72 cm
Hardwood swamp min: -5.91 cm
Palm swamp max: 21.6 cm
Hardwood swamp max: 14.93 cm
Palm swamp mean: 6.32 cm
Hardwood swamp mean: 3.4 cm


In [None]:
### The minimum value might not be of much significance here because the rainfall was significantly higher during this period than usual. It would be interesting to investigate if minimum water level becomes significant 