# WEGE index
This notebook provides code for the Weighted Endemism including Global Endangerment (WEGE) index as it is described in [Farooq et al. (2020)](https://onlinelibrary.wiley.com/doi/full/10.1111/ddi.13148). WEGE is calculated like this:

$$\text{WEGE} = \sum_{i=1}^\text{N} \sqrt{WE_i}\times ER_i$$

- $WE_i$ weighted endemism for species $i$ (See [Crisp et al. (2002)](https://biology-assets.anu.edu.au/hosted_sites/Crisp/pdfs/Crisp2001_endemism.pdf))
- $ER_i$: probability of extinction of species $i$ (Using the IUCN50 transformation for the ER from [Davis et al. (2008)](https://www.pnas.org/doi/10.1073/pnas.1804906115))

### Weighted Endemism (WE)
[Crisp et al. (2002)](https://biology-assets.anu.edu.au/hosted_sites/Crisp/pdfs/Crisp2001_endemism.pdf)) describes weighted endemism in terms of grid cells, stating:
> ...a single-cell endemic
has the maximum weight of 1, a species occurring in two
cells has a weight of 0.5, and a species occurring in 100 cells
has a weight of 0.01. **To obtain an endemism score for a cell,
these weights are summed for all species occurring in the
cell. We term this measure *weighted endemism*.**

Instead of counting grid cells, we can define the endemism weight for each species as a ratio of areas:

$$WE_i = \frac{\text{"Area where species i is found within the target area"}}{\text{"Target area"}}$$


### Extinction Probability (ER)
[Farooq et al. (2020)](https://onlinelibrary.wiley.com/doi/full/10.1111/ddi.13148) base the extinction probability on the species IUCN Red List categeory, using the transformation from [Davis et al. (2008)](https://www.pnas.org/doi/10.1073/pnas.1804906115). Other transformations exist in the literature, such as [Moors et al. (2008)](https://www.sfu.ca/~amooers/papers/Mooers_etal_PLoSOne08.pdf). Here, I follow [Farooq et al. (2020)](https://onlinelibrary.wiley.com/doi/full/10.1111/ddi.13148) and use the [Davis et al. (2008)](https://www.pnas.org/doi/10.1073/pnas.1804906115) transformation. In addition, also following [Farooq et al. (2020)](https://onlinelibrary.wiley.com/doi/full/10.1111/ddi.13148), I set the extinction probabilty of data deficient (DD) species to the extinction probability of "vulnerable" species following [Bland et al. (2015)](https://conbio.onlinelibrary.wiley.com/doi/10.1111/cobi.12372). This is based on the idea that DD species may be more threatened than some data-sufficient species ([Bland et al., 2015](https://conbio.onlinelibrary.wiley.com/doi/10.1111/cobi.12372), [Borgelt et al., 2022](https://www.nature.com/articles/s42003-022-03638-9))

Here the transformation used for each [IUCN Red List category](https://www.iucnredlist.org/resources/categories-and-criteria):

$$(DD, LC, NT, VU, EN, CR, EW, EX) \mapsto (0.0513, 0.0009, 0.0071, 0.0513, 0.4276, 0.9688, 1.0, 1.0)$$

Where:     
- DD = "Data Deficient"
- LC = "Least Concern"
- NT = "Near Threatened"
- VU = "Vulnerable"
- EN = "Endangered"
- CR = "Critically Endangered"
- EW = "Extinct In The Wild"
- EX = "Extinct"

### Notes
- square root transformation used to improve normality of $WE$ data. Could test whether this transformation is necessary by making historgram of $WE$ and seeing if it is skewed and if square root actually improves normality. 
- which mapping from IUCN category to extinction probability should we use?
- What should the probability of extinction be for "data deficient" species? [Bland et al. (2015)](https://conbio.onlinelibrary.wiley.com/doi/10.1111/cobi.12372) and [Borgelt et al. (2022)](https://www.nature.com/articles/s42003-022-03638-9) suggest DD species may be more threatened than data-sufficient species.

In [None]:
def extinction_risk(cat: str = None) -> float:
    '''calculates extinction risk (ER) for species 
    following Farooq et al. (2020)
    we assign probability of extinction for each IUCN category
    using extinction probabilities 
    from Table S2 in supplemental material of Davis et al (2018).
    Here we use use IUCN50 values, same as Farooq et al. (2020).
    
    Extinction risk for data deficiient (DD) category is assigned
    the vulnerable (VU) probability,
    see Bland et al. (2015) for explanation.
    
    Args:
        cat (str): IUCN category
            - DD = Data Deficient
            - LC = Least Concern
            - NT = Near Threatened
            - VU = Vulnerable
            - EN = Endangered
            - CR = Critically Endangered
            - EW = Extinct in the wild 
            - EX = Extinct
        
    Returns:
        float: probability of extinction
        
    References:
        Bland et al. (2015) "Predicting the conservation status of data-deficient species" 
            https://doi.org/10.1111/cobi.12372
        Davis et al. (2018) "Mammal diversity will take millions of years to recover from the current biodiversity crisis"
            https://doi.org/10.1073/pnas.1804906115
        Farooq et al. (2020) "WEGE: A new metric for ranking locations for biodiversity conservation" 
            https://doi.org/10.1111/ddi.13148
    '''
    cat_to_risk = dict(
        DD=0.0513, # using Bland et al. (2015) assumption
        LC=0.0009,
        NT=0.0071,
        VU=0.0513,
        EN=0.4276,
        CR=0.9688,
        EW=1.0,
        EX=1.0
    )

    if cat_to_risk.get(cat) is None:
        raise ValueError("Invalid value for 'cat', expected one of 'DD', 'LC', 'NT', 'VU', 'EN', 'CR', EW', 'EX'")
    
    return cat_to_risk.get(cat)

In [None]:
def weighted_endemism(species_area: float=None, total_area: float=None, power: float=None) -> float:
    '''calculates the endemism score for a single species
    following Crisp et al. (2002) and using the scaling power of 0.5
    from Farooq et al. (2020).
    
    `("area species is found" / "total area")^(power)`
    
    Args:
        species_area (float): amount of area within the total where species is found
        total_area (float): total area where searching for species
        power(float): raise each endemism value to this power[default: 0.5]
        
    Returns:
        float: endemism score for the species
    
    References:
        Crisp et al. (2002) "Endemism in the Australian flora"
            https://onlinelibrary.wiley.com/doi/abs/10.1046/j.1365-2699.2001.00524.x
        Farooq et al. (2020) "WEGE: A new metric for ranking locations for biodiversity conservation" 
            https://doi.org/10.1111/ddi.13148
    '''
    power = 0.5 if power is None else power # see Farooq et al. (2020)
    
    if not isinstance(species_area, (int, float)):
        raise TypeError("species_area must be a number")
        
    if species_area < 0:
        raise ValueError("species_area must be positive")

    if not isinstance(total_area, (int, float)):
        raise TypeError("total_area must be a number")
        
    if total_area <= 0:
        raise ValueError("total_area must be positive and non-zero")
    
    return (species_area/total_area)**power


calculate WEGE from lists of `species_area`, `total_area`, and IUCN `categories`

```
WEGE = sum(
    [endemism(sp_area, tot_area) * extinction_risk(cat) 
     for sp_area, tot_area, cat 
     in zip(species_area, total_area, category)]
)
```

# 1. Read data

In [None]:
import geopandas as gpd
import numpy as np
from shapely.ops import linemerge, unary_union, polygonize

In [None]:
from MBU_utils import *

## 1.1 IUCN Redlist
**Note:** this takes a couple minutes to load. Takes longer, and requires AWS account, if loading directly from S3 bucket.

In [None]:
%%time
df_redlist = gpd.read_file('s3://ocean-program/data/processed/ACMC_IUCN_RedList/')

In [None]:
df_redlist = gpd.read_file('/Users/maureenfonseca/Desktop/Data-Oceans/ACMC_IUCN_data/gdf_ACMC_IUCN_range_status_filtered.shp')

In [None]:
# replaces long RedList name with two-letter code
long_to_short = {
    'Data Deficient':'DD',
    'Least Concern':'LC',
    'Near Threatened':'NT',
    'Vulnerable':'VU',
    'Endangered':'EN',
    'Critically Endangered':'CR',
    'Extinct In The Wild':'EW',
    'Extinct':'EX'
}

df_redlist['redlistCat'] = df_redlist['redlistCat'].replace(long_to_short)

## 1.2 Coco Marine Conservation Area
- The Coco Marine Conservation Area (ACMC)
- the Bicentennial Marine Management Area (AMMB)
- Cocos Island National Park (PNIC)

In [None]:
%%time
df_acmc = gpd.read_file('s3://ocean-program/data/processed/geospatial/')

# 2. Calculate WEGE

In [None]:
# set coord. references system
# https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.set_crs.html
df_redlist = df_redlist.set_crs('epsg:4326', allow_override=True)
df_acmc = df_acmc.set_crs('epsg:4326', allow_override=True)

In [None]:
df_redlist.crs = {"init":"epsg:4326"}
df_acmc.crs = {"init":"epsg:4326"}

In [None]:
# area of ACMC
acmc_area = df_acmc.area[0]

In [None]:
# create list of weighted endemism to specified power for each species
# hiding warnings for now

import warnings

with warnings.catch_warnings(record=True):
    power = 0.5
    WE_list = []
    for _, row in df_redlist.iterrows():
        df_intersect = df_redlist.loc[df_redlist.BINOMIAL==row.BINOMIAL].intersection(df_acmc)
        species_area_within_acmc = df_intersect.area
        species_redlist_area = df_redlist[df_redlist.BINOMIAL==row.BINOMIAL].area
        WE_tmp = ((species_area_within_acmc/acmc_area).sum())**power
        WE_list.append(WE_tmp)

In [None]:
# list of extinction probabilities for each species
ER_list = [extinction_risk(cat) for cat in df_redlist['redlistCat']]

In [None]:
# calculate WEGE index
WEGE = sum([WE * ER for WE, ER in zip(WE_list, ER_list)])

**Next Steps**
- Calculate the WEGE index per grid (sqkm)

In [None]:
df_redlist = gpd.clip(df_redlist.set_crs(epsg=4326, allow_override=True), df_acmc)

In [None]:
we = np.round(df_redlist.area/acmc_area, decimals=4, out=None)
sq_we = we**(0.5)

In [None]:
df_redlist['we'] = we
df_redlist['sq_we'] = sq_we

In [None]:
# list of extinction probabilities for each species
df_redlist['ER'] = [extinction_risk(cat) for cat in df_redlist['redlistCat']]

In [None]:
df_redlist['wege_i'] = df_redlist['sq_we']*df_redlist['ER']

In [None]:
def sum_values(gdf):
    #main source: https://stackoverflow.com/questions/65073549/combine-and-sum-values-of-overlapping-polygons-in-geopandas

    #The explode() method converts each element of the specified column(s) into a row
    #This is useful if there are multipolygons
    new_gdf = gdf.explode('geometry')

    #convert all polygons to lines and perform union
    lines = unary_union(linemerge([geom.exterior for geom in new_gdf.geometry]))

    #convert again to (smaller) intersecting polygons and to geodataframe
    polygons = list(polygonize(lines))
    intersects = gpd.GeoDataFrame({'geometry': polygons}, crs="EPSG:4326")
    
    #to fix invalid geometries
    intersects['geometry'] = intersects['geometry'].buffer(0)

    #Perform sjoin with original geoframe to get overlapping polygons.
    #Afterwards group per intersecting polygon to perform (arbitrary) aggregation
    intersects['sum_overlaps'] = (intersects
                            .sjoin(new_gdf, predicate='within')
                            .reset_index()
                            .groupby(['level_0', 'index_right0'])
                            .head(1)
                            .groupby('level_0')
                            .wege_i.sum())
    return intersects

In [None]:
df_redlist = df_redlist[0:200]

In [None]:
overlap_wege_v = sum_values(df_redlist)

In [None]:
grid = create_grid(df_acmc, grid_shape="hexagon", grid_size_deg=1.)

In [None]:
merged = gpd.sjoin(overlap_wege_v, grid, how='left')

# make a simple count variable that we can sum
merged['n_value']= overlap_wege_v['sum_overlaps']

# Compute stats per grid cell
dissolve = merged.dissolve(by="index_right", aggfunc="sum")

# put this into cell
grid.loc[dissolve.index, 'n_value'] = dissolve.n_value.values
    
#https://epsg.io/31970
grid['area_sqkm'] = (grid.to_crs(crs=31970).area)*10**(-6)

grid['mbu_wege'] = grid['n_value']*grid['area_sqkm']