# Open Ocean
# Open Earth Fundation

<h1> Step 2: calculate different metrics for each modulating factor </h1>

This notebook is the second part of the `Step1_Curate_IUCN_RedList.ipynb`

<h2> Modulating Factor 1: Normalize Biodiversity Score </h2>

Species diversity refers to the variety of different species present in a given area, as well as their abundance and distribution. This includes the number of species, their relative abundances, and how evenly or unevenly distributed they are.
Our proposal is: apply the Simpson and Shannon Index to obtain a local value of the MPA and normalize each sqd km value

### Data needed for this project

- Species names
- Species abundance
- Species distribution

Next Steps:

1. Find a database or datasets with abundance and distribution information for the entire ACMC
2. If it isn't reallistic, try to simulate that data

Options:
1. IUCN RED List and simulate abundance information
2. GBIF species information and simulate abundance and distribution information

### 1. Importing libraries.

In [1]:
# load basic libraries
import os
import glob
import boto3

import math
import numpy as np
import pandas as pd

# to plot
import matplotlib.pyplot as plt

# to manage shapefiles
import shapely
import geopandas as gpd
from shapely.geometry import Polygon, Point, box
from shapely.ops import linemerge, unary_union, polygonize

In [2]:
import fiona; #help(fiona.open)

**Import OEF functions**

In [3]:
%load_ext autoreload

In [4]:
#Run this to reload the python file
%autoreload 2
from MBU_utils import *

### 2. Load data

In [5]:
ACMC = gpd.read_file('https://ocean-program.s3.amazonaws.com/data/raw/MPAs/ACMC.geojson')

In [None]:
%%time
df = gpd.read_file('https://ocean-program.s3.amazonaws.com/data/processed/ACMC_IUCN_RedList/gdf_ACMC_IUCN_range_status_filtered.shp')

In [6]:
%%time
df = gpd.read_file('/Users/maureenfonseca/Desktop/Data-Oceans/ACMC_IUCN_data/gdf_ACMC_IUCN_range_status_filtered.shp')

CPU times: user 7min 20s, sys: 3.72 s, total: 7min 24s
Wall time: 7min 25s


In [7]:
grid = create_grid(ACMC, grid_shape="hexagon", grid_size_deg=1.)

### 3. Preliminary calculations


**Shannon Index**

$\text{H} = -\sum[{p_i}\times\ln(p_i)]$


where, pi is the proportion of the entire community made up of species i

$p_i = {n/N}$

In [None]:
def shannon(gdf_abundance_col):
    """
    Calculates the value of H using the given values of abundance.
    
    Parameters:
        - gdf_abundance_col (list): A list of species of the entire community made up of each species
        
    Returns:
        - H (float): The calculated value of H
    """
    
    abundance = np.array(gdf_abundance_col)
    N = np.sum(abundance)
    
    p = (abundance/N)
    
    H = 0
    for pi in p:
        if pi > 0:
            H += pi * math.log(pi)
    H = -H
    return H

In [10]:
df = df[0:100]

In [11]:
fake_abundance = np.random.randint(50, size = (len(df)))

In [12]:
df['abundance'] = fake_abundance

In [24]:
def shannon(roi, gdf, grid_gdf, gdf_col_name):
    """
    This function calculates the Shannon Index per grid cell and its corresponding MBU value
    pi = (n/N): where n is the abundance number per species and N is the total abundance number in the dataset 
    
    input(s):
    roi <shapely polygon in CRS WGS84:EPSG 4326>: region of interest or the total project area
    gdf <geopandas dataframe>: contains at least the name of the species, the distribution polygons of each of them 
                             :and their abundance
    grid_gdf <geopandas dataframe>: consists of polygons of grids typically generated by the gridding function
                                  : containts at least a geometry column and a unique grid_id
    gdf_col_name <string>: corresponds to the name of the abundance information column in the gdf
    
    output(s):
    gdf <geopandas dataframe>: with an additional column ('mbu_habitat_survey') containing the number
                             : of units for that grid or geometry
    """
    
    #Join in a gdf all the geometries within ROI
    gdf = gpd.clip(gdf.set_crs(epsg=4326, allow_override=True), roi)

    #This function calculates the sum of all abundances of overlapping species
    overlap = sum_values(gdf, str(gdf_col_name))

    #Merged the overlap values of overlapping geometries with the grid gdf
    merged = gpd.sjoin(overlap, grid_gdf, how='left')
    merged['n_value'] = overlap['sum_overlaps']
    
    #Calculate the pi value per row
    pi = merged['n_value']/np.sum(merged['n_value'])
    pi = pi.fillna(0)
    merged['pilogpi'] = pi*np.log(pi)

    #Dissolve the DataFrame by 'index_right' and aggregate using the calculated Shannon entropy
    dissolve = merged.dissolve(by="index_right", aggfunc={'pilogpi': 'sum'})
    
    #Calculate the Shannon index per grid
    dissolve['pilogpi'] = (-1)*dissolve['pilogpi']

    #Put this into cell
    grid_gdf.loc[dissolve.index, 'Shannon'] = dissolve.pilogpi.values

    #Normalization factor
    Norm_factor = grid_gdf['Shannon']/grid_gdf['Shannon'].max()
    
    return grid_gdf

In [25]:
shannon(ACMC, df, grid, 'abundance')