# Open Earth Foundation
# Ocean Biodiversity Credit Data Framework

This is the first notebook of a series of 5 notebooks that explains step by step how to calculate each modulating factor and assign credtis for the [Marine Biodiversity Credits methodology](https://zenodo.org/records/10182712) applied to the Cocos Marine Conservation Area of Costa Rica

<h1> Step 1: curate the IUCN data in the Eastern Tropical Pacific </h1>

This notebook shows the first step in getting:
- species distribution
- species statuses

and the pre-processing that goes along with it. 

## Data sources

The data needed for this project is available in the Ocean Program S3 Bucket

**Species information**

Unfortunately, at this time, this data needs to be manually downloaded by making a request to IUCN.

From our S3 public bucket:
https://ocean-program.s3.amazonaws.com/data/raw/IUCN_RedList_CentralPacific/

**Geospatial information**

The geospatial shapefiles were been downloaded from SNIT - CR. (link?)
As our case study and example for these data pipelines, we use the Area de Conservación Marina Isla del Coco (ACMC)

ACMC: https://ocean-program.s3.amazonaws.com/data/raw/MPAs/ACMC.geojson

In point 2 you can find a way to access this data

## 1. Loading libraries

In [None]:
import os
import glob
import boto3

import numpy as np
import pandas as pd
import geopandas as gpd
import concurrent.futures

import fiona; #help(fiona.open)

import seaborn as sns
from shapely.geometry import Point
import matplotlib.pyplot as plt

## 2. Get the conservation area, i.e. the area of interest

**Cocos Island Coordinates**

Cocos Island is located at 05°31′41″N; 87°03′40″W

In [None]:
Cocos_lat = 5+31/60+41/3600
Cocos_lon = -(87+3/60+40/3600)

In [None]:
Cocos = Point(Cocos_lon, Cocos_lat)

**Import entire ACMC**

ACMC = Coco Marine Conservation Area

In [None]:
ACMC = gpd.read_file('https://ocean-program.s3.amazonaws.com/data/raw/MPAs/ACMC.geojson')

Inspect the files and their Coordinate Reference Systems (CRS).

In [None]:
ACMC

Let's check the coordinate reference system:

In [None]:
ACMC.crs

### Plot to visually inspect the data.

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

In [None]:
fig, ax = plt.subplots()

ax.set_aspect('equal')

world.plot(ax=ax, color='white', edgecolor='black')

ACMC.plot(ax=ax, alpha = 0.35, color = 'turquoise', label = 'ACMC')

ax.scatter(Cocos.x, Cocos.y, c = 'r', label = 'Cocos Island')

ax.set_xlim((-95, -75))
ax.set_ylim((0, 12.5))
ax.legend();
plt.show();

In [None]:
print("\nTotal Area, ACMC:")
print("{:0.2f}".format(ACMC.area.item()) + " sqdeg.")
print("{:,.2f}".format(ACMC.to_crs(crs=31970).area.item()*10**(-6)) + " sqkm in CRS 31970.")

## 3. Get the species data

### Data gathering for the distribution range

This first step is the pre-processing to combine the ~7GB data downloaded from IUCN into a single shapefile that only covers the species within the ACMC. 

The outcome of this step has been saved in https://ocean-program.s3.amazonaws.com/data/processed/ACMC_IUCN_RedList/

List all of the .shp files.

In [None]:
# Initialize S3 client
s3 = boto3.resource('s3')

# Set the name of the bucket and the path to the shapefiles
bucket = s3.Bucket('ocean-program')

List = [obj.key for obj in bucket.objects.filter(Prefix='data/raw/IUCN_RedList_CentralPacific/')]

fnames = [f's3://ocean-program/{s}' for s in List if '.shp' in str(s)]
print(np.sort(fnames))   

### Turn the IUCN data into a geopandas dataframe

In [None]:
#Read shapefiles
def read_file(file, crs):
    
    gdf = gpd.read_file(file)
    gdf = gdf.set_crs(epsg=crs, allow_override=True)
    
    return gdf

Start with the first data file to get column headers.

In [None]:
fname = fnames[0]
fname

In [None]:
gdf = read_file(fname, 4326)

In [None]:
print("The dataframe has " + str(len(gdf)) + " rows.")

In [None]:
df1 = gpd.GeoDataFrame(columns = gdf.columns)

In [None]:
df1

<h3> Overlap with Marine Protected Area spatial boundaries </h3>

In this case, we use as an example the Area de Conservation Marina Isla del Coco (ACMC)

We now want to filter the dataframe to only keep rows that overlap with our area of interest, `ACMC`.

Note: there are some rows in `gdf` that cause *issues* with a boolean filtering. Thus doing

`df[df.overlaps(ACMC)]` or `df.loc[:][df.loc[:].overlaps(ACMC)]`

gives the following error:
```
TopologicalError: The operation 'GEOSOverlaps_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x16b4b0880>
```

To avoid so, add `.buffer(0)`. It takes a longtime, however.


We thus execute a for-loop over `gdf` to extract the rows that overlap with `ACMC`.

See the python notebook / html file on data wrangling for this project. 

---------------------------------------------------------------------------------------------------------------------
Run the next cell only for AMMB and later for PNIC for all the 0..4 files

---------------------------------------------------------------------------------------------------------------------

In [None]:
area_of_conservation = ACMC.geometry.item()

In [None]:
area_of_conservation

In [None]:
start = timeit.timeit()
print("We start with a df of length " + str(len(gdf)))
for idj in range(0, len(gdf)):
    try:
        if gdf.loc[idj].geometry.overlaps(area_of_conservation):
            df1 = df1.append(gdf.loc[idj])
    except:
        try:
            if gdf.loc[idj].geometry.buffer(0).overlaps(area_of_conservation):
                df1 = df1.append(gdf.loc[idj])
        except:
            print("Issue at row " + str(idj))
            pass
end = timeit.timeit()    
print("We end with a df of length " + str(len(df1)) + " and it took:")
print(end - start)

Now we append the rest of the files in fnames.

In [None]:
for fname in fnames[1:]:
    gdf = read_file(fname, 4326)
    print("The dataframe has " + str(len(gdf)) + " rows.")
    print(gdf.crs)
    start = timeit.timeit()
    
    for idj in range(0, len(gdf)):
        try:
            if gdf.loc[idj].geometry.overlaps(area_of_conservation):
                df1 = df1.append(gdf.loc[idj])
        except:
            try:
                if gdf.loc[idj].geometry.buffer(0).overlaps(area_of_conservation):
                    df1 = df1.append(gdf.loc[idj])
            except:
                print("Issue at row " + str(idj))
                pass
    end = timeit.timeit()    
    print("We end with a df of length " + str(len(df1)) + " and it took:")
    print(end - start)

In [None]:
type(df1)

In [None]:
len(df1)

In [None]:
df1.head()

In [None]:
df1.head()

In [None]:
df1.to_file('gdf_species_in_ACMC.shp') 

`gdf_species_in_ACMC.shp` is the saved output. It is also on the Drive. It can be retrieved doing:

In [None]:
#df = gpd.read_file('s3://ocean-program/data/processed/ACMC_IUCN_RedList/gdf_species_in_ACMC.shp')

In [None]:
df1 = df1.reset_index()

In [None]:
print("There are " + str(len(df1)) + " unique species in this dataset.")
print("The dates span " + str(df1.YEAR.min()) + " to " + str(df1.YEAR.max()))

<h2> 2.2 Get the conservation status </h2>

This was also manually downloaded following a manual request to UICN.

In [None]:
stat = pd.read_csv('s3://ocean-program/data/raw/IUCN_RedList_CentralPacific/IUCN status - redlist_species_data_a5560fc7-ec95-45c9-8c1f-364584e4173d/assessments.csv')
stat.head()

<h2> 2.3 Append conservation status to list of species & distribution </h2>

We make copies to be safe.

In [None]:
df = df1.copy()

In [None]:
df.head()

In [None]:
df_nonan= df.copy()

In [None]:
df_nonan['BINOMIAL'].isnull().values.any()

In [None]:
#print("With nan's we have " + str(len(df)) + " rows.")
print("Without nan's we have " + str(len(df_nonan[~df_nonan['BINOMIAL'].isnull()])) + " rows.")

In [None]:
df_nonan = df_nonan[~df_nonan['BINOMIAL'].isnull()]

In [None]:
len(df_nonan)

In [None]:
df_nonan["redlistCategory"] = ""
df_nonan["scientificName"] = ""

In [None]:
scientificName = []
redlistCategory = []
for iter, row in df_nonan.iterrows():
    try:
        redlistCategory.append(stat[stat.scientificName==row.BINOMIAL].redlistCategory.item())
        scientificName.append(row.BINOMIAL)
    except:
        try:
            redlistCategory.append(stat[stat.scientificName==row.BINOMIAL].redlistCategory)
            scientificName.append(row.BINOMIAL)
        except:
            redlistCategory.append("No category found")
            scientificName.append(row.BINOMIAL)

In [None]:
df_nonan["redlistCategory"] = redlistCategory
df_nonan["scientificName"] = scientificName
df_nonan.head()

We just check:

In [None]:
(df_nonan.scientificName==df_nonan.BINOMIAL).unique()

We have the following conservation statuses:

In [None]:
df_nonan.redlistCategory.unique()

In [None]:
for status in df_nonan.redlistCategory.unique():
    print("There are " + str(len(df_nonan[df_nonan.redlistCategory==status])) + \
          " species with the status " + status)

In [None]:
print("The species with the status Critically Endangered are :")
print(df_nonan[df_nonan.redlistCategory=='Critically Endangered'].BINOMIAL)

- *Carcharhinus longimanus* is Oceanic whitetip shark
- *Eretmochelys imbricata* is Hawksbill sea turtle
- *Pristis pristis* is Largetooth sawfish

### Alternative approach to merging the DFs

In [None]:
# rename column
stat.rename(columns = {'internalTaxonId':'ID_NO'}, inplace=True)
stat.columns, df_nonan.columns

In [None]:
#Convert ID_NO from string to int
df_nonan[['ID_NO']] = df_nonan[['ID_NO']].apply(pd.to_numeric)
df_nonan.dtypes

In [None]:
ACMC_IUCN_df = df_nonan.merge(stat, on=['ID_NO'], how='left')

Check:

In [None]:
ACMC_IUCN_df.scientificName_x.equals(ACMC_IUCN_df.scientificName_y)

In [None]:
#Drop unused columns
drop_cols = ['redlistCriteria', 'yearPublished', 'assessmentDate', 'criteriaVersion',
       'language', 'rationale', 'habitat', 'threats', 'population','populationTrend', 'range', 'useTrade', 'systems',
       'conservationActions', 'realm', 'yearLastSeen', 'possiblyExtinct', 'possiblyExtinctInTheWild', 'scopes','scientificName_y','redlistCategory_y',
        'assessmentId']
ACMC_IUCN_df.drop(columns=drop_cols, inplace=True)

In [None]:
# Now remove all geometries that are outside the ACMC -> clip does the job
#ACMC_IUCN_df1 = gpd.clip(ACMC_IUCN_df.set_crs(epsg=4326, allow_override=True), ACMC)

In [None]:
ACMC_IUCN_df.columns, len(ACMC_IUCN_df), ACMC_IUCN_df.dtypes

In [None]:
ACMC_IUCN_df.head()

In [None]:
# ACMC_IUCN_df1.drop('index_left', inplace=True, axis=1)
ACMC_IUCN_df1 = ACMC_IUCN_df.reset_index(drop=True).sort_values(['ID_NO'])
ACMC_IUCN_df1

<h1> 3. Saving output </h1>

Excellent! We save the final result as `gdf_ACMC_IUCN_range_status_filtered.shp` under `ACMCC_IUCN_data`. It is also on the Drive.

In [None]:
df_nonan.to_file('gdf_ACMC_IUCN_range_status_filtered.shp') 