# Protect Marine Life

### This notebook outlines the general workflow for the data within the [Protect Marine Life](https://oceancentral.org/track/protect-marine-life) page of the Ocean Central website.

# Overview Figures

## Figure 1

<p align="center">
  <img src="Figs/marine_life_1.png" style="width:50%;">
</p>


The data are derived from the Marine populations dataset from [Living Planet Index](https://www.livingplanetindex.org/latest_results).

## Figure 2

<p align="center">
  <img src="Figs/marine_life_2.png" style="width:50%;">
</p>


The data are derived from [Rebuilding Marine Life](https://www.nature.com/articles/s41586-020-2146-7) -- Duarte et al. (2020).

# IUCN Figures

### This section contains code to get marine species, threats, and trends from the IUCN API

In [None]:
# Utils functions and necessary packages

import requests
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import pycountry
import time

from dotenv import load_dotenv 
from tqdm import tqdm
from sklearn.linear_model import LinearRegression

# Load environment variables
load_dotenv()
token = os.getenv("IUCN_API_KEY")

# Define the base URL
base_url = "https://api.iucnredlist.org/api/v4/habitats/"

# Set up the headers with the token
headers = {
    "accept": "application/json",
    "Authorization": token
}

# Function to convert country code to country name
def get_country_name(code):
    try:
        return pycountry.countries.get(alpha_2=code).name
    except:
        return 'Unknown'

# Function to get assessments for a given habitat
def get_assessments(habitat_id):
    page = 1
    per_page = 100
    assessments = []
    
    while True:
        print(f"Habitat: {habitat_id}, Page: {page}")
        # Construct the URL with pagination parameters
        url = f"{base_url}{habitat_id}?page={page}&per_page={per_page}"
        
        # Make the GET request
        response = requests.get(url, headers=headers)
        
        # Check if the request was successful
        if response.status_code == 200:
            data = response.json()
            assessments.extend(data.get('assessments', []))
            
            # Check if we have reached the last page
            total_pages = int(response.headers.get('total-pages', 1))
            if page >= total_pages:
                break
            
            # Move to the next page
            page += 1
        else:
            print(f"Failed to retrieve data for habitat {habitat_id}: {response.status_code} - {response.text}")
            break
            
    return assessments

**This code block below gets all marine species assessments (with the habitat codes 9, 10, 11, 12, or 13 corresponding to marine species). The species (sis_taxon_id) and assessment (assessment_id) IDs will be used for the code to generate all of the figures.**

In [None]:
if not os.path.exists("../Data/marine_habitats.csv"):
    # Initialize variables for storing all assessments
    all_assessments = []
    
    # Iterate through major categories and their subcategories
    major_habitats = [9, 10, 11, 12, 13]
    for habitat_id in major_habitats:
        # Get assessments for the main category
        all_assessments.extend(get_assessments(habitat_id))
        
        # Check for subcategories (assuming subcategories range from 1 to 15)
        for sub_id in range(1, 16):
            sub_habitat_id = f"{habitat_id}_{sub_id}"
            all_assessments.extend(get_assessments(sub_habitat_id))
    
    # Extract relevant information for each assessment
    rows = []
    for assessment in all_assessments:
        row = {
            "year_published": assessment.get("year_published"),
            "latest": assessment.get("latest"),
            "sis_taxon_id": assessment.get("sis_taxon_id"),
            "url": assessment.get("url"),
            "assessment_id": assessment.get("assessment_id"),
            "code": assessment.get("code"),
            "code_type": assessment.get("code_type"),
            "scope_description": assessment.get("scopes")[0].get("description").get("en") if assessment.get("scopes") else None,
            "scope_code": assessment.get("scopes")[0].get("code") if assessment.get("scopes") else None,
            "habitat_id": assessment.get("habitat_id"),  # Add habitat_id to the row
        }
        rows.append(row)
    
    # Create a DataFrame
    marine_df = pd.DataFrame(rows)
    marine_df.to_csv("../Data/marine_habitats.csv")
    marine_df
else:
    marine_df = pd.read_csv("../Data/marine_habitats.csv")
    marine_df

**This code block below gets threats and trends for all assessments IDs for marine species defined in the block above.**

In [None]:
base_url = "https://api.iucnredlist.org/api/v4/assessment"

assessements_list = marine_df.loc[marine_df.groupby('sis_taxon_id')['year_published'].idxmax()].reset_index(drop=True)

# List of assessment IDs
assessment_ids = assessements_list['assessment_id'].dropna().astype(int).values

# Initialize empty lists to hold all the data if not already defined
try:
    all_data
except NameError:
    all_data = []

try:
    all_data_threats
except NameError:
    all_data_threats = []

# Extract processed assessment_ids from all_data
processed_assessment_ids = {entry["assessment_id"] for entry in all_data}

def fetch_data_with_retry(url, headers, retries=5, backoff_factor=0.5):
    for i in range(retries):
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 429:
            wait_time = backoff_factor * (2 ** i)
            print(f"Rate limit exceeded. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)
        else:
            print(f"Failed to retrieve data: {response.status_code} - {response.text}")
            break
    return None

if not os.path.exists("../Data/all_threats_species.csv"):
    for assessment_id in tqdm(assessment_ids):
        # Skip if the assessment_id is already processed
        if assessment_id in processed_assessment_ids:
            continue
    
        # Construct the URL for the current assessment ID
        url = f"{base_url}/{assessment_id}"
        
        # Fetch the data with retry mechanism
        data = fetch_data_with_retry(url, headers)
        
        if data:
            trend = data.get("population_trend") if data.get("population_trend") is None else data.get("population_trend").get("code")
            class_name = data.get("taxon", {}).get("class_name")
            sis_taxon_id = data.get("sis_taxon_id", [])
            locations = data.get("locations", [])
            threats = data.get("threats", [])
            status = data.get("red_list_category", {}).get("code")
            year_published = data.get("year_published", [])
    
            # Extract the common English name (if available)
            common_names = data.get("taxon", {}).get("common_names", [])
            english_common_name = next(
                (name["name"] for name in common_names if name.get("language") == "eng"),
                None
            )

            
            # Add the assessment data to the list
            for location in locations:
                country_code = location.get("code")
                assessment_data = {
                    "assessment_id": assessment_id,
                    "sis_taxon_id": sis_taxon_id,
                    "country_code": country_code,
                    "trend": trend,
                    "class_name": class_name,
                    "status": status,
                    "year_published": year_published,
                    "english_common_name": english_common_name  # Store the common English name here
                }
                all_data.append(assessment_data)
    
            # Add the threats data to the list
            for location in locations:
                country_code = location.get("code")
                for threat in threats:
                    threat_code = threat.get("code")
                    threat_data = {
                        "assessment_id": assessment_id,
                        "sis_taxon_id": sis_taxon_id,
                        "country_code": country_code,
                        "threat_code": threat_code,
                        "class_name": class_name,
                        "year_published": year_published,
                        "english_common_name": english_common_name  # Store the common English name here
                    }
                    all_data_threats.append(threat_data)
            
            # Mark the assessment_id as processed
            processed_assessment_ids.add(assessment_id)
    
    # Create DataFrames from the collected data
    trends_df = pd.DataFrame(all_data)
    threats_df = pd.DataFrame(all_data_threats)
    trends_df.to_csv("../Data/all_trends_species.csv")
    threats_df.to_csv("../Data/all_threats_species.csv")
else:
    trends_df = pd.read_csv("../Data/all_trends_species.csv")
    threats_df = pd.read_csv("../Data/all_threats_species.csv")

## Figure 3

<p align="center">
  <img src="Figs/marine_life_3.png" style="width:50%;">
</p>


**This code iterates through each species over time, and if a species does not have an assessment for a given year, it assigns that species-year combination the most recent assessment's trend for that species.**

In [None]:
trends_df = trends_df[["assessment_id","year_published","sis_taxon_id","country_code","status","class_name","english_common_name","trend"]]#.merge(df_unique[["assessment_id","year_published","sis_taxon_id"]], on="assessment_id", how="inner")
trends_df['year_published'] = trends_df['year_published'].astype(int)

# Mapping dictionary
threats_mapping = {
    'DD': 'Data Deficient',
    'LC': 'Least Concern',
    'NT': 'Near Threatened',
    'VU': 'Vulnerable',
    'EN': 'Endangered',
    'CR': 'Critically Endangered',
    'EW': 'Extinct in the Wild',
    'EX': 'Extinct',
    'NE': 'Not Evaluated',
}

# Replace the values in the 'Threat' column
trends_df['Threat'] = trends_df['status'].replace(threats_mapping)

trends_df.loc[trends_df["trend"] == 0, 'trend_title'] = "Increasing"
trends_df.loc[trends_df["trend"] == 1, 'trend_title'] = "Decreasing"
trends_df.loc[trends_df["trend"] == 2, 'trend_title'] = "Stable"
trends_df.loc[trends_df["trend"] == 3, 'trend_title'] = "Unknown"

# Initialize an empty list to store new rows
new_rows = []

# Define the range of years
years = range(1997, 2026)

# Iterate through each sis_taxon_id
for sis_taxon_id in tqdm(trends_df['sis_taxon_id'].unique()):
    # Filter data for the current sis_taxon_id
    df_sis = trends_df[trends_df['sis_taxon_id'] == sis_taxon_id]
    
    # Iterate through each year in the defined range
    for year in years:
        # Check if there is an entry for the current year
        if year not in df_sis['year_published'].values:
            # Find the most recent threat categorization from a previous year
            previous_threats = df_sis[df_sis['year_published'] < year]
            if not previous_threats.empty:
                last_threat = previous_threats.iloc[-1]['Threat']
                class_name = previous_threats.iloc[-1]['class_name']
                country_code = previous_threats.iloc[-1]['country_code']
                english_common_name = previous_threats.iloc[-1]['english_common_name']
                trend_title = previous_threats.iloc[-1]['trend_title']
                # Create a new row with the current year, sis_taxon_id, and the most recent threat categorization
                new_row = {
                    'year_published': year,
                    'Threat': last_threat,
                    'sis_taxon_id': sis_taxon_id,
                    'class_name': class_name,
                    "country_code": country_code,
                    "english_common_name": english_common_name,
                    "trend_title": trend_title,
                }
                new_rows.append(new_row)

# Create a DataFrame from the new rows
df_new_rows = pd.DataFrame(new_rows)

# Append the new rows to the original dataframe
df_combined = pd.concat([trends_df, df_new_rows])

# Sort the combined dataframe
df_combined = df_combined.sort_values(by=['sis_taxon_id', 'year_published']).reset_index(drop=True)

# Apply the function to the country_code column
df_combined['country_name'] = df_combined['country_code'].apply(get_country_name)

**We then translate from scientific classnames to general categories (fish, mammals, reptiles, and birds). We also filter only to species with a threat level of "Endangered" or worse.**

In [None]:
# Define the mapping
class_to_category = {
    'ACTINOPTERYGII': 'Fish', 
    'CHONDRICHTHYES': 'Fish', 
    'SARCOPTERYGII': 'Fish', 
    'PETROMYZONTI': 'Fish',
    'MYXINI': 'Fish',
    'CEPHALASPIDOMORPHI': 'Fish',
    'MAMMALIA': 'Mammals', 
    'REPTILIA': 'Reptiles', 
    'AVES': 'Birds'
}

df_combined = df_combined.drop_duplicates(["sis_taxon_id","year_published"]).merge(marine_df.drop_duplicates("sis_taxon_id")[["sis_taxon_id"]],on='sis_taxon_id',how='inner')

# Map the class names to categories
df_combined['category'] = df_combined['class_name'].map(class_to_category)

# Remove rows where category is NaN (those not in the specified classes)
df_combined = df_combined.dropna(subset=['category'])

# Create a pivot table for the total counts of species across all years and categories
pivot_total = df_combined.pivot_table(index='year_published', columns='category', aggfunc='size', fill_value=0)

# Filter the data for specific threats
filtered_df = df_combined[df_combined['Threat'].isin(['Critically Endangered', 'Endangered', 'Extinct in the Wild'])]

# Create a pivot table for the filtered (threatened) species
pivot_filtered = filtered_df.pivot_table(index='year_published', columns='category', aggfunc='size', fill_value=0)

pivot_filtered = pivot_filtered.query("year_published > 2009")
pivot_total = pivot_total.query("year_published > 2009")

**We then calculate the percentage of species that are threatened (Endangered or worse) to all species assessed over time. This information will be used in Figure 3 of the Protect Marine Life section.**

In [None]:
# Define the threats of interest
threat_levels = ['Critically Endangered', 'Endangered', 'Extinct in the Wild']

X_years = pivot_filtered.reset_index()["year_published"].values.reshape([-1, 1])

# Define a function to compute the linear trend for a given species category
def compute_trend(species, pivot_filtered, pivot_total, X_years):
    # Calculate the percentage and count for the given species
    y_percents = pivot_filtered[species].values / pivot_total[species].values
    y_counts = pivot_total[species].values
    
    # Filter out rows where y_percents or y_counts are NaN
    mask = ~np.isnan(y_percents) & ~np.isnan(y_counts)
    X_filtered = X_years[mask]
    y_percents_filtered = y_percents[mask]
    y_counts_filtered = y_counts[mask]
    
    # Initialize and fit the linear regression model on filtered data
    linear_model = LinearRegression()
    linear_model.fit(X_filtered, y_percents_filtered, sample_weight=y_counts_filtered)
    
    # Predict using the full X_years, but only return the valid predictions
    y_pred = np.full_like(y_percents, np.nan, dtype=np.float64)  # Initialize with NaN
    y_pred[mask] = linear_model.predict(X_filtered)
    
    return y_pred

pivot_percents = pivot_filtered/pivot_total

# Initialize an empty DataFrame to hold counts by threat level for each category
pivot_threat_total = df_combined[df_combined['Threat'].isin(threat_levels)].pivot_table(
    index='year_published', 
    columns='category', 
    aggfunc='size', 
    fill_value=0
).query("year_published > 2009")

# Calculate individual trends for Birds, Fish, Mammals, and Reptiles, considering all threat levels
pivot_percents['Bird_Linear_Trend'] = compute_trend("Birds", pivot_threat_total, pivot_total, X_years)
pivot_percents['Fish_Linear_Trend'] = compute_trend("Fish", pivot_threat_total, pivot_total, X_years)
pivot_percents['Mammal_Linear_Trend'] = compute_trend("Mammals", pivot_threat_total, pivot_total, X_years)
pivot_percents['Reptile_Linear_Trend'] = compute_trend("Reptiles", pivot_threat_total, pivot_total, X_years)

# Calculate the overall trend for all categories combined
y_percents_all = (pivot_threat_total["Birds"] + pivot_threat_total["Fish"] + 
                  pivot_threat_total["Mammals"] + pivot_threat_total["Reptiles"]).values / \
                 (pivot_total["Birds"] + pivot_total["Fish"] + pivot_total["Mammals"] + pivot_total["Reptiles"]).values
y_counts_all = (pivot_total["Birds"] + pivot_total["Fish"] + pivot_total["Mammals"] + pivot_total["Reptiles"]).values

# Filter out NaNs for the overall trend
mask_all = ~np.isnan(y_percents_all) & ~np.isnan(y_counts_all)
X_filtered_all = X_years[mask_all]
y_percents_filtered_all = y_percents_all[mask_all]
y_counts_filtered_all = y_counts_all[mask_all]

# Fit the linear regression model for the overall trend
linear_model = LinearRegression()
linear_model.fit(X_filtered_all, y_percents_filtered_all, sample_weight=y_counts_filtered_all)
pivot_percents['All_Linear_Trend'] = np.full_like(y_percents_all, np.nan, dtype=np.float64)
pivot_percents['All_Linear_Trend'][mask_all] = linear_model.predict(X_filtered_all)
pivot_percents['All'] = y_percents_all

# Define the threats of interest
threat_levels = ['Critically Endangered', 'Endangered', 'Extinct in the Wild']

# Initialize an empty dictionary to hold pivot tables by threat level
pivot_by_threat = {}

# Create pivot tables for each threat level by species category
for threat in threat_levels:
    filtered_df_threat = df_combined[df_combined['Threat'] == threat]
    pivot_by_threat[threat] = filtered_df_threat.pivot_table(
        index='year_published', 
        columns='category', 
        aggfunc='size', 
        fill_value=0
    )

# Add threat percentages for each species and threat level
for threat in threat_levels:
    for species in ["Birds", "Fish", "Mammals", "Reptiles"]:
        # Calculate the percentage of the species category for this threat level
        try:
            pivot_percents[f"{species}_{threat}"] = pivot_by_threat[threat][species] / pivot_total[species]
        except:
            pivot_percents[f"{species}_{threat}"] = 0

# Fill any remaining missing values with 0 and save to CSV
pivot_percents.reset_index()[[
    'year_published', 'All_Linear_Trend', 'All', 'Bird_Linear_Trend', 'Birds', 
    'Fish_Linear_Trend', 'Fish', 'Mammal_Linear_Trend', 'Mammals', 
    'Reptile_Linear_Trend', 'Reptiles',
    'Birds_Critically Endangered', 'Birds_Endangered', 'Birds_Extinct in the Wild',
    'Fish_Critically Endangered', 'Fish_Endangered', 'Fish_Extinct in the Wild',
    'Mammals_Critically Endangered', 'Mammals_Endangered', 'Mammals_Extinct in the Wild',
    'Reptiles_Critically Endangered', 'Reptiles_Endangered', 'Reptiles_Extinct in the Wild'
]].fillna(0).to_csv("../Data/protect_marine_life_1.csv")

subset = pivot_percents.reset_index()[[
    'year_published', 'All_Linear_Trend', 'All', 'Bird_Linear_Trend', 'Birds', 
    'Fish_Linear_Trend', 'Fish', 'Mammal_Linear_Trend', 'Mammals', 
    'Reptile_Linear_Trend', 'Reptiles',
    'Birds_Critically Endangered', 'Birds_Endangered', 'Birds_Extinct in the Wild',
    'Fish_Critically Endangered', 'Fish_Endangered', 'Fish_Extinct in the Wild',
    'Mammals_Critically Endangered', 'Mammals_Endangered', 'Mammals_Extinct in the Wild',
    'Reptiles_Critically Endangered', 'Reptiles_Endangered', 'Reptiles_Extinct in the Wild'
]].fillna(0)

# Save as JSON
subset.to_json("../Data/Figure_3.json", orient="records", indent=2)

# Display the updated DataFrame with linear trends and threat percentages
pivot_percents.fillna(0)

## Figure 4

<p align="center">
  <img src="Figs/marine_life_4.png" style="width:50%;">
</p>


**This figure calculates the distribution of trends for all marine species.**

In [None]:
threat_levels = ['Critically Endangered', 'Endangered', 'Extinct in the Wild']

df_subset = df_combined[df_combined['Threat'].isin(threat_levels)]

# Count rows by trend_title
counts = df_subset.query("year_published == 2025") \
    .groupby(["country_name","country_code","category"]) \
    .size() \
    .reset_index(name="count")

# Save to JSON
counts.to_json("../Data/Figure_4.json", orient="records", indent=2)

## Figure 5

<p align="center">
  <img src="Figs/marine_life_5.png" style="width:50%;">
</p>

**This figure shows the distribution of trend types by country**

In [None]:
trends_df = df_combined.query("year_published==2025")[['category','sis_taxon_id','country_name','country_code','trend_title']]
trends_df.to_csv("../Data/protect_marine_life_2.csv")

# Count rows by trend_title and country
counts = (
    df_combined.query("year_published == 2025")
    .groupby(["trend_title", "category"])
    .size()
    .reset_index(name="count")
)

# Compute total counts per category
category_totals = counts.groupby("category")["count"].transform("sum")

# Add percent within each category
counts["percent"] = counts["count"] / category_totals * 100

# Save to JSON
counts.to_json("../Data/Figure_5_by_species.json", orient="records", indent=2)

# Count rows by trend_title and country
counts = df_combined.query("year_published == 2025") \
    .groupby(["trend_title"]) \
    .size() \
    .reset_index(name="count")

# Add percent within each country
counts["percent"] = counts["count"] / counts["count"].sum() * 100

# Save to JSON
counts.to_json("../Data/Figure_5_all.json", orient="records", indent=2)

In [None]:
counts

## Figures 6 and 8

<p align="center">
  <img src="Figs/marine_life_6.png" style="width:50%;">
</p>

<p align="center">
  <img src="Figs/marine_life_8.png" style="width:50%;">
</p>

**This code iterates through each species over time, and if a species does not have an assessment for a given year, it assigns that species-year combination the most recent assessment's threat for that species.**

In [None]:
# Define a function to map threat_code to Threat
def map_threat(threat_code):
    if threat_code.startswith('1_'):
        return 'Residential & commercial development'
    elif threat_code.startswith('2_'):
        return 'Agriculture & aquaculture'
    elif threat_code.startswith('3'):
        return 'Energy production & mining'
    elif threat_code.startswith('4_'):
        return 'Transportation & service corridors'
    elif threat_code.startswith('5_'):
        return 'Biological resource use'
    elif threat_code.startswith('6'):
        return 'Human intrusions & disturbance'
    elif threat_code.startswith('7_'):
        return 'Natural system modifications'
    elif threat_code.startswith('8_'):
        return 'Invasive and other problematic species, genes & diseases'
    elif threat_code.startswith('9_'):
        return 'Pollution'
    elif threat_code.startswith('10_'):
        return 'Geological events'
    elif threat_code.startswith('11_'):
        return 'Climate change & severe weather'
    elif threat_code.startswith('12_'):
        return 'Other'
    else:
        return 'Unknown'  # for any other threat codes

threats_df = pd.read_csv("../Data/all_threats_species.csv")
threats_df['year_published'] = threats_df['year_published'].astype(int)

# Apply the function to create the new Threat column
threats_df['Threat'] = threats_df['threat_code'].apply(map_threat)

# Initialize an empty list to store new rows
new_rows = []

# Define the range of years
years = range(1997, 2026)

# Iterate through each sis_taxon_id
for sis_taxon_id in tqdm(threats_df['sis_taxon_id'].unique()):
    # Filter data for the current sis_taxon_id
    df_sis = threats_df[threats_df['sis_taxon_id'] == sis_taxon_id]
    
    # Iterate through each year in the defined range
    for year in years:
        # Check if there is an entry for the current year
        if year not in df_sis['year_published'].values:
            # Find the most recent threat categorization from a previous year
            previous_threats = df_sis[df_sis['year_published'] < year]
            if not previous_threats.empty:
                last_threat = previous_threats.iloc[-1]['Threat']
                class_name = previous_threats.iloc[-1]['class_name']
                country_code = previous_threats.iloc[-1]['country_code']
                english_common_name = previous_threats.iloc[-1]['english_common_name']
                # Create a new row with the current year, sis_taxon_id, and the most recent threat categorization
                new_row = {
                    'year_published': year,
                    'Threat': last_threat,
                    'sis_taxon_id': sis_taxon_id,
                    'class_name': class_name,
                    "country_code": country_code,
                    "english_common_name": english_common_name,
                }
                new_rows.append(new_row)

# Create a DataFrame from the new rows
df_new_rows = pd.DataFrame(new_rows)

# Append the new rows to the original dataframe
threats_combined = pd.concat([threats_df, df_new_rows])

# Sort the combined dataframe
threats_combined = threats_combined.sort_values(by=['sis_taxon_id', 'year_published']).reset_index(drop=True)

# Map the class names to categories
threats_combined['category'] = threats_combined['class_name'].map(class_to_category)

# Apply the function to the country_code column
threats_combined['country_name'] = threats_combined['country_code'].apply(get_country_name)

threats_df = threats_combined.query("year_published==2024")[['category','country_name','country_code','Threat']]

In [None]:
# Count rows by trend_title
counts = threats_df \
    .groupby(["Threat","category"]) \
    .size() \
    .reset_index(name="count")

# Save to JSON
counts.to_json("../Data/Figure_6.json", orient="records", indent=2)

# Count rows by trend_title
counts = threats_df \
    .groupby(["country_name","country_code","Threat","category"]) \
    .size() \
    .reset_index(name="count")

# Save to JSON
counts.to_json("../Data/Figure_8.json", orient="records", indent=2)

## Figures 7 and 9

**This figure takes the most recent trends for each marine species and shows how they are distributed across countries globally.**

<p align="center">
  <img src="Figs/marine_life_7.png" style="width:50%;">
</p>

<p align="center">
  <img src="Figs/marine_life_9.png" style="width:50%;">
</p>

In [None]:
from tqdm import tqdm

trends_df = pd.read_csv("../Data/all_trends_species.csv")
trends_df = trends_df[["assessment_id","sis_taxon_id","year_published","country_code","status","class_name","english_common_name","trend"]]

# Mapping dictionary
threats_mapping = {
    'DD': 'Data Deficient',
    'LC': 'Least Concern',
    'NT': 'Near Threatened',
    'VU': 'Vulnerable',
    'EN': 'Endangered',
    'CR': 'Critically Endangered',
    'EW': 'Extinct in the Wild',
    'EX': 'Extinct',
    'NE': 'Not Evaluated',
}

# Replace the values in the 'Threat' column
trends_df['Threat'] = trends_df['status'].replace(threats_mapping)

trends_df.loc[trends_df["trend"] == 0, 'trend_title'] = "Increasing"
trends_df.loc[trends_df["trend"] == 1, 'trend_title'] = "Decreasing"
trends_df.loc[trends_df["trend"] == 2, 'trend_title'] = "Stable"
trends_df.loc[trends_df["trend"] == 3, 'trend_title'] = "Unknown"

trends_df['year'] = trends_df['year_published']

# Initialize an empty list to store new rows
new_rows = []

# Define the range of years
years = range(1997, 2026)

# Iterate through each sis_taxon_id
for sis_taxon_id in tqdm(trends_df['sis_taxon_id'].unique()):
    # Filter data for the current sis_taxon_id
    df_sis = trends_df[trends_df['sis_taxon_id'] == sis_taxon_id]
    
    # Iterate through each year in the defined range
    for year in years:
        # Check if there is an entry for the current year
        if year not in df_sis['year_published'].values:
            # Find the most recent threat categorization from a previous year
            previous_threats = df_sis[df_sis['year_published'] < year]
            if not previous_threats.empty:
                last_threat = previous_threats.iloc[-1]['Threat']
                class_name = previous_threats.iloc[-1]['class_name']
                country_code = previous_threats.iloc[-1]['country_code']
                english_common_name = previous_threats.iloc[-1]['english_common_name']
                trend_title = previous_threats.iloc[-1]['trend_title']
                # Create a new row with the current year, sis_taxon_id, and the most recent threat categorization
                new_row = {
                    'year_published': previous_threats.iloc[-1]['year_published'],
                    'year': year,
                    'Threat': last_threat,
                    'sis_taxon_id': sis_taxon_id,
                    'class_name': class_name,
                    "country_code": country_code,
                    "english_common_name": english_common_name,
                    "trend_title": trend_title,
                }
                new_rows.append(new_row)

# Create a DataFrame from the new rows
df_new_rows = pd.DataFrame(new_rows)

# Append the new rows to the original dataframe
df_combined = pd.concat([trends_df, df_new_rows])

# Sort the combined dataframe
df_combined = df_combined.sort_values(by=['sis_taxon_id', 'year_published']).reset_index(drop=True)

# Apply the function to the country_code column
df_combined['country_name'] = df_combined['country_code'].apply(get_country_name)

In [None]:
import pandas as pd

df = df_combined.query("year_published == year")

# Define current year for relative calculations
CURRENT_YEAR = 2025

def classify_group(group):
    years = pd.to_numeric(group["year_published"], errors="coerce").dropna().astype(int)
    years = sorted(years.unique())  # still keep unique years for recency checks
    n_rows = len(group)  # total rows for this assessment_id
    
    if n_rows == 0:
        return "Insufficient"
    
    max_year = max(years) if years else None
    if max_year is None or max_year < CURRENT_YEAR - 10:
        return "Expired"
    
    has_recent = any(year >= CURRENT_YEAR - 7 for year in years)
    has_stale = any(CURRENT_YEAR - 10 <= year <= CURRENT_YEAR - 8 for year in years)
    
    if has_recent and n_rows >= 2:   # use total number of rows instead of unique years
        return "Sufficient"
    elif has_recent:
        return "Recent"
    elif has_stale:
        return "Stale"
    else:
        return "Expired"  # fallback

# Apply per assessment_id
classification = df.groupby("assessment_id").apply(classify_group)

# Merge back into dataframe
df = df.merge(classification.rename("classification"), 
              on="assessment_id", how="left")

# Keep most recent entry for each assessment_id
df = df.loc[df.groupby("assessment_id")["year_published"].idxmax()].reset_index(drop=True)

# Map classes to categories
class_to_category = {
    'ACTINOPTERYGII': 'Fish', 
    'CHONDRICHTHYES': 'Fish', 
    'SARCOPTERYGII': 'Fish', 
    'PETROMYZONTI': 'Fish',
    'MYXINI': 'Fish',
    'CEPHALASPIDOMORPHI': 'Fish',
    'MAMMALIA': 'Mammals', 
    'REPTILIA': 'Reptiles', 
    'AVES': 'Birds'
}
df['category'] = df['class_name'].map(class_to_category)

df


In [None]:
# Define the threats of interest
threat_levels = ['Critically Endangered', 'Endangered', 'Extinct in the Wild']

final_df = df[df['Threat'].isin(threat_levels)][['category','country_name','country_code','english_common_name','Threat','trend_title']]

final_df.to_json("../Data/Figure_7.json", orient="records", indent=2)

In [None]:
# Count rows by trend_title
counts = final_df \
    .groupby(["country_name","country_code","Threat"]) \
    .size() \
    .reset_index(name="count")

# Save to JSON
counts.to_json("../Data/Figure_9.json", orient="records", indent=2)

## Ship Strikes Figure

<p align="center">
  <img src="Figs/marine_life_10.png" style="width:50%;">
</p>

The data are derived from Table 7 of this [IWC report](https://www.researchgate.net/publication/342734400_Global_Numbers_of_Ship_Strikes_An_Assessment_of_Collisions_Between_Vessels_and_Cetaceans_Using_Available_Data_in_the_IWC_Ship_Strike_Database_Report_to_the_International_Whaling_Commission_IWC68BSC_HI).