# Apply Distance Truncation to Acoustic Survey Data

```yaml
---
title: "Apply Distance Truncation to Acoustic Survey Data"
description: >
  This notebook demonstrates the process of applying perceptibility truncation 
  to acoustic survey data for bird species. It includes steps for loading 
  libraries and data, creating distance-based amplitude dictionaries, filtering 
  data based on predicted amplitudes, and processing species counts. The 
  truncation method adjusts detection distances based on species, habitat type, 
  and recording equipment, improving the accuracy of abundance estimates from 
  acoustic surveys.
author: "Isabelle Lebeuf-Taylor"
date: "2024-09-25"
tags:
  - acoustic surveys
  - distance truncation
---
```

## 1. Load libraries and data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [5]:
# load predictive dataframe from the attenuation model
counts_to_truncate = pd.read_csv("Data/counts_to_truncate.csv")
predicted_amps = pd.read_csv("Data/predicted_distance_amplitudes.csv")
number_of_visits_per_site = pd.read_csv("Data/number_of_transcribed_recordings_per_site.csv")

## 2. Apply truncation to real data

### 2.1 Dictionary of distances and predicted amplitudes

Make a dictionary of dictionaries where the key is the distance (integer) and the value is the dictionary of the open and forested amps at which to truncate for a given distance.

| Target species  | Reference |
|------|-------|
| WTSP | VESP  |
| RCKI | AMRO  |
| TEWA | AMRO  |
| OSFL | WEME  |
| YRWA | CCSP  |
| REVI | BRBL  |


In [21]:
# Reference species for my focal species
spp_dict = {
    "WTSP": "VESP",
    "RCKI": "AMRO",
    "TEWA": "AMRO",
    "OSFL": "WEME",
    "YRWA": "CCSP",
    "REVI": "BRBL"
}

def map_spp(reference_spp):
    for key, value in spp_dict.items():
        if reference_spp == value:
            return key
    return None 

predicted_amps['spp'] = predicted_amps['species_code'].apply(map_spp)

In [27]:
def create_distance_dict(predicted_amps):
    """
    Create a nested dictionary of predicted amplitudes for species of interest across different distances,
    forest types, and SM2 conditions.

    Args:
    predicted_amps (DataFrame): A DataFrame containing predicted amplitudes and associated metadata.

    Returns:
    dict: A nested dictionary with the structure:
          {distance: {species: {'OP': {'SM2_0': value, 'SM2_1': value},
                                'FO': {'SM2_0': value, 'SM2_1': value}}}}
    """

    # TEWA and RCI are the same, so use RCKI fro the TEWA part
    species_of_interest = ['RCKI', 'WTSP', 'YRWA', 'REVI', 'OSFL']

    # Filter DataFrames for 'BinForest' conditions with the addition of 'SM2' filtering
    df_op_0 = predicted_amps[(predicted_amps['BinForest'] == 'OP') & (predicted_amps['SM2'] == 0)]
    df_op_1 = predicted_amps[(predicted_amps['BinForest'] == 'OP') & (predicted_amps['SM2'] == 1)]
    df_fo_0 = predicted_amps[(predicted_amps['BinForest'] == 'FO') & (predicted_amps['SM2'] == 0)]
    df_fo_1 = predicted_amps[(predicted_amps['BinForest'] == 'FO') & (predicted_amps['SM2'] == 1)]

    # Initialize the main dictionary
    distance_dict = {}

    for distance in range(30, 500):
        distance_dict[distance] = {}
        for species in species_of_interest:
            # Initialize nested dictionaries for this species
            distance_dict[distance][species] = {'OP': {'SM2_0': None, 'SM2_1': None},
                                                'FO': {'SM2_0': None, 'SM2_1': None}}

            op_rows_0 = df_op_0[df_op_0['spp'] == species]
            if not op_rows_0.empty:
                closest_op_row_0 = op_rows_0.iloc[(op_rows_0['distance'] - distance).abs().argsort()[:1]]
                closest_op_distance_0 = closest_op_row_0['distance'].values[0]
                if abs(closest_op_distance_0 - distance) <= 1:
                    distance_dict[distance][species]['OP']['SM2_0'] = closest_op_row_0['predicted'].values[0]

            op_rows_1 = df_op_1[df_op_1['spp'] == species]
            if not op_rows_1.empty:
                closest_op_row_1 = op_rows_1.iloc[(op_rows_1['distance'] - distance).abs().argsort()[:1]]
                closest_op_distance_1 = closest_op_row_1['distance'].values[0]
                if abs(closest_op_distance_1 - distance) <= 1:
                    distance_dict[distance][species]['OP']['SM2_1'] = closest_op_row_1['predicted'].values[0]

            fo_rows_0 = df_fo_0[df_fo_0['spp'] == species]
            if not fo_rows_0.empty:
                closest_fo_row_0 = fo_rows_0.iloc[(fo_rows_0['distance'] - distance).abs().argsort()[:1]]
                closest_fo_distance_0 = closest_fo_row_0['distance'].values[0]
                if abs(closest_fo_distance_0 - distance) <= 1:
                    distance_dict[distance][species]['FO']['SM2_0'] = closest_fo_row_0['predicted'].values[0]

            fo_rows_1 = df_fo_1[df_fo_1['spp'] == species]
            if not fo_rows_1.empty:
                closest_fo_row_1 = fo_rows_1.iloc[(fo_rows_1['distance'] - distance).abs().argsort()[:1]]
                closest_fo_distance_1 = closest_fo_row_1['distance'].values[0]
                if abs(closest_fo_distance_1 - distance) <= 1:
                    distance_dict[distance][species]['FO']['SM2_1'] = closest_fo_row_1['predicted'].values[0]

    for distance, species_dict in distance_dict.items():
        if 'RCKI' in species_dict:
            species_dict['TEWA'] = species_dict['RCKI']

    return distance_dict

distance_dict = create_distance_dict(predicted_amps)

## 3 Filter predicted data by distance

In [33]:
def filter_dataframes_by_distance(dataframe, distance_dicts):
    """
    Filter a dataframe based on distance-specific criteria.

    Args:
    dataframe (pd.DataFrame): The input dataframe to be filtered.
    distance_dicts (dict): A dictionary containing filtering criteria for each distance.

    Returns:
    dict: A dictionary where keys are distances and values are filtered dataframes.
    """
    filtered_dfs = {}
    
    for distance in range(30, 501): 
        filter_dict = distance_dicts.get(distance)
        if filter_dict is not None: 
            all_filtered_rows = []  
            for species, habitats in filter_dict.items():
                for habitat_type, sm2_values in habitats.items():
                    for sm2, amp_threshold in sm2_values.items():
                        year_condition = (dataframe['Year_since_logging'] < 11) if habitat_type == 'OP' else (dataframe['Year_since_logging'] >= 12)
                        sm2_condition = (dataframe['SM2'] == float(sm2.split('_')[-1]))
                        amp_condition = (dataframe['mean_amp'] >= amp_threshold) if amp_threshold is not None else pd.Series([True] * len(dataframe))
                        condition = (dataframe['species_code'] == species) & year_condition & sm2_condition & amp_condition
                        
                        filtered_rows = dataframe[condition]
                        if not filtered_rows.empty:
                            all_filtered_rows.append(filtered_rows)
            
            if all_filtered_rows:
                filtered_df = pd.concat(all_filtered_rows, ignore_index=True)
                filtered_dfs[distance] = filtered_df
    
    return filtered_dfs


filtered_dfs = filter_dataframes_by_distance(counts_to_truncate, distance_dict)


In [30]:
filtered_dfs = filter_dataframes_by_distance(counts_to_truncate, distance_dict)
# filtered_dfs[150].to_csv('Truncated_150m.csv'), index=False)

#### Check that the filtering behaves as expected

In [44]:
def sanity_check(dataframe, filter_dict):
    """
    Perform a sanity check on a filtered dataframe against given filter conditions.

    This function verifies if all rows in the dataframe meet the specified amplitude thresholds for each species, 
    considering different thresholds based on the years since logging and SM2 status.

    Args:
    dataframe (pd.DataFrame): The filtered dataframe to check. Must contain columns 'species_code', 'Year_since_logging', 'SM2', and 'mean_amp'.
    
    filter_dict (dict): A nested dictionary with the structure:
                        {species: {'OP': {'SM2_0': value, 'SM2_1': value},
                                   'FO': {'SM2_0': value, 'SM2_1': value}}}

    Returns:
    str: A message indicating whether the sanity check passed or failed. If failed, 
         it specifies which species and condition caused the failure.

    Note:
    The function assumes that 'Year_since_logging' < 11 corresponds to recently logged areas (OP),
    and >= 12 to older logged or unlogged areas (FO). 'SM2' can be 0 or 1.
    """
    for species, conditions in filter_dict.items():
        species_df = dataframe[dataframe['species_code'] == species]
        
        for sm2_status in [0, 1]:
            # Check for recently logged areas (OP)
            invalid_rows_op = species_df[
                (species_df['Year_since_logging'] < 11) & 
                (species_df['SM2'] == sm2_status) & 
                (species_df['mean_amp'] < conditions['OP'][f'SM2_{sm2_status}'])
            ]
            
            # Check for older logged or unlogged areas (FO)
            invalid_rows_fo = species_df[
                (species_df['Year_since_logging'] >= 12) & 
                (species_df['SM2'] == sm2_status) & 
                (species_df['mean_amp'] < conditions['FO'][f'SM2_{sm2_status}'])
            ]
            
            if not invalid_rows_op.empty:
                return f"Sanity check failed for species {species}, recently logged areas (OP), SM2 {sm2_status}. Check the filter conditions and data."
            
            if not invalid_rows_fo.empty:
                return f"Sanity check failed for species {species}, older logged/unlogged areas (FO), SM2 {sm2_status}. Check the filter conditions and data."
    
    return "Sanity check passed! The filtered dataframe meets all conditions."

sanity_check_result = sanity_check(filtered_dfs[150], distance_dict[150])
print(sanity_check_result)

Sanity check passed! The filtered dataframe meets all conditions.


## 4. Apply filtering to real data

In [45]:
transcribed_tasks_all = pd.read_csv("Data/transcribed_tasks_all.csv")

### 4.1 Exclude songs that are quieter than the expected amplitude given a distance

In [51]:
def process_species_counts(filtered_dfs, transcribed_tasks_all, distance, species_of_interest=None):
    """
    Process species counts data for a given distance, filtering and combining data from multiple sources.

    Args:
    filtered_dfs (dict): Dictionary of filtered dataframes for different distances.
    transcribed_tasks_all (pd.DataFrame): Dataframe containing all transcribed tasks.
    distance (int): The distance (in meters) for which to process the data.
    species_of_interest (list, optional): List of species codes to include. Defaults to ['TEWA', 'RCKI', 'WTSP', 'YRWA', 'REVI', 'OSFL'].

    Returns:
    pd.DataFrame: Processed dataframe with species counts for each location and recording date/time.

    Note:
    - Removes specific sites: "H23-RS-167", "H-RS-1-98", "H23-RS-109": these have retnetion ptaches larger than 12,000 m^2
    - Filters for locations with 10 or more visits.
    """
    if species_of_interest is None:
        species_of_interest = ['TEWA', 'RCKI', 'WTSP', 'YRWA', 'REVI', 'OSFL']

    # Read and filter sites
    list_of_sites_to_use = number_of_visits_per_site[number_of_visits_per_site['number_of_visits'] >= 10]['location'].tolist()
    list_of_sites_to_use = [site for site in list_of_sites_to_use if site not in ["H23-RS-167", "H-RS-1-98", "H23-RS-109"]]

    # Group and pivot
    grouped_all_df = filtered_dfs[distance].groupby(['location', 'recording_date_time', 'species_code']).size().reset_index(name='count')
    pivot_all_df = grouped_all_df.pivot_table(index=['location', 'recording_date_time'], columns='species_code', values='count', fill_value=0).reset_index()
    pivot_all_df = pivot_all_df[['location', 'recording_date_time'] + species_of_interest].fillna(0)

    # Check for missing combinations
    transcribed_combinations = set(transcribed_tasks_all[['location', 'recording_date_time']].apply(tuple, axis=1))
    existing_combinations = set(pivot_all_df[['location', 'recording_date_time']].apply(tuple, axis=1))
    missing_combinations = transcribed_combinations - existing_combinations

    # Create dataframe for missing combinations
    missing_combinations_df = pd.DataFrame(list(missing_combinations), columns=['location', 'recording_date_time'])
    for species in species_of_interest:
        missing_combinations_df[species] = 0

    # Combine existing and missing data
    updated_pivot_df = pd.concat([pivot_all_df, missing_combinations_df], ignore_index=True)

    # Filter the final result
    truncated_counts = updated_pivot_df[updated_pivot_df[['location', 'recording_date_time']].apply(tuple, axis=1).isin(transcribed_combinations)]

    return truncated_counts


truncated_150m_counts = process_species_counts(filtered_dfs, transcribed_tasks_all, 150)
truncated_250m_counts = process_species_counts(filtered_dfs, transcribed_tasks_all, 250)

truncated_150m_counts = pd.read_csv("Data/Abundance within 150m.csv")
truncated_250m_counts = pd.read_csv("Data/Abundance within 250m.csv")