# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></div><div class="lev1 toc-item"><a href="#Import-modules-and-functions" data-toc-modified-id="Import-modules-and-functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import modules and functions</a></div><div class="lev1 toc-item"><a href="#Input-fixed-variables-and-datapaths" data-toc-modified-id="Input-fixed-variables-and-datapaths-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Input fixed variables and datapaths</a></div><div class="lev1 toc-item"><a href="#Import-data" data-toc-modified-id="Import-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Import data</a></div><div class="lev1 toc-item"><a href="#Classify-receiver-clusters" data-toc-modified-id="Classify-receiver-clusters-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Classify receiver clusters</a></div><div class="lev1 toc-item"><a href="#Check-the-classification:-3-fold-cross-validation" data-toc-modified-id="Check-the-classification:-3-fold-cross-validation-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Check the classification: 3-fold cross validation</a></div><div class="lev1 toc-item"><a href="#Filter-fish-positions-based-on-receiver-cluster-classification" data-toc-modified-id="Filter-fish-positions-based-on-receiver-cluster-classification-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Filter fish positions based on receiver cluster classification</a></div>

# Introduction

This notebook gives an example of how to use the function "classify_receiver_clusters".   
A **cluster** is a group of at least three receivers used to calculate a position.   
Clusters are classified as **well** or **badly** performing, according to their ability to correctly position fixed tags.   
A cluster is only classified if it is used to calculate at least 10 fixed tag positions.   

Before using the function, you have to define:
- an **accuracy objective** = the acceptable distance between calculated and real fish position
- a **confidence level** = the percentage of fixed tag positions falling within the accuracy objective for a well performing cluster.

For a confidence level of 95% applies:

If at least 95% of the fixed tag positions calculated by a cluster, fall within the accuracy objective, the cluster is classified as well performing. If less than 95% fall within this objective, the cluster is classified as badly performing. 

Only fish positions calculated by well performing clusters are retained for behaviour analysis.

# Import modules and functions

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
from classify_receiver_clusters import classify_receiver_clusters


In [2]:
def classification_percentage(to_classify, performers_list):
    """
    Function to calculate the percentage of good/bad performers in a deployment
    
    Parameters
    ----------
    to_classify = series with URX-groups used in a certain deployment (or dataset)
    performers_list = list of bad or good performers
    
    Returns
    -------
    percentage
    """
    return sum(to_classify.isin(performers_list))/len(to_classify)*100

In [3]:
def URX_filtering_stats(pos_to_filter, known_pos, levels, accuracy_goal, min_nb_in_group=10):
    """
    Calculate percentages of excluded, included and unclassified positions for data after the classification of the receiver clusters, and also false neg/pos/neutrals.
    The filtering is based on data with known positions, of which the URX groups are classified for a given confidence level.
    
    Parameters
    ----------
    pos_to_filter = dataframe with positioning data needing to be filtered
    known_pos = dataframe with positions with known HPEm (error), used to classify the receiver clusters
    levels = list of confidence levels (e.g. [0.7, 0.8, 0.9])
    accuracy_goal = maximum acceptable error (e.g. 2.5)
    min_nb_in_group = minimum number of positions calculated by a URX-cluster to allow it to be classified (default 10)
    
    Returns
    -------
    dataframe with confidence levels in index and percentages in columns
    """
    keys = ['excluded', 'false_neg', 'f_neg_tot', 'included', 'false_pos', 'f_pos_tot', 'unclassified', 'false_neutral', 'f_neutr_tot', 'avg_HPEm', 'med_HPEm', 'quant_95']
    stats = {key: {} for key in keys}

    for lim in levels:
        URX_groups, good_perf, bad_perf = classify_receiver_clusters(known_pos, accuracy_goal, lim, min_group_size = min_nb_in_group)
        
        excluded_pos = pos_to_filter[pos_to_filter['URX'].isin(bad_perf)]
        stats['excluded'][lim] = len(excluded_pos)/len(pos_to_filter)*100
        if len(excluded_pos) > 0:
            stats['false_neg'][lim] = len(excluded_pos[excluded_pos.HPEm<=2.5])/len(excluded_pos)*100
        else:
            stats['false_neg'][lim] = None
        
        stats['f_neg_tot'][lim] = len(excluded_pos[excluded_pos.HPEm<=2.5])/len(pos_to_filter)*100

        included_pos = pos_to_filter[pos_to_filter['URX'].isin(good_perf)]
        stats['included'][lim] = len(included_pos)/len(pos_to_filter)*100
        if len(included_pos) > 0:
            stats['false_pos'][lim] = len(included_pos[included_pos.HPEm>2.5])/len(included_pos)*100
        else:
            stats['false_pos'][lim] = None
        
        stats['f_pos_tot'][lim] = len(included_pos[included_pos.HPEm>2.5])/len(pos_to_filter)*100
            
        unclassified_pos = pos_to_filter[np.logical_not((pos_to_filter['URX'].isin(bad_perf)|(pos_to_filter['URX'].isin(good_perf))))]
        stats['unclassified'][lim] = len(unclassified_pos)/len(pos_to_filter)*100
        if len(unclassified_pos) > 0:
            stats['false_neutral'][lim] = len(unclassified_pos[unclassified_pos.HPEm>2.5])/len(unclassified_pos)*100
        else:
            stats['false_neutral'][lim] = None
            
        stats['f_neutr_tot'][lim] = len(unclassified_pos[unclassified_pos.HPEm>2.5])/len(pos_to_filter)*100
        
        stats['avg_HPEm'][lim] = included_pos.HPEm.mean()
        stats['med_HPEm'][lim] = included_pos.HPEm.median()
        stats['quant_95'][lim] = included_pos.HPEm.quantile(0.95)

            
    return pd.DataFrame.from_dict(stats)

In [4]:
def classify_fish_pos(fishdata, good_performers, bad_performers):
    """
    Classify the fish positions as good, bad or unclassified, according to the classification 
    of the receiver clusters based on fixed tag data.
    This function also writes out the results (percentages good, bad and unclassified positions).
    
    Parameters
    ---------
    fishdata = dataframe with fish positions and at least column URX
    good_performers = list with good performing receiver clusters
    bad_performers = list with bad performing receiver clusters
    """
    
    fish_good = fishdata[fishdata.URX.isin(good_performers)]
    fish_bad = fishdata[fishdata.URX.isin(bad_performers)]
    fish_rest = fishdata[fishdata.URX.isin(good_performers+bad_performers)==False]
    
    print('Good positions: {:.2f}%'.format(len(fish_good)/len(fishdata)*100))
    print('Bad positions: {:.2f}%'.format(len(fish_bad)/len(fishdata)*100))
    print('Unclassified positions: {:.2f}%'.format(len(fish_rest)/len(fishdata)*100))

    
    return fish_good, fish_bad, fish_rest

In [None]:
def three_fold_cross(dataset):
    """
    Split dataset randomly in 3 parts.
    
    Inputs:
    -------
    dataset = pandas dataframe
    
    Returns:
    --------
    3 new dataframes which are parts of the old dataframe
    """
    nb_list = np.arange(len(dataset))
    np.random.shuffle(nb_list)
    
    part1 = dataset.iloc[nb_list[:round(len(dataset)/3)],:]
    part2 = dataset.iloc[nb_list[round(len(dataset)/3)+1:round(len(dataset)/3*2)],:]
    part3 = dataset.iloc[nb_list[round(len(dataset)/3*2)+1:],:]
    
    return part1, part2, part3


# Input fixed variables and datapaths

Give the path to the csv-file which contains all calculated positions of fish tags and fixed tags (synchronization and reference tags). This file contains at least the columns TRANSMITTER, DATETIME, LAT, LON and URX (list of receivers used to calculate the position) for each position, and HPEm (absolute error) for fixed positions.   
Give also the names of the fixed tags, as used in the TRANSMITTER column.

In [5]:
path_to_vps_data = '/Users/jennavergeynst/Documents/Ham/VPS_01_20160304/data/positions/ALL-CALC-POSITIONS.csv'
ref_sync = ['R1', 'R2', 'R3', 'S10', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S7', 'S8']

In [6]:
accuracy_goal = 2.5 #m
confidence_level = 0.95
min_nb_in_group = 10

# Import data

In [7]:
vps_data = pd.read_csv(path_to_vps_data, parse_dates=['DATETIME'], infer_datetime_format=True)

In [8]:
fixed_tags = vps_data[vps_data.TRANSMITTER.isin(ref_sync)]
fish_tags = vps_data[vps_data.TRANSMITTER.isin(ref_sync)==False]

# Classify receiver clusters

In [9]:
URX_groups, good_performers, bad_performers = classify_receiver_clusters(fixed_tags, accuracy_goal, confidence_level, min_nb_in_group)

In [10]:
clusters = URX_groups['URX']
fish_URX = pd.Series(fish_tags.URX.unique())

print('Percentage bad performer clusters: {:.2f}'.format(classification_percentage(clusters, bad_performers)))
print('Percentage good performer clusters: {:.2f}'.format(classification_percentage(clusters, good_performers)))
print('Percentage bad performers for fish positions: {:.2f}'.format(classification_percentage(fish_URX, bad_performers)))
print('Percentage good performers for fish positions: {:.2f}'.format(classification_percentage(fish_URX, good_performers)))

Percentage bad performer clusters: 25.97
Percentage good performer clusters: 7.14
Percentage bad performers for fish positions: 23.05
Percentage good performers for fish positions: 6.44


# Check the classification: 3-fold cross validation

In [None]:
part1, part2, part3 = three_fold_cross(fixed_tags)

In [None]:
levels = [0.7, 0.8, 0.9, 0.95, 0.99] # validate the classification for different confidence percentages
cols = ['excluded', 'false_neg', 'f_neg_tot', 'included', 'false_pos', 'f_pos_tot', 'unclassified', 'false_neutral', 'f_neutr_tot', 'avg_HPEm', 'med_HPEm', 'quant_95']

In [None]:
URX_filtering_stats(part1, pd.concat([part2, part3]), levels, accuracy_goal, min_nb_in_group).loc[:,cols].round(1)

In [None]:
URX_filtering_stats(part2, pd.concat([part1, part3]), levels, accuracy_goal, min_nb_in_group).loc[:,cols].round(1)

In [None]:
URX_filtering_stats(part3, pd.concat([part1, part2]), levels, accuracy_goal, min_nb_in_group).loc[:,cols].round(1)

# Filter fish positions based on receiver cluster classification

In [None]:
fish_good, fish_bad, fish_rest = classify_fish_pos(fish_tags, good_performers, bad_performers)