# **PDAC CellTracksColab - General**
---

### Modified CellTracksColab Notebook for Circulating Cell Attachment Analysis

<font size = 4>This version of the CellTracksColab notebook has been specifically adapted to analyze the attachment of circulating cells to endothelial cells. It builds upon the original framework to offer specialized functionalities tailored for this complex aspect of cell migration studies.

<font size = 4>For reference, the original CellTracksColab notebook and its comprehensive suite of tools can be found at the CellMigrationLab GitHub repository:

<font size = 4>[CellMigrationLab/CellTracksColab](https://github.com/CellMigrationLab/CellTracksColab)



<font size = 4>Notebook created by [Guillaume Jacquemet](https://cellmig.org/)


In [None]:
# @title #MIT License

print("""
**MIT License**

Copyright (c) 2023 Guillaume Jacquemet

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.""")

--------------------------------------------------------
# **Part 1: Prepare the session and load your data**
--------------------------------------------------------


## **1.1. Install key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to install
!pip -q install pandas scikit-learn
!pip -q install hdbscan
!pip -q install umap-learn
!pip -q install plotly
!pip -q install tqdm

!git clone https://github.com/CellMigrationLab/CellTracksColab.git


import ipywidgets as widgets
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import numpy as np
import itertools
from matplotlib.gridspec import GridSpec
import requests

import os
import pandas as pd
import seaborn as sns
import numpy as np
import sys
import matplotlib.colors as mcolors
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import itertools
import requests
import ipywidgets as widgets
import warnings
import scipy.stats as stats

from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.gridspec import GridSpec
from ipywidgets import Dropdown, interact,Layout, VBox, Button, Accordion, SelectMultiple, IntText
from tqdm.notebook import tqdm
from IPython.display import display, clear_output
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cosine, pdist
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.metrics import pairwise_distances
from scipy.stats import zscore, ks_2samp
from sklearn.preprocessing import MinMaxScaler
from multiprocessing import Pool
from matplotlib.ticker import FixedLocator
from matplotlib.ticker import FuncFormatter
from matplotlib.colors import LogNorm
sys.path.append("../")
sys.path.append("CellTracksColab/")

import celltracks
from celltracks import *
from celltracks.Track_Plots import *
from celltracks.BoxPlots_Statistics import *
from celltracks.Track_Metrics import *


def save_dataframe_with_progress(df, path, desc="Saving", chunk_size=500000):
    """Save a DataFrame with a progress bar and gzip compression."""

    # Estimating the number of chunks based on the provided chunk size
    num_chunks = int(len(df) / chunk_size) + 1

    # Create a tqdm instance for progress tracking
    with tqdm(total=len(df), unit="rows", desc=desc) as pbar:
        # Open the file for writing with gzip compression
        with gzip.open(path, "wt") as f:
            # Write the header once at the beginning
            df.head(0).to_csv(f, index=False)

            for chunk in np.array_split(df, num_chunks):
                chunk.to_csv(f, mode="a", header=False, index=False)
                pbar.update(len(chunk))




## **1.2. Mount your Google Drive**
---
<font size = 4> To use this notebook on the data present in your Google Drive, you need to mount your Google Drive to this notebook.

<font size = 4> Play the cell below to mount your Google Drive and follow the instructions.

<font size = 4> Once this is done, your data are available in the **Files** tab on the top left of notebook.

In [None]:
#@markdown ##Play the cell to connect your Google Drive to Colab

from google.colab import drive
drive.mount('/content/gdrive/')



## **1.3.1 Compile your data or load existing dataframes**
---

<font size = 4> Please ensure that your data is properly organised


In [None]:
#@markdown ##Run for the CTC vs CIC and IL-1b dataset


def populate_columns(df, filename):
    cells_conditions = {
        'Mia': 'MiaPaca-2', 'P10': 'Panc10',  'p10': 'Panc10', 'As': 'AsPc1',
        'neu': 'Neutrophil', 'mono': 'Monocyte', 'mon': 'Monocyte'
    }
    flow_speed_conditions = {'p1': 300, 'p2': 200, 'p3': 100, 'p4': 'wash'}
    Treatment_conditions = {'IL1b': 'IL1b', 'il1b': 'IL1b', 'ctrl': 'CTRL'}

    df['Cells'] = next((v for k, v in cells_conditions.items() if k in filename), 'Unknown')
    df['Flow_speed'] = next((v for k, v in flow_speed_conditions.items() if k in filename), 'Unknown')
    df['Treatment'] = next((v for k, v in Treatment_conditions.items() if k in filename), 'CTRL')
    filename_without_extension = os.path.splitext(os.path.basename(filename))[0]
    df['File_name'] = remove_suffix(filename_without_extension)
    df['Condition'] = df['Cells'] + '_' + df['Flow_speed'].astype(str) + '_' + df['Treatment']
    match = re.search(r'n(\d+)', filename)
    df['experiment_nb'] = int(match.group(1)) if match else 'Unknown'

    return df

In [None]:
#@markdown ##Run for the CD44 ab, siCD44 and HA datasets


def populate_columns(df, filename):
    cells_conditions = {
        'Mia': 'MiaPaca-2', 'P10': 'Panc10',  'p10': 'Panc10', 'As': 'AsPc1',
        'neu': 'Neutrophil', 'mono': 'Monocyte', 'mon': 'Monocyte'
    }
    flow_speed_conditions = {'p1': 300, 'p2': 200, 'p3': 100, 'p4': 'wash'}
    Treatment_conditions = {'siCtrl': 'siCtrl', 'sictrl': 'siCtrl', 'si2': 'siCD44_2', 'si1': 'siCD44_1', 'si3': 'siCD44_3', 'ctrldig': 'ctrldig', 'HUdig': 'HUdig','TCdig': 'TCdig', 'blockboth': 'blockboth', 'ctrlblock': 'ctrlblock', 'HUblock': 'HUblock', 'TCblock': 'TCblock'}

    df['Cells'] = next((v for k, v in cells_conditions.items() if k in filename), 'Unknown')
    df['Flow_speed'] = next((v for k, v in flow_speed_conditions.items() if k in filename), 'Unknown')
    df['Treatment'] = next((v for k, v in Treatment_conditions.items() if k in filename), 'Unknown')
    filename_without_extension = os.path.splitext(os.path.basename(filename))[0]
    df['File_name'] = remove_suffix(filename_without_extension)
    df['Condition'] = df['Cells'] + '_' + df['Flow_speed'].astype(str) + '_' + df['Treatment']
    match = re.search(r'n(\d+)', filename)
    df['experiment_nb'] = int(match.group(1)) if match else 'Unknown'

    return df

In [None]:
#@markdown ##Provide the path to your dataset:

#@markdown ###You have multiple TrackMate files you want to compile, provide the path to your:

import os
import re
import glob
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np
import requests
import zipfile
import gzip


Folder_path = ''  # @param {type: "string"}

#@markdown ###You have existing dataframes, provide the path to your:

Track_table = ''  # @param {type: "string"}
Spot_table = ''  # @param {type: "string"}

#@markdown ###Provide the path to your Result folder

Results_Folder = ""  # @param {type: "string"}

if not Results_Folder:
    Results_Folder = '/content/Results'  # Default Results_Folder path if not defined

if not os.path.exists(Results_Folder):
    os.makedirs(Results_Folder)  # Create Results_Folder if it doesn't exist

# Print the location of the result folder
print(f"Result folder is located at: {Results_Folder}")


def load_and_populate(file_pattern, usecols=None, chunksize=500000):
    df_list = []
    pattern = re.compile(file_pattern)
    files_to_process = [f for f in glob.glob(Folder_path + '/*') if pattern.match(os.path.basename(f))]

    # Metadata list
    metadata_list = []

    for filepath in tqdm(files_to_process, desc="Processing Files"):
        print(filepath)
        # Get the expected number of rows in the file (subtracting header rows)
        expected_rows = sum(1 for row in open(filepath)) - 4

        # Add to the metadata list
        metadata_list.append({
            'filename': os.path.basename(filepath),
            'expected_rows': expected_rows
        })

        chunked_reader = pd.read_csv(filepath, skiprows=[1, 2, 3], usecols=usecols, chunksize=chunksize)
        for chunk in chunked_reader:
            df_list.append(populate_columns(chunk, filepath))

    if not df_list:
        print(f"No files found with pattern: {file_pattern}")
        return pd.DataFrame()

    merged_df = pd.concat(df_list, ignore_index=True)

    # Verify the total rows in the merged dataframe matches the total expected rows from metadata
    total_expected_rows = sum(item['expected_rows'] for item in metadata_list)
    if len(merged_df) != total_expected_rows:
        print(f"Warning: Mismatch in total rows. Expected {total_expected_rows}, found {len(merged_df)} in the merged dataframe.")
    else:
        print(f"Success: The processed dataframe matches the metadata. Total rows: {len(merged_df)}")

    return merged_df

def sort_and_generate_repeat(merged_df):
    merged_df.sort_values(['Condition', 'experiment_nb'], inplace=True)
    merged_df = merged_df.groupby('Condition', group_keys=False).apply(generate_repeat)
    return merged_df

def generate_repeat(group):
    # Convert to string if the experiment_nb has numeric and 'Unknown' values
    group['experiment_nb'] = group['experiment_nb'].astype(str)

    # Handle non-numeric and missing values if needed, here we assume 'Unknown' is one such value
    numeric_part = group[group['experiment_nb'].str.isdigit()]
    non_numeric_part = group[~group['experiment_nb'].str.isdigit()]

    # Sort numeric values and assign repeats
    unique_experiment_nbs_numeric = sorted(numeric_part['experiment_nb'].unique(), key=int)
    experiment_nb_to_repeat_numeric = {experiment_nb: i+1 for i, experiment_nb in enumerate(unique_experiment_nbs_numeric)}
    numeric_part['Repeat'] = numeric_part['experiment_nb'].map(experiment_nb_to_repeat_numeric)

    # Handle non-numeric parts, you can decide how to sort and assign repeat values
    # Here we give all 'Unknown' the same repeat number, for example, 0
    non_numeric_part['Repeat'] = 0  # Or some other logic for non-numeric parts

    # Concatenate the parts back together
    group = pd.concat([numeric_part, non_numeric_part])

    return group


def remove_suffix(filename):
    suffixes_to_remove = ["-tracks", "-spots"]
    for suffix in suffixes_to_remove:
        if filename.endswith(suffix):
            filename = filename[:-len(suffix)]
            break
    return filename


def validate_tracks_df(df):
    """Validate the tracks dataframe for necessary columns and data types."""
    required_columns = ['TRACK_ID']
    for col in required_columns:
        if col not in df.columns:
            print(f"Error: Column '{col}' missing in tracks dataframe.")
            return False

    # Additional data type checks or value ranges can be added here
    return True

def validate_spots_df(df):
    """Validate the spots dataframe for necessary columns and data types."""
    required_columns = ['TRACK_ID', 'POSITION_X', 'POSITION_Y', 'POSITION_T']
    for col in required_columns:
        if col not in df.columns:
            print(f"Error: Column '{col}' missing in spots dataframe.")
            return False

    # Additional data type checks or value ranges can be added here
    return True

def check_unique_id_match(df1, df2):
    df1_ids = set(df1['Unique_ID'])
    df2_ids = set(df2['Unique_ID'])

    # Check if the IDs in the two dataframes match
    if df1_ids == df2_ids:
        print("The Unique_ID values in both dataframes match perfectly!")
    else:
        missing_in_df1 = df2_ids - df1_ids
        missing_in_df2 = df1_ids - df2_ids

        if missing_in_df1:
            print(f"There are {len(missing_in_df1)} Unique_ID values present in the second dataframe but missing in the first.")
            print("Examples of these IDs are:", list(missing_in_df1)[:5])

        if missing_in_df2:
            print(f"There are {len(missing_in_df2)} Unique_ID values present in the first dataframe but missing in the second.")
            print("Examples of these IDs are:", list(missing_in_df2)[:5])

if Folder_path:

    merged_tracks_df = load_and_populate(r'.*tracks.*\.csv')

    if not validate_tracks_df(merged_tracks_df):
        print("Error: Validation failed for merged tracks dataframe.")
    else:
        merged_tracks_df = sort_and_generate_repeat(merged_tracks_df)
        merged_tracks_df['Unique_ID'] = merged_tracks_df['File_name'] + "_" + merged_tracks_df['TRACK_ID'].astype(str)
        save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz', desc="Saving Tracks")


    merged_spots_df = load_and_populate(r'.*spots.*\.csv', usecols=['TRACK_ID', 'POSITION_X', 'POSITION_Y', 'POSITION_T', 'RADIUS', 'CIRCULARITY', 'SOLIDITY', 'SHAPE_INDEX'])

    if not validate_spots_df(merged_spots_df):
        print("Error: Validation failed for merged spots dataframe.")
    else:
        merged_spots_df = sort_and_generate_repeat(merged_spots_df)
        merged_spots_df.dropna(subset=['POSITION_X', 'POSITION_Y'], inplace=True)
        merged_spots_df.reset_index(drop=True, inplace=True)
        merged_spots_df['Unique_ID'] = merged_spots_df['File_name'] + "_" + merged_spots_df['TRACK_ID'].astype(str)
        save_dataframe_with_progress(merged_spots_df, Results_Folder + '/' + 'merged_Spots.csv.gz', desc="Saving Spots")
        # Now, call the check function
        check_unique_id_match(merged_spots_df, merged_tracks_df)
        print("...Done")

# For existing dataframes
if Track_table:
    print("Loading track table file....")
    merged_tracks_df = pd.read_csv(Track_table, low_memory=False)
    if not validate_tracks_df(merged_tracks_df):
        print("Error: Validation failed for loaded tracks dataframe.")

if Spot_table:
    print("Loading spot table file....")
    merged_spots_df = pd.read_csv(Spot_table, low_memory=False)
    if not validate_spots_df(merged_spots_df):
        print("Error: Validation failed for loaded spots dataframe.")


check_for_nans(merged_spots_df, "merged_spots_df")
check_for_nans(merged_tracks_df, "merged_tracks_df")


In [None]:
#@markdown ##Check Metadata


# Define the metadata columns that are expected to have identical values for each filename
metadata_columns = ['Cells', 'Flow_speed', 'Treatment', 'Condition', 'experiment_nb', 'Repeat']

# Group the DataFrame by 'File_name' and then check if all entries within each group are identical
consistent_metadata = True
for name, group in merged_tracks_df.groupby('File_name'):
    for col in metadata_columns:
        if not group[col].nunique() == 1:
            consistent_metadata = False
            print(f"Inconsistency found for file: {name} in column: {col}")
            break  # Stop checking other columns for this group and move to the next file
    if not consistent_metadata:
        break  # Stop the entire process if any inconsistency is found

if consistent_metadata:
    print("All files have consistent metadata across the specified columns.")
else:
    print("There are inconsistencies in the metadata. Please check the output for details.")

# Drop duplicates based on the 'File_name' to get a unique list of filenames and their metadata
unique_files_df = merged_tracks_df.drop_duplicates(subset=['File_name'])[['File_name', 'Cells', 'Flow_speed', 'Treatment', 'Condition', 'experiment_nb', 'Repeat']]

# Reset the index to clean up the DataFrame
unique_files_df.reset_index(drop=True, inplace=True)

# Display the resulting DataFrame in a nicely formatted HTML table
unique_files_df

import pandas as pd

# Assuming 'df' is your DataFrame and it already contains 'Conditions' and 'Repeats' columns.

# Group by 'Conditions' and 'Repeats' and count the occurrences
grouped = unique_files_df.groupby(['Condition', 'Repeat']).size().reset_index(name='counts')

# Check if any combinations have a count greater than 1, which means they are not unique
non_unique_combinations = grouped[grouped['counts'] > 1]

# Print the non-unique combinations
if not non_unique_combinations.empty:
    print("There are non-unique combinations of Conditions and Repeats:")
    print(non_unique_combinations)
else:
    print("All combinations of Conditions and Repeats are unique.")

check_unique_id_match(merged_spots_df, merged_tracks_df)


# Group the DataFrame by 'Cells', 'Treatment', 'Repeat' and then check if there are 4 unique 'Flow_speed' values for each group
consistent_flow_speeds = True
for (cells, Treatment, repeat), group in merged_tracks_df.groupby(['Cells', 'Treatment', 'Repeat']):
    if group['Flow_speed'].nunique() != 4:
        consistent_flow_speeds = False
        print(f"Inconsistency found for Cells: {cells}, Treatment: {Treatment_conditions}, Repeat: {repeat} - Expected 4 Flow_speeds, found {group['Flow_speed'].nunique()}")
        break  # Stop the entire process if any inconsistency is found

if consistent_flow_speeds:
    print("Each combination of 'Cells', 'Treatment', 'Repeat' has exactly 4 different 'Flow_speed' values.")
else:
    print("There are inconsistencies in 'Flow_speed' values. Please check the output for details.")


unique_cells = unique_files_df['Cells'].unique()
unique_flow_speeds = unique_files_df['Flow_speed'].unique()
unique_Treatment = unique_files_df['Treatment'].unique()
unique_conditions = unique_files_df['Condition'].unique()

print("Unique Cells:", unique_cells)
print("Unique Flow Speeds:", unique_flow_speeds)
print("Unique Silencing:", unique_Treatment)
print("Unique Conditions:", unique_conditions)


## **1.4. Filter tracks shorter than 50 spots**


In [None]:
# @title ##Filter tracks shorter than 50 spots


merged_tracks_df = merged_tracks_df[merged_tracks_df['NUMBER_SPOTS'] >= 50]
merged_spots_df = merged_spots_df[merged_spots_df['Unique_ID'].isin(merged_tracks_df['Unique_ID'])]


## **1.5. Visualise your tracks**
---

In [None]:
# @title ##Run the cell and choose the file you want to inspect

import ipywidgets as widgets
from ipywidgets import interact
import matplotlib.pyplot as plt

if not os.path.exists(Results_Folder+"/Tracks"):
    os.makedirs(Results_Folder+"/Tracks")  # Create Results_Folder if it doesn't exist

# Extract unique filenames from the dataframe
filenames = merged_spots_df['File_name'].unique()

# Create a Dropdown widget with the filenames
filename_dropdown = widgets.Dropdown(
    options=filenames,
    value=filenames[0] if len(filenames) > 0 else None,  # Default selected value
    description='File Name:',
)

def plot_coordinates(filename):
    if filename:
        # Filter the DataFrame based on the selected filename
        filtered_df = merged_spots_df[merged_spots_df['File_name'] == filename]

        plt.figure(figsize=(10, 8))
        for unique_id in filtered_df['Unique_ID'].unique():
            unique_df = filtered_df[filtered_df['Unique_ID'] == unique_id].sort_values(by='POSITION_T')
            plt.plot(unique_df['POSITION_X'], unique_df['POSITION_Y'], marker='o', linestyle='-', markersize=2)

        plt.xlabel('POSITION_X')
        plt.ylabel('POSITION_Y')
        plt.title(f'Coordinates for {filename}')
        plt.savefig(f"{Results_Folder}/Tracks/Tracks_{filename}.pdf")
        plt.show()
    else:
        print("No valid filename selected")

# Link the Dropdown widget to the plotting function
interact(plot_coordinates, filename=filename_dropdown)


In [None]:
# @title ##Batch Process


import os
import matplotlib.pyplot as plt

# Ensure the Results_Folder/Tracks directory exists
if not os.path.exists(Results_Folder + "/Tracks"):
    os.makedirs(Results_Folder + "/Tracks")

# Extract unique filenames from the dataframe
filenames = merged_spots_df['File_name'].unique()

def plot_coordinates(filename):
    if filename:
        # Filter the DataFrame based on the selected filename
        filtered_df = merged_spots_df[merged_spots_df['File_name'] == filename]

        plt.figure(figsize=(10, 8))
        for unique_id in filtered_df['Unique_ID'].unique():
            unique_df = filtered_df[filtered_df['Unique_ID'] == unique_id].sort_values(by='POSITION_T')
            plt.plot(unique_df['POSITION_X'], unique_df['POSITION_Y'], marker='o', linestyle='-', markersize=2)

        plt.xlabel('POSITION_X')
        plt.ylabel('POSITION_Y')
        plt.title(f'Coordinates for {filename}')
        plt.savefig(f"{Results_Folder}/Tracks/Tracks_{filename}.pdf")
        plt.close()  # Close the plot to avoid displaying it

# Loop through all filenames and generate plots
for filename in filenames:
    plot_coordinates(filename)


In [None]:
# @title ##Speed density plots


# Updated code to visualize distributions using the 'fill' parameter in sns.kdeplot

import seaborn as sns
import matplotlib.pyplot as plt

def plot_distribution_by_condition_updated(df):
    conditions = df['Condition'].unique()

    # Setting up the plotting environment
    sns.set_style("whitegrid")
    plt.figure(figsize=(18, 20))  # Increased height to fit the fourth plot

    # Plotting histograms for TRACK_MEAN_SPEED
    plt.subplot(4, 1, 1)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TRACK_MEAN_SPEED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TRACK_MEAN_SPEED by Condition')
    plt.legend()

    # Plotting histograms for TRACK_MAX_SPEED
    plt.subplot(4, 1, 2)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TRACK_MAX_SPEED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TRACK_MAX_SPEED by Condition')
    plt.legend()

    # Plotting histograms for TRACK_MIN_SPEED
    plt.subplot(4, 1, 3)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TRACK_MIN_SPEED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TRACK_MIN_SPEED by Condition')
    plt.legend()

    # Plotting histograms for TOTAL_DISTANCE_TRAVELED
    plt.subplot(4, 1, 4)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TOTAL_DISTANCE_TRAVELED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TOTAL_DISTANCE_TRAVELED by Condition')
    plt.legend()

    plt.tight_layout()
    plt.show()

# You can call this function with your dataframe like this:
plot_distribution_by_condition_updated(merged_tracks_df)



In [None]:
# @title ##Time points per tracks


import matplotlib.pyplot as plt


# Calculate the count of time points per track
time_points_per_track = merged_spots_df.groupby('Unique_ID').size()

# Plotting
plt.figure(figsize=(10, 6))
time_points_per_track.hist(bins=30, edgecolor='black')
plt.title('Distribution of Time Points per Track')
plt.xlabel('Number of Time Points')
plt.ylabel('Count of Tracks')
plt.grid(False)
plt.show()


--------------------------------------------------------
# **Part 2. Compute Additional Metrics (Optional)**
--------------------------------------------------------



<font size="4" color="red">Part 2 does not support Track splitting</font>.

<font size="4" color="red">Part 2 supports 3D tracking data</font>.

<font size="4">In this section, you can compute useful track metrics. These metrics can be calculated from the start to the end of the track or using a rolling window approach.

<font size = 4>**Usefulness of Start to End Approach**

<font size = 4>The start to end approach calculates metrics over the entire length of the track, providing a comprehensive overview of the track's characteristics from beginning to end. This method is useful for understanding overall trends such as directionality or average speed over the entire track.

<font size = 4>**Usefulness of the Rolling Window Approach**

<font size = 4>The rolling window approach is particularly useful when comparing tracks of different lengths, especially when the metric is not normalized over time, such as the total distance traveled. By using rolling averages, you ensure that the comparisons account for variations in track length and provide a more consistent basis for analysis.

<font size = 4>**Choosing the Window Size**

- <font size = 4>**Window Size**: The `window_size` parameter determines the number of data points considered in each rolling calculation. A larger window size will smooth the data more, averaging out short-term variations and focusing on long-term trends. Conversely, a smaller window size will be more sensitive to short-term changes, capturing finer details of the movement.
- <font size = 4>**Selection Tips**: The optimal window size depends on the nature of your data and the specific analysis goals. It also depends on the length of your tracks.
</font>


## **2.1. Duration and speed metrics**
---
<font size = 4>When this cell is executed, it calculates various metrics for each unique track (using the whole track). Specifically, for each track, it determines the duration of the track, the average, maximum, minimum, and standard deviation of speeds, as well as the total distance traveled by the tracked object.

In [None]:
# @title ##Calculate duration and speed metrics

print("Calculating track metrics...")

merged_spots_df['POSITION_Z'] = 0


merged_spots_df.dropna(subset=['POSITION_X', 'POSITION_Y', 'POSITION_Z'], inplace=True)

tqdm.pandas(desc="Calculating Track Metrics")

columns_to_remove = [
    "TRACK_DURATION",
    "TRACK_MEAN_SPEED",
    "TRACK_MAX_SPEED",
    "TRACK_MIN_SPEED",
    "TRACK_MEDIAN_SPEED",
    "TRACK_STD_SPEED",
    "TOTAL_DISTANCE_TRAVELED"
]

for column in columns_to_remove:
    if column in merged_tracks_df.columns:
        merged_tracks_df.drop(column, axis=1, inplace=True)

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
df_track_metrics = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_track_metrics).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_track_metrics.columns).drop('Unique_ID')
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)
merged_tracks_df = pd.merge(merged_tracks_df, df_track_metrics, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz')
check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

<font size = 4>**Calculate duration and speed metrics using rolling windows**

<font size = 4>When this cell is executed, it calculates various metrics for each unique track using a rolling window approach. Specifically, it computes rolling sums for distances traveled and various rolling statistics for speeds, including the mean, median, maximum, minimum, and standard deviation within the defined window.

- <font size = 4>**Mean Speed Rolling**: The average speed within each rolling window.
- <font size = 4>**Median Speed Rolling**: The median speed within each rolling window.
- <font size = 4>**Max Speed Rolling**: The highest speed within each rolling window.
- <font size = 4>**Min Speed Rolling**: The lowest speed within each rolling window.
- <font size = 4>**Speed Standard Deviation Rolling**: The variability of speeds within each rolling window.
- <font size = 4>**Total Distance Traveled Rolling**: The average distance traveled within each rolling window.


In [None]:
# @title ##Compute Speed and rolling distance

from tqdm.notebook import tqdm

import numpy as np

def compute_instantaneous_speed(dataframe):
    # Check for required columns
    required_columns = ['Unique_ID', 'POSITION_T', 'POSITION_X', 'POSITION_Y']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    speeds = []

    for _, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Speeds"):
        # Check for NaN values in columns
        if track[['POSITION_X', 'POSITION_Y', 'POSITION_T']].isna().any().any():
            raise ValueError(f"Track with ID '{track['Unique_ID'].iloc[0]}' contains NaN values which might affect the computation.")

        # Calculate the instantaneous speed using positional data and time difference
        speed = np.sqrt(track['POSITION_X'].diff()**2 + track['POSITION_Y'].diff()**2) / track['POSITION_T'].diff()

        # Ensure that time differences are non-negative
        if (track['POSITION_T'].diff() < 0).any():
            raise ValueError(f"Track with ID '{track['Unique_ID'].iloc[0]}' has negative time differences.")

        # Ensuring the first speed value for each track is NaN
        speed.iloc[0] = np.nan

        speeds.extend(speed.tolist())

    # Safety Check
    if len(speeds) != len(dataframe):
        raise ValueError("The computed speeds list length doesn't match the dataframe's length.")

    dataframe['Speed'] = speeds

    return dataframe

# Example usage:
merged_spots_df = compute_instantaneous_speed(merged_spots_df)


def compute_rolling_average(dataframe, window_size=5):
    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    rolling_avg_speeds = []

    # Wrap the groupby object with tqdm for progress visualization
    for _, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Rolling Averages"):
        rolling_avg = track['Speed'].rolling(window=window_size, min_periods=1, center=True).mean()
        rolling_avg_speeds.extend(rolling_avg.tolist())

    # Safety Check
    if len(rolling_avg_speeds) != len(dataframe):
        raise ValueError("The computed rolling averages list length doesn't match the dataframe's length.")

    dataframe['RollingAvgSpeed'] = rolling_avg_speeds

    return dataframe

# Example usage:
merged_spots_df = compute_rolling_average(merged_spots_df, window_size=5)


def average_speed_first_last_n(dataframe, n=5):
    # Ensure n is a positive integer
    if not isinstance(n, int) or n <= 0:
        raise ValueError("n should be a positive integer.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    speeds_first = {}
    speeds_last = {}

    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Calculating average speeds"):
        # Ensure the track has at least n points
        if len(track) < n:
            print(f"Track {track_id} has less than {n} points. Skipping.")
            continue


        # Average speed for first n time points using RollingAvgSpeed
        avg_speed_first = track['Speed'].iloc[:n].mean()
        speeds_first[track_id] = avg_speed_first

        # Average speed for last n time points using RollingAvgSpeed
        avg_speed_last = track['Speed'].iloc[-n:].mean()
        speeds_last[track_id] = avg_speed_last

    # Convert average speeds to DataFrames
    avg_speeds_first_df = pd.DataFrame(speeds_first.items(), columns=['Unique_ID', 'AvgSpeedFirstN'])
    avg_speeds_last_df = pd.DataFrame(speeds_last.items(), columns=['Unique_ID', 'AvgSpeedLastN'])

    return avg_speeds_first_df, avg_speeds_last_df

# Example usage:
avg_speeds_first, avg_speeds_last = average_speed_first_last_n(merged_spots_df, 5)


def compute_min_rolling_speed(dataframe):
    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'POSITION_T', 'RollingAvgSpeed']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Safeguard: Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    min_speeds = {}

    # Wrap the groupby object with tqdm for progress visualization
    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Min Rolling Speeds"):

        min_speed = track['RollingAvgSpeed'].min()
        min_speeds[track_id] = min_speed

    # Convert the dictionary to a DataFrame
    min_speed_df = pd.DataFrame(min_speeds.items(), columns=['Unique_ID', 'MinRollingAvgSpeed'])

    return min_speed_df

# Compute the minimum rolling speed for merged_spots_df
min_rolling_speed_df = compute_min_rolling_speed(merged_spots_df)


def merge_speeds(df_main, df_to_merge, key='Unique_ID'):
    # Safeguard: Ensure 'key' is present in both dataframes
    if key not in df_main.columns or key not in df_to_merge.columns:
        raise ValueError(f"The key '{key}' is not present in both dataframes to be merged.")

    overlapping_columns = df_main.columns.intersection(df_to_merge.columns).drop(key)
    df_main.drop(columns=overlapping_columns, inplace=True)
    return pd.merge(df_main, df_to_merge, on=key, how='left')


merged_tracks_df = merge_speeds(merged_tracks_df, avg_speeds_first)
merged_tracks_df = merge_speeds(merged_tracks_df, avg_speeds_last)
merged_tracks_df = pd.merge(merged_tracks_df, min_rolling_speed_df)

def compute_rolling_distance(dataframe, window_size=3):
    """Compute the total distance traveled within a rolling time window."""
    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'POSITION_T', 'POSITION_X', 'POSITION_Y']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Safeguard: Handle potential negative or zero values for window size
    if window_size <= 0:
        raise ValueError("Window size must be a positive integer.")

    # Safeguard: Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    # Safeguard: Ensure window size is odd for trimming edges correctly
    if window_size % 2 == 0:
        raise ValueError("Please use an odd value for the window size for accurate trimming.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    trim_size = window_size // 2  # Determine how much to trim from the edges
    rolling_distances = []

    for _, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Rolling Distance"):
        # Compute the Euclidean distance between consecutive points
        distances = np.sqrt(track['POSITION_X'].diff()**2 + track['POSITION_Y'].diff()**2).fillna(0)

        # Compute the rolling sum of distances
        rolling_distance = distances.rolling(window=window_size, center=True).sum()

        # Trim the edges
        rolling_distance[:trim_size] = np.nan
        rolling_distance[-trim_size:] = np.nan

        rolling_distances.extend(rolling_distance.tolist())

    # Safeguard: Ensure the list of rolling distances matches the length of the dataframe
    if len(rolling_distances) != len(dataframe):
        raise ValueError("The computed rolling distances list length doesn't match the dataframe's length.")

    dataframe['RollingDistance'] = rolling_distances
    return dataframe

merged_spots_df = compute_rolling_distance(merged_spots_df, window_size=5)


def average_rolling_distance_first_last_n(dataframe, n=1):
    """Compute the average rolling distance for the first and last n points."""

    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'POSITION_T', 'RollingDistance']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Safeguard: Handle potential non-positive values for n
    if n <= 0:
        raise ValueError("n must be a positive integer.")

    # Safeguard: Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    distance_first = {}
    distance_last = {}
    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one


    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Calculating average rolling distances"):
        avg_distance_first = track['RollingDistance'].iloc[:n].sum()
        distance_first[track_id] = avg_distance_first

        avg_distance_last = track['RollingDistance'].iloc[-n:].sum()
        distance_last[track_id] = avg_distance_last

    avg_distances_first_df = pd.DataFrame(distance_first.items(), columns=['Unique_ID', 'AvgRollingDistanceFirstN'])
    avg_distances_last_df = pd.DataFrame(distance_last.items(), columns=['Unique_ID', 'AvgRollingDistanceLastN'])

    return avg_distances_first_df, avg_distances_last_df


def merge_rolling_distances(df_main, df_to_merge, key='Unique_ID'):
    """Merge rolling distances into a dataframe."""
    overlapping_columns = df_main.columns.intersection(df_to_merge.columns).drop(key)

    # Safeguard: Ensure that the df_main is updated correctly after dropping overlapping columns
    df_main = df_main.drop(columns=overlapping_columns)
    return pd.merge(df_main, df_to_merge, on=key, how='left')


def compute_min_rolling_distance(dataframe):
    """Compute the minimum rolling distance for each track."""

    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'RollingDistance']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    min_distances = {}

    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Min Rolling Distances"):
        min_distance = track['RollingDistance'].min()
        min_distances[track_id] = min_distance

    min_distance_df = pd.DataFrame(min_distances.items(), columns=['Unique_ID', 'MinRollingDistance'])

    return min_distance_df

# Usage and merging operations:
avg_distances_first, avg_distances_last = average_rolling_distance_first_last_n(merged_spots_df, 1)
merged_tracks_df = merge_rolling_distances(merged_tracks_df, avg_distances_first)
merged_tracks_df = merge_rolling_distances(merged_tracks_df, avg_distances_last)

min_rolling_distance_df = compute_min_rolling_distance(merged_spots_df)
overlapping_columns = merged_tracks_df.columns.intersection(min_rolling_distance_df.columns).drop('Unique_ID')

# Safeguard: Ensure that the merged_tracks_df is updated correctly after dropping overlapping columns
merged_tracks_df = merged_tracks_df.drop(columns=overlapping_columns)
merged_tracks_df = pd.merge(merged_tracks_df, min_rolling_distance_df, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_spots_df, Results_Folder + '/' + 'merged_Spots.csv.gz', desc="Saving Spots")
save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz', desc="Saving Tracks")

# Safeguard: check for NaN
check_for_nans(merged_tracks_df, "merged_tracks_df")
check_for_nans(merged_spots_df, "merged_spots_df")


## **2.2. Directionality**
---
<font size = 4>To calculate the directionality of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The directionality, denoted as \(D\), is calculated using the formula:

$$ D = \frac{d_{\text{euclidean}}}{d_{\text{total path}}} $$

where \($d_{\text{euclidean}}$\) is the Euclidean distance between the first and the last points of the track, calculated as:

$$ d_{\text{euclidean}} = \sqrt{(x_{\text{end}} - x_{\text{start}})^2 + (y_{\text{end}} - y_{\text{start}})^2 + (z_{\text{end}} - z_{\text{start}})^2} $$

and \($d_{\text{total path}}$\) is the sum of the Euclidean distances between all consecutive points in the track, representing the total path length traveled. If the total path length is zero, the directionality is defined to be zero. This measure provides insight into the straightness of the path taken, with a value of 1 indicating a straight path between the start and end points, and values approaching 0 indicating more circuitous paths.</font>


In [None]:
# @title ##Calculate directionality
from celltracks.Track_Metrics import calculate_directionality

print("In progress...")

merged_spots_df.dropna(subset=['POSITION_X', 'POSITION_Y', 'POSITION_Z'], inplace=True)

tqdm.pandas(desc="Calculating Directionality")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_directionality = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_directionality).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_directionality.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_directionality, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

<font size = 4>**Calculate directionality using rolling windows**

<font size = 4>When this cell is executed, it calculates the directionality for each unique track using a rolling window approach.

- <font size = 4>**Directionality Rolling**: The average directionality within each rolling window, indicating how straight the path is in that segment of the track.


## **2.3. Tortuosity**
---
<font size = 4>This measure provides insight into the curvature and complexity of the path taken, with a value of 1 indicating a straight path between the start and end points, and values greater than 1 indicating paths with more twists and turns.
To calculate the tortuosity of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The tortuosity, denoted as \(T\), is calculated using the formula:

$$ T = \frac{d_{\text{total path}}}{d_{\text{euclidean}}} $$



In [None]:
# @title ##Calculate tortuosity
print("In progress...")

tqdm.pandas(desc="Calculating Tortuosity")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_tortuosity = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_tortuosity).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_tortuosity.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_tortuosity, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

<font size = 4>**Calculate tortuosity using rolling windows**

<font size = 4>When this cell is executed, it calculates the tortuosity for each unique track using a rolling window approach.

- <font size = 4>**Tortuosity Rolling**: The average tortuosity within each rolling window, indicating how convoluted or twisted the path is in that segment of the track. Tortuosity is calculated as the ratio of the total path length to the Euclidean distance between the start and end points of each window. This metric helps in understanding the complexity of movement patterns over short segments of the track, providing insights into the movement behavior of tracked objects.


## **2.4. Calculate the total turning angle**
---

<font size = 4>This measure provides insight into the cumulative amount of turning along the path, with a value of 0 indicating a straight path with no turning, and higher values indicating paths with more turning.

<font size = 4>To calculate the Total Turning Angle of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The Total Turning Angle, denoted as \(A\), is the sum of the angles between each pair of consecutive direction vectors along the track, representing the cumulative amount of turning along the path.

<font size = 4>For each pair of consecutive segments in the track, we calculate the direction vectors \( $\vec{v_1}$ \) and \($ \vec{v_2}$ \), and the angle \($ \theta$ \) between them is calculated using the formula:

$$ \cos(\theta) = \frac{\vec{v_1} \cdot \vec{v_2}}{||\vec{v_1}|| \cdot ||\vec{v_2}||} $$

<font size = 4>where \( $\vec{v_1} \cdot$ $\vec{v_2}$ \) is the dot product of the direction vectors, and \( $||\vec{v_1}||$ \) and \( $||\vec{v_2}||$ \) are the magnitudes of the direction vectors. The Total Turning Angle \( $A$ \) is then the sum of all the angles \( \$theta$ \) calculated between each pair of consecutive direction vectors along the track:

$$ A = \sum \theta $$
<font size = 4>
If either of the direction vectors is a zero vector, the angle between them is undefined, and such cases are skipped in the calculation.


In [None]:
# @title ##Calculate the total turning angle

tqdm.pandas(desc="Calculating Total Turning Angle")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_turning_angle = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_total_turning_angle).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_turning_angle.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_turning_angle, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

<font size = 4>**Calculate the total turning angle using rolling windows**

<font size = 4>When this cell is executed, it calculates the total turning angle for each unique track using a rolling window approach.

- <font size = 4>**Total Turning Angle Rolling**: The average total turning angle within each rolling window, indicating how much the direction of movement changes over short segments of the track. This metric helps in understanding the directional changes and maneuverability of the tracked objects over time.


## **2.5. Calculate the Spatial Coverage**
---

<font size = 4>Spatial coverage provides insight into the spatial extent covered by the object's movement, with higher values indicating that the object has covered a larger area or volume during its movement.


<font size = 4>To calculate the spatial coverage of a track in 2D or 3D space, we consider a series of points each with \(x\), \(y\), and optionally \(z\) coordinates, sorted by time. The spatial coverage, denoted as \(S\), represents the area (in 2D) or volume (in 3D) enclosed by the convex hull formed by the points in the track. It provides insight into the spatial extent covered by the moving object.

<font size = 4>**In the implementation below we:**
1. <font size = 4>**Check Dimensionality**:
   <font size = 4>- If the variance of the \(z\) coordinates is zero, implying all \(z\) coordinates are the same, the spatial coverage is calculated in 2D using only the \(x\) and \(y\) coordinates.
  <font size = 4> - If the \(z\) coordinates vary, the spatial coverage is calculated in 3D using the \(x\), \(y\), and \(z\) coordinates.

2. <font size = 4>**Form Convex Hull**:
   <font size = 4>- In 2D, a minimum of 3 non-collinear points is required to form a convex hull.
   <font size = 4>- In 3D, a minimum of 4 non-coplanar points is required to form a convex hull.
   <font size = 4>- If the required minimum points are not available, the spatial coverage is defined to be zero.

3. <font size = 4>**Calculate Spatial Coverage**:
   <font size = 4>- In 2D, the spatial coverage \(S\) is the area of the convex hull formed by the points in the track.
   <font size = 4>- In 3D, the spatial coverage \(S\) is the volume of the convex hull formed by the points in the track.

<font size = 4>**Formula:**
- For 2D Spatial Coverage (Area of Convex Hull), if points are \(P_1(x_1, y_1), P_2(x_2, y_2), \ldots, P_n(x_n, y_n)\):
  $$ S_{2D} = \text{Area of Convex Hull formed by } P_1, P_2, \ldots, P_n $$

- For 3D Spatial Coverage (Volume of Convex Hull), if points are \(P_1(x_1, y_1, z_1), P_2(x_2, y_2, z_2), \ldots, P_n(x_n, y_n, z_n)\):
  $$ S_{3D} = \text{Volume of Convex Hull formed by } P_1, P_2, \ldots, P_n $$



In [None]:
# @title ##Calculate the Spatial Coverage

tqdm.pandas(desc="Calculating Spatial Coverage")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_spatial_coverage = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_spatial_coverage).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_spatial_coverage.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_spatial_coverage, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

<font size = 4>**Calculate Spatial Coverage using rolling windows**

<font size = 4>When this cell is executed, it calculates the spatial coverage for each unique track using a rolling window approach.

- <font size = 4>**Spatial Coverage Rolling**: The average spatial coverage within each rolling window, representing the area (in 2D) or volume (in 3D) covered by the tracked object over short segments of the track. This metric helps in understanding the spatial extent and movement patterns of the tracked objects over time.


## **2.6. Compute additional metrics**
---

<font size = 4>This cell computes various metrics for each track in the provided dataset. These metrics are derived from the information provided by your tracking software.


In [None]:
# @title ##Compute additional metrics

print("In progress...")

# List of potential metrics to compute
potential_metrics = [
    'MEAN_INTENSITY_CH1', 'MEDIAN_INTENSITY_CH1', 'MIN_INTENSITY_CH1', 'MAX_INTENSITY_CH1',
    'TOTAL_INTENSITY_CH1', 'STD_INTENSITY_CH1', 'CONTRAST_CH1', 'SNR_CH1', 'ELLIPSE_X0',
    'ELLIPSE_Y0', 'ELLIPSE_MAJOR', 'ELLIPSE_MINOR', 'ELLIPSE_THETA', 'ELLIPSE_ASPECTRATIO',
    'AREA', 'PERIMETER', 'CIRCULARITY', 'SOLIDITY', 'SHAPE_INDEX','MEAN_INTENSITY_CH2', 'MEDIAN_INTENSITY_CH2', 'MIN_INTENSITY_CH2', 'MAX_INTENSITY_CH2',
    'TOTAL_INTENSITY_CH2', 'STD_INTENSITY_CH2', 'CONTRAST_CH2', 'SNR_CH2', 'MEAN_INTENSITY_CH3', 'MEDIAN_INTENSITY_CH3', 'MIN_INTENSITY_CH3', 'MAX_INTENSITY_CH3',
    'TOTAL_INTENSITY_CH3', 'STD_INTENSITY_CH3', 'CONTRAST_CH3', 'SNR_CH3', 'MEAN_INTENSITY_CH4', 'MEDIAN_INTENSITY_CH4', 'MIN_INTENSITY_CH4', 'MAX_INTENSITY_CH4',
    'TOTAL_INTENSITY_CH4', 'STD_INTENSITY_CH4', 'CONTRAST_CH4', 'SNR_CH4',
    'Diameter_0',	'Euclidean_Diameter_0',	'Number_of_Holes_0',	'Center_of_the_Skeleton_0',	'Center_of_the_Skeleton_1',
    'Length_of_the_Skeleton_0',	'Convexity_0',	'Number_of_Defects_0',	'Mean_Defect_Displacement_0',	'Mean_Defect_Area_0',
    'Variance_of_Defect_Area_0',	'Convex_Hull_Center_0',	'Convex_Hull_Center_1', 'Object_Center_0',	'Object_Center_1',
    'Object_Area_0',	'Kurtosis_of_Intensity_0',	'Maximum_intensity_0',	'Mean_Intensity_0',	'Minimum_intensity_0',
    'Principal_components_of_the_object_0', 'Principal_components_of_the_object_1',	'Principal_components_of_the_object_2',
    'Principal_components_of_the_object_3', 'Radii_of_the_object_0',	'Radii_of_the_object_1',	'Skewness_of_Intensity_0',
    'Total_Intensity_0',	'Variance_of_Intensity_0',	'Bounding_Box_Maximum_0',	'Bounding_Box_Maximum_1',	'Bounding_Box_Minimum_0',
    'Bounding_Box_Minimum_1',	'Size_in_pixels_0'
]

available_metrics = check_metrics_availability(merged_spots_df, potential_metrics)

morphological_metrics_df = compute_morphological_metrics(merged_spots_df, available_metrics)

morphological_metrics_df.reset_index(inplace=True)

if 'Unique_ID' in merged_tracks_df.columns:
    overlapping_columns = merged_tracks_df.columns.intersection(morphological_metrics_df.columns).drop('Unique_ID', errors='ignore')
    merged_tracks_df.drop(columns=overlapping_columns, inplace=True)
    merged_tracks_df = merged_tracks_df.merge(morphological_metrics_df, on='Unique_ID', how='left')
    save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz')

else:
    print("Error: 'Unique_ID' column missing in merged_tracks_df. Skipping merging with morphological metrics.")

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

## **2.7. Calculate the FMI**
---



In [None]:
# @title #Calculate the FMI

from tqdm.notebook import tqdm

def calculate_fmi(group):
    group = group.sort_values('POSITION_T')

    deltas = np.sqrt(group['POSITION_X'].diff().fillna(0)**2 + group['POSITION_Y'].diff().fillna(0)**2)
    total_path_length = deltas.sum()

    total_forward_displacement = group['POSITION_X'].diff().fillna(0).sum()

    FMI = total_forward_displacement / total_path_length if total_path_length != 0 else 0

    return pd.Series({'FMI': FMI})


# Use tqdm.pandas() for progress_apply
tqdm.pandas(desc="Processing tracks")

# Sort the DataFrame by 'Unique_ID' and 'POSITION_T'
merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

# Group by track ID and calculate metrics with tqdm progress bar
grouped = merged_spots_df.groupby('Unique_ID')
df_fmi = grouped.progress_apply(calculate_fmi).reset_index()

# Find the overlapping columns between the two DataFrames, excluding the merging key
overlapping_columns = merged_tracks_df.columns.intersection(df_fmi.columns).drop('Unique_ID')

# Drop the overlapping columns from the left DataFrame
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

# Merge the FMI values back into the original DataFrame
merged_tracks_df = pd.merge(merged_tracks_df, df_fmi, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv.gz')

check_for_nans(merged_tracks_df, "merged_tracks_df")



-------------------------------------------

# **Part 3. Plot track parameters**
-------------------------------------------

<font size = 4> In this section you can plot all the track parameters previously computed. Data and graphs are automatically saved in your result folder.

<font size = 4 color="red"> Parameters computed are in the unit you provided when tracking your data in TrackMate.

##**Statistical analyses**
### Cohen's d (Effect Size):
<font size = 4>Cohen's d measures the size of the difference between two groups, normalized by their pooled standard deviation. Values can be interpreted as small (0 to 0.2), medium (0.2 to 0.5), or large (0.5 and above) effects. It helps quantify how significant the observed difference is, beyond just being statistically significant.

### Randomization Test:
<font size = 4>This non-parametric test evaluates if observed differences between conditions could have arisen by random chance. It shuffles condition labels multiple times, recalculating the Cohen's d each time. The resulting p-value, which indicates the likelihood of observing the actual difference by chance, provides evidence against the null hypothesis: a smaller p-value implies stronger evidence against the null.

### Bonferroni Correction:
<font size = 4>Given multiple comparisons, the Bonferroni Correction adjusts significance thresholds to mitigate the risk of false positives. By dividing the standard significance level (alpha) by the number of tests, it ensures that only robust findings are considered significant. However, it's worth noting that this method can be conservative, sometimes overlooking genuine effects.


In [None]:
# @title ##Plot track parameters

# Import necessary libraries
import os
import itertools
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from matplotlib.backends.backend_pdf import PdfPages
import ipywidgets as widgets
from matplotlib.ticker import FixedLocator

# Check and create necessary directories
if not os.path.exists(f"{Results_Folder}/track_parameters_plots"):
    os.makedirs(f"{Results_Folder}/track_parameters_plots")

if not os.path.exists(f"{Results_Folder}/track_parameters_plots/pdf"):
    os.makedirs(f"{Results_Folder}/track_parameters_plots/pdf")

if not os.path.exists(f"{Results_Folder}/track_parameters_plots/csv"):
    os.makedirs(f"{Results_Folder}/track_parameters_plots/csv")


def get_selectable_columns(df):
    """Get columns that can be plotted."""
    exclude_cols = ['Condition', 'File_name', 'Flow_speed', 'Cells', 'Treatment', 'Repeat', 'Unique_ID',
                    'experiment_nb', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION',
                    'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION']
    return [col for col in df.columns if col not in exclude_cols]

def display_variable_checkboxes(selectable_columns):
    """Display checkboxes for selecting variables."""
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3))
    ]))
    return variable_checkboxes


def create_filename(base, selected_cells, selected_speeds, selected_Treatment, var):
    """Create a unique filename based on selected options."""
    def summarize_options(options):
        if len(options) > 3:
            return f"{len(options)}options"
        return "_".join(options)

    selected_options = "_".join([
        summarize_options(selected_cells),
        summarize_options(selected_speeds),
        summarize_options(selected_Treatment)
    ])

    filename = f"{base}_{selected_options}_{var}.pdf"
    return filename.replace(" ", "_")  # Replace spaces with underscores for file compatibility


# Create checkboxes for various attributes
cells_checkboxes = [widgets.Checkbox(value=False, description=str(cell)) for cell in merged_tracks_df['Cells'].unique()]
flow_speed_checkboxes = [widgets.Checkbox(value=False, description=str(speed)) for speed in merged_tracks_df['Flow_speed'].unique()]
Treatment_checkboxes = [widgets.Checkbox(value=False, description=str(ilbeta)) for ilbeta in merged_tracks_df['Treatment'].unique()]


# Display checkboxes
display(widgets.VBox([
    widgets.Label('Cells:'),
    widgets.GridBox(cells_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 100px)" % 4)),
    widgets.Label('Flow Speed:'),
    widgets.GridBox(flow_speed_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 100px)" % 4)),
    widgets.Label('Treatment:'),
    widgets.GridBox(Treatment_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 100px)" % 4))

]))

# Convert Flow_speed to string for checkbox matching
merged_tracks_df['Flow_speed'] = merged_tracks_df['Flow_speed'].astype(str)

# Define the plotting function
def plot_selected_vars(button, variable_checkboxes):
    print("Plotting in progress...")

    # Fetch selected values
    selected_cells = [box.description for box in cells_checkboxes if box.value]
    selected_speeds = [box.description for box in flow_speed_checkboxes if box.value]
    selected_Treatment = [box.description for box in Treatment_checkboxes if box.value]
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]

    # Filter dataframe
    filtered_df = merged_tracks_df.copy()
    filtered_df = filtered_df[filtered_df['Cells'].isin(selected_cells)]
    filtered_df = filtered_df[filtered_df['Flow_speed'].isin(selected_speeds)]
    filtered_df = filtered_df[filtered_df['Treatment'].isin(selected_Treatment)]

    # Initialize matrices for statistics
    effect_size_matrices = {}
    p_value_matrices = {}
    bonferroni_matrices = {}

    unique_conditions = filtered_df['Condition'].unique().tolist()
    num_comparisons = len(unique_conditions) * (len(unique_conditions) - 1) // 2
    alpha = 0.05
    corrected_alpha = alpha / num_comparisons
    n_iterations = 1000

# Loop through each variable to plot
    for var in variables_to_plot:

      filename = create_filename("track_parameters_plots", selected_cells, selected_speeds, selected_Treatment, var)
      pdf_path = os.path.join(Results_Folder, "track_parameters_plots", "pdf", filename)
      csv_path = os.path.join(Results_Folder, "track_parameters_plots", "csv", f"{filename[:-4]}.csv")  # Remove '.pdf' and add '.csv'

      pdf_pages = PdfPages(pdf_path)

      effect_size_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      p_value_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      bonferroni_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)

      for cond1, cond2 in itertools.combinations(unique_conditions, 2):
        group1 = filtered_df[filtered_df['Condition'] == cond1][var]
        group2 = filtered_df[filtered_df['Condition'] == cond2][var]

        original_d = abs(cohen_d(group1, group2))
        effect_size_matrix.loc[cond1, cond2] = original_d
        effect_size_matrix.loc[cond2, cond1] = original_d  # Mirroring

        count_extreme = 0
        for i in range(n_iterations):
            combined = pd.concat([group1, group2])
            shuffled = combined.sample(frac=1, replace=False).reset_index(drop=True)
            new_group1 = shuffled[:len(group1)]
            new_group2 = shuffled[len(group1):]

            new_d = cohen_d(new_group1, new_group2)
            if np.abs(new_d) >= np.abs(original_d):
                count_extreme += 1

        p_value = (count_extreme + 1) / (n_iterations + 1)
        p_value_matrix.loc[cond1, cond2] = p_value
        p_value_matrix.loc[cond2, cond1] = p_value  # Mirroring

        # Apply Bonferroni correction
        bonferroni_corrected_p_value = min(p_value * num_comparisons, 1.0)
        bonferroni_matrix.loc[cond1, cond2] = bonferroni_corrected_p_value
        bonferroni_matrix.loc[cond2, cond1] = bonferroni_corrected_p_value  # Mirroring

      effect_size_matrices[var] = effect_size_matrix
      p_value_matrices[var] = p_value_matrix
      bonferroni_matrices[var] = bonferroni_matrix

    # Concatenate the three matrices side-by-side
      combined_df = pd.concat(
        [
            effect_size_matrices[var].rename(columns={col: f"{col} (Effect Size)" for col in effect_size_matrices[var].columns}),
            p_value_matrices[var].rename(columns={col: f"{col} (P-Value)" for col in p_value_matrices[var].columns}),
            bonferroni_matrices[var].rename(columns={col: f"{col} (Bonferroni-corrected P-Value)" for col in bonferroni_matrices[var].columns})
        ], axis=1
    )

    # Save the combined DataFrame to a CSV file
      combined_df.to_csv(csv_path)

    # Create a new figure
      fig = plt.figure(figsize=(16, 10))

    # Create a gridspec for 2 rows and 4 columns
      gs = GridSpec(2, 3, height_ratios=[1.5, 1])

    # Create the ax for boxplot using the gridspec
      ax_box = fig.add_subplot(gs[0, :])

    # Extract the data for this variable
      data_for_var = filtered_df[['Condition', var, 'Repeat', 'File_name' ]]

    # Save the data_for_var to a CSV for replotting
      data_for_var.to_csv(f"{Results_Folder}/track_parameters_plots/csv/{var}_boxplot_data.csv", index=False)

    # Calculate the Interquartile Range (IQR) using the 25th and 75th percentiles
      Q1 = filtered_df[var].quantile(0.25)
      Q3 = filtered_df[var].quantile(0.75)
      IQR = Q3 - Q1

    # Define bounds for the outliers
      multiplier = 10
      lower_bound = Q1 - multiplier * IQR
      upper_bound = Q3 + multiplier * IQR


    # Plotting
      sns.boxplot(x='Condition', y=var, data=filtered_df, ax=ax_box, color='lightgray')  # Boxplot
      sns.stripplot(x='Condition', y=var, data=filtered_df, ax=ax_box, hue='Repeat', dodge=True, jitter=True, alpha=0.2)  # Individual data points
      ax_box.set_ylim([max(min(filtered_df[var]), lower_bound), min(max(filtered_df[var]), upper_bound)])
      ax_box.set_title(f"{var}")
      ax_box.set_xlabel('Condition')
      ax_box.set_ylabel(var)
      tick_labels = ax_box.get_xticklabels()
      tick_locations = ax_box.get_xticks()
      ax_box.xaxis.set_major_locator(FixedLocator(tick_locations))
      ax_box.set_xticklabels(tick_labels, rotation=90)
      ax_box.legend(loc='center left', bbox_to_anchor=(1, 0.5), title='Repeat')

    # Statistical Analyses and Heatmaps

    # Effect Size heatmap ax
      ax_d = fig.add_subplot(gs[1, 0])
      sns.heatmap(effect_size_matrices[var].fillna(0), annot=True, cmap="viridis", cbar=True, square=True, ax=ax_d, vmax=1)
      ax_d.set_title(f"Effect Size (Cohen's d) for {var}")

    # p-value heatmap ax
      ax_p = fig.add_subplot(gs[1, 1])
      sns.heatmap(p_value_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_p, vmax=0.1)
      ax_p.set_title(f"Randomization Test p-value for {var}")

    # Bonferroni corrected p-value heatmap ax
      ax_bonf = fig.add_subplot(gs[1, 2])
      sns.heatmap(bonferroni_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_bonf, vmax=0.1)
      ax_bonf.set_title(f"Bonferroni-corrected p-value for {var}")

      plt.tight_layout()
      pdf_pages.savefig(fig)
# Close the PDF
      pdf_pages.close()

# Display variable checkboxes and button
selectable_columns = get_selectable_columns(merged_tracks_df)
variable_checkboxes = display_variable_checkboxes(selectable_columns)
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes))
display(button)
