# **PDAC CellTracksColab - General**
---

<font size = 4>Colab Notebook for Analyzing Migration Tracks generated by [TrackMate](https://imagej.net/plugins/trackmate/)


<font size = 4>Notebook created by [Guillaume Jacquemet](https://cellmig.org/)


In [None]:
# @title #MIT License

print("""
**MIT License**

Copyright (c) 2023 Guillaume Jacquemet

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.""")

--------------------------------------------------------
# **Part 1: Prepare the session and load your data**
--------------------------------------------------------


## **1.1. Install key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to install
!pip -q install pandas scikit-learn
!pip -q install hdbscan
!pip -q install umap-learn
!pip -q install plotly
!pip -q install tqdm

import ipywidgets as widgets
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import numpy as np
import itertools
from matplotlib.gridspec import GridSpec
import requests


#----------------------- Key functions -----------------------------#

# Function to calculate Cohen's d
def cohen_d(group1, group2):
    diff = group1.mean() - group2.mean()
    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()
    pooled_var = ((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2)
    d = diff / np.sqrt(pooled_var)
    return d

import requests


def save_dataframe_with_progress(df, path, desc="Saving", chunk_size=50000):
    """Save a DataFrame with a progress bar."""

    # Estimating the number of chunks based on the provided chunk size
    num_chunks = int(len(df) / chunk_size) + 1

    # Create a tqdm instance for progress tracking
    with tqdm(total=len(df), unit="rows", desc=desc) as pbar:
        # Open the file for writing
        with open(path, "w") as f:
            # Write the header once at the beginning
            df.head(0).to_csv(f, index=False)

            for chunk in np.array_split(df, num_chunks):
                chunk.to_csv(f, mode="a", header=False, index=False)
                pbar.update(len(chunk))


def check_for_nans(df, df_name):
    """
    Checks the given DataFrame for NaN values and prints the count for each column containing NaNs.

    Args:
    df (pd.DataFrame): DataFrame to be checked for NaN values.
    df_name (str): The name of the DataFrame as a string, used for printing.
    """
    # Check if the DataFrame has any NaN values and print a warning if it does.
    nan_columns = df.columns[df.isna().any()].tolist()

    if nan_columns:
        for col in nan_columns:
            nan_count = df[col].isna().sum()
            print(f"Column '{col}' in {df_name} contains {nan_count} NaN values.")
    else:
        print(f"No NaN values found in {df_name}.")




## **1.2. Mount your Google Drive**
---
<font size = 4> To use this notebook on the data present in your Google Drive, you need to mount your Google Drive to this notebook.

<font size = 4> Play the cell below to mount your Google Drive and follow the instructions.

<font size = 4> Once this is done, your data are available in the **Files** tab on the top left of notebook.

In [None]:
#@markdown ##Play the cell to connect your Google Drive to Colab

from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive



## **1.3. Compile your data or load existing dataframes**
---

<font size = 4> Please ensure that your data is properly organised (see above)


In [None]:
#@markdown ##Provide the path to your dataset (chunk):

#@markdown ###You have multiple TrackMate files you want to compile, provide the path to your:

import os
import re
import glob
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np
import requests
import zipfile

Folder_path = ''  # @param {type: "string"}

#@markdown ###You have existing dataframes, provide the path to your:

Track_table = ''  # @param {type: "string"}
Spot_table = ''  # @param {type: "string"}

#@markdown ###Provide the path to your Result folder

Results_Folder = ""  # @param {type: "string"}

if not Results_Folder:
    Results_Folder = '/content/Results'  # Default Results_Folder path if not defined

if not os.path.exists(Results_Folder):
    os.makedirs(Results_Folder)  # Create Results_Folder if it doesn't exist

# Print the location of the result folder
print(f"Result folder is located at: {Results_Folder}")

def populate_columns(df, filename):
    cells_conditions = {
        'Mia': 'MiaPaca-2', 'P10': 'Panc10',  'p10': 'Panc10', 'As': 'AsPc1',
        'neu': 'Neutrophil', 'mono': 'Monocyte', 'mon': 'Monocyte'
    }
    flow_speed_conditions = {'p1': 300, 'p2': 200, 'p3': 100, 'p4': 'wash'}
    ilbeta_conditions = {'IL1b': 'IL1b', 'il1b': 'IL1b', 'ctrl': 'CTRL'}

    df['Cells'] = next((v for k, v in cells_conditions.items() if k in filename), 'Unknown')
    df['Flow_speed'] = next((v for k, v in flow_speed_conditions.items() if k in filename), 'Unknown')
    df['ILbeta'] = next((v for k, v in ilbeta_conditions.items() if k in filename), 'CTRL')
    filename_without_extension = os.path.splitext(os.path.basename(filename))[0]
    df['File_name'] = remove_suffix(filename_without_extension)
    df['Condition'] = df['Cells'] + '_' + df['Flow_speed'].astype(str) + '_' + df['ILbeta']
    match = re.search(r'n(\d+)', filename)
    df['experiment_nb'] = int(match.group(1)) if match else 'Unknown'

    return df

def load_and_populate(file_pattern, usecols=None, chunksize=500000):
    df_list = []
    pattern = re.compile(file_pattern)
    files_to_process = [f for f in glob.glob(Folder_path + '/*') if pattern.match(os.path.basename(f))]

    # Metadata list
    metadata_list = []

    for filepath in tqdm(files_to_process, desc="Processing Files"):
        print(filepath)
        # Get the expected number of rows in the file (subtracting header rows)
        expected_rows = sum(1 for row in open(filepath)) - 4

        # Add to the metadata list
        metadata_list.append({
            'filename': os.path.basename(filepath),
            'expected_rows': expected_rows
        })

        chunked_reader = pd.read_csv(filepath, skiprows=[1, 2, 3], usecols=usecols, chunksize=chunksize)
        for chunk in chunked_reader:
            df_list.append(populate_columns(chunk, filepath))

    if not df_list:
        print(f"No files found with pattern: {file_pattern}")
        return pd.DataFrame()

    merged_df = pd.concat(df_list, ignore_index=True)

    # Verify the total rows in the merged dataframe matches the total expected rows from metadata
    total_expected_rows = sum(item['expected_rows'] for item in metadata_list)
    if len(merged_df) != total_expected_rows:
        print(f"Warning: Mismatch in total rows. Expected {total_expected_rows}, found {len(merged_df)} in the merged dataframe.")
    else:
        print(f"Success: The processed dataframe matches the metadata. Total rows: {len(merged_df)}")

    return merged_df

def sort_and_generate_repeat(merged_df):
    merged_df.sort_values(['Condition', 'experiment_nb'], inplace=True)
    merged_df = merged_df.groupby('Condition', group_keys=False).apply(generate_repeat)
    return merged_df

def generate_repeat(group):
    # Convert to string if the experiment_nb has numeric and 'Unknown' values
    group['experiment_nb'] = group['experiment_nb'].astype(str)

    # Handle non-numeric and missing values if needed, here we assume 'Unknown' is one such value
    numeric_part = group[group['experiment_nb'].str.isdigit()]
    non_numeric_part = group[~group['experiment_nb'].str.isdigit()]

    # Sort numeric values and assign repeats
    unique_experiment_nbs_numeric = sorted(numeric_part['experiment_nb'].unique(), key=int)
    experiment_nb_to_repeat_numeric = {experiment_nb: i+1 for i, experiment_nb in enumerate(unique_experiment_nbs_numeric)}
    numeric_part['Repeat'] = numeric_part['experiment_nb'].map(experiment_nb_to_repeat_numeric)

    # Handle non-numeric parts, you can decide how to sort and assign repeat values
    # Here we give all 'Unknown' the same repeat number, for example, 0
    non_numeric_part['Repeat'] = 0  # Or some other logic for non-numeric parts

    # Concatenate the parts back together
    group = pd.concat([numeric_part, non_numeric_part])

    return group


def remove_suffix(filename):
    suffixes_to_remove = ["-tracks", "-spots"]
    for suffix in suffixes_to_remove:
        if filename.endswith(suffix):
            filename = filename[:-len(suffix)]
            break
    return filename


def validate_tracks_df(df):
    """Validate the tracks dataframe for necessary columns and data types."""
    required_columns = ['TRACK_ID']
    for col in required_columns:
        if col not in df.columns:
            print(f"Error: Column '{col}' missing in tracks dataframe.")
            return False

    # Additional data type checks or value ranges can be added here
    return True

def validate_spots_df(df):
    """Validate the spots dataframe for necessary columns and data types."""
    required_columns = ['TRACK_ID', 'POSITION_X', 'POSITION_Y', 'POSITION_T']
    for col in required_columns:
        if col not in df.columns:
            print(f"Error: Column '{col}' missing in spots dataframe.")
            return False

    # Additional data type checks or value ranges can be added here
    return True

def check_unique_id_match(df1, df2):
    df1_ids = set(df1['Unique_ID'])
    df2_ids = set(df2['Unique_ID'])

    # Check if the IDs in the two dataframes match
    if df1_ids == df2_ids:
        print("The Unique_ID values in both dataframes match perfectly!")
    else:
        missing_in_df1 = df2_ids - df1_ids
        missing_in_df2 = df1_ids - df2_ids

        if missing_in_df1:
            print(f"There are {len(missing_in_df1)} Unique_ID values present in the second dataframe but missing in the first.")
            print("Examples of these IDs are:", list(missing_in_df1)[:5])

        if missing_in_df2:
            print(f"There are {len(missing_in_df2)} Unique_ID values present in the first dataframe but missing in the second.")
            print("Examples of these IDs are:", list(missing_in_df2)[:5])

if Folder_path:

    merged_tracks_df = load_and_populate(r'.*tracks.*\.csv')

    if not validate_tracks_df(merged_tracks_df):
        print("Error: Validation failed for merged tracks dataframe.")
    else:
        merged_tracks_df = sort_and_generate_repeat(merged_tracks_df)
        merged_tracks_df['Unique_ID'] = merged_tracks_df['File_name'] + "_" + merged_tracks_df['TRACK_ID'].astype(str)
        save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv', desc="Saving Tracks")


    merged_spots_df = load_and_populate(r'.*spots.*\.csv', usecols=['TRACK_ID', 'POSITION_X', 'POSITION_Y', 'POSITION_T', 'RADIUS', 'CIRCULARITY', 'SOLIDITY', 'SHAPE_INDEX'])

    if not validate_spots_df(merged_spots_df):
        print("Error: Validation failed for merged spots dataframe.")
    else:
        merged_spots_df = sort_and_generate_repeat(merged_spots_df)
        merged_spots_df.dropna(subset=['POSITION_X', 'POSITION_Y'], inplace=True)
        merged_spots_df.reset_index(drop=True, inplace=True)
        merged_spots_df['Unique_ID'] = merged_spots_df['File_name'] + "_" + merged_spots_df['TRACK_ID'].astype(str)
        save_dataframe_with_progress(merged_spots_df, Results_Folder + '/' + 'merged_Spots.csv', desc="Saving Spots")
        # Now, call the check function
        check_unique_id_match(merged_spots_df, merged_tracks_df)
        print("...Done")

# For existing dataframes
if Track_table:
    print("Loading track table file....")
    merged_tracks_df = pd.read_csv(Track_table, low_memory=False)
    if not validate_tracks_df(merged_tracks_df):
        print("Error: Validation failed for loaded tracks dataframe.")

if Spot_table:
    print("Loading spot table file....")
    merged_spots_df = pd.read_csv(Spot_table, low_memory=False)
    if not validate_spots_df(merged_spots_df):
        print("Error: Validation failed for loaded spots dataframe.")


check_for_nans(merged_spots_df, "merged_spots_df")
check_for_nans(merged_tracks_df, "merged_tracks_df")


In [None]:
#@markdown ##Check Metadata


# Define the metadata columns that are expected to have identical values for each filename
metadata_columns = ['Cells', 'Flow_speed', 'ILbeta', 'Condition', 'experiment_nb', 'Repeat']

# Group the DataFrame by 'File_name' and then check if all entries within each group are identical
consistent_metadata = True
for name, group in merged_tracks_df.groupby('File_name'):
    for col in metadata_columns:
        if not group[col].nunique() == 1:
            consistent_metadata = False
            print(f"Inconsistency found for file: {name} in column: {col}")
            break  # Stop checking other columns for this group and move to the next file
    if not consistent_metadata:
        break  # Stop the entire process if any inconsistency is found

if consistent_metadata:
    print("All files have consistent metadata across the specified columns.")
else:
    print("There are inconsistencies in the metadata. Please check the output for details.")

# Drop duplicates based on the 'File_name' to get a unique list of filenames and their metadata
unique_files_df = merged_tracks_df.drop_duplicates(subset=['File_name'])[['File_name', 'Cells', 'Flow_speed', 'ILbeta', 'Condition', 'experiment_nb', 'Repeat']]

# Reset the index to clean up the DataFrame
unique_files_df.reset_index(drop=True, inplace=True)

# Display the resulting DataFrame in a nicely formatted HTML table
unique_files_df

import pandas as pd

# Assuming 'df' is your DataFrame and it already contains 'Conditions' and 'Repeats' columns.

# Group by 'Conditions' and 'Repeats' and count the occurrences
grouped = unique_files_df.groupby(['Condition', 'Repeat']).size().reset_index(name='counts')

# Check if any combinations have a count greater than 1, which means they are not unique
non_unique_combinations = grouped[grouped['counts'] > 1]

# Print the non-unique combinations
if not non_unique_combinations.empty:
    print("There are non-unique combinations of Conditions and Repeats:")
    print(non_unique_combinations)
else:
    print("All combinations of Conditions and Repeats are unique.")

check_unique_id_match(merged_spots_df, merged_tracks_df)


# Group the DataFrame by 'Cells', 'ILbeta', 'Repeat' and then check if there are 4 unique 'Flow_speed' values for each group
consistent_flow_speeds = True
for (cells, ilbeta, repeat), group in merged_tracks_df.groupby(['Cells', 'ILbeta', 'Repeat']):
    if group['Flow_speed'].nunique() != 4:
        consistent_flow_speeds = False
        print(f"Inconsistency found for Cells: {cells}, ILbeta: {ilbeta}, Repeat: {repeat} - Expected 4 Flow_speeds, found {group['Flow_speed'].nunique()}")
        break  # Stop the entire process if any inconsistency is found

if consistent_flow_speeds:
    print("Each combination of 'Cells', 'ILbeta', 'Repeat' has exactly 4 different 'Flow_speed' values.")
else:
    print("There are inconsistencies in 'Flow_speed' values. Please check the output for details.")


## **1.4. Filter tracks shorter than 50 spots**


In [None]:
# @title ##Filter tracks shorter than 50 spots


merged_tracks_df = merged_tracks_df[merged_tracks_df['NUMBER_SPOTS'] >= 50]
merged_spots_df = merged_spots_df[merged_spots_df['Unique_ID'].isin(merged_tracks_df['Unique_ID'])]


## **1.5. Visualise your tracks**
---

In [None]:
# @title ##Run the cell and choose the file you want to inspect

import ipywidgets as widgets
from ipywidgets import interact
import matplotlib.pyplot as plt

if not os.path.exists(Results_Folder+"/Tracks"):
    os.makedirs(Results_Folder+"/Tracks")  # Create Results_Folder if it doesn't exist

# Extract unique filenames from the dataframe
filenames = merged_spots_df['File_name'].unique()

# Create a Dropdown widget with the filenames
filename_dropdown = widgets.Dropdown(
    options=filenames,
    value=filenames[0] if len(filenames) > 0 else None,  # Default selected value
    description='File Name:',
)

def plot_coordinates(filename):
    if filename:
        # Filter the DataFrame based on the selected filename
        filtered_df = merged_spots_df[merged_spots_df['File_name'] == filename]

        plt.figure(figsize=(10, 8))
        for unique_id in filtered_df['Unique_ID'].unique():
            unique_df = filtered_df[filtered_df['Unique_ID'] == unique_id].sort_values(by='POSITION_T')
            plt.plot(unique_df['POSITION_X'], unique_df['POSITION_Y'], marker='o', linestyle='-', markersize=2)

        plt.xlabel('POSITION_X')
        plt.ylabel('POSITION_Y')
        plt.title(f'Coordinates for {filename}')
        plt.savefig(f"{Results_Folder}/Tracks/Tracks_{filename}.pdf")
        plt.show()
    else:
        print("No valid filename selected")

# Link the Dropdown widget to the plotting function
interact(plot_coordinates, filename=filename_dropdown)


In [None]:
# @title ##Batch Process


import os
import matplotlib.pyplot as plt

# Ensure the Results_Folder/Tracks directory exists
if not os.path.exists(Results_Folder + "/Tracks"):
    os.makedirs(Results_Folder + "/Tracks")

# Extract unique filenames from the dataframe
filenames = merged_spots_df['File_name'].unique()

def plot_coordinates(filename):
    if filename:
        # Filter the DataFrame based on the selected filename
        filtered_df = merged_spots_df[merged_spots_df['File_name'] == filename]

        plt.figure(figsize=(10, 8))
        for unique_id in filtered_df['Unique_ID'].unique():
            unique_df = filtered_df[filtered_df['Unique_ID'] == unique_id].sort_values(by='POSITION_T')
            plt.plot(unique_df['POSITION_X'], unique_df['POSITION_Y'], marker='o', linestyle='-', markersize=2)

        plt.xlabel('POSITION_X')
        plt.ylabel('POSITION_Y')
        plt.title(f'Coordinates for {filename}')
        plt.savefig(f"{Results_Folder}/Tracks/Tracks_{filename}.pdf")
        plt.close()  # Close the plot to avoid displaying it

# Loop through all filenames and generate plots
for filename in filenames:
    plot_coordinates(filename)


In [None]:
# @title ##Speed density plots


# Updated code to visualize distributions using the 'fill' parameter in sns.kdeplot

import seaborn as sns
import matplotlib.pyplot as plt

def plot_distribution_by_condition_updated(df):
    conditions = df['Condition'].unique()

    # Setting up the plotting environment
    sns.set_style("whitegrid")
    plt.figure(figsize=(18, 20))  # Increased height to fit the fourth plot

    # Plotting histograms for TRACK_MEAN_SPEED
    plt.subplot(4, 1, 1)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TRACK_MEAN_SPEED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TRACK_MEAN_SPEED by Condition')
    plt.legend()

    # Plotting histograms for TRACK_MAX_SPEED
    plt.subplot(4, 1, 2)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TRACK_MAX_SPEED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TRACK_MAX_SPEED by Condition')
    plt.legend()

    # Plotting histograms for TRACK_MIN_SPEED
    plt.subplot(4, 1, 3)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TRACK_MIN_SPEED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TRACK_MIN_SPEED by Condition')
    plt.legend()

    # Plotting histograms for TOTAL_DISTANCE_TRAVELED
    plt.subplot(4, 1, 4)
    for condition in conditions:
        sns.histplot(df[df['Condition'] == condition]['TOTAL_DISTANCE_TRAVELED'], label=condition, kde=False, bins=30)
    plt.title('Histogram of TOTAL_DISTANCE_TRAVELED by Condition')
    plt.legend()

    plt.tight_layout()
    plt.show()

# You can call this function with your dataframe like this:
plot_distribution_by_condition_updated(merged_tracks_df)



In [None]:
# @title ##Time points per tracks


import matplotlib.pyplot as plt


# Calculate the count of time points per track
time_points_per_track = merged_spots_df.groupby('Unique_ID').size()

# Plotting
plt.figure(figsize=(10, 6))
time_points_per_track.hist(bins=30, edgecolor='black')
plt.title('Distribution of Time Points per Track')
plt.xlabel('Number of Time Points')
plt.ylabel('Count of Tracks')
plt.grid(False)
plt.show()


--------------------------------------------------------
# **Part 2. Compute additional metrics**
--------------------------------------------------------
<font size = 4 color="red">Part2 does not support Track splitting</font>.

<font size = 4> For users aiming to compute additional track metrics within this environment, it is crucial to disable track splitting in TrackMate.


## **2.1. Compute Speed and rolling distance**

In [None]:
# @title ##Compute Speed and rolling distance

from tqdm.notebook import tqdm

import numpy as np

def compute_instantaneous_speed(dataframe):
    # Check for required columns
    required_columns = ['Unique_ID', 'POSITION_T', 'POSITION_X', 'POSITION_Y']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    speeds = []

    for _, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Speeds"):
        # Check for NaN values in columns
        if track[['POSITION_X', 'POSITION_Y', 'POSITION_T']].isna().any().any():
            raise ValueError(f"Track with ID '{track['Unique_ID'].iloc[0]}' contains NaN values which might affect the computation.")

        # Calculate the instantaneous speed using positional data and time difference
        speed = np.sqrt(track['POSITION_X'].diff()**2 + track['POSITION_Y'].diff()**2) / track['POSITION_T'].diff()

        # Ensure that time differences are non-negative
        if (track['POSITION_T'].diff() < 0).any():
            raise ValueError(f"Track with ID '{track['Unique_ID'].iloc[0]}' has negative time differences.")

        # Ensuring the first speed value for each track is NaN
        speed.iloc[0] = np.nan

        speeds.extend(speed.tolist())

    # Safety Check
    if len(speeds) != len(dataframe):
        raise ValueError("The computed speeds list length doesn't match the dataframe's length.")

    dataframe['Speed'] = speeds

    return dataframe

# Example usage:
merged_spots_df = compute_instantaneous_speed(merged_spots_df)


def compute_rolling_average(dataframe, window_size=5):
    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    rolling_avg_speeds = []

    # Wrap the groupby object with tqdm for progress visualization
    for _, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Rolling Averages"):
        rolling_avg = track['Speed'].rolling(window=window_size, min_periods=1, center=True).mean()
        rolling_avg_speeds.extend(rolling_avg.tolist())

    # Safety Check
    if len(rolling_avg_speeds) != len(dataframe):
        raise ValueError("The computed rolling averages list length doesn't match the dataframe's length.")

    dataframe['RollingAvgSpeed'] = rolling_avg_speeds

    return dataframe

# Example usage:
merged_spots_df = compute_rolling_average(merged_spots_df, window_size=5)


def average_speed_first_last_n(dataframe, n=5):
    # Ensure n is a positive integer
    if not isinstance(n, int) or n <= 0:
        raise ValueError("n should be a positive integer.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    speeds_first = {}
    speeds_last = {}

    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Calculating average speeds"):
        # Ensure the track has at least n points
        if len(track) < n:
            print(f"Track {track_id} has less than {n} points. Skipping.")
            continue


        # Average speed for first n time points using RollingAvgSpeed
        avg_speed_first = track['Speed'].iloc[:n].mean()
        speeds_first[track_id] = avg_speed_first

        # Average speed for last n time points using RollingAvgSpeed
        avg_speed_last = track['Speed'].iloc[-n:].mean()
        speeds_last[track_id] = avg_speed_last

    # Convert average speeds to DataFrames
    avg_speeds_first_df = pd.DataFrame(speeds_first.items(), columns=['Unique_ID', 'AvgSpeedFirstN'])
    avg_speeds_last_df = pd.DataFrame(speeds_last.items(), columns=['Unique_ID', 'AvgSpeedLastN'])

    return avg_speeds_first_df, avg_speeds_last_df

# Example usage:
avg_speeds_first, avg_speeds_last = average_speed_first_last_n(merged_spots_df, 5)


def compute_min_rolling_speed(dataframe):
    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'POSITION_T', 'RollingAvgSpeed']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Safeguard: Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    min_speeds = {}

    # Wrap the groupby object with tqdm for progress visualization
    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Min Rolling Speeds"):

        min_speed = track['RollingAvgSpeed'].min()
        min_speeds[track_id] = min_speed

    # Convert the dictionary to a DataFrame
    min_speed_df = pd.DataFrame(min_speeds.items(), columns=['Unique_ID', 'MinRollingAvgSpeed'])

    return min_speed_df

# Compute the minimum rolling speed for merged_spots_df
min_rolling_speed_df = compute_min_rolling_speed(merged_spots_df)


def merge_speeds(df_main, df_to_merge, key='Unique_ID'):
    # Safeguard: Ensure 'key' is present in both dataframes
    if key not in df_main.columns or key not in df_to_merge.columns:
        raise ValueError(f"The key '{key}' is not present in both dataframes to be merged.")

    overlapping_columns = df_main.columns.intersection(df_to_merge.columns).drop(key)
    df_main.drop(columns=overlapping_columns, inplace=True)
    return pd.merge(df_main, df_to_merge, on=key, how='left')


merged_tracks_df = merge_speeds(merged_tracks_df, avg_speeds_first)
merged_tracks_df = merge_speeds(merged_tracks_df, avg_speeds_last)
merged_tracks_df = pd.merge(merged_tracks_df, min_rolling_speed_df)

def compute_rolling_distance(dataframe, window_size=3):
    """Compute the total distance traveled within a rolling time window."""
    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'POSITION_T', 'POSITION_X', 'POSITION_Y']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Safeguard: Handle potential negative or zero values for window size
    if window_size <= 0:
        raise ValueError("Window size must be a positive integer.")

    # Safeguard: Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    # Safeguard: Ensure window size is odd for trimming edges correctly
    if window_size % 2 == 0:
        raise ValueError("Please use an odd value for the window size for accurate trimming.")

    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one

    trim_size = window_size // 2  # Determine how much to trim from the edges
    rolling_distances = []

    for _, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Rolling Distance"):
        # Compute the Euclidean distance between consecutive points
        distances = np.sqrt(track['POSITION_X'].diff()**2 + track['POSITION_Y'].diff()**2).fillna(0)

        # Compute the rolling sum of distances
        rolling_distance = distances.rolling(window=window_size, center=True).sum()

        # Trim the edges
        rolling_distance[:trim_size] = np.nan
        rolling_distance[-trim_size:] = np.nan

        rolling_distances.extend(rolling_distance.tolist())

    # Safeguard: Ensure the list of rolling distances matches the length of the dataframe
    if len(rolling_distances) != len(dataframe):
        raise ValueError("The computed rolling distances list length doesn't match the dataframe's length.")

    dataframe['RollingDistance'] = rolling_distances
    return dataframe

merged_spots_df = compute_rolling_distance(merged_spots_df, window_size=5)


def average_rolling_distance_first_last_n(dataframe, n=1):
    """Compute the average rolling distance for the first and last n points."""

    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'POSITION_T', 'RollingDistance']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    # Safeguard: Handle potential non-positive values for n
    if n <= 0:
        raise ValueError("n must be a positive integer.")

    # Safeguard: Check for duplicate entries
    if dataframe.duplicated(subset=['Unique_ID', 'POSITION_T']).any():
        raise ValueError("There are duplicate entries based on 'Unique_ID' and 'POSITION_T'.")

    distance_first = {}
    distance_last = {}
    dataframe.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
    dataframe.reset_index(drop=True, inplace=True)  # Reset the index and drop the old one


    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Calculating average rolling distances"):
        avg_distance_first = track['RollingDistance'].iloc[:n].sum()
        distance_first[track_id] = avg_distance_first

        avg_distance_last = track['RollingDistance'].iloc[-n:].sum()
        distance_last[track_id] = avg_distance_last

    avg_distances_first_df = pd.DataFrame(distance_first.items(), columns=['Unique_ID', 'AvgRollingDistanceFirstN'])
    avg_distances_last_df = pd.DataFrame(distance_last.items(), columns=['Unique_ID', 'AvgRollingDistanceLastN'])

    return avg_distances_first_df, avg_distances_last_df


def merge_rolling_distances(df_main, df_to_merge, key='Unique_ID'):
    """Merge rolling distances into a dataframe."""
    overlapping_columns = df_main.columns.intersection(df_to_merge.columns).drop(key)

    # Safeguard: Ensure that the df_main is updated correctly after dropping overlapping columns
    df_main = df_main.drop(columns=overlapping_columns)
    return pd.merge(df_main, df_to_merge, on=key, how='left')


def compute_min_rolling_distance(dataframe):
    """Compute the minimum rolling distance for each track."""

    # Safeguard: Ensure required columns are present
    required_columns = ['Unique_ID', 'RollingDistance']
    for col in required_columns:
        if col not in dataframe.columns:
            raise ValueError(f"Column '{col}' is missing in the dataframe.")

    min_distances = {}

    for track_id, track in tqdm(dataframe.groupby('Unique_ID'), desc="Computing Min Rolling Distances"):
        min_distance = track['RollingDistance'].min()
        min_distances[track_id] = min_distance

    min_distance_df = pd.DataFrame(min_distances.items(), columns=['Unique_ID', 'MinRollingDistance'])

    return min_distance_df

# Usage and merging operations:
avg_distances_first, avg_distances_last = average_rolling_distance_first_last_n(merged_spots_df, 1)
merged_tracks_df = merge_rolling_distances(merged_tracks_df, avg_distances_first)
merged_tracks_df = merge_rolling_distances(merged_tracks_df, avg_distances_last)

min_rolling_distance_df = compute_min_rolling_distance(merged_spots_df)
overlapping_columns = merged_tracks_df.columns.intersection(min_rolling_distance_df.columns).drop('Unique_ID')

# Safeguard: Ensure that the merged_tracks_df is updated correctly after dropping overlapping columns
merged_tracks_df = merged_tracks_df.drop(columns=overlapping_columns)
merged_tracks_df = pd.merge(merged_tracks_df, min_rolling_distance_df, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_spots_df, Results_Folder + '/' + 'merged_Spots.csv', desc="Saving Spots")
save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv', desc="Saving Tracks")

# Safeguard: check for NaN
check_for_nans(merged_tracks_df, "merged_tracks_df")
check_for_nans(merged_spots_df, "merged_spots_df")


## **2.2. Directionality**
---
<font size = 4>To calculate the directionality of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The directionality, denoted as \(D\), is calculated using the formula:

$$ D = \frac{d_{\text{euclidean}}}{d_{\text{total path}}} $$

where \($d_{\text{euclidean}}$\) is the Euclidean distance between the first and the last points of the track, calculated as:

$$ d_{\text{euclidean}} = \sqrt{(x_{\text{end}} - x_{\text{start}})^2 + (y_{\text{end}} - y_{\text{start}})^2 + (z_{\text{end}} - z_{\text{start}})^2} $$

and \($d_{\text{total path}}$\) is the sum of the Euclidean distances between all consecutive points in the track, representing the total path length traveled. If the total path length is zero, the directionality is defined to be zero. This measure provides insight into the straightness of the path taken, with a value of 1 indicating a straight path between the start and end points, and values approaching 0 indicating more circuitous paths.</font>


In [None]:
# @title ##Calculate directionality
import pandas as pd
import numpy as np

print("In progress...")

# Check if spots_df_to_use is None or empty; if so, set it to merged_spots_df
if 'spots_df_to_use' not in globals() or spots_df_to_use is None or spots_df_to_use.empty:
    spots_df_to_use = merged_spots_df

spots_df_to_use.dropna(subset=['POSITION_X', 'POSITION_Y'], inplace=True)

# Function to calculate Directionality
def calculate_directionality(group):

    group = group.sort_values('POSITION_T')
    start_point = group.iloc[0][['POSITION_X', 'POSITION_Y']].to_numpy()
    end_point = group.iloc[-1][['POSITION_X', 'POSITION_Y']].to_numpy()

    # Calculating Euclidean distance in 3D between start and end points
    euclidean_distance = np.linalg.norm(end_point - start_point)

    # Calculating the total path length in 3D
    deltas = np.linalg.norm(np.diff(group[['POSITION_X', 'POSITION_Y']].values, axis=0), axis=1)
    total_path_length = deltas.sum()

    # Calculating Directionality
    D = euclidean_distance / total_path_length if total_path_length != 0 else 0

    return pd.Series({'Directionality': D})


# Create a tqdm object for the groupby apply
tqdm.pandas(desc="Calculating Directionality")

# Assuming spots_df_to_use is your DataFrame
spots_df_to_use.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

# Calculate directionality for each track
df_directionality = spots_df_to_use.groupby('Unique_ID').progress_apply(calculate_directionality).reset_index()

# Find the overlapping columns between the two DataFrames, excluding the merging key
overlapping_columns = merged_tracks_df.columns.intersection(df_directionality.columns).drop('Unique_ID')

# Drop the overlapping columns from the left DataFrame
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

# Merge the directionality back into the original DataFrame
merged_tracks_df = pd.merge(merged_tracks_df, df_directionality, on='Unique_ID', how='left')

# Save the DataFrame with the calculated directionality
save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")


## **2.3. Tortuosity**
---
<font size = 4>This measure provides insight into the curvature and complexity of the path taken, with a value of 1 indicating a straight path between the start and end points, and values greater than 1 indicating paths with more twists and turns.
To calculate the tortuosity of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The tortuosity, denoted as \(T\), is calculated using the formula:

$$ T = \frac{d_{\text{total path}}}{d_{\text{euclidean}}} $$



In [None]:
# @title ##Calculate tortuosity

import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

print("In progress...")

# Initialize tqdm with pandas
tqdm.pandas(desc="Calculating Tortuosity")

# Check if spots_df_to_use is None or empty; if so, set it to merged_spots_df
if 'spots_df_to_use' not in globals() or spots_df_to_use is None or spots_df_to_use.empty:
    spots_df_to_use = merged_spots_df

def calculate_tortuosity(group):
    group = group.sort_values('POSITION_T')

    # Apply spatial calibration to the coordinates
    calibrated_coords = group[['POSITION_X', 'POSITION_Y']].values

    start_point = calibrated_coords[0]
    end_point = calibrated_coords[-1]

    # Calculating Euclidean distance in 3D between start and end points
    euclidean_distance = np.linalg.norm(end_point - start_point)

    # Calculating the total path length in 3D
    deltas = np.linalg.norm(np.diff(calibrated_coords, axis=0), axis=1)
    total_path_length = deltas.sum()

    # Calculating Tortuosity
    T = total_path_length / euclidean_distance if euclidean_distance != 0 else 0

    return pd.Series({'Tortuosity': T})

# Sort the DataFrame by 'Unique_ID' and 'POSITION_T'
spots_df_to_use.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

# Calculate tortuosity for each track using progress_apply
df_tortuosity = spots_df_to_use.groupby('Unique_ID').progress_apply(calculate_tortuosity).reset_index()

# Find the overlapping columns between the two DataFrames, excluding the merging key
overlapping_columns = merged_tracks_df.columns.intersection(df_tortuosity.columns).drop('Unique_ID')

# Drop the overlapping columns from the left DataFrame
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

# Merge the tortuosity back into the original DataFrame
merged_tracks_df = pd.merge(merged_tracks_df, df_tortuosity, on='Unique_ID', how='left')

# Save the DataFrame with the calculated tortuosity
save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")


## **2.4. Calculate the total turning angle**
---

<font size = 4>This measure provides insight into the cumulative amount of turning along the path, with a value of 0 indicating a straight path with no turning, and higher values indicating paths with more turning.

<font size = 4>To calculate the Total Turning Angle of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The Total Turning Angle, denoted as \(A\), is the sum of the angles between each pair of consecutive direction vectors along the track, representing the cumulative amount of turning along the path.

<font size = 4>For each pair of consecutive segments in the track, we calculate the direction vectors \( $\vec{v_1}$ \) and \($ \vec{v_2}$ \), and the angle \($ \theta$ \) between them is calculated using the formula:

$$ \cos(\theta) = \frac{\vec{v_1} \cdot \vec{v_2}}{||\vec{v_1}|| \cdot ||\vec{v_2}||} $$

<font size = 4>where \( $\vec{v_1} \cdot$ $\vec{v_2}$ \) is the dot product of the direction vectors, and \( $||\vec{v_1}||$ \) and \( $||\vec{v_2}||$ \) are the magnitudes of the direction vectors. The Total Turning Angle \( $A$ \) is then the sum of all the angles \( \$theta$ \) calculated between each pair of consecutive direction vectors along the track:

$$ A = \sum \theta $$
<font size = 4>
If either of the direction vectors is a zero vector, the angle between them is undefined, and such cases are skipped in the calculation.


In [None]:
# @title ##Calculate the total turning angle

import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

print("In progress...")

# Check if spots_df_to_use is None or empty; if so, set it to merged_spots_df
if 'spots_df_to_use' not in globals() or spots_df_to_use is None or spots_df_to_use.empty:
    spots_df_to_use = merged_spots_df

# Initialize tqdm with pandas
tqdm.pandas(desc="Calculating Total Turning Angle")

# Check if spots_df_to_use is None or empty; if so, set it to merged_spots_df
if 'spots_df_to_use' not in globals() or spots_df_to_use is None or spots_df_to_use.empty:
    spots_df_to_use = merged_spots_df

def calculate_total_turning_angle(group):
    group = group.sort_values('POSITION_T')
    directions = group[['POSITION_X', 'POSITION_Y']].diff().dropna()
    total_turning_angle = 0

    for i in range(1, len(directions)):
        dir1 = directions.iloc[i - 1]
        dir2 = directions.iloc[i]

        if np.linalg.norm(dir1) == 0 or np.linalg.norm(dir2) == 0:
            continue

        cos_angle = np.dot(dir1, dir2) / (np.linalg.norm(dir1) * np.linalg.norm(dir2))
        cos_angle = np.clip(cos_angle, -1, 1)
        angle = np.degrees(np.arccos(cos_angle))
        total_turning_angle += angle

    return pd.Series({'Total_Turning_Angle': total_turning_angle})

# Sort the DataFrame by 'Unique_ID' and 'POSITION_T'
spots_df_to_use.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

# Calculate total turning angle for each track using progress_apply instead of apply
df_turning_angle = spots_df_to_use.groupby('Unique_ID').progress_apply(calculate_total_turning_angle).reset_index()

# Check if 'Total_Turning_Angle' is in the columns of df_turning_angle
if 'Total_Turning_Angle' not in df_turning_angle.columns:
    print("Error: 'Total_Turning_Angle' not in df_turning_angle columns")

# Find the overlapping columns between the two DataFrames, excluding the merging key
overlapping_columns = merged_tracks_df.columns.intersection(df_turning_angle.columns).drop('Unique_ID')

# Drop the overlapping columns from the left DataFrame
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

# Merge the total turning angle back into the original DataFrame
merged_tracks_df = pd.merge(merged_tracks_df, df_turning_angle, on='Unique_ID', how='left')

# Save the DataFrame with the calculated total turning angle
save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")


## **2.6. Calculate the FMI**
---



In [None]:
# @title #Calculate the FMI

from tqdm.notebook import tqdm

def calculate_fmi(group):
    group = group.sort_values('POSITION_T')

    deltas = np.sqrt(group['POSITION_X'].diff().fillna(0)**2 + group['POSITION_Y'].diff().fillna(0)**2)
    total_path_length = deltas.sum()

    total_forward_displacement = group['POSITION_X'].diff().fillna(0).sum()

    FMI = total_forward_displacement / total_path_length if total_path_length != 0 else 0

    return pd.Series({'FMI': FMI})


# Use tqdm.pandas() for progress_apply
tqdm.pandas(desc="Processing tracks")

# Sort the DataFrame by 'Unique_ID' and 'POSITION_T'
merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

# Group by track ID and calculate metrics with tqdm progress bar
grouped = merged_spots_df.groupby('Unique_ID')
df_fmi = grouped.progress_apply(calculate_fmi).reset_index()

# Find the overlapping columns between the two DataFrames, excluding the merging key
overlapping_columns = merged_tracks_df.columns.intersection(df_fmi.columns).drop('Unique_ID')

# Drop the overlapping columns from the left DataFrame
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

# Merge the FMI values back into the original DataFrame
merged_tracks_df = pd.merge(merged_tracks_df, df_fmi, on='Unique_ID', how='left')

merged_tracks_df.to_csv(Results_Folder + '/' + 'merged_Tracks.csv', index=False)

check_for_nans(merged_tracks_df, "merged_tracks_df")



## **2.7. additional morphological metrics**
---

In [None]:
# @title ##Compute additional morphological metrics
from tqdm.notebook import tqdm

print("In progress...")

def compute_morphological_metrics(spots_df, metrics):
    # Compute mean, median, std, min, and max for each metric
    mean_df = spots_df.groupby('Unique_ID')[metrics].mean(numeric_only=True).add_prefix('MEAN_')
    median_df = spots_df.groupby('Unique_ID')[metrics].median(numeric_only=True).add_prefix('MEDIAN_')
    std_df = spots_df.groupby('Unique_ID')[metrics].std(numeric_only=True).add_prefix('STD_')
    min_df = spots_df.groupby('Unique_ID')[metrics].min(numeric_only=True).add_prefix('MIN_')
    max_df = spots_df.groupby('Unique_ID')[metrics].max(numeric_only=True).add_prefix('MAX_')

    # Concatenate the computed metrics into a single dataframe without resetting the index
    metrics_df = pd.concat([mean_df, median_df, std_df, min_df, max_df], axis=1)

    return metrics_df

# Required columns for compute_morphological_metrics
required_columns_spots = ['Unique_ID', 'RADIUS', 'CIRCULARITY', 'SOLIDITY', 'SHAPE_INDEX']

# Check which required columns are present in merged_spots_df
available_columns = [col for col in required_columns_spots if col in merged_spots_df.columns]
missing_columns = [col for col in required_columns_spots if col not in merged_spots_df.columns]

# Compute the morphological metrics
morphological_metrics_df = compute_morphological_metrics(merged_spots_df, available_columns)

# Reset the index for the morphological_metrics_df to have Unique_ID as a column
morphological_metrics_df.reset_index(inplace=True)

# Find overlapping columns and merge
if 'Unique_ID' in merged_tracks_df.columns:
    overlapping_columns = merged_tracks_df.columns.intersection(morphological_metrics_df.columns).drop('Unique_ID', errors='ignore')
    merged_tracks_df.drop(columns=overlapping_columns, inplace=True)
    merged_tracks_df = merged_tracks_df.merge(morphological_metrics_df, on='Unique_ID', how='left')
    save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

else:
    print("Error: 'Unique_ID' column missing in merged_tracks_df. Skipping merging with morphological metrics.")

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

-------------------------------------------

# **Part 3. Plot track parameters**
-------------------------------------------

<font size = 4> In this section you can plot all the track parameters previously computed. Data and graphs are automatically saved in your result folder.

<font size = 4 color="red"> Parameters computed are in the unit you provided when tracking your data in TrackMate.

##**Statistical analyses**
### Cohen's d (Effect Size):
<font size = 4>Cohen's d measures the size of the difference between two groups, normalized by their pooled standard deviation. Values can be interpreted as small (0 to 0.2), medium (0.2 to 0.5), or large (0.5 and above) effects. It helps quantify how significant the observed difference is, beyond just being statistically significant.

### Randomization Test:
<font size = 4>This non-parametric test evaluates if observed differences between conditions could have arisen by random chance. It shuffles condition labels multiple times, recalculating the Cohen's d each time. The resulting p-value, which indicates the likelihood of observing the actual difference by chance, provides evidence against the null hypothesis: a smaller p-value implies stronger evidence against the null.

### Bonferroni Correction:
<font size = 4>Given multiple comparisons, the Bonferroni Correction adjusts significance thresholds to mitigate the risk of false positives. By dividing the standard significance level (alpha) by the number of tests, it ensures that only robust findings are considered significant. However, it's worth noting that this method can be conservative, sometimes overlooking genuine effects.


In [None]:
# @title ##Plot track parameters

# Import necessary libraries
import os
import itertools
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from matplotlib.backends.backend_pdf import PdfPages
import ipywidgets as widgets


# Check and create necessary directories
if not os.path.exists(f"{Results_Folder}/track_parameters_plots"):
    os.makedirs(f"{Results_Folder}/track_parameters_plots")

if not os.path.exists(f"{Results_Folder}/track_parameters_plots/pdf"):
    os.makedirs(f"{Results_Folder}/track_parameters_plots/pdf")

if not os.path.exists(f"{Results_Folder}/track_parameters_plots/csv"):
    os.makedirs(f"{Results_Folder}/track_parameters_plots/csv")

# Helper functions
def cohen_d(group1, group2):
    """Compute Cohen's d."""
    mean_diff = group1.mean() - group2.mean()
    pooled_var = (len(group1) * group1.var() + len(group2) * group2.var()) / (len(group1) + len(group2))
    d = mean_diff / pooled_var**0.5
    return d

def get_selectable_columns(df):
    """Get columns that can be plotted."""
    exclude_cols = ['Condition', 'File_name', 'Flow_speed', 'Cells', 'ILbeta', 'Repeat', 'Unique_ID',
                    'experiment_nb', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION',
                    'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION']
    return [col for col in df.columns if col not in exclude_cols]

def display_variable_checkboxes(selectable_columns):
    """Display checkboxes for selecting variables."""
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3))
    ]))
    return variable_checkboxes


def create_filename(base, selected_cells, selected_speeds, selected_ilbetas, var):
    """Create a unique filename based on selected options."""
    def summarize_options(options):
        if len(options) > 3:
            return f"{len(options)}options"
        return "_".join(options)

    selected_options = "_".join([
        summarize_options(selected_cells),
        summarize_options(selected_speeds),
        summarize_options(selected_ilbetas)
    ])

    filename = f"{base}_{selected_options}_{var}.pdf"
    return filename.replace(" ", "_")  # Replace spaces with underscores for file compatibility


# Create checkboxes for various attributes
cells_checkboxes = [widgets.Checkbox(value=False, description=str(cell)) for cell in merged_tracks_df['Cells'].unique()]
flow_speed_checkboxes = [widgets.Checkbox(value=False, description=str(speed)) for speed in merged_tracks_df['Flow_speed'].unique()]
ilbeta_checkboxes = [widgets.Checkbox(value=False, description=str(ilbeta)) for ilbeta in merged_tracks_df['ILbeta'].unique()]


# Display checkboxes
display(widgets.VBox([
    widgets.Label('Cells:'),
    widgets.GridBox(cells_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 100px)" % 4)),
    widgets.Label('Flow Speed:'),
    widgets.GridBox(flow_speed_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 100px)" % 4)),
    widgets.Label('ILbeta:'),
    widgets.GridBox(ilbeta_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 100px)" % 4))

]))

# Convert Flow_speed to string for checkbox matching
merged_tracks_df['Flow_speed'] = merged_tracks_df['Flow_speed'].astype(str)

# Define the plotting function
def plot_selected_vars(button, variable_checkboxes):
    print("Plotting in progress...")

    # Fetch selected values
    selected_cells = [box.description for box in cells_checkboxes if box.value]
    selected_speeds = [box.description for box in flow_speed_checkboxes if box.value]
    selected_ilbetas = [box.description for box in ilbeta_checkboxes if box.value]
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]

    # Filter dataframe
    filtered_df = merged_tracks_df.copy()
    filtered_df = filtered_df[filtered_df['Cells'].isin(selected_cells)]
    filtered_df = filtered_df[filtered_df['Flow_speed'].isin(selected_speeds)]
    filtered_df = filtered_df[filtered_df['ILbeta'].isin(selected_ilbetas)]

    # Initialize matrices for statistics
    effect_size_matrices = {}
    p_value_matrices = {}
    bonferroni_matrices = {}

    unique_conditions = filtered_df['Condition'].unique().tolist()
    num_comparisons = len(unique_conditions) * (len(unique_conditions) - 1) // 2
    alpha = 0.05
    corrected_alpha = alpha / num_comparisons
    n_iterations = 1000

# Loop through each variable to plot
    for var in variables_to_plot:

      filename = create_filename("track_parameters_plots", selected_cells, selected_speeds, selected_ilbetas, var)
      pdf_path = os.path.join(Results_Folder, "track_parameters_plots", "pdf", filename)
      csv_path = os.path.join(Results_Folder, "track_parameters_plots", "csv", f"{filename[:-4]}.csv")  # Remove '.pdf' and add '.csv'

      pdf_pages = PdfPages(pdf_path)

      effect_size_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      p_value_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      bonferroni_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)

      for cond1, cond2 in itertools.combinations(unique_conditions, 2):
        group1 = filtered_df[filtered_df['Condition'] == cond1][var]
        group2 = filtered_df[filtered_df['Condition'] == cond2][var]

        original_d = cohen_d(group1, group2)
        effect_size_matrix.loc[cond1, cond2] = original_d
        effect_size_matrix.loc[cond2, cond1] = original_d  # Mirroring

        count_extreme = 0
        for i in range(n_iterations):
            combined = pd.concat([group1, group2])
            shuffled = combined.sample(frac=1, replace=False).reset_index(drop=True)
            new_group1 = shuffled[:len(group1)]
            new_group2 = shuffled[len(group1):]

            new_d = cohen_d(new_group1, new_group2)
            if np.abs(new_d) >= np.abs(original_d):
                count_extreme += 1

        p_value = count_extreme / n_iterations
        p_value_matrix.loc[cond1, cond2] = p_value
        p_value_matrix.loc[cond2, cond1] = p_value  # Mirroring

        # Apply Bonferroni correction
        bonferroni_corrected_p_value = min(p_value * num_comparisons, 1.0)
        bonferroni_matrix.loc[cond1, cond2] = bonferroni_corrected_p_value
        bonferroni_matrix.loc[cond2, cond1] = bonferroni_corrected_p_value  # Mirroring

      effect_size_matrices[var] = effect_size_matrix
      p_value_matrices[var] = p_value_matrix
      bonferroni_matrices[var] = bonferroni_matrix

    # Concatenate the three matrices side-by-side
      combined_df = pd.concat(
        [
            effect_size_matrices[var].rename(columns={col: f"{col} (Effect Size)" for col in effect_size_matrices[var].columns}),
            p_value_matrices[var].rename(columns={col: f"{col} (P-Value)" for col in p_value_matrices[var].columns}),
            bonferroni_matrices[var].rename(columns={col: f"{col} (Bonferroni-corrected P-Value)" for col in bonferroni_matrices[var].columns})
        ], axis=1
    )

    # Save the combined DataFrame to a CSV file
      combined_df.to_csv(csv_path)

    # Create a new figure
      fig = plt.figure(figsize=(16, 10))

    # Create a gridspec for 2 rows and 4 columns
      gs = GridSpec(2, 3, height_ratios=[1.5, 1])

    # Create the ax for boxplot using the gridspec
      ax_box = fig.add_subplot(gs[0, :])

    # Extract the data for this variable
      data_for_var = filtered_df[['Condition', var, 'Repeat', 'File_name' ]]

    # Save the data_for_var to a CSV for replotting
      data_for_var.to_csv(f"{Results_Folder}/track_parameters_plots/csv/{var}_boxplot_data.csv", index=False)

    # Calculate the Interquartile Range (IQR) using the 25th and 75th percentiles
      Q1 = filtered_df[var].quantile(0.25)
      Q3 = filtered_df[var].quantile(0.75)
      IQR = Q3 - Q1

    # Define bounds for the outliers
      multiplier = 10
      lower_bound = Q1 - multiplier * IQR
      upper_bound = Q3 + multiplier * IQR

    # Plotting
      sns.boxplot(x='Condition', y=var, data=filtered_df, ax=ax_box, color='lightgray')  # Boxplot
      sns.stripplot(x='Condition', y=var, data=filtered_df, ax=ax_box, hue='Repeat', dodge=True, jitter=True, alpha=0.2)  # Individual data points
      ax_box.set_ylim([max(min(filtered_df[var]), lower_bound), min(max(filtered_df[var]), upper_bound)])
      ax_box.set_title(f"{var}")
      ax_box.set_xlabel('Condition')
      ax_box.set_ylabel(var)
      ax_box.set_xticklabels(ax_box.get_xticklabels(), rotation=90)
      ax_box.legend(loc='center left', bbox_to_anchor=(1, 0.5), title='Repeat')

    # Statistical Analyses and Heatmaps

    # Effect Size heatmap ax
      ax_d = fig.add_subplot(gs[1, 0])
      sns.heatmap(effect_size_matrices[var].fillna(0), annot=True, cmap="coolwarm", cbar=True, square=True, ax=ax_d)
      ax_d.set_title(f"Effect Size (Cohen's d) for {var}")

    # p-value heatmap ax
      ax_p = fig.add_subplot(gs[1, 1])
      sns.heatmap(p_value_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_p, vmax=0.1)
      ax_p.set_title(f"Randomization Test p-value for {var}")

    # Bonferroni corrected p-value heatmap ax
      ax_bonf = fig.add_subplot(gs[1, 2])
      sns.heatmap(bonferroni_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_bonf, vmax=0.1)
      ax_bonf.set_title(f"Bonferroni-corrected p-value for {var}")

      plt.tight_layout()
      pdf_pages.savefig(fig)
# Close the PDF
      pdf_pages.close()

# Display variable checkboxes and button
selectable_columns = get_selectable_columns(merged_tracks_df)
variable_checkboxes = display_variable_checkboxes(selectable_columns)
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes))
display(button)


--------
# **Part 4. Quality Control**
--------


### Compute Similarity Metrics between Field of Views (FOV) and between Conditions and Repeats

<font size = 4>**Purpose**:

<font size = 4>This section provides a set of tools to compute and visualize similarities between different field of views (FOV) based on selected track parameters. By leveraging hierarchical clustering, the resulting dendrogram offers a clear visualization of how different FOV, conditions, or repeats relate to one another. This tool is essential for:

<font size = 4>1. **Quality Control**:
    - Ensuring that FOVs from the same condition or experimental setup are more similar to each other than to FOVs from different conditions.
    - Confirming that repeats of the same experiment yield consistent results and cluster together.
    
<font size = 4>2. **Data Integrity**:
    - Identifying potential outliers or anomalies in the dataset.
    - Assessing the overall consistency of the experiment and ensuring reproducibility.

<font size = 4>**How to Use**:

<font size = 4>1. **Track Parameters Selection**:
    - A list of checkboxes allows users to select which track parameters they want to consider for similarity calculations. By default, all parameters are selected. Users can deselect parameters that they believe might not contribute significantly to the similarity.

<font size = 4>2. **Similarity Metric**:
    - Users can choose a similarity metric from a dropdown list. Options include cosine, euclidean, cityblock, jaccard, and correlation. The choice of similarity metric can influence the clustering results, so users might need to experiment with different metrics to see which one provides the most meaningful results.

<font size = 4>3. **Linkage Method**:
    - Determines how the distance between clusters is calculated in the hierarchical clustering process. Different linkage methods can produce different dendrograms, so users might want to try various methods.

<font size = 4>4. **Visualization**:
    - Once the parameters are selected, users can click on the "Select the track parameters and visualize similarity" button. This will compute the hierarchical clustering and display two dendrograms:
        - One dendrogram displays similarities between individual FOVs.
        - Another dendrogram aggregates the data based on conditions and repeats, providing a higher-level view of the similarities.
      



In [None]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

# @title ##Filter the data


# Global variables to store the selected options
global filtered_df
filtered_df = pd.DataFrame()

global selected_cells, selected_speeds, selected_ilbetas
selected_cells, selected_speeds, selected_ilbetas = [], [], []

# Function to summarize selected options into a string
def summarize_options(options):
    return "_".join([str(option) for option in options if option])  # Filters out any 'falsy' values like empty strings or None

# Function to create a filename based on selected options
def create_filename(selected_cells, selected_speeds, selected_ilbetas):
    # Join the summarized options for each parameter with an underscore
    selected_options = "_".join([
        summarize_options(selected_cells),
        summarize_options(selected_speeds),
        summarize_options(selected_ilbetas)
    ])

    # Replace spaces with underscores and return the filename
    filename = f"{selected_options}"
    return filename.replace(" ", "_")

# Create checkboxes for each category
cells_checkboxes = [widgets.Checkbox(value=False, description=str(cell)) for cell in merged_tracks_df['Cells'].unique()]
flow_speed_checkboxes = [widgets.Checkbox(value=False, description=str(speed)) for speed in merged_tracks_df['Flow_speed'].unique()]
ilbeta_checkboxes = [widgets.Checkbox(value=False, description=str(ilbeta)) for ilbeta in merged_tracks_df['ILbeta'].unique()]

# Function to filter dataframe and update global variables based on selected checkbox values
def filter_dataframe(button):
    global filtered_df, selected_cells, selected_speeds, selected_ilbetas

    # Trim whitespace and correct cases if necessary
    merged_tracks_df['Cells'] = merged_tracks_df['Cells'].str.strip()
    merged_tracks_df['Flow_speed'] = merged_tracks_df['Flow_speed'].str.strip()
    merged_tracks_df['ILbeta'] = merged_tracks_df['ILbeta'].str.strip()

    selected_cells = [box.description for box in cells_checkboxes if box.value]
    selected_speeds = [box.description for box in flow_speed_checkboxes if box.value]
    selected_ilbetas = [box.description for box in ilbeta_checkboxes if box.value]

    # Debugging output
    print("Selected Cells:", selected_cells)
    print("Selected Speeds:", selected_speeds)
    print("Selected ILbetas:", selected_ilbetas)
    print("Original DF length:", len(merged_tracks_df))

    filtered_df = merged_tracks_df[
        (merged_tracks_df['Cells'].isin(selected_cells)) &
        (merged_tracks_df['Flow_speed'].isin(selected_speeds)) &
        (merged_tracks_df['ILbeta'].isin(selected_ilbetas))
    ]

    # More debugging output
    print("Filtered DF length:", len(filtered_df))
    if len(filtered_df) == 0:
        print("No data matched the selected filters. Check filters and data for consistency.")
        print("Unique 'Cells' in DataFrame:", merged_tracks_df['Cells'].unique())
        print("Unique 'Flow_speed' in DataFrame:", merged_tracks_df['Flow_speed'].unique())
        print("Unique 'ILbeta' in DataFrame:", merged_tracks_df['ILbeta'].unique())

    print("Done")

# Now call the filter function or trigger the button to filter the dataframe and see the output.


# Button to trigger dataframe filtering
filter_button = widgets.Button(description="Filter Dataframe")
filter_button.on_click(filter_dataframe)

# Display checkboxes and button
display(widgets.VBox([
    widgets.Label('Select Cells:'),
    widgets.HBox(cells_checkboxes),
    widgets.Label('Select Flow Speed:'),
    widgets.HBox(flow_speed_checkboxes),
    widgets.Label('Select ILbeta:'),
    widgets.HBox(ilbeta_checkboxes),
    filter_button
]))


In [None]:
# @title ##Compute similarity metrics between FOV and between conditions and repeats

import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import ipywidgets as widgets
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import pdist

# Check and create "Similarity" folder
if not os.path.exists(f"{Results_Folder}/Similarity"):
    os.makedirs(f"{Results_Folder}/Similarity")

# Columns to exclude
excluded_columns = ['experiment_nb', 'File_name', 'Repeat', 'TRACK_INDEX', 'TRACK_ID',
                    'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION']

filename = create_filename(selected_cells, selected_speeds, selected_ilbetas)

selected_df = pd.DataFrame()

# Filter out non-numeric columns but keep 'File_name'
numeric_df = filtered_df.select_dtypes(include=['float64', 'int64']).copy()
numeric_df['File_name'] = filtered_df['File_name']

# Create a list of column names excluding 'File_name'
column_names = [col for col in numeric_df.columns if col not in excluded_columns]

# Create a checkbox for each column
checkboxes = [widgets.Checkbox(value=True, description=col, indent=False) for col in column_names]

# Dropdown for similarity metrics
similarity_dropdown = widgets.Dropdown(
    options=['cosine', 'euclidean', 'cityblock', 'jaccard', 'correlation'],
    value='cosine',
    description='Similarity Metric:'
)

# Dropdown for linkage methods
linkage_dropdown = widgets.Dropdown(
    options=['single', 'complete', 'average', 'ward'],
    value='single',
    description='Linkage Method:'
)

# Arrange checkboxes in a 2x grid
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection and visualization
button = widgets.Button(description="Select the track parameters and visualize similarity", layout=widgets.Layout(width='400px'))

# Define the button click event handler
def on_button_click(b):
    global selected_df  # Declare selected_df as global

    # Get the selected columns from the checkboxes
    selected_columns = [box.description for box in checkboxes if box.value]
    selected_columns.append('File_name')  # Always include 'File_name'

    # Extract the selected columns from the DataFrame
    selected_df = numeric_df[selected_columns]

    # Aggregate the data by filename
    aggregated_by_filename = selected_df.groupby('File_name').mean(numeric_only=True)

    # Aggregate the data by condition and repeat
    aggregated_by_condition_repeat = filtered_df.groupby(['Condition', 'Repeat'])[selected_columns].mean(numeric_only=True)

    # Compute condensed distance matrices
    distance_matrix_filename = pdist(aggregated_by_filename, metric=similarity_dropdown.value)
    distance_matrix_condition_repeat = pdist(aggregated_by_condition_repeat, metric=similarity_dropdown.value)

    # Perform hierarchical clustering
    linked_filename = linkage(distance_matrix_filename, method=linkage_dropdown.value)
    linked_condition_repeat = linkage(distance_matrix_condition_repeat, method=linkage_dropdown.value)

    annotation_text = f"Similarity Method: {similarity_dropdown.value}, Linkage Method: {linkage_dropdown.value}"


    # Plot the dendrograms one under the other
    plt.figure(figsize=(10, 10))

    # Dendrogram for individual filenames
    plt.subplot(2, 1, 1)
    dendrogram(linked_filename, labels=aggregated_by_filename.index, orientation='top', distance_sort='descending', leaf_rotation=90)
    plt.title(f'Dendrogram of Field of view Similarities\n{annotation_text}')

    # Dendrogram for aggregated data based on condition and repeat
    plt.subplot(2, 1, 2)
    dendrogram(linked_condition_repeat, labels=aggregated_by_condition_repeat.index, orientation='top', distance_sort='descending', leaf_rotation=90)
    plt.title(f'Dendrogram of Aggregated Similarities by Condition and Repeat\n{annotation_text}')

    plt.tight_layout()

    # Save the dendrogram to a PDF
    pdf_pages = PdfPages(f"{Results_Folder}/Similarity/{filename}_Dendrogram_Similarities.pdf")

    # Save the current figure to the PDF
    pdf_pages.savefig()

    # Close the PdfPages object to finalize the document
    pdf_pages.close()

    plt.show()

# Set the button click event handler
button.on_click(on_button_click)

# Display the widgets
display(grid, similarity_dropdown, linkage_dropdown, button)


--------
# **Part 5. Explore your high-dimensional data using UMAP and HDBSCAN**
--------

<font size = 4> The workflow provided below is inspired by [CellPlato](https://github.com/Michael-shannon/cellPLATO)

## **5.1. Choose the track metrics to use for clustering**



In [None]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

# @title ##Filter the data


# Global variables to store the selected options
global filtered_df
filtered_df = pd.DataFrame()

global selected_cells, selected_speeds, selected_ilbetas
selected_cells, selected_speeds, selected_ilbetas = [], [], []

# Function to summarize selected options into a string
def summarize_options(options):
    return "_".join([str(option) for option in options if option])  # Filters out any 'falsy' values like empty strings or None

# Function to create a filename based on selected options
def create_filename(selected_cells, selected_speeds, selected_ilbetas):
    # Join the summarized options for each parameter with an underscore
    selected_options = "_".join([
        summarize_options(selected_cells),
        summarize_options(selected_speeds),
        summarize_options(selected_ilbetas)
    ])

    # Replace spaces with underscores and return the filename
    filename = f"{selected_options}"
    return filename.replace(" ", "_")

# Create checkboxes for each category
cells_checkboxes = [widgets.Checkbox(value=False, description=str(cell)) for cell in merged_tracks_df['Cells'].unique()]
flow_speed_checkboxes = [widgets.Checkbox(value=False, description=str(speed)) for speed in merged_tracks_df['Flow_speed'].unique()]
ilbeta_checkboxes = [widgets.Checkbox(value=False, description=str(ilbeta)) for ilbeta in merged_tracks_df['ILbeta'].unique()]

# Function to filter dataframe and update global variables based on selected checkbox values
def filter_dataframe(button):
    global filtered_df, selected_cells, selected_speeds, selected_ilbetas

    # Trim whitespace and correct cases if necessary
    merged_tracks_df['Cells'] = merged_tracks_df['Cells'].str.strip()
    merged_tracks_df['Flow_speed'] = merged_tracks_df['Flow_speed'].str.strip()
    merged_tracks_df['ILbeta'] = merged_tracks_df['ILbeta'].str.strip()

    selected_cells = [box.description for box in cells_checkboxes if box.value]
    selected_speeds = [box.description for box in flow_speed_checkboxes if box.value]
    selected_ilbetas = [box.description for box in ilbeta_checkboxes if box.value]

    # Debugging output
    print("Selected Cells:", selected_cells)
    print("Selected Speeds:", selected_speeds)
    print("Selected ILbetas:", selected_ilbetas)
    print("Original DF length:", len(merged_tracks_df))

    filtered_df = merged_tracks_df[
        (merged_tracks_df['Cells'].isin(selected_cells)) &
        (merged_tracks_df['Flow_speed'].isin(selected_speeds)) &
        (merged_tracks_df['ILbeta'].isin(selected_ilbetas))
    ]

    # More debugging output
    print("Filtered DF length:", len(filtered_df))
    if len(filtered_df) == 0:
        print("No data matched the selected filters. Check filters and data for consistency.")
        print("Unique 'Cells' in DataFrame:", merged_tracks_df['Cells'].unique())
        print("Unique 'Flow_speed' in DataFrame:", merged_tracks_df['Flow_speed'].unique())
        print("Unique 'ILbeta' in DataFrame:", merged_tracks_df['ILbeta'].unique())

    print("Done")

# Now call the filter function or trigger the button to filter the dataframe and see the output.


# Button to trigger dataframe filtering
filter_button = widgets.Button(description="Filter Dataframe")
filter_button.on_click(filter_dataframe)

# Display checkboxes and button
display(widgets.VBox([
    widgets.Label('Select Cells:'),
    widgets.HBox(cells_checkboxes),
    widgets.Label('Select Flow Speed:'),
    widgets.HBox(flow_speed_checkboxes),
    widgets.Label('Select ILbeta:'),
    widgets.HBox(ilbeta_checkboxes),
    filter_button
]))


In [None]:
# @title ##Choose the track metrics to use

import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/Umap"):
    os.makedirs(f"{Results_Folder}/Umap")


excluded_columns = ['experiment_nb', 'TRACK_INDEX', 'TRACK_ID',
                    'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION']

# Columns you want to always include
columns_to_include = ['File_name', 'Repeat', 'Condition', 'Unique_ID']

selected_df = pd.DataFrame()
nan_columns = pd.DataFrame()
# Extract the columns you always want to include and ensure they exist in the original dataframe
saved_columns = {col: filtered_df[col].copy() for col in columns_to_include if col in filtered_df}

# Filter out non-numeric columns
numeric_df = filtered_df.select_dtypes(include=['float64', 'int64'])  # Selecting only numeric columns

column_names = [col for col in numeric_df.columns if col not in excluded_columns]

# Create a checkbox for each column
checkboxes = [widgets.Checkbox(value=True, description=col, indent=False) for col in column_names]

# Arrange checkboxes in a 2x grid
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection
button = widgets.Button(description="Select the track parameters", layout=widgets.Layout(width='400px'))

# Define the button click event handler
def on_button_click(b):
    global selected_df  # Declare selected_df as global
    global nan_columns
    # Get the selected columns from the checkboxes
    selected_columns = [box.description for box in checkboxes if box.value]

    # Extract the selected columns from the DataFrame
    selected_df = numeric_df[selected_columns].copy()

    # Add back the always-included columns to selected_df
    for col, data in saved_columns.items():
        selected_df.loc[:, col] = data

    # Check if the DataFrame has any NaN values and print a warning if it does.
    nan_columns = selected_df.columns[selected_df.isna().any()].tolist()

    if nan_columns:
      for col in nan_columns:
        initial_row_count = len(selected_df)
        selected_df = selected_df.dropna(subset=[col])  # Drop NaN values only from columns containing them
        dropped_row_count = initial_row_count - len(selected_df)

        print(f"Dropped {dropped_row_count} rows from column '{col}' due to NaN values.")


    print("Done")

# Set the button click event handler
button.on_click(on_button_click)

# Display the grid of checkboxes and the button
display(grid, button)



## **5.2. UMAP**
---

<font size = 4> The given code performs UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction on the merged tracks dataframe, focusing on its numeric columns, and visualizes the result. In the provided UMAP code, the parameters `n_neighbors`, `min_dist`, and `n_components` are crucial for determining the structure and appearance of the resulting low-dimensional representation of the data.

<font size = 4>`n_neighbors`: This parameter controls how UMAP balances local versus global structure in the data. It determines the size of the local neighborhood UMAP will look at when learning the manifold structure of the data.
- A smaller value emphasizes the local structure of the data, potentially at the expense of the global structure.
- A larger value allows UMAP to consider more distant neighbors, emphasizing more on the global structure of the data.
- Typically, values in the range of 5 to 50 are chosen, depending on the density and scale of the data.

<font size = 4>`min_dist`: This parameter controls how tightly UMAP is allowed to pack points together. It determines the minimum distance between points in the low-dimensional representation.
- Setting it to a low value will allow points to be packed more closely, potentially revealing clusters in the data.
- A higher value ensures that points are more spread out in the representation.
- Values usually range between 0 and 1.

<font size = 4>`n_dimension`: This parameter determines the number of dimensions in the low-dimensional space that the data will be reduced to.
For visualization purposes, `n_dimension` is typically set to 2 or 3 to obtain 2D or 3D representations, respectively.


In [None]:
# @title ##Perform UMAP
import umap
import plotly.offline as pyo


filename = create_filename(selected_cells, selected_speeds, selected_ilbetas)

# Check and create necessary directories
if not os.path.exists(f"{Results_Folder}/Umap/{filename}"):
    os.makedirs(f"{Results_Folder}/Umap/{filename}")

#@markdown ###UMAP parameters:

n_neighbors = 30  # @param {type: "number"}
min_dist = 0.1  # @param {type: "number"}
n_dimension = 2  # @param {type: "slider", min: 1, max: 3}

#@markdown ###Display parameters:
spot_size = 10 # @param {type: "number"}

# Initialize UMAP object with the specified settings
reducer = umap.UMAP(n_neighbors=n_neighbors, min_dist=min_dist, n_components=n_dimension, random_state=42)
# Exclude non-numeric columns when fitting UMAP
embedding = reducer.fit_transform(selected_df.drop(columns=columns_to_include))
# Create dynamic column names based on n_components
column_names = [f'UMAP dimension {i}' for i in range(1, n_dimension + 1)]

# Extract the columns_to_include from selected_df
included_data = selected_df[columns_to_include].reset_index(drop=True)

# Concatenate the UMAP embedding with the included columns
umap_df = pd.concat([pd.DataFrame(embedding, columns=column_names), included_data], axis=1)


# Check if the DataFrame has any NaN values and print a warning if it does.
nan_columns = umap_df.columns[umap_df.isna().any()].tolist()

if nan_columns:
  warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}")
  for col in nan_columns:
    umap_df = umap_df.dropna(subset=[col])  # Drop NaN values only from columns containing them

# Visualize the UMAP projection
plt.figure(figsize=(12, 10))

# The plot will adjust automatically based on the n_components
if n_dimension == 2:
    sns.scatterplot(x=column_names[0], y=column_names[1], hue='Condition', data=umap_df, palette='Set2', s=spot_size)
    plt.title('UMAP Projection of the Dataset')
    plt.savefig(f"{Results_Folder}/Umap/{filename}/umap_projection_2D.pdf")  # Save 2D plot as PDF
    plt.show()
elif n_dimension == 1:
    sns.stripplot(x=column_names[0], hue='Condition', data=umap_df, palette='Set2', jitter=0.05, size=spot_size)
    plt.title('UMAP Projection of the Dataset')
    plt.savefig(f"{Results_Folder}/Umap/{filename}/umap_projection_1D.pdf")  # Save 2D plot as PDF
    plt.show()
else:
    # umap_df should have columns like 'UMAP dimension 1', 'UMAP dimension 2', 'UMAP dimension 3', and 'condition'
    import plotly.express as px
    import pandas as pd
    import numpy as np

    fig = px.scatter_3d(umap_df,
                    x='UMAP dimension 1',
                    y='UMAP dimension 2',
                    z='UMAP dimension 3',
                    color='Condition')

    for trace in fig.data:
      trace.marker.size = spot_size  # You can set this to any desired value

    fig.show()
    pyo.plot(fig, filename=f"{Results_Folder}/Umap/{filename}/umap_projection.html", auto_open=False)

## **5.3. HDBSCAN**
---

<font size = 4> The provided code employs HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to identify clusters within a dataset that has already undergone UMAP dimensionality reduction. HDBSCAN is utilized for its proficiency in determining the optimal number of clusters while managing varied densities within the data.

<font size = 4>In the provided HDBSCAN code, the parameters `min_samples`, `min_cluster_size`, and `metric` are crucial for determining the structure and appearance of the resulting clusters in the data.

<font size = 4>`min_samples`: This parameter primarily controls the degree to which the algorithm is willing to declare noise. It's the number of samples in a neighborhood for a point to be considered as a core point.
- A smaller value of `min_samples` makes the algorithm more prone to declaring points as part of a cluster, potentially leading to larger clusters and fewer noise points.
- A larger value makes the algorithm more conservative, resulting in more points declared as noise and smaller, more defined clusters.
- The choice of `min_samples` typically depends on the density of the data; denser datasets may require a larger value.

<font size = 4>`min_cluster_size`: This parameter determines the smallest size grouping that you wish to consider a cluster.
- A smaller value will allow the formation of smaller clusters, whereas a larger value will prevent small isolated groups of points from being declared as clusters.
- The choice of `min_cluster_size` depends on the scale of the data and the desired level of granularity in the clustering.

<font size = 4>`metric`: This parameter is the metric used for distance computation between data points, and it affects the shape of the clusters.
- The `euclidean` metric is a good starting point, and depending on the clustering results and the data type, it might be beneficial to experiment with different metrics.


In [None]:
# @title ##Run to see more information about the available metrics
print("""
Metric                   Description                                                               Suitable For
-------------------------------------------------------------------------------------------------------------------------------------------------------
Euclidean                Standard distance metric.                                                 Numerical data.
Manhattan                Sum of absolute differences.                                              Numerical/Categorical data.
Chebyshev                Maximum value of absolute differences.                                    Numerical data.
Minkowski                Generalization of Euclidean and Manhattan distance.                       Numerical data.
Bray-Curtis              Dissimilarity between sample sets.                                        Numerical data.
Canberra                 Weighted version of Manhattan distance.                                   Numerical data.
Mahalanobis              Distance between a point and a distribution.                              Numerical data.

""")


In [None]:
# @title ##Identify clusters using HDBSCAN
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np

#@markdown ###HDBSCAN parameters:
clustering_data_source = 'umap'  # @param ['umap', 'raw']
min_samples = 50  # @param {type: "number"}
min_cluster_size = 100  # @param {type: "number"}
metric = "euclidean"  # @param ['euclidean', 'manhattan', 'chebyshev', 'braycurtis', 'canberra']

#@markdown ###Display parameters:
spot_size = 10 # @param {type: "number"}
# Apply HDBSCAN
clusterer = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric=metric)  # You may need to tune these parameters

if clustering_data_source == 'umap':
  if n_dimension == 1:
    clusterer.fit(umap_df[['UMAP dimension 1']])  # Use only one UMAP dimension for clustering

  elif n_dimension == 2:
    clusterer.fit(umap_df[['UMAP dimension 1', 'UMAP dimension 2']])  # Use two UMAP dimensions for clustering

  elif n_dimension == 3:
    clusterer.fit(umap_df[['UMAP dimension 1', 'UMAP dimension 2', 'UMAP dimension 3']])  # Use three UMAP dimensions for clustering

else:
  clusterer.fit(selected_df.select_dtypes(include=['number']))

# Add the cluster labels to your UMAP DataFrame
umap_df['Cluster'] = clusterer.labels_

# If the Cluster column already exists in merged_tracks_df, drop it to avoid duplications
if 'Cluster' in filtered_df.columns:
    filtered_df.drop(columns='Cluster', inplace=True)

# Merge the Cluster column from umap_df to merged_tracks_df based on Unique_ID
filtered_df = pd.merge(filtered_df, umap_df[['Unique_ID', 'Cluster']], on='Unique_ID', how='left')

# Handle cases where some rows in merged_tracks_df might not have a corresponding cluster label
filtered_df['Cluster'].fillna(-1, inplace=True)  # Assigning -1 to cells that were not assigned to any cluster

# Save the DataFrame with the identified clusters
filtered_df.to_csv(Results_Folder + '/Umap/'+filename+'/' + 'filtered_Tracks.csv', index=False)

# Plotting the results
if n_dimension == 1:
    plt.figure(figsize=(12, 6))
    sns.stripplot(data=umap_df, x='UMAP dimension 1', hue='Cluster', palette='viridis', s=spot_size)
    plt.title('Clusters Identified by HDBSCAN (1D)')
    plt.xlabel('UMAP dimension 1')
    plt.ylabel('Count')
    plt.savefig(f"{Results_Folder}/Umap/{filename}/HDBSCAN_clusters_1D.pdf")  # Save 1D histogram as PDF
    plt.show()

if n_dimension == 2:

  plt.figure(figsize=(12,10))
  sns.scatterplot(x='UMAP dimension 1', y='UMAP dimension 2', hue='Cluster', palette='viridis', data=umap_df, s=spot_size)
  plt.title('Clusters Identified by HDBSCAN')
  plt.savefig(f"{Results_Folder}/Umap/{filename}/HDBSCAN_clusters_2D.pdf")  # Save 2D plot as PDF
  plt.show()

if n_dimension == 3:

  fig = px.scatter_3d(umap_df,
                    x='UMAP dimension 1',
                    y='UMAP dimension 2',
                    z='UMAP dimension 3',
                    color='Cluster')

  for trace in fig.data:
    trace.marker.size = spot_size

  fig.show()
  pyo.plot(fig, filename=f"{Results_Folder}/Umap/{filename}/HDBSCAN_clusters.html", auto_open=False)

## **5.4. Understand your clusters using box plots**

<font size = 4>The provided code aims to visually represent the distribution of different track parameters across the identified clusters. Specifically, for each parameter selected, a boxplot is generated to showcase the spread of its values across different clusters. This approach provides a comprehensive view of how each track parameter varies within and across the clusters.




In [None]:
import os
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.gridspec import GridSpec
import pandas as pd
import ipywidgets as widgets

# @title ##Plot track parameters based on clusters

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/Umap/{filename}/Track_parameters"):
    os.makedirs(f"{Results_Folder}/Umap/{filename}/Track_parameters")

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar']
    return [col for col in df.columns if col not in exclude_cols]

def display_variable_checkboxes(selectable_columns):
    # Create checkboxes for selectable columns
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]

    # Display checkboxes in the notebook
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3)),
    ]))
    return variable_checkboxes

def plot_selected_vars(button, variable_checkboxes, df, Results_Folder):
    print("Plotting in progress...")

    # Get selected variables
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]
    n_plots = len(variables_to_plot)

    if n_plots == 0:
        print("No variables selected for plotting")
        return

    for var in variables_to_plot:
        # Extract data for the specific variable and cluster
        data_to_save = df[['Cluster', var]]

        # Save data for the plot to CSV
        data_to_save.to_csv(f"{Results_Folder}/Umap/{filename}/Track_parameters/{var}_data_by_Cluster.csv", index=False)

        plt.figure(figsize=(16, 10))

        # Plotting
        sns.boxplot(x='Cluster', y=var, data=df, color='lightgray')  # Boxplot by cluster
        sns.stripplot(x='Cluster', y=var, data=df, jitter=True, alpha=0.2)  # Individual data points

        plt.title(f"{var} by Cluster")
        plt.xlabel('Cluster')
        plt.ylabel(var)
        plt.xticks(rotation=90)
        plt.tight_layout()

        # Save the plot
        plt.savefig(f"{Results_Folder}/Umap/{filename}/Track_parameters/{var}_Boxplots_by_Cluster.pdf")
        plt.show()

selectable_columns = get_selectable_columns(filtered_df)
variable_checkboxes = display_variable_checkboxes(selectable_columns)

# Create and display the plot button
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes, filtered_df, Results_Folder))
display(button)


## **5.5. Understand your clusters using heatmaps**

<font size = 4>This section help visualize how different track parameters vary across the identified clusters. The approach is to display these variations using a heatmap, which offers a color-coded representation of the median values of each parameter for each cluster. This visualization technique can make it easier to spot differences or patterns among the clusters.


In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd
from scipy.stats import zscore

# @title ##Plot track normalized track parameters based on clusters as an heatmap

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/Umap/{filename}/Track_parameters"):
    os.makedirs(f"{Results_Folder}/Umap/{filename}/Track_parameters")

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar']
    return [col for col in df.columns if col not in exclude_cols]

def heatmap_comparison(df, Results_Folder):
    # Get all the selectable columns
    variables_to_plot = get_selectable_columns(df)

    # Compute median for each variable across clusters
    median_values = df.groupby('Cluster')[variables_to_plot].median().transpose()

    # Normalize the median values using Z-score
    normalized_values = median_values.apply(zscore, axis=1)

    # Plot the heatmap
    plt.figure(figsize=(16, 10))
    sns.heatmap(normalized_values, cmap='coolwarm', annot=True, linewidths=.5)
    plt.title("Z-score Normalized Median Values of Variables by Cluster")
    plt.tight_layout()

    # Save the heatmap
    plt.savefig(f"{Results_Folder}/Umap/{filename}/Track_parameters/Heatmap_Normalized_Median_Values_by_Cluster.pdf")
    plt.show()

    # Save the normalized median values data to CSV
    normalized_values.to_csv(f"{Results_Folder}/Umap/{filename}/Track_parameters/Normalized_Median_Values_by_Cluster.csv")

# Plot the heatmap directly
heatmap_comparison(filtered_df, Results_Folder)


## **5.6 Fingerprint**
---

<font size = 4>This section is designed to visualize the distribution of different clusters within each condition in a dataset, showing the 'fingerprint' of each cluster per condition.

In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# Group by 'Condition' and 'Cluster' and calculate the size of each group
cluster_counts = umap_df.groupby(['Condition', 'Cluster']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = umap_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'Condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100

# Save the percentage_df DataFrame as a CSV file
percentage_df.to_csv(Results_Folder+'/Umap/'+filename+'/UMAP_percentage_results.csv', index=False)

# Pivot the percentage_df to have Conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster', values='percentage')

# Fill NaN values with 0 if any, as there might be some Condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

# Initialize PDF
pdf_pages = PdfPages(Results_Folder+'/Umap/'+filename+'/UMAP_Cluster_Fingerprint_Plot.pdf')

# Plotting
fig, ax = plt.subplots(figsize=(10, 7))
pivot_df.plot(kind='bar', stacked=True, ax=ax, colormap='Set2')
plt.title('Percentage in each cluster per Condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)
plt.tight_layout()

# Save the figure to a PDF
pdf_pages.savefig(fig)

# Close the PDF
pdf_pages.close()

# Display the plot
plt.show()



--------
# **Part 6. Explore your high-dimensional data using t-SNE and HDBSCAN**
--------

In [None]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

# @title ##Filter the data


# Global variables to store the selected options
global filtered_df
filtered_df = pd.DataFrame()

global selected_cells, selected_speeds, selected_ilbetas
selected_cells, selected_speeds, selected_ilbetas = [], [], []

# Function to summarize selected options into a string
def summarize_options(options):
    return "_".join([str(option) for option in options if option])  # Filters out any 'falsy' values like empty strings or None

# Function to create a filename based on selected options
def create_filename(selected_cells, selected_speeds, selected_ilbetas):
    # Join the summarized options for each parameter with an underscore
    selected_options = "_".join([
        summarize_options(selected_cells),
        summarize_options(selected_speeds),
        summarize_options(selected_ilbetas)
    ])

    # Replace spaces with underscores and return the filename
    filename = f"{selected_options}"
    return filename.replace(" ", "_")

# Create checkboxes for each category
cells_checkboxes = [widgets.Checkbox(value=False, description=str(cell)) for cell in merged_tracks_df['Cells'].unique()]
flow_speed_checkboxes = [widgets.Checkbox(value=False, description=str(speed)) for speed in merged_tracks_df['Flow_speed'].unique()]
ilbeta_checkboxes = [widgets.Checkbox(value=False, description=str(ilbeta)) for ilbeta in merged_tracks_df['ILbeta'].unique()]

# Function to filter dataframe and update global variables based on selected checkbox values
def filter_dataframe(button):
    global filtered_df, selected_cells, selected_speeds, selected_ilbetas

    # Trim whitespace and correct cases if necessary
    merged_tracks_df['Cells'] = merged_tracks_df['Cells'].str.strip()
    merged_tracks_df['Flow_speed'] = merged_tracks_df['Flow_speed'].str.strip()
    merged_tracks_df['ILbeta'] = merged_tracks_df['ILbeta'].str.strip()

    selected_cells = [box.description for box in cells_checkboxes if box.value]
    selected_speeds = [box.description for box in flow_speed_checkboxes if box.value]
    selected_ilbetas = [box.description for box in ilbeta_checkboxes if box.value]

    # Debugging output
    print("Selected Cells:", selected_cells)
    print("Selected Speeds:", selected_speeds)
    print("Selected ILbetas:", selected_ilbetas)
    print("Original DF length:", len(merged_tracks_df))

    filtered_df = merged_tracks_df[
        (merged_tracks_df['Cells'].isin(selected_cells)) &
        (merged_tracks_df['Flow_speed'].isin(selected_speeds)) &
        (merged_tracks_df['ILbeta'].isin(selected_ilbetas))
    ]

    # More debugging output
    print("Filtered DF length:", len(filtered_df))
    if len(filtered_df) == 0:
        print("No data matched the selected filters. Check filters and data for consistency.")
        print("Unique 'Cells' in DataFrame:", merged_tracks_df['Cells'].unique())
        print("Unique 'Flow_speed' in DataFrame:", merged_tracks_df['Flow_speed'].unique())
        print("Unique 'ILbeta' in DataFrame:", merged_tracks_df['ILbeta'].unique())

    print("Done")

# Now call the filter function or trigger the button to filter the dataframe and see the output.


# Button to trigger dataframe filtering
filter_button = widgets.Button(description="Filter Dataframe")
filter_button.on_click(filter_dataframe)

# Display checkboxes and button
display(widgets.VBox([
    widgets.Label('Select Cells:'),
    widgets.HBox(cells_checkboxes),
    widgets.Label('Select Flow Speed:'),
    widgets.HBox(flow_speed_checkboxes),
    widgets.Label('Select ILbeta:'),
    widgets.HBox(ilbeta_checkboxes),
    filter_button
]))


In [None]:
# @title ##Choose the track metrics to use

import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/Umap"):
    os.makedirs(f"{Results_Folder}/Umap")


excluded_columns = ['experiment_nb', 'TRACK_INDEX', 'TRACK_ID',
                    'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION']

# Columns you want to always include
columns_to_include = ['File_name', 'Repeat', 'Condition', 'Unique_ID']

selected_df = pd.DataFrame()
nan_columns = pd.DataFrame()
# Extract the columns you always want to include and ensure they exist in the original dataframe
saved_columns = {col: filtered_df[col].copy() for col in columns_to_include if col in filtered_df}

# Filter out non-numeric columns
numeric_df = filtered_df.select_dtypes(include=['float64', 'int64'])  # Selecting only numeric columns

column_names = [col for col in numeric_df.columns if col not in excluded_columns]

# Create a checkbox for each column
checkboxes = [widgets.Checkbox(value=True, description=col, indent=False) for col in column_names]

# Arrange checkboxes in a 2x grid
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection
button = widgets.Button(description="Select the track parameters", layout=widgets.Layout(width='400px'))

# Define the button click event handler
def on_button_click(b):
    global selected_df  # Declare selected_df as global
    global nan_columns
    # Get the selected columns from the checkboxes
    selected_columns = [box.description for box in checkboxes if box.value]

    # Extract the selected columns from the DataFrame
    selected_df = numeric_df[selected_columns].copy()

    # Add back the always-included columns to selected_df
    for col, data in saved_columns.items():
        selected_df.loc[:, col] = data

    # Check if the DataFrame has any NaN values and print a warning if it does.
    nan_columns = selected_df.columns[selected_df.isna().any()].tolist()

    if nan_columns:
      for col in nan_columns:
        initial_row_count = len(selected_df)
        selected_df = selected_df.dropna(subset=[col])  # Drop NaN values only from columns containing them
        dropped_row_count = initial_row_count - len(selected_df)

        print(f"Dropped {dropped_row_count} rows from column '{col}' due to NaN values.")

    print("Done")

# Set the button click event handler
button.on_click(on_button_click)

# Display the grid of checkboxes and the button
display(grid, button)



In [None]:
# @title ##Perform t-SNE
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import pandas as pd

filename = create_filename(selected_cells, selected_speeds, selected_ilbetas)

# Check and create necessary directories
tsne_folder_path = f"{Results_Folder}/Tsne/{filename}"
if not os.path.exists(tsne_folder_path):
    os.makedirs(tsne_folder_path)

#@markdown ###t-SNE parameters:

perplexity = 50  # @param {type: "number"}
learning_rate = 200  # @param {type: "number"}
n_iter = 1000  # @param {type: "number"}
n_dimension = 2  # The number of dimensions is set to 2 for t-SNE as standard practice

#@markdown ###Display parameters:
spot_size = 10  # @param {type: "number"}

# Initialize t-SNE object with the specified settings
tsne = TSNE(n_components=n_dimension, perplexity=perplexity, learning_rate=learning_rate, n_iter=n_iter, random_state=42)

# Exclude non-numeric columns when fitting t-SNE
numeric_columns = selected_df._get_numeric_data()
embedding = tsne.fit_transform(numeric_columns)

# Create dynamic column names based on n_components
column_names = [f't-SNE dimension {i+1}' for i in range(n_dimension)]

# Extract the columns_to_include from selected_df
included_data = selected_df[columns_to_include].reset_index(drop=True)

# Concatenate the t-SNE embedding with the included columns
tsne_df = pd.concat([pd.DataFrame(embedding, columns=column_names), included_data], axis=1)

# Check if the DataFrame has any NaN values and print a warning if it does.
nan_columns = tsne_df.columns[tsne_df.isna().any()].tolist()
if nan_columns:
  warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}")
  tsne_df.dropna(subset=nan_columns, inplace=True)  # Drop NaN values only from columns containing them

# Visualize the t-SNE projection
plt.figure(figsize=(12, 10))
sns.scatterplot(x=column_names[0], y=column_names[1], hue='Condition', data=tsne_df, palette='Set2', s=spot_size)
plt.title('t-SNE Projection of the Dataset')
tsne_output_path = os.path.join(tsne_folder_path, 'tsne_projection_2D.pdf')
plt.savefig(tsne_output_path)  # Save 2D plot as PDF
plt.show()


In [None]:
# @title ##Identify clusters using HDBSCAN
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#@markdown ###HDBSCAN parameters:
clustering_data_source = 'tsne'  # @param ['tsne', 'raw']
min_samples = 50  # @param {type: "number"}
min_cluster_size = 100  # @param {type: "number"}
metric = "euclidean"  # @param ['euclidean', 'manhattan', 'chebyshev', 'braycurtis', 'canberra']

#@markdown ###Display parameters:
spot_size = 10 # @param {type: "number"}

# Apply HDBSCAN
clusterer = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric=metric)

# Depending on the data source, we fit HDBSCAN to the t-SNE dimensions or the raw data
if clustering_data_source == 'tsne':
    # We only have two t-SNE dimensions based on the previous t-SNE code provided
    clusterer.fit(tsne_df[['t-SNE dimension 1', 't-SNE dimension 2']])
else:
    # If raw data is selected, we use all the numerical columns for clustering
    clusterer.fit(selected_df.select_dtypes(include=['number']))

# Add the cluster labels to your t-SNE DataFrame
tsne_df['Cluster'] = clusterer.labels_

# If the Cluster column already exists in filtered_df, drop it to avoid duplications
if 'Cluster' in filtered_df.columns:
    filtered_df.drop(columns='Cluster', inplace=True)

# Merge the Cluster column from tsne_df to filtered_df based on Unique_ID
filtered_df = pd.merge(filtered_df, tsne_df[['Unique_ID', 'Cluster']], on='Unique_ID', how='left')

# Handle cases where some rows in filtered_df might not have a corresponding cluster label
filtered_df['Cluster'].fillna(-1, inplace=True)  # Assigning -1 to cells that were not assigned to any cluster

# Save the DataFrame with the identified clusters
filtered_df.to_csv(os.path.join(Results_Folder, 'Tsne', filename, 'filtered_Tracks.csv'), index=False)

# Plotting the results
plt.figure(figsize=(12, 10))
sns.scatterplot(x='t-SNE dimension 1', y='t-SNE dimension 2', hue='Cluster', palette='viridis', data=tsne_df, s=spot_size)
plt.title('Clusters Identified by HDBSCAN')
plt.savefig(os.path.join(Results_Folder, 'Tsne', filename, 'HDBSCAN_clusters_2D.pdf'))  # Save 2D plot as PDF
plt.show()


In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import ipywidgets as widgets

# @title ##Plot track parameters based on clusters

# Define paths for Tsne
tsne_track_parameters_path = os.path.join(Results_Folder, 'Tsne', filename, 'Track_parameters')
os.makedirs(tsne_track_parameters_path, exist_ok=True)

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar']
    return [col for col in df.columns if col not in exclude_cols]

def display_variable_checkboxes(selectable_columns):
    # Create checkboxes for selectable columns
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]

    # Display checkboxes in the notebook
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(3, 300px)")),
    ]))
    return variable_checkboxes

def plot_selected_vars(button, variable_checkboxes, df, Results_Folder, filename):
    print("Plotting in progress...")

    # Get selected variables
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]
    n_plots = len(variables_to_plot)

    if n_plots == 0:
        print("No variables selected for plotting")
        return

    for var in variables_to_plot:
        # Extract data for the specific variable and cluster
        data_to_save = df[['Cluster', var]]

        # Save data for the plot to CSV
        data_to_save.to_csv(os.path.join(tsne_track_parameters_path, f"{var}_data_by_Cluster.csv"), index=False)

        plt.figure(figsize=(16, 10))

        # Plotting
        sns.boxplot(x='Cluster', y=var, data=df, color='lightgray')  # Boxplot by cluster
        sns.stripplot(x='Cluster', y=var, data=df, jitter=True, alpha=0.2)  # Individual data points

        plt.title(f"{var} by Cluster")
        plt.xlabel('Cluster')
        plt.ylabel(var)
        plt.xticks(rotation=90)
        plt.tight_layout()

        # Save the plot to PDF
        plt.savefig(os.path.join(tsne_track_parameters_path, f"{var}_Boxplots_by_Cluster.pdf"))
        plt.show()

selectable_columns = get_selectable_columns(filtered_df)
variable_checkboxes = display_variable_checkboxes(selectable_columns)

# Create and display the plot button
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes, filtered_df, Results_Folder, filename))
display(button)


In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore
import pandas as pd

# @title ##Plot track normalized track parameters based on clusters as a heatmap

# Create "Tsne/Track_parameters" directory if it doesn't exist
tsne_track_parameters_path = os.path.join(Results_Folder, 'Tsne', filename, 'Track_parameters')
os.makedirs(tsne_track_parameters_path, exist_ok=True)

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar']
    return [col for col in df.columns if col not in exclude_cols]

def heatmap_comparison(df, Results_Folder, filename):
    # Get all the selectable columns
    variables_to_plot = get_selectable_columns(df)

    # Compute median for each variable across clusters
    median_values = df.groupby('Cluster')[variables_to_plot].median().transpose()

    # Normalize the median values using Z-score
    normalized_values = median_values.apply(zscore, axis=1)

    # Plot the heatmap
    plt.figure(figsize=(16, 10))
    sns.heatmap(normalized_values, cmap='coolwarm', annot=True, linewidths=.5)
    plt.title("Z-score Normalized Median Values of Variables by Cluster")
    plt.tight_layout()

    # Save the heatmap to PDF
    heatmap_pdf_path = os.path.join(tsne_track_parameters_path, 'Heatmap_Normalized_Median_Values_by_Cluster.pdf')
    plt.savefig(heatmap_pdf_path)
    plt.show()

    # Save the normalized median values data to CSV
    normalized_values_csv_path = os.path.join(tsne_track_parameters_path, 'Normalized_Median_Values_by_Cluster.csv')
    normalized_values.to_csv(normalized_values_csv_path)

# Plot the heatmap directly
heatmap_comparison(filtered_df, Results_Folder, filename)


In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# Group by 'Condition' and 'Cluster' and calculate the size of each group
cluster_counts = tsne_df.groupby(['Condition', 'Cluster']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = tsne_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'Condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100

# Save the percentage_df DataFrame as a CSV file
percentage_df.to_csv(os.path.join(Results_Folder, 'Tsne', filename, 'TSNE_percentage_results.csv'), index=False)

# Pivot the percentage_df to have Conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster', values='percentage')

# Fill NaN values with 0 if any, as there might be some Condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

# Initialize PDF
pdf_path = os.path.join(Results_Folder, 'Tsne', filename, 'TSNE_Cluster_Fingerprint_Plot.pdf')
pdf_pages = PdfPages(pdf_path)

# Plotting
fig, ax = plt.subplots(figsize=(10, 7))
pivot_df.plot(kind='bar', stacked=True, ax=ax, colormap='Set2')
plt.title('Percentage in each cluster per Condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)
plt.tight_layout()

# Save the figure to a PDF
pdf_pages.savefig(fig)

# Close the PDF
pdf_pages.close()

# Display the plot
plt.show()


## **Part 7. Version log**
---
<font size = 4>While I strive to provide accurate and helpful information, please be aware that:
  - This notebook may contain bugs.
  - Features are currently limited and will be expanded in future releases.

<font size = 4>We encourage users to report any issues or suggestions for improvement. Please check the [repository](https://github.com/guijacquemet/CellTracksColab) regularly for updates and the latest version of this notebook.

#### **Known Issues**:
- Tracks are displayed in 2D in section 1.4

<font size = 4>**Version 0.6**
  - Improved organisation of the results
  - Tracks visualisation are now saved

<font size = 4>**Version 0.5**
  - Improved part 5
  - Added the possibility to find examplar on the raw movies when available
  - Added the possibility to export video with the examplar labeled
  - Code improved to deal with larger dataset (tested with over 50k tracks)
  - test dataset now contains raw video and is hosted on Zenodo
  - Results are now organised in folders
  - Added progress bars
  - Minor code fixes

<font size = 4>**Version 0.4**

  - Added the possibility to filter and smooth tracks
  - Added spatial and temporal calibration
  - Notebook is streamlined
  - multiple bug fix
  - Remove the t-sne
  - Improved documentation

<font size = 4>**Version 0.3**
  - Fix a nasty bug in the import functions
  - Add basic examplar for UMAP
  - Added the statistical analyses and their explanations.
  - Added a new quality control part that helps assessing the similarity of results between FOV, conditions and repeats
  - Improved part 5 (previously part 4).

<font size = 4>**Version 0.2**
  - Added support for 3D tracks
  - New documentation and metrics added.

<font size = 4>**Version 0.1**
This is the first release of this notebook.

---