# **CellTracksColab - TrackMate**
---

<font size = 4>Colab Notebook for Analyzing Migration Tracks generated by [TrackMate](https://imagej.net/plugins/trackmate/)

<font size = 4>This notebook can digest both .csv and .xml files






# **Before getting started**
---

<font size = 5>**Important notes**

---
## Data Requirements for Analysis

<font size = 4>Be advised of one significant limitation inherent to this notebook.

<font size = 4 color="red">**This notebook does not support Track splitting**</font>. <font size = 4>For users aiming to compute additional track metrics within this environment, it is crucial to disable track splitting in TrackMate.

<font size = 4>It’s important to clarify that the absence of track splitting support does not hinder the notebook's ability to compile and display results in part 3 of the analysis process.

---

## Folder Hierarchy
<font size = 4> This notebook is compatible with the output of TrackMate, both in CSV or XML formats.

<font size = 4>To load your TrackMate outputs, your dataset should be meticulously organized into a two-tiered folder hierarchy as depicted below. We also expect that your track files ends by `-tracks` and that your spot tables ends by `-spots` when the files are CSV.

<font size = 4>Here's a common data structure that can work:

- 📁 **Experiments** `[Folder_path]`
  - 🌿 **Condition_1** `[‘condition’ is derived from this folder name]`
    - 🔄 **R1** `[‘repeat’ is derived from this folder name]`
      - 📄 `FOV1-spots.csv`
      - 📄 `FOV1-tracks.csv`
      - 📄 `FOV2-spots.csv`
      - 📄 `FOV2-tracks.csv`
    - 🔄 **R2**
      - 📄 `FOV1-spots.csv`
      - 📄 `FOV1-tracks.csv`
      - 📄 `FOV2-spots.csv`
      - 📄 `FOV2-tracks.csv`
  - 🌿 **Condition_2**
    - 🔄 **R1**
    - 🔄 **R2**

<font size = 4>In this representation, different symbols are used to represent folders and files clearly:

📁 represents the main folder or directory.
🌿 represents the condition folders.
🔄 represents the repeat folders.
📄 represents the individual CSV files or the corresponding XML files.

---

## Test dataset

A test dataset can be downloaded directly in this notebook or is available here:

https://zenodo.org/record/8413510

---


In [None]:
# @title #MIT License

print("""
**MIT License**

Copyright (c) 2023 Guillaume Jacquemet

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.""")

--------------------------------------------------------
# **Part 0. Prepare the Google Colab session**
--------------------------------------------------------
<font size = 4>skip this section when using a local installation


## **0.1. Install key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to install

print("In progress....")
%pip -q install pandas scikit-learn
%pip -q install plotly
%pip -q install tqdm

!git clone --single-branch --branch dev-egm https://github.com/CellMigrationLab/CellTracksColab.git


## **0.2. Mount your Google Drive**
---
<font size = 4> To use this notebook on the data present in your Google Drive, you need to mount your Google Drive to this notebook.

<font size = 4> Play the cell below to mount your Google Drive and follow the instructions.

<font size = 4> Once this is done, your data are available in the **Files** tab on the top left of notebook.

In [None]:
#@markdown ##Play the cell to connect your Google Drive to Colab
from google.colab import drive
drive.mount('/content/Gdrive')
# This command was originally but I think it doesn't do anything really
## %cd /gdrive

--------------------------------------------------------
# **Part 1. Prepare the session and load the data**
--------------------------------------------------------

## **1.1. Load key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to load the dependancies
import os
import pandas as pd
import seaborn as sns
import numpy as np
import sys
import matplotlib.colors as mcolors
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import itertools
import requests
import ipywidgets as widgets
import warnings
import scipy.stats as stats

from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.gridspec import GridSpec
from ipywidgets import Dropdown, interact,Layout, VBox, Button, Accordion, SelectMultiple, IntText
from tqdm.notebook import tqdm
from IPython.display import display, clear_output
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cosine, pdist
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.metrics import pairwise_distances
from scipy.stats import zscore, ks_2samp
from sklearn.preprocessing import MinMaxScaler
from multiprocessing import Pool
from matplotlib.ticker import FixedLocator
from matplotlib.ticker import FuncFormatter
from matplotlib.colors import LogNorm

sys.path.append("../")
sys.path.append("CellTracksColab/")

import celltracks
from celltracks import *
from celltracks.Track_Plots import *
from celltracks.BoxPlots_Statistics import *
from celltracks.Track_Metrics import *

# Current version of the notebook the user is running
current_version = "1.0.1"
Notebook_name = 'TrackMate'

# URL to the raw content of the version file in the repository
version_url = "https://raw.githubusercontent.com/guijacquemet/CellTracksColab/main/Notebook/latest_version.txt"

# Function to define colors for formatting messages
class bcolors:
    WARNING = '\033[91m'  # Red color for warning messages
    ENDC = '\033[0m'      # Reset color to default

# Check if this is the latest version of the notebook
try:
    All_notebook_versions = pd.read_csv(version_url, dtype=str)
    print('Notebook version: ' + current_version)

    # Check if 'Version' column exists in the DataFrame
    if 'Version' in All_notebook_versions.columns:
        Latest_Notebook_version = All_notebook_versions[All_notebook_versions["Notebook"] == Notebook_name]['Version'].iloc[0]
        print('Latest notebook version: ' + Latest_Notebook_version)

        if current_version == Latest_Notebook_version:
            print("This notebook is up-to-date.")
        else:
            print(bcolors.WARNING + "A new version of this notebook has been released. We recommend that you download it at https://github.com/guijacquemet/CellTracksColab" + bcolors.ENDC)
    else:
        print("The 'Version' column is not present in the version file.")
except requests.exceptions.RequestException as e:
    print("Unable to fetch the latest version information. Please check your internet connection.")
except Exception as e:
    print("An error occurred:", str(e))



## **1.2. Compile your data or load existing dataframes**
---

<font size = 4> Please ensure that your data is properly organised (see above)


In [None]:
#@markdown ##Provide the path to your dataset:

#@markdown ###You have multiple tracking files you want to compile, provide the path to your:

Folder_path = ''  # @param {type: "string"}
Data_Dims = "2D" #@param ["2D", "3D"]
File_Format = "xml" #@param ["csv", "xml"]
Data_Type = "TrackMate Files"

#@markdown ###Or use a test dataset (up to 10 min download)
Use_test_dataset = False #@param {type:"boolean"}

#@markdown ###Provide the path to your Result folder

Results_Folder = "/content/Results"  # @param {type: "string"}

# Update the parameters to load the data
CellTracks = celltracks.TrackingData()
if Use_test_dataset:
    # Download the test dataset
    test_celltrackscolab = "https://zenodo.org/records/8413510/files/T_cell_dataset.zip?download=1"
    CellTracks.DownloadTestData(test_celltrackscolab)
    File_Format = "csv"
else:
    CellTracks.Folder_path = Folder_path

CellTracks.Results_Folder = Results_Folder
CellTracks.skiprows = None
CellTracks.data_type = Data_Type
CellTracks.data_dims = Data_Dims
CellTracks.file_format = File_Format

# Load data
CellTracks.LoadTrackingData()

merged_spots_df = CellTracks.spots_data
check_for_nans(merged_spots_df, "merged_spots_df")
merged_tracks_df = CellTracks.tracks_data
if Data_Dims == '2D':
  merged_spots_df['POSITION_Z'] = 0

CellTracks.dim_mapping

print("...Done")

--------------------------------------------------------
# **Part 2. Visualise your tracks**
--------------------------------------------------------

## **2.1 Visualise your tracks in each field of view**
---

In [None]:
# @title ##Run the cell and choose the file you want to inspect
display_plots=True

if not os.path.exists(Results_Folder+"/Tracks"):
    os.makedirs(Results_Folder+"/Tracks")

filenames = merged_spots_df['File_name'].unique()

filename_dropdown = widgets.Dropdown(
    options=filenames,
    value=filenames[0] if len(filenames) > 0 else None,  # Default selected value
    description='File Name:',
)

interact(lambda filename: plot_track_coordinates(filename, merged_spots_df, Results_Folder, display_plots=display_plots), filename=filename_dropdown)


In [None]:
# @title ##Process all field of view

display_plots = False # @param {type:"boolean"}

print("Plotting and saving tracks for all FOVs...")
for filename in tqdm(filenames, desc="Processing"):
  plot_track_coordinates(filename, merged_spots_df, Results_Folder, display_plots=display_plots)

print(f"All plots saved in: {Results_Folder}/Tracks/")


## **2.2 Origin-Normalized Plot for each field of view**

In [None]:
# @title ##Run the cell and choose the file you want to inspect

display_plots=True

if not os.path.exists(Results_Folder+"/Tracks"):
    os.makedirs(Results_Folder+"/Tracks")

filenames = merged_spots_df['File_name'].unique()

filename_dropdown = widgets.Dropdown(
    options=filenames,
    value=filenames[0] if len(filenames) > 0 else None,
    description='File Name:',
)

interact(lambda filename: plot_origin_normalized_coordinates_FOV(filename, merged_spots_df, Results_Folder), filename=filename_dropdown)


In [None]:
# @title ##Process all field of view

display_plots = False # @param {type:"boolean"}

print("Plotting and saving tracks for all FOVs...")
for filename in tqdm(filenames, desc="Processing"):
  plot_origin_normalized_coordinates_FOV(filename, merged_spots_df, Results_Folder, display_plots=display_plots)

print(f"All plots saved in: {Results_Folder}/Tracks/")


## **2.3 Origin-Normalized Plot for each condition and repeat**

In [None]:
# @title ##Run the cell and choose the file you want to inspect

if not os.path.exists(Results_Folder + "/Tracks"):
    os.makedirs(Results_Folder + "/Tracks")  # Ensure the directory exists for saving the plots

conditions = merged_spots_df['Condition'].unique()
repeats = merged_spots_df['Repeat'].unique()

condition_dropdown = widgets.Dropdown(
    options=conditions,
    value=conditions[0] if len(conditions) > 0 else None,
    description='Condition:',
)

repeat_dropdown = widgets.Dropdown(
    options=repeats,
    value=repeats[0] if len(repeats) > 0 else None,
    description='Repeat:',
)

interact(lambda condition, repeat: plot_origin_normalized_coordinates_condition_repeat(
            condition, repeat, merged_spots_df, Results_Folder),
         condition=condition_dropdown,
         repeat=repeat_dropdown)

In [None]:
# @title ##Process all Repeat/Condition combinations

from celltracks.Track_Plots import plot_origin_normalized_coordinates_condition_repeat

display_plots = False # @param {type:"boolean"}

if not os.path.exists(Results_Folder + "/Tracks"):
  os.makedirs(Results_Folder + "/Tracks")

conditions = merged_spots_df['Condition'].unique()
repeats = merged_spots_df['Repeat'].unique()

print("Plotting and saving tracks for all combinations of Conditions and Repeats...")

for condition in tqdm(conditions, desc="Conditions"):
  for repeat in tqdm(repeats, desc="Repeats", leave=False):
    plot_origin_normalized_coordinates_condition_repeat(condition, repeat, merged_spots_df, Results_Folder, display_plots=display_plots)

print(f"All plots saved in: {Results_Folder}/Tracks/")


## **2.4 Origin-Normalized Plot for each condition**

In [None]:
# @title ##Run the cell and choose the file you want to inspect

if not os.path.exists(Results_Folder + "/Tracks"):
    os.makedirs(Results_Folder + "/Tracks")  # Ensure the directory exists for saving the plots

conditions = merged_spots_df['Condition'].unique()

condition_dropdown = widgets.Dropdown(
    options=conditions,
    value=conditions[0] if len(conditions) > 0 else None,
    description='Condition:',
)

interact(lambda condition: plot_origin_normalized_coordinates_condition(
            condition, merged_spots_df, Results_Folder),
         condition=condition_dropdown)

In [None]:
# @title ##Process all conditions

from celltracks.Track_Plots import plot_origin_normalized_coordinates_condition

display_plots = False # @param {type:"boolean"}

if not os.path.exists(Results_Folder + "/Tracks"):
  os.makedirs(Results_Folder + "/Tracks")

conditions = merged_spots_df['Condition'].unique()

print("Plotting and saving tracks for all Conditions...")

# Iterate over all combinations of Condition
for condition in tqdm(conditions, desc="Conditions"):
    plot_origin_normalized_coordinates_condition(condition, merged_spots_df, Results_Folder, display_plots=display_plots)

print(f"All plots saved in: {Results_Folder}/Tracks/")


## **2.5 Plot the migration vectors for each field of view**

In [None]:
# @title ##Plot the migration vectors
display_plots=True

fovs = merged_spots_df['File_name'].unique()
fov_dropdown = Dropdown(
    options=fovs,
    value=fovs[0] if len(fovs) > 0 else None,
    description='Select FOV:',
)

interact(lambda filename, display_plots: plot_migration_vectors(filename, merged_spots_df, Results_Folder, display_plots),
         filename=fov_dropdown,
         display_plots=display_plots)

In [None]:
# @title ##Process all field of view

display_plots = False # @param {type:"boolean"}

print("Plotting and saving track vectors for all FOVs...")
for filename in tqdm(filenames, desc="Processing"):
  plot_migration_vectors(filename, merged_spots_df, Results_Folder, display_plots=display_plots)
print(f"All plots saved in: {Results_Folder}/Tracks/")




--------------------------------------------------------
# **Part 3. Compute additional metrics (optional)**
--------------------------------------------------------
<font size = 4 color="red">Part3 does not support Track splitting</font>.

<font size = 4> For users aiming to compute additional track metrics within this environment, it is crucial to disable track splitting in TrackMate.


## **3.1 Filter and smooth your tracks (Optional)**
---


<font size = 4>The following section provides an interactive way to refine your tracking data. Here's what it's designed to achieve:

1. <font size = 4>**Filter Tracks**:
    - <font size = 4>This feature allows you to define a range for the track lengths you're interested in. By adjusting the `Min Length` and `Max Length` sliders, you can ignore very short or very long tracks that might be artifacts or noise in your data.

2. <font size = 4>**Smooth Tracks**:
    - <font size = 4>The positional data in your tracks can be smoothed using a moving average technique. By adjusting the `Smoothing` slider, you can control the degree of smoothing applied to the tracks. A higher value will average over more points, producing smoother tracks. This can be beneficial if your raw data has a lot of jitter or minor positional fluctuations.

<font size = 4>**How to Use**:

- <font size = 4>**Min Length**: Use the slider to set the minimum length of the tracks you're interested in.
- <font size = 4>**Max Length**: Use the slider to set the maximum length of the tracks you're interested in.
- <font size = 4>**Smoothing**: Adjust this slider to control the degree of smoothing you'd like to apply to your tracks.
- <font size = 4>**Apply Filters**: After adjusting the sliders to your preference, click this button. This will process the data based on your choices and prepare it for downstream analyses.



In [None]:
# @title ##Run to compute basic track metrics for filtering purpose

tqdm.pandas(desc="Calculating track metrics for filtering purpose")

global_metrics_df = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_track_metrics)

In [None]:
# @title ##Run to filter and smooth your tracks (slow when the dataset is large)

duration_slider = create_metric_slider('Duration:', 'Track Duration', global_metrics_df, width='500px')
mean_speed_slider = create_metric_slider('Mean Speed:', 'Mean Speed', global_metrics_df, width='500px')
max_speed_slider = create_metric_slider('Max Speed:', 'Max Speed', global_metrics_df, width='500px')
min_speed_slider = create_metric_slider('Min Speed:', 'Min Speed', global_metrics_df, width='500px')
total_distance_slider = create_metric_slider('Total Distance:', 'Total Distance Traveled', global_metrics_df, width='500px')
smoothing_slider = widgets.IntSlider(
    value=3,  # Default value; adjust as needed
    min=1,    # Minimum value
    max=10,   # Maximum value, adjust based on expected maximum
    step=1,   # Step value for the slider
    description='Smoothing Neighbors:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')  # Adjust width to match other sliders if necessary
)

def filter_on_button_click(button):
    global filtered_and_smoothed_df
    metric_filters = {
        'Track Duration': duration_slider.value,
        'Mean Speed': mean_speed_slider.value,
        'Max Speed': max_speed_slider.value,
        'Min Speed': min_speed_slider.value,
        'Total Distance Traveled': total_distance_slider.value,
    }
    with output:
        clear_output(wait=True)
        filtered_and_smoothed_df, metrics_summary_df = optimized_filter_and_smooth_tracks(
            merged_spots_df,
            metric_filters,
            smoothing_neighbors=smoothing_slider.value,
            global_metrics_df=global_metrics_df
        )
        # Save parameters
        params_file_path = os.path.join(Results_Folder, "filter_smoothing_parameters.csv")
        save_filter_smoothing_params(
            params_file_path,
            smoothing_slider.value,
            duration_slider.value,
            mean_speed_slider.value,
            max_speed_slider.value,
            min_speed_slider.value,
            total_distance_slider.value
        )
        print("Filtering and Smoothing Done")

apply_button = widgets.Button(description="Apply Filters", button_style='info')
apply_button.on_click(filter_on_button_click)
output = widgets.Output()

display_widgets = widgets.VBox([
    smoothing_slider,
    duration_slider, mean_speed_slider, max_speed_slider, min_speed_slider, total_distance_slider,
    apply_button, output
])
display(display_widgets)

In [None]:
# @title ##Compare Raw vs Filtered tracks

if not os.path.exists(Results_Folder+"/Tracks"):
    os.makedirs(Results_Folder+"/Tracks")  # Create Results_Folder if it doesn't exist

# Extract unique filenames from the dataframe
filenames = merged_spots_df['File_name'].unique()

# Create a Dropdown widget with the filenames
filename_dropdown = widgets.Dropdown(
    options=filenames,
    value=filenames[0] if len(filenames) > 0 else None,  # Default selected value
    description='File Name:',
)

# Link the Dropdown widget to the plotting function
interact(lambda filename: plot_coordinates_side_by_side(filename, merged_spots_df, filtered_and_smoothed_df, Results_Folder), filename=filename_dropdown)

In [None]:
# @title ##Run to choose which data you want to use for further analysis

widget_layout = widgets.Layout(width='500px')

# Create a RadioButtons widget to allow users to choose the DataFrame
data_choice = widgets.RadioButtons(
    options=[('Raw data', 'raw'), ('Smooth and filtered data', 'smoothed')],
    description='Use:',
    value='raw',
    disabled=False,
    layout=widget_layout
)

# Create a button for analysis
analyze_button = widgets.Button(
    description="Analyze",
    button_style='info',
    layout=widget_layout
)

# Define the button click callback
def on_analyze_button_click(button):
    global spots_df_to_use
    global merged_tracks_df

    if data_choice.value == 'smoothed':
        merged_spots_df = filtered_and_smoothed_df
        save_dataframe_with_progress(merged_spots_df, Results_Folder + '/' + 'merged_Spots.csv')
        merged_tracks_df = merged_tracks_df[merged_tracks_df['Unique_ID'].isin(merged_spots_df['Unique_ID'])]
        save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

    print(f"Analysis will be performed using: {data_choice.label}")

# Assign button callback
analyze_button.on_click(on_analyze_button_click)

# Initial display of the widgets
display(data_choice)
display(analyze_button)


## **3.2. Duration and speed metrics**
---
<font size = 4>When this cell is executed, it calculates various metrics for each unique track. Specifically, for each track, it determines the duration of the track, the average, maximum, minimum, and standard deviation of speeds, as well as the total distance traveled by the tracked object.

In [None]:
# @title ##Calculate track metrics

print("Calculating track metrics...")

merged_spots_df.dropna(subset=['POSITION_X', 'POSITION_Y', 'POSITION_Z'], inplace=True)

tqdm.pandas(desc="Calculating Track Metrics")

columns_to_remove = [
    "TRACK_DURATION",
    "TRACK_MEAN_SPEED",
    "TRACK_MAX_SPEED",
    "TRACK_MIN_SPEED",
    "TRACK_MEDIAN_SPEED",
    "TRACK_STD_SPEED",
    "TOTAL_DISTANCE_TRAVELED"
]

for column in columns_to_remove:
    if column in merged_tracks_df.columns:
        merged_tracks_df.drop(column, axis=1, inplace=True)

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
df_track_metrics = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_track_metrics).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_track_metrics.columns).drop('Unique_ID')
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)
merged_tracks_df = pd.merge(merged_tracks_df, df_track_metrics, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')
check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

In [None]:
# @title ##Calculate track metrics using rolling windows

window_size = 5  # @param {type: "number"}

tqdm.pandas(desc="Calculating Track Metrics using a rolling window")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)
df_track_metrics = merged_spots_df.groupby('Unique_ID').progress_apply(lambda x: calculate_track_metrics_rolling(x, window_size=5)).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_track_metrics.columns).drop('Unique_ID')
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)
merged_tracks_df = pd.merge(merged_tracks_df, df_track_metrics, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")


## **3.3. Directionality**
---
<font size = 4>To calculate the directionality of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The directionality, denoted as \(D\), is calculated using the formula:

$$ D = \frac{d_{\text{euclidean}}}{d_{\text{total path}}} $$

where \($d_{\text{euclidean}}$\) is the Euclidean distance between the first and the last points of the track, calculated as:

$$ d_{\text{euclidean}} = \sqrt{(x_{\text{end}} - x_{\text{start}})^2 + (y_{\text{end}} - y_{\text{start}})^2 + (z_{\text{end}} - z_{\text{start}})^2} $$

and \($d_{\text{total path}}$\) is the sum of the Euclidean distances between all consecutive points in the track, representing the total path length traveled. If the total path length is zero, the directionality is defined to be zero. This measure provides insight into the straightness of the path taken, with a value of 1 indicating a straight path between the start and end points, and values approaching 0 indicating more circuitous paths.</font>


In [None]:
# @title ##Calculate directionality
from celltracks.Track_Metrics import calculate_directionality

print("In progress...")

merged_spots_df.dropna(subset=['POSITION_X', 'POSITION_Y', 'POSITION_Z'], inplace=True)

tqdm.pandas(desc="Calculating Directionality")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_directionality = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_directionality).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_directionality.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_directionality, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

In [None]:
# @title ##Calculate directionality using rolling windows

window_size = 5  # @param {type: "number"}

tqdm.pandas(desc="Calculating Rolling Directionality")

df_rolling_directionality = merged_spots_df.groupby('Unique_ID').progress_apply(lambda x: calculate_rolling_directionality(x, window_size=window_size)).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_rolling_directionality.columns).drop('Unique_ID')
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_rolling_directionality, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')
print("Rolling Directionality Calculation...Done")

check_for_nans(merged_tracks_df, "merged_tracks_df")

## **3.4. Tortuosity**
---
<font size = 4>This measure provides insight into the curvature and complexity of the path taken, with a value of 1 indicating a straight path between the start and end points, and values greater than 1 indicating paths with more twists and turns.
To calculate the tortuosity of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The tortuosity, denoted as \(T\), is calculated using the formula:

$$ T = \frac{d_{\text{total path}}}{d_{\text{euclidean}}} $$



In [None]:
# @title ##Calculate tortuosity
print("In progress...")

tqdm.pandas(desc="Calculating Tortuosity")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_tortuosity = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_tortuosity).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_tortuosity.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_tortuosity, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

In [None]:
# @title ##Calculate tortuosity using rolling windows

window_size = 5  # @param {type: "number"}

tqdm.pandas(desc="Calculating Rolling Tortuosity")
df_rolling_tortuosity = merged_spots_df.groupby('Unique_ID').progress_apply(lambda x: calculate_rolling_tortuosity(x, window_size=window_size)).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_rolling_tortuosity.columns).drop('Unique_ID')
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_rolling_tortuosity, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')
check_for_nans(merged_tracks_df, "merged_tracks_df")

print("Rolling Tortuosity Calculation...Done")

## **3.5. Calculate the total turning angle**
---

<font size = 4>This measure provides insight into the cumulative amount of turning along the path, with a value of 0 indicating a straight path with no turning, and higher values indicating paths with more turning.

<font size = 4>To calculate the Total Turning Angle of a track in 3D space, we consider a series of points each with \(x\), \(y\), and \(z\) coordinates, sorted by time. The Total Turning Angle, denoted as \(A\), is the sum of the angles between each pair of consecutive direction vectors along the track, representing the cumulative amount of turning along the path.

<font size = 4>For each pair of consecutive segments in the track, we calculate the direction vectors \( $\vec{v_1}$ \) and \($ \vec{v_2}$ \), and the angle \($ \theta$ \) between them is calculated using the formula:

$$ \cos(\theta) = \frac{\vec{v_1} \cdot \vec{v_2}}{||\vec{v_1}|| \cdot ||\vec{v_2}||} $$

<font size = 4>where \( $\vec{v_1} \cdot$ $\vec{v_2}$ \) is the dot product of the direction vectors, and \( $||\vec{v_1}||$ \) and \( $||\vec{v_2}||$ \) are the magnitudes of the direction vectors. The Total Turning Angle \( $A$ \) is then the sum of all the angles \( \$theta$ \) calculated between each pair of consecutive direction vectors along the track:

$$ A = \sum \theta $$
<font size = 4>
If either of the direction vectors is a zero vector, the angle between them is undefined, and such cases are skipped in the calculation.


In [None]:
# @title ##Calculate the total turning angle

tqdm.pandas(desc="Calculating Total Turning Angle")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_turning_angle = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_total_turning_angle).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_turning_angle.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_turning_angle, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

In [None]:
# @title ##Calculate the total turning angle using rolling windows

window_size = 5  # @param {type: "number"}

tqdm.pandas(desc="Calculating Average Total Turning Angle")
df_rolling_turning_angle = merged_spots_df.groupby('Unique_ID').progress_apply(lambda x: calculate_rolling_total_turning_angle(x, window_size=window_size)).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_rolling_turning_angle.columns).drop('Unique_ID')
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)
merged_tracks_df = pd.merge(merged_tracks_df, df_rolling_turning_angle, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')
check_for_nans(merged_tracks_df, "merged_tracks_df")

print("Rolling Total Turning Angle Calculation...Done")

## **3.6. Calculate the Spatial Coverage**
---

<font size = 4>Spatial coverage provides insight into the spatial extent covered by the object's movement, with higher values indicating that the object has covered a larger area or volume during its movement.


<font size = 4>To calculate the spatial coverage of a track in 2D or 3D space, we consider a series of points each with \(x\), \(y\), and optionally \(z\) coordinates, sorted by time. The spatial coverage, denoted as \(S\), represents the area (in 2D) or volume (in 3D) enclosed by the convex hull formed by the points in the track. It provides insight into the spatial extent covered by the moving object.

#### In the implementation below we:
1. <font size = 4>**Check Dimensionality**:
   <font size = 4>- If the variance of the \(z\) coordinates is zero, implying all \(z\) coordinates are the same, the spatial coverage is calculated in 2D using only the \(x\) and \(y\) coordinates.
  <font size = 4> - If the \(z\) coordinates vary, the spatial coverage is calculated in 3D using the \(x\), \(y\), and \(z\) coordinates.

2. <font size = 4>**Form Convex Hull**:
   <font size = 4>- In 2D, a minimum of 3 non-collinear points is required to form a convex hull.
   <font size = 4>- In 3D, a minimum of 4 non-coplanar points is required to form a convex hull.
   <font size = 4>- If the required minimum points are not available, the spatial coverage is defined to be zero.

3. <font size = 4>**Calculate Spatial Coverage**:
   <font size = 4>- In 2D, the spatial coverage \(S\) is the area of the convex hull formed by the points in the track.
   <font size = 4>- In 3D, the spatial coverage \(S\) is the volume of the convex hull formed by the points in the track.

#### Formula:
- For 2D Spatial Coverage (Area of Convex Hull), if points are \(P_1(x_1, y_1), P_2(x_2, y_2), \ldots, P_n(x_n, y_n)\):
  $$ S_{2D} = \text{Area of Convex Hull formed by } P_1, P_2, \ldots, P_n $$

- For 3D Spatial Coverage (Volume of Convex Hull), if points are \(P_1(x_1, y_1, z_1), P_2(x_2, y_2, z_2), \ldots, P_n(x_n, y_n, z_n)\):
  $$ S_{3D} = \text{Volume of Convex Hull formed by } P_1, P_2, \ldots, P_n $$



In [None]:
# @title ##Calculate the Spatial Coverage

tqdm.pandas(desc="Calculating Spatial Coverage")

merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

df_spatial_coverage = merged_spots_df.groupby('Unique_ID').progress_apply(calculate_spatial_coverage).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_spatial_coverage.columns).drop('Unique_ID')

merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_spatial_coverage, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

In [None]:
# @title ##Calculate the Spatial Coverage using rolling windows

window_size = 5  # Adjust as needed

tqdm.pandas(desc="Calculating Rolling Spatial Coverage")

df_rolling_spatial_coverage = merged_spots_df.groupby('Unique_ID').progress_apply(lambda x: calculate_rolling_spatial_coverage(x, window_size=window_size)).reset_index()

overlapping_columns = merged_tracks_df.columns.intersection(df_rolling_spatial_coverage.columns).drop('Unique_ID')
merged_tracks_df.drop(columns=overlapping_columns, inplace=True)

merged_tracks_df = pd.merge(merged_tracks_df, df_rolling_spatial_coverage, on='Unique_ID', how='left')

save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("Rolling Spatial Coverage Calculation...Done")

## **3.7. Compute additional metrics**
---

This cell computes various metrics for each track in the provided dataset. These metrics are derived from the information provided by TrackMate in the spots table and include statistical properties like mean, median, standard deviation, minimum, and maximum values.


In [None]:
# @title ##Compute additional metrics

print("In progress...")

# List of potential metrics to compute
potential_metrics = [
    'MEAN_INTENSITY_CH1', 'MEDIAN_INTENSITY_CH1', 'MIN_INTENSITY_CH1', 'MAX_INTENSITY_CH1',
    'TOTAL_INTENSITY_CH1', 'STD_INTENSITY_CH1', 'CONTRAST_CH1', 'SNR_CH1', 'ELLIPSE_X0',
    'ELLIPSE_Y0', 'ELLIPSE_MAJOR', 'ELLIPSE_MINOR', 'ELLIPSE_THETA', 'ELLIPSE_ASPECTRATIO',
    'AREA', 'PERIMETER', 'CIRCULARITY', 'SOLIDITY', 'SHAPE_INDEX','MEAN_INTENSITY_CH2', 'MEDIAN_INTENSITY_CH2', 'MIN_INTENSITY_CH2', 'MAX_INTENSITY_CH2',
    'TOTAL_INTENSITY_CH2', 'STD_INTENSITY_CH2', 'CONTRAST_CH2', 'SNR_CH2', 'MEAN_INTENSITY_CH3', 'MEDIAN_INTENSITY_CH3', 'MIN_INTENSITY_CH3', 'MAX_INTENSITY_CH3',
    'TOTAL_INTENSITY_CH3', 'STD_INTENSITY_CH3', 'CONTRAST_CH3', 'SNR_CH3', 'MEAN_INTENSITY_CH4', 'MEDIAN_INTENSITY_CH4', 'MIN_INTENSITY_CH4', 'MAX_INTENSITY_CH4',
    'TOTAL_INTENSITY_CH4', 'STD_INTENSITY_CH4', 'CONTRAST_CH4', 'SNR_CH4'
]

available_metrics = check_metrics_availability(merged_spots_df, potential_metrics)

morphological_metrics_df = compute_morphological_metrics(merged_spots_df, available_metrics)

morphological_metrics_df.reset_index(inplace=True)

if 'Unique_ID' in merged_tracks_df.columns:
    overlapping_columns = merged_tracks_df.columns.intersection(morphological_metrics_df.columns).drop('Unique_ID', errors='ignore')
    merged_tracks_df.drop(columns=overlapping_columns, inplace=True)
    merged_tracks_df = merged_tracks_df.merge(morphological_metrics_df, on='Unique_ID', how='left')
    save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

else:
    print("Error: 'Unique_ID' column missing in merged_tracks_df. Skipping merging with morphological metrics.")

check_for_nans(merged_tracks_df, "merged_tracks_df")

print("...Done")

--------
# **Part 4. Quality Control**
--------

      



## **4.1. Assess if your dataset is balanced**
---

In cell tracking and similar biological analyses, the balance of the dataset is important, particularly in ensuring that each biological repeat carries equal weight. Here's why this balance is essential:

### Accurate Representation of Biological Variability

- **Capturing True Biological Variation**: Biological repeats are crucial for capturing the natural variability inherent in biological systems. Equal weighting ensures that this variability is accurately represented.
- **Reducing Sampling Bias**: By balancing the dataset, we avoid overemphasizing the characteristics of any single repeat, which might not be representative of the broader biological context.

If your data is too imbalanced, it may be useful to ensure that this does not shift your results.



In [None]:
# @title ##Check the number of track per condition per repeats

if not os.path.exists(f"{Results_Folder}/QC"):
    os.makedirs(f"{Results_Folder}/QC")

result_df = count_tracks_by_condition_and_repeat(merged_tracks_df, f"{Results_Folder}/QC")


## **4.2. Compute Similarity Metrics between Field of Views (FOV) and between Conditions and Repeats**
---

<font size = 4>**Purpose**:

<font size = 4>This section provides a set of tools to compute and visualize similarities between different field of views (FOV) based on selected track parameters. By leveraging hierarchical clustering, the resulting dendrogram offers a clear visualization of how different FOV, conditions, or repeats relate to one another. This tool is essential for:

<font size = 4>1. **Quality Control**:
    - Ensuring that FOVs from the same condition or experimental setup are more similar to each other than to FOVs from different conditions.
    - Confirming that repeats of the same experiment yield consistent results and cluster together.
    
<font size = 4>2. **Data Integrity**:
    - Identifying potential outliers or anomalies in the dataset.
    - Assessing the overall consistency of the experiment and ensuring reproducibility.

<font size = 4>**How to Use**:

<font size = 4>1. **Track Parameters Selection**:
    - A list of checkboxes allows users to select which track parameters they want to consider for similarity calculations. By default, all parameters are selected. Users can deselect parameters that they believe might not contribute significantly to the similarity.

<font size = 4>2. **Similarity Metric**:
    - Users can choose a similarity metric from a dropdown list. Options include cosine, euclidean, cityblock, jaccard, and correlation. The choice of similarity metric can influence the clustering results, so users might need to experiment with different metrics to see which one provides the most meaningful results.

<font size = 4>3. **Linkage Method**:
    - Determines how the distance between clusters is calculated in the hierarchical clustering process. Different linkage methods can produce different dendrograms, so users might want to try various methods.

<font size = 4>4. **Visualization**:
    - Once the parameters are selected, users can click on the "Select the track parameters and visualize similarity" button. This will compute the hierarchical clustering and display two dendrograms:
        - One dendrogram displays similarities between individual FOVs.
        - Another dendrogram aggregates the data based on conditions and repeats, providing a higher-level view of the similarities.
      


In [None]:
# @title ##Compute similarity metrics between FOV and between conditions and repeats

# Check and create "QC" folder
if not os.path.exists(f"{Results_Folder}/QC"):
    os.makedirs(f"{Results_Folder}/QC")

# Columns to exclude
excluded_columns = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar','TRACK_STOP', 'TRACK_START', 'Cluster_UMAP', 'Cluster_tsne']

selected_df = pd.DataFrame()

# Filter out non-numeric columns but keep 'File_name'
numeric_df = merged_tracks_df.select_dtypes(include=['float64', 'int64']).copy()
numeric_df['File_name'] = merged_tracks_df['File_name']

# Create a list of column names excluding 'File_name'
column_names = [col for col in numeric_df.columns if col not in excluded_columns]

# Create a checkbox for each column
checkboxes = [widgets.Checkbox(value=True, description=col, indent=False) for col in column_names]

# Dropdown for similarity metrics
similarity_dropdown = widgets.Dropdown(
    options=['cosine', 'euclidean', 'cityblock', 'jaccard', 'correlation'],
    value='cosine',
    description='Similarity Metric:'
)

# Dropdown for linkage methods
linkage_dropdown = widgets.Dropdown(
    options=['single', 'complete', 'average', 'ward'],
    value='single',
    description='Linkage Method:'
)

# Arrange checkboxes in a 2x grid
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection and visualization
button = widgets.Button(description="Select the track parameters and visualize similarity", layout=widgets.Layout(width='400px'), button_style='info')

# Define the button click event handler
def on_button_click(b):
    global selected_df  # Declare selected_df as global

    # Get the selected columns from the checkboxes
    selected_columns = [box.description for box in checkboxes if box.value]
    selected_columns.append('File_name')  # Always include 'File_name'

    # Extract the selected columns from the DataFrame
    selected_df = numeric_df[selected_columns]

    # Check and print the percentage of NaNs for each selected column
    for column in selected_columns:
        if selected_df[column].isna().any():
            nan_percentage = selected_df[column].isna().mean() * 100
            print("Warning: NaN values found in the selected data.")
            print(f"{column}: {nan_percentage:.2f}%")
            any_nan = True
            print("Proceeding to handle NaN values.")
            selected_df = selected_df.dropna()

    # Aggregate the data by filename
    aggregated_by_filename = selected_df.groupby('File_name').mean(numeric_only=True)

    # Aggregate the data by condition and repeat
    aggregated_by_condition_repeat = merged_tracks_df.groupby(['Condition', 'Repeat'])[selected_columns].mean(numeric_only=True)

    # Compute condensed distance matrices
    distance_matrix_filename = pdist(aggregated_by_filename, metric=similarity_dropdown.value)
    distance_matrix_condition_repeat = pdist(aggregated_by_condition_repeat, metric=similarity_dropdown.value)

    # Perform hierarchical clustering
    linked_filename = linkage(distance_matrix_filename, method=linkage_dropdown.value)
    linked_condition_repeat = linkage(distance_matrix_condition_repeat, method=linkage_dropdown.value)

    annotation_text = f"Similarity Method: {similarity_dropdown.value}, Linkage Method: {linkage_dropdown.value}"

        # Prepare the parameters dictionary
    similarity_params = {
        'Similarity Metric': similarity_dropdown.value,
        'Linkage Method': linkage_dropdown.value,
        'Selected Columns': ', '.join(selected_columns)
    }

    # Save the parameters
    params_file_path = os.path.join(Results_Folder, "QC/analysis_parameters.csv")
    save_parameters(similarity_params, params_file_path, 'Similarity Metrics')

    # Plot the dendrograms one under the other
    plt.figure(figsize=(10, 10))

    # Dendrogram for individual filenames
    plt.subplot(2, 1, 1)
    dendrogram(linked_filename, labels=aggregated_by_filename.index, orientation='top', distance_sort='descending', leaf_rotation=90)
    plt.title(f'Dendrogram of Field of view Similarities\n{annotation_text}')

    # Dendrogram for aggregated data based on condition and repeat
    plt.subplot(2, 1, 2)
    dendrogram(linked_condition_repeat, labels=aggregated_by_condition_repeat.index, orientation='top', distance_sort='descending', leaf_rotation=90)
    plt.title(f'Dendrogram of Aggregated Similarities by Condition and Repeat\n{annotation_text}')

    plt.tight_layout()

    # Save the dendrogram to a PDF
    pdf_pages = PdfPages(f"{Results_Folder}/QC/Dendrogram_Similarities.pdf")

    # Save the current figure to the PDF
    pdf_pages.savefig()

    # Close the PdfPages object to finalize the document
    pdf_pages.close()

    plt.show()

# Set the button click event handler
button.on_click(on_button_click)

# Display the widgets
display(grid, similarity_dropdown, linkage_dropdown, button)


-------------------------------------------

# **Part 5. Plot track parameters**
-------------------------------------------



<font size = 4> In this section you can plot all the track parameters previously computed. Data and graphs are automatically saved in your result folder.

<font size = 4 color="red"> Parameters computed are in the unit you provided when tracking your data in TrackMate.

##**Statistical analyses**
### Cohen's d (Effect Size):
<font size = 4>Cohen's d measures the size of the difference between two groups, normalized by their pooled standard deviation. Values can be interpreted as small (0 to 0.2), medium (0.2 to 0.5), or large (0.5 and above) effects. It helps quantify how significant the observed difference is, beyond just being statistically significant.

### Randomization Test:
<font size = 4>This non-parametric test evaluates if observed differences between conditions could have arisen by random chance. It shuffles condition labels multiple times, recalculating the Cohen's d each time. The resulting p-value, which indicates the likelihood of observing the actual difference by chance, provides evidence against the null hypothesis: a smaller p-value implies stronger evidence against the null.

### Bonferroni Correction:
<font size = 4>Given multiple comparisons, the Bonferroni Correction adjusts significance thresholds to mitigate the risk of false positives. By dividing the standard significance level (alpha) by the number of tests, it ensures that only robust findings are considered significant. However, it's worth noting that this method can be conservative, sometimes overlooking genuine effects.

## **5.1. Plot your entire dataset**
--------

In [None]:
# @title ##Plot track normalized track parameters based on conditions as an heatmap (entire dataset)

base_folder = f"{Results_Folder}/track_parameters_plots"
Conditions = 'Condition'
df_to_plot = merged_tracks_df

folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

# Example usage
heatmap_comparison(merged_tracks_df, base_folder, Conditions, normalization="zscore")

In [None]:
# @title ##Plot track parameters (entire dataset)

base_folder = f"{Results_Folder}/track_parameters_plots"
Conditions = 'Condition'
df_to_plot = merged_tracks_df

folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

condition_selector, condition_accordion = display_condition_selection(df_to_plot, Conditions)
checkboxes_dict, checkboxes_accordion = display_variable_checkboxes(categorize_columns(df_to_plot))
variable_checkboxes, checkboxes_widget = display_variable_checkboxes(get_selectable_columns_plots(df_to_plot))
stat_method_selector = widgets.Dropdown(
    options=['randomization test', 't-test'],
    value='randomization test',
    description='Stat Method:',
    style={'description_width': 'initial'}
)

button = Button(description="Plot Selected Variables", layout=Layout(width='400px'), button_style='info')
button.on_click(lambda b: plot_selected_vars(b, checkboxes_dict, df_to_plot, Conditions, base_folder, condition_selector, stat_method_selector))

display(VBox([condition_accordion, checkboxes_accordion, stat_method_selector, button]))

## **5.2. Plot a balanced dataset**
--------

### **5.2.1. Downsample your dataset to ensure that it is balanced**
--------

### Downsampling and Balancing Dataset

This section of the notebook is dedicated to addressing imbalances in the dataset, which is crucial for ensuring the accuracy and reliability of the analysis. The cell bellow will downsample the dataset to balance the number of tracks across different conditions and repeats. It allows for reproducibility by including a `random_seed` parameter, which is set to 42 by default but can be adjusted as needed.

All results from this section will be saved in the Balanced Dataset Directory created in your `Results_Folder`.




In [None]:
# @title ##Run this cell to downsample and balance your dataset

random_seed = 42

if not os.path.exists(f"{Results_Folder}/Balanced_dataset"):
    os.makedirs(f"{Results_Folder}/Balanced_dataset")

balanced_merged_tracks_df = balance_dataset(merged_tracks_df, random_seed=random_seed)
result_df = count_tracks_by_condition_and_repeat(balanced_merged_tracks_df, f"{Results_Folder}/Balanced_dataset")

check_for_nans(balanced_merged_tracks_df, "balanced_merged_tracks_df")
save_dataframe_with_progress(balanced_merged_tracks_df, Results_Folder + '/Balanced_dataset/merged_Tracks_balanced_dataset.csv')


### **5.2.2. Check if the downsampling has affected data distribution**
--------

This section of the notebook generates a heatmap visualizing the Kolmogorov-Smirnov (KS) p-values for each numerical column in the dataset, comparing the distributions before and after downsampling. This heatmap serves as a tool for assessing the impact of downsampling on data quality, guiding decisions on whether the downsampled dataset is suitable for further analysis.

#### Purpose of the Heatmap
- **KS Test:** The KS test is used to determine if two samples are drawn from the same distribution. In this context, it compares the distribution of each numerical column in the original dataset (`merged_tracks_df`) with its counterpart in the downsampled dataset (`balanced_merged_tracks_df`).
- **P-Value Interpretation:** The p-value indicates the probability that the two samples come from the same distribution. A higher p-value suggests a greater likelihood that the distributions are similar.

#### Interpreting the Heatmap
- **Color Coding:** The heatmap uses a color gradient (from viridis) to represent the range of p-values. Darker colors indicate higher p-values.
- **P-Value Thresholds:**
  - **High P-Values (Lighter Areas):** Indicate that the downsampling process likely did not significantly alter the distribution of that numerical column for the specific condition-repeat group.
  - **Low P-Values (Darker Areas):** Suggest that the downsampling process may have affected the distribution significantly.
- **Varying P-Values:** Variations in color across different columns and rows help identify which specific numerical columns and condition-repeat groups are most affected by the downsampling.




In [None]:
# @title ##Check if your downsampling has affected your data distribution

numerical_columns = merged_tracks_df.select_dtypes(include=['int64', 'float64']).columns

# Initialize a DataFrame to store KS p-values
ks_p_values = pd.DataFrame(columns=numerical_columns)

# Iterate over each group and numerical column
for group, group_df in merged_tracks_df.groupby(['Condition', 'Repeat']):
    group_p_values = []
    balanced_group_df = balanced_merged_tracks_df[(balanced_merged_tracks_df['Condition'] == group[0]) & (balanced_merged_tracks_df['Repeat'] == group[1])]
    for column in numerical_columns:
        p_value = calculate_ks_p_value(group_df, balanced_group_df, column)
        group_p_values.append(p_value)
    ks_p_values.loc[f'Condition: {group[0]}, Repeat: {group[1]}'] = group_p_values

max_columns_per_heatmap = 20

total_columns = len(ks_p_values.columns)

num_heatmaps = -(-total_columns // max_columns_per_heatmap)  # Ceiling division

pdf_filepath = Results_Folder+'/Balanced_dataset/p-Value Heatmap.pdf'

# Create a PDF file
with PdfPages(pdf_filepath) as pdf:
    # Loop through each subset of columns and create a heatmap
    for i in range(num_heatmaps):
        start_col = i * max_columns_per_heatmap
        end_col = min(start_col + max_columns_per_heatmap, total_columns)

        # Subset of columns for this heatmap
        subset_columns = ks_p_values.columns[start_col:end_col]

        # Create the heatmap for the subset of columns
        plt.figure(figsize=(12, 8))
        sns.heatmap(ks_p_values[subset_columns], cmap='viridis', vmax=0.5, vmin=0)
        plt.title(f'Kolmogorov-Smirnov P-Value Heatmap (Columns {start_col+1} to {end_col})')
        plt.xlabel('Numerical Columns')
        plt.ylabel('Condition-Repeat Groups')
        plt.tight_layout()

        # Save the current figure to the PDF
        pdf.savefig()
        plt.show()
        plt.close()

print(f"Saved all heatmaps to {pdf_filepath}")

ks_p_values.to_csv(Results_Folder + '/Balanced_dataset/ks_p_values.csv')
print("Saved KS p-values to ks_p_values.csv")


### **5.2.3. Plot your balanced dataset**
--------

In [None]:
# @title ##Plot track parameters (balanced dataset)

# Parameters to adapt in function of the notebook section
base_folder = f"{Results_Folder}/Balanced_dataset/track_parameters_plots"
Conditions = 'Condition'
df_to_plot = balanced_merged_tracks_df

# Check and create necessary directories
folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

condition_selector, condition_accordion = display_condition_selection(df_to_plot, Conditions)
checkboxes_dict, checkboxes_accordion = display_variable_checkboxes(categorize_columns(df_to_plot))
variable_checkboxes, checkboxes_widget = display_variable_checkboxes(get_selectable_columns_plots(df_to_plot))
stat_method_selector = widgets.Dropdown(
    options=['randomization test', 't-test'],
    value='randomization test',
    description='Stat Method:',
    style={'description_width': 'initial'}
)

button = Button(description="Plot Selected Variables", layout=Layout(width='400px'), button_style='info')
button.on_click(lambda b: plot_selected_vars(b, checkboxes_dict, df_to_plot, Conditions, base_folder, condition_selector, stat_method_selector))

display(VBox([condition_accordion, checkboxes_accordion, stat_method_selector, button]))

# **Part 6. Version log**
---
<font size = 4>While I strive to provide accurate and helpful information, please be aware that:
  - This notebook may contain bugs.
  - Features are currently limited and will be expanded in future releases.

<font size = 4>We encourage users to report any issues or suggestions for improvement. Please check the [repository](https://github.com/guijacquemet/CellTracksColab) regularly for updates and the latest version of this notebook.

#### **Known Issues**:
- Tracks are displayed in 2D in section 1.4

<font size = 4>**Version 10.0.1**
  - Includes a general data reader
    
<font size = 4>**Version 0.9.2**
  - Added the Origin normalized plots

<font size = 4>**Version 0.9.1**
  - Added the PIP freeze option to save a requirement text
  - Added the heatmap visualisation of track parameters
  - Heatmaps can now be displayed on multiple pages
  - Fix userwarning message during plotting (all box plots)
  - Added the possibility to copy and paste an existing list of selected metric for clustering analyses

<font size = 4>**Version 0.9**
  - Improved plotting strategy. Specific conditions can be chosen
  - absolute cohen d values are now shown
  - In the QC the heatmap is automatically divided in subplot when too many columns are in the df

<font size = 4>**Version 0.8**
  - Settings are now saved
  - Order of the section has been modified to help streamline biological discoveries
  - New section added to quality Control to check if the dataset is balanced
  - New section added to the UMAP and tsne section to plot track parameters for selected clusters
  - clusters for UMAP and t-sne are now saved in the dataframe separetly

<font size = 4>**Version 0.7**
  - check_for_nans function added
  - Clustering using t-SNE added

<font size = 4>**Version 0.6**
  - Improved organisation of the results
  - Tracks visualisation are now saved

<font size = 4>**Version 0.5**
  - Improved part 5
  - Added the possibility to find examplar on the raw movies when available
  - Added the possibility to export video with the examplar labeled
  - Code improved to deal with larger dataset (tested with over 50k tracks)
  - test dataset now contains raw video and is hosted on Zenodo
  - Results are now organised in folders
  - Added progress bars
  - Minor code fixes

<font size = 4>**Version 0.4**

  - Added the possibility to filter and smooth tracks
  - Added spatial and temporal calibration
  - Notebook is streamlined
  - multiple bug fix
  - Remove the t-sne
  - Improved documentation

<font size = 4>**Version 0.3**
  - Fix a nasty bug in the import functions
  - Add basic examplar for UMAP
  - Added the statistical analyses and their explanations.
  - Added a new quality control part that helps assessing the similarity of results between FOV, conditions and repeats
  - Improved part 5 (previously part 4).

<font size = 4>**Version 0.2**
  - Added support for 3D tracks
  - New documentation and metrics added.

<font size = 4>**Version 0.1**
This is the first release of this notebook.

---