# **CellTracksColab Dimensionality Reduction**
---


<font size = 4>**Notebook Overview:**

<font size = 4>This notebook is designed for analyzing datasets stored in the CellTracksColab format, utilizing advanced dimensionality reduction techniques to facilitate the interpretation of complex, high-dimensional data.

<font size = 4>**Techniques Employed:**

- <font size = 4>**UMAP (Uniform Manifold Approximation and Projection):** UMAP is a powerful method for dimensionality reduction that preserves as much of the local and global structure of the data as possible. It's particularly suited for large datasets and is capable of revealing intricate structures within the data that other techniques might miss.

- <font size = 4>**t-SNE (t-distributed Stochastic Neighbor Embedding):** t-SNE is another highly effective technique used for exploring high-dimensional data. It converts affinities of data points to probabilities and then tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

- <font size = 4>**HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise):** Complementing the above dimensionality reduction techniques, HDBSCAN identifies clusters using a density-based approach. This method excels in finding clusters of varying densities and sizes from data shaped by UMAP or t-SNE, making it invaluable for discerning the nuanced groupings within complex datasets.

<font size = 4>**Objective:**

<font size = 4>The goal of this notebook is to make complex, high-dimensional data more interpretable and actionable by applying these techniques. This approach not only aids in visualizing the data in two or three dimensions but also in identifying inherent clusters and patterns critical for further analysis and decision-making processes.








# **Before getting started**
---



In [None]:
# @title #MIT License

print("""
**MIT License**

Copyright (c) 2023 Guillaume Jacquemet

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.""")

--------------------------------------------------------
# **Part 0. Prepare the Google Colab session**
--------------------------------------------------------
<font size = 4>skip this section when using a local installation


## **0.1. Install key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to install

print("In progress....")

!git clone --single-branch --branch dev-egm https://github.com/CellMigrationLab/CellTracksColab.git

%pip -q install pandas scikit-learn
%pip -q install hdbscan
%pip -q install umap-learn
%pip -q install plotly
%pip -q install tqdm



## **0.2. Mount your Google Drive**
---
<font size = 4> To use this notebook on the data present in your Google Drive, you need to mount your Google Drive to this notebook.

<font size = 4> Play the cell below to mount your Google Drive and follow the instructions.

<font size = 4> Once this is done, your data are available in the **Files** tab on the top left of notebook.

In [None]:
#@markdown ##Play the cell to connect your Google Drive to Colab
from google.colab import drive
drive.mount('/content/Gdrive')
# This command was originally but I think it doesn't do anything really
## %cd /gdrive

--------------------------------------------------------
# **Part 1. Prepare the session and load the data**
--------------------------------------------------------

## **1.1 Load key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to load the dependancies
import os
import pandas as pd
import seaborn as sns
import numpy as np
import sys
import matplotlib.colors as mcolors
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import itertools
import requests
import ipywidgets as widgets
import warnings
import scipy.stats as stats

from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.gridspec import GridSpec
from ipywidgets import Dropdown, interact,Layout, VBox, Button, Accordion, SelectMultiple, IntText
from tqdm.notebook import tqdm
from IPython.display import display, clear_output
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cosine, pdist
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.metrics import pairwise_distances
from scipy.stats import zscore, ks_2samp
from sklearn.preprocessing import MinMaxScaler
from multiprocessing import Pool
from matplotlib.ticker import FixedLocator
from matplotlib.ticker import FuncFormatter
from matplotlib.colors import LogNorm

sys.path.append("../")
sys.path.append("CellTracksColab/")

import celltracks
from celltracks import *
from celltracks.Track_Plots import *
from celltracks.BoxPlots_Statistics import *
from celltracks.Dimensionality_Reduction import *


# Current version of the notebook the user is running
current_version = "1.0.1"
Notebook_name = 'Dimensionality_Reduction'

# URL to the raw content of the version file in the repository
version_url = "https://raw.githubusercontent.com/guijacquemet/CellTracksColab/main/Notebook/latest_version.txt"

# Function to define colors for formatting messages
class bcolors:
    WARNING = '\033[91m'  # Red color for warning messages
    ENDC = '\033[0m'      # Reset color to default

# Check if this is the latest version of the notebook
try:
    All_notebook_versions = pd.read_csv(version_url, dtype=str)
    print('Notebook version: ' + current_version)

    # Check if 'Version' column exists in the DataFrame
    if 'Version' in All_notebook_versions.columns:
        Latest_Notebook_version = All_notebook_versions[All_notebook_versions["Notebook"] == Notebook_name]['Version'].iloc[0]
        print('Latest notebook version: ' + Latest_Notebook_version)

        if current_version == Latest_Notebook_version:
            print("This notebook is up-to-date.")
        else:
            print(bcolors.WARNING + "A new version of this notebook has been released. We recommend that you download it at https://github.com/guijacquemet/CellTracksColab" + bcolors.ENDC)
    else:
        print("The 'Version' column is not present in the version file.")
except requests.exceptions.RequestException as e:
    print("Unable to fetch the latest version information. Please check your internet connection.")
except Exception as e:
    print("An error occurred:", str(e))

#----------------------- Key functions -----------------------------#



## **1.2. Load your CellTracksColab dataset**
---

<font size="4"> Before proceeding, please ensure that your data has been properly processed using CellTracksColab. Typically, your Track table should be named `merged_Tracks.csv`, and your Spot table should be named `merged_Spots.csv`.

For the `Results_Folder` parameter, you can choose the same folder that already contains all the results associated with your dataset. Any results generated by this notebook will be saved in the `Distance_to_ROI` subfolder.

In [None]:
#@markdown ##Provide the path to your CellTracksColab dataset:

Data_Dims = "2D" #@param ["2D", "3D"]
Data_Type = "CellTracksColab"

Track_table = ''  # @param {type: "string"}
Spot_table = ''  # @param {type: "string"}


Use_test_dataset = False

#@markdown ###Provide the path to your Result folder

Results_Folder = ""  # @param {type: "string"}

# Update the parameters to load the data
CellTracks = celltracks.TrackingData()
if Use_test_dataset:
    # Download the test dataset
    test_celltrackscolab = "https://zenodo.org/record/8420011/files/T_Cells_spots_only.zip?download=1"
    CellTracks.DownloadTestData(test_celltrackscolab)
    File_Format = "csv"
else:

    CellTracks.Spot_table = Spot_table
    CellTracks.Track_table = Track_table

CellTracks.Results_Folder = Results_Folder
CellTracks.skiprows = None
CellTracks.data_type = Data_Type
CellTracks.data_dims = Data_Dims

# Load data
CellTracks.LoadTrackingData()

merged_spots_df = CellTracks.spots_data
check_for_nans(merged_spots_df, "merged_spots_df")
merged_tracks_df = CellTracks.tracks_data
print("...Done")

--------
# **Part 2. Explore your high-dimensional data using UMAP and HDBSCAN**
--------

<font size = 4> The workflow provided below is inspired by [CellPlato](https://github.com/Michael-shannon/cellPLATO)

## **2.1. Choose the track metrics to use for clustering**
--------


In [None]:
# @title ##Choose the track metrics to use

import ipywidgets as widgets
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Check and create "pdf" folder
if not os.path.exists(os.path.join(Results_Folder, "Umap")):
    os.makedirs(os.path.join(Results_Folder, "Umap"), exist_ok=True)


excluded_columns = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar','TRACK_STOP', 'TRACK_START', 'Cluster_UMAP', 'Cluster_tsne']

# Columns you want to always include
columns_to_include = ['File_name', 'Repeat', 'Condition', 'Unique_ID']

selected_df = pd.DataFrame()
nan_columns = pd.DataFrame()
# Extract the columns you always want to include and ensure they exist in the original dataframe
saved_columns = {col: merged_tracks_df[col].copy() for col in columns_to_include if col in merged_tracks_df}

# Filter out non-numeric columns
numeric_df = merged_tracks_df.select_dtypes(include=['float64', 'int64'])  # Selecting only numeric columns

column_names = [col for col in numeric_df.columns if col not in excluded_columns]

# Text area for user to paste the list of metrics
text_area = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of metrics here, separated by commas.',
    description='Metrics:',
    disabled=False,
    layout=widgets.Layout(width='100%', height='100px')
)


# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip() in column_names]

# Create a checkbox for each column
def create_checkboxes(parsed_metrics):
    return [widgets.Checkbox(value=(col in parsed_metrics or not parsed_metrics), description=col, indent=False) for col in column_names]

checkboxes = create_checkboxes(column_names)  # Initialize with all metrics

# Grid for displaying checkboxes
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection
button = widgets.Button(description="Select the track parameters",
                        layout=widgets.Layout(width='400px'),
                        button_style='info')


def on_button_click(b):
    global selected_df
    global nan_columns
    parsed_metrics = parse_text_area(text_area.value)
    selected_columns = [box.description for box in checkboxes if box.value]

    # Extract the selected columns from the DataFrame
    selected_df = numeric_df[selected_columns].copy()

        # Prepare the parameters dictionary
    UMAP_params = {
        'Selected Columns': ', '.join(selected_columns)
    }

    # Save the parameters
    params_file_path = os.path.join(Results_Folder, "Umap", "analysis_parameters.csv")
    save_parameters(UMAP_params, params_file_path, 'UMAP')

    # Add back the always-included columns to selected_df
    for col, data in saved_columns.items():
        selected_df.loc[:, col] = data

    # Check if the DataFrame has any NaN values and print a warning if it does.
    nan_columns = selected_df.columns[selected_df.isna().any()].tolist()
    if nan_columns:
        for col in nan_columns:
            selected_df = selected_df.dropna(subset=[col])  # Drop NaN values only from columns containing them

    print("Done")

# Set the button click event handler
button.on_click(on_button_click)

# Function to update checkboxes based on text area input
def update_checkboxes(b):
    parsed_metrics = parse_text_area(text_area.value)
    global checkboxes
    checkboxes = create_checkboxes(parsed_metrics)
    grid.children = checkboxes

# Update checkboxes when text area content changes
text_area.observe(update_checkboxes, names='value')

# Display the text area, grid of checkboxes, and the button
display(text_area, grid, button)

## **2.2. UMAP**
---

<font size = 4> The given code performs UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction on the merged tracks dataframe, focusing on its numeric columns, and visualizes the result. In the provided UMAP code, the parameters `n_neighbors`, `min_dist`, and `n_components` are crucial for determining the structure and appearance of the resulting low-dimensional representation of the data.

<font size = 4>`n_neighbors`: This parameter controls how UMAP balances local versus global structure in the data. It determines the size of the local neighborhood UMAP will look at when learning the manifold structure of the data.
- A smaller value emphasizes the local structure of the data, potentially at the expense of the global structure.
- A larger value allows UMAP to consider more distant neighbors, emphasizing more on the global structure of the data.
- Typically, values in the range of 5 to 50 are chosen, depending on the density and scale of the data.

<font size = 4>`min_dist`: This parameter controls how tightly UMAP is allowed to pack points together. It determines the minimum distance between points in the low-dimensional representation.
- Setting it to a low value will allow points to be packed more closely, potentially revealing clusters in the data.
- A higher value ensures that points are more spread out in the representation.
- Values usually range between 0 and 1.

<font size = 4>`n_dimension`: This parameter determines the number of dimensions in the low-dimensional space that the data will be reduced to.
For visualization purposes, `n_dimension` is typically set to 2 or 3 to obtain 2D or 3D representations, respectively.


In [None]:
# @title ##Perform UMAP
import umap
import plotly.offline as pyo

#@markdown ###UMAP parameters:

n_neighbors = 10  # @param {type: "number"}
min_dist = 0  # @param {type: "number"}
n_dimension = 2  # @param {type: "slider", min: 1, max: 3}

#@markdown ###Display parameters:
spot_size = 30 # @param {type: "number"}

# Initialize UMAP object with the specified settings
reducer = umap.UMAP(n_neighbors=n_neighbors, min_dist=min_dist, n_components=n_dimension, random_state=42)
# Exclude non-numeric columns when fitting UMAP
embedding = reducer.fit_transform(selected_df.drop(columns=columns_to_include))
# Create dynamic column names based on n_components
column_names = [f'UMAP dimension {i}' for i in range(1, n_dimension + 1)]

# Extract the columns_to_include from selected_df
included_data = selected_df[columns_to_include].reset_index(drop=True)

# Concatenate the UMAP embedding with the included columns
umap_df = pd.concat([pd.DataFrame(embedding, columns=column_names), included_data], axis=1)


# Check if the DataFrame has any NaN values and print a warning if it does.
nan_columns = umap_df.columns[umap_df.isna().any()].tolist()

if nan_columns:
  warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}")
  for col in nan_columns:
    umap_df = umap_df.dropna(subset=[col])  # Drop NaN values only from columns containing them

  # Prepare the parameters dictionary
UMAP_params = {
        'n_neighbors': n_neighbors,
        'min_dist': min_dist,
        'n_dimension': n_dimension
    }

    # Save the parameters
params_file_path = os.path.join(Results_Folder, "Umap","analysis_parameters.csv")
save_parameters(UMAP_params, params_file_path, 'UMAP')

# Visualize the UMAP projection
plt.figure(figsize=(12, 10))

# The plot will adjust automatically based on the n_components
if n_dimension == 2:
    sns.scatterplot(x=column_names[0], y=column_names[1], hue='Condition', data=umap_df, palette='Set2', s=spot_size)
    plt.title('UMAP Projection of the Dataset')
    plt.savefig(f"{Results_Folder}/Umap/umap_projection_2D.pdf")  # Save 2D plot as PDF
    plt.show()
elif n_dimension == 1:
    sns.stripplot(x=column_names[0], hue='Condition', data=umap_df, palette='Set2', jitter=0.05, size=spot_size)
    plt.title('UMAP Projection of the Dataset')
    plt.savefig(f"{Results_Folder}/Umap/umap_projection_1D.pdf")  # Save 2D plot as PDF
    plt.show()
else:
    # umap_df should have columns like 'UMAP dimension 1', 'UMAP dimension 2', 'UMAP dimension 3', and 'condition'
    import plotly.express as px
    import pandas as pd
    import numpy as np

    fig = px.scatter_3d(umap_df,
                    x='UMAP dimension 1',
                    y='UMAP dimension 2',
                    z='UMAP dimension 3',
                    color='Condition')

    for trace in fig.data:
      trace.marker.size = spot_size/10  # You can set this to any desired value

    fig.show()
    pyo.plot(fig, filename=os.path.join(Results_Folder, "Umap", "umap_projection.html"), auto_open=False)

## **2.3. HDBSCAN**
---

<font size="4">The provided code employs HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to identify clusters within a dataset that has already undergone UMAP dimensionality reduction. HDBSCAN is utilized for its proficiency in determining the optimal number of clusters while managing varied densities within the data.</font>

<font size="4">In the provided HDBSCAN code, the parameters `min_samples`, `min_cluster_size`, and `metric` are crucial for determining the structure and appearance of the resulting clusters in the data.</font>

<font size="4">`min_samples`: This parameter primarily controls the degree to which the algorithm is willing to declare noise. It's the number of samples in a neighborhood for a point to be considered as a core point.</font>
- <font size="4">A smaller value of `min_samples` makes the algorithm more prone to declaring points as part of a cluster, potentially leading to larger clusters and fewer noise points.</font>
- <font size="4">A larger value makes the algorithm more conservative, resulting in more points declared as noise and smaller, more defined clusters.</font>
- <font size="4">The choice of `min_samples` typically depends on the density of the data; denser datasets may require a larger value.</font>

<font size="4">`min_cluster_size`: This parameter determines the smallest size grouping that you wish to consider a cluster.</font>
- <font size="4">A smaller value will allow the formation of smaller clusters, whereas a larger value will prevent small isolated groups of points from being declared as clusters.</font>
- <font size="4">The choice of `min_cluster_size` depends on the scale of the data and the desired level of granularity in the clustering.</font>

<font size="4">`metric`: This parameter is the metric used for distance computation between data points, and it affects the shape of the clusters.</font>
- <font size="4">The `euclidean` metric is a good starting point, and depending on the clustering results and the data type, it might be beneficial to experiment with different metrics.</font>

---

| <font size="4">Metric</font> | <font size="4">Description</font>                                   | <font size="4">Data Type</font>            |
|-------------------|-------------------------------------------------------|-------------------------------|
| <font size="4">Euclidean</font>   | <font size="4">Standard distance metric.</font>                          | <font size="4">Numerical data.</font>               |
| <font size="4">Manhattan</font>   | <font size="4">Sum of absolute differences.</font>                       | <font size="4">Numerical/Categorical data.</font>   |
| <font size="4">Chebyshev</font>   | <font size="4">Maximum value of absolute differences.</font>             | <font size="4">Numerical data.</font>               |
| <font size="4">Bray-Curtis</font> | <font size="4">Dissimilarity between sample sets.</font>                 | <font size="4">Numerical data.</font>               |
| <font size="4">Canberra</font>    | <font size="4">Weighted version of Manhattan distance.</font>            | <font size="4">Numerical data.</font>               |


In [None]:
# @title ##Identify clusters using HDBSCAN
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np

#@markdown ###HDBSCAN parameters:
clustering_data_source = 'umap'  # @param ['umap', 'raw']
min_samples = 20  # @param {type: "number"}
min_cluster_size = 200  # @param {type: "number"}
metric = "euclidean"  # @param ['euclidean', 'manhattan', 'chebyshev', 'braycurtis', 'canberra']

#@markdown ###Display parameters:
spot_size = 30 # @param {type: "number"}
# Apply HDBSCAN
clusterer = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric=metric)  # You may need to tune these parameters

if clustering_data_source == 'umap':
  if n_dimension == 1:
    clusterer.fit(umap_df[['UMAP dimension 1']])  # Use only one UMAP dimension for clustering

  elif n_dimension == 2:
    clusterer.fit(umap_df[['UMAP dimension 1', 'UMAP dimension 2']])  # Use two UMAP dimensions for clustering

  elif n_dimension == 3:
    clusterer.fit(umap_df[['UMAP dimension 1', 'UMAP dimension 2', 'UMAP dimension 3']])  # Use three UMAP dimensions for clustering

else:
  clusterer.fit(selected_df.select_dtypes(include=['number']))

# Add the cluster labels to your UMAP DataFrame
umap_df['Cluster_UMAP'] = clusterer.labels_

# If the Cluster column already exists in merged_tracks_df, drop it to avoid duplications
if 'Cluster_UMAP' in merged_tracks_df.columns:
    merged_tracks_df.drop(columns='Cluster_UMAP', inplace=True)

# Merge the Cluster column from umap_df to merged_tracks_df based on Unique_ID
merged_tracks_df = pd.merge(merged_tracks_df, umap_df[['Unique_ID', 'Cluster_UMAP']], on='Unique_ID', how='left')

# Handle cases where some rows in merged_tracks_df might not have a corresponding cluster label
merged_tracks_df['Cluster_UMAP'].fillna(-1, inplace=True)  # Assigning -1 to cells that were not assigned to any cluster

# Save the DataFrame with the identified clusters
merged_tracks_df.to_csv(Results_Folder + '/' + 'merged_Tracks.csv', index=False)

  # Prepare the parameters dictionary
UMAP_params = {
        'clustering_data_source': clustering_data_source,
        'min_samples': min_samples,
        'min_cluster_size': min_cluster_size,
        'metric': metric
    }

    # Save the parameters
params_file_path = os.path.join(Results_Folder, "Umap/analysis_parameters.csv")
save_parameters(UMAP_params, params_file_path, 'HDBSCAN')

# Plotting the results
if n_dimension == 1:
    plt.figure(figsize=(12, 6))
    sns.stripplot(data=umap_df, x='UMAP dimension 1', hue='Cluster_UMAP', palette='viridis', s=spot_size)
    plt.title('Clusters Identified by HDBSCAN (1D)')
    plt.xlabel('UMAP dimension 1')
    plt.ylabel('Count')
    plt.savefig(f"{Results_Folder}/Umap/HDBSCAN_clusters_1D.pdf")  # Save 1D histogram as PDF
    plt.show()

if n_dimension == 2:

  plt.figure(figsize=(12,10))
  sns.scatterplot(x='UMAP dimension 1', y='UMAP dimension 2', hue='Cluster_UMAP', palette='viridis', data=umap_df, s=spot_size)
  plt.title('Clusters Identified by HDBSCAN')
  plt.savefig(os.path.join(Results_Folder, "Umap", "HDBSCAN_clusters_2D.pdf"))  # Save 2D plot as PDF
  plt.show()

if n_dimension == 3:

  fig = px.scatter_3d(umap_df,
                    x='UMAP dimension 1',
                    y='UMAP dimension 2',
                    z='UMAP dimension 3',
                    color='Cluster_UMAP')

  for trace in fig.data:
    trace.marker.size = spot_size/10

  fig.show()
  pyo.plot(fig, filename=os.path.join(Results_Folder, "Umap", "HDBSCAN_clusters.html"), auto_open=False)

## **2.4. Fingerprint**
---

<font size = 4>This section is designed to visualize the distribution of different clusters within each condition in a dataset, showing the 'fingerprint' of each cluster per condition.

In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# Group by 'Condition' and 'Cluster' and calculate the size of each group
cluster_counts = umap_df.groupby(['Condition', 'Cluster_UMAP']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = umap_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'Condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100

# Save the percentage_df DataFrame as a CSV file
percentage_df.to_csv(Results_Folder+'/Umap/UMAP_percentage_results.csv', index=False)

# Pivot the percentage_df to have Conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster_UMAP', values='percentage')

# Fill NaN values with 0 if any, as there might be some Condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

# Initialize PDF
pdf_pages = PdfPages(os.path.join(Results_Folder, 'Umap', 'UMAP_Cluster_Fingerprint_Plot.pdf'))

# Plotting
fig, ax = plt.subplots(figsize=(10, 7))
pivot_df.plot(kind='bar', stacked=True, ax=ax, colormap='viridis')
plt.title('Percentage in each cluster per Condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)
plt.tight_layout()

# Save the figure to a PDF
pdf_pages.savefig(fig)

# Close the PDF
pdf_pages.close()

# Display the plot
plt.show()



## **2.5. Understand your clusters using heatmaps**
--------

<font size = 4>This section help visualize how different track parameters vary across the identified clusters. The approach is to display these variations using a heatmap, which offers a color-coded representation of the median values of each parameter for each cluster. This visualization technique can make it easier to spot differences or patterns among the clusters.


In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd
from scipy.stats import zscore

# @title ##Plot track normalized track parameters based on clusters as an heatmap

# Parameters to adapt in function of the notebook section
base_folder = os.path.join(Results_Folder, "Umap", "Track_parameters")
Conditions = 'Cluster_UMAP'

# Check and create necessary directories
folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)


heatmap_comparison(merged_tracks_df, base_folder, Conditions)

## **2.6. Understand your clusters using box plots**
--------

<font size = 4>The provided code aims to visually represent the distribution of different track parameters across the identified clusters. Specifically, for each parameter selected, a boxplot is generated to showcase the spread of its values across different clusters. This approach provides a comprehensive view of how each track parameter varies within and across the clusters.




In [None]:
# @title ##Plot track parameters for each clusters

base_folder = f"{Results_Folder}/Umap/Track_parameters/"
Cluster = "Cluster_UMAP"

if not os.path.exists(base_folder):
  os.makedirs(base_folder)

checkboxes_dict, checkboxes_accordion = display_variable_checkboxes(categorize_columns(merged_tracks_df))
variable_checkboxes, checkboxes_widget = display_variable_checkboxes(get_selectable_columns_plots(merged_tracks_df))

# Create and display the plot button
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars_per_cluster(b, Cluster, checkboxes_dict, merged_tracks_df, base_folder));

# Display the UI components
display(VBox([checkboxes_accordion, button]))

## **2.7. Plot track parameters for a selected cluster**
---

In [None]:
# @title ##Plot track parameters for a selected cluster

import ipywidgets as widgets
from ipywidgets import Layout, VBox, Button, Accordion
import pandas as pd
import os
import itertools
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.ticker import FixedLocator
from matplotlib.gridspec import GridSpec
import seaborn as sns

# Parameters to adapt in function of the notebook section
base_folder = f"{Results_Folder}/Umap/cluster_plots"
Conditions = 'Condition'
df_to_plot = merged_tracks_df
Cluster = "Cluster_UMAP"

# Check and create necessary directories
folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

condition_selector, condition_accordion = display_condition_selection(df_to_plot, Conditions)
checkboxes_dict, checkboxes_accordion = display_variable_checkboxes(categorize_columns(df_to_plot))
variable_checkboxes, checkboxes_widget = display_variable_checkboxes(get_selectable_columns_plots(df_to_plot))
stat_method_selector = widgets.Dropdown(
    options=['randomization test', 't-test'],
    value='randomization test',
    description='Stat Method:',
    style={'description_width': 'initial'}
)
cluster_dropdown = display_cluster_dropdown(merged_tracks_df, Cluster)

button = Button(description="Plot Selected Variables", layout=Layout(width='400px'), button_style='info')
button.on_click(lambda b: plot_selected_vars_cluster(b, checkboxes_dict, df_to_plot, Conditions, Cluster, cluster_dropdown, base_folder, condition_selector, stat_method_selector));

display(VBox([condition_accordion, checkboxes_accordion, stat_method_selector, cluster_dropdown, button]))


## **2.8. Identify exemplar tracks from each clusters**
---

<font size = 4>Exemplars, in the context of clustering analysis, refer to representative data points that are selected to encapsulate the essential characteristics of a cluster. They are often chosen because they are central or prototypical members of a cluster, making them valuable for summarizing the cluster's properties. In the provided code, exemplars are identified using the HDBSCAN clustering algorithm and marked within the dataset.

<font size = 4>**Keep in mind that not all cluster will have examplar**.


In [None]:
import plotly.express as px  # Importing plotly for 3D plots

# @title ##Identify exemplar tracks using HDBSCAN

#@markdown ###Display parameters:
spot_size = 15 # @param {type: "number"}

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/Umap/Examplar"):
    os.makedirs(f"{Results_Folder}/Umap/Examplar")

# Extracting exemplar points
exemplars = []
for exemplar in clusterer.exemplars_:
    exemplars.extend(exemplar)

# Flatten the exemplars list of lists into a single list
flattened_exemplars = [index for sublist in exemplars for index in sublist]

# Now pass the flattened list to iloc
exemplar_df = umap_df.iloc[flattened_exemplars]

# Deduplicate exemplar_df based on 'Unique_ID'
exemplar_df = exemplar_df.drop_duplicates(subset='Unique_ID')

# Create a new column in exemplar_df to indicate it's an exemplar
exemplar_df['Exemplar'] = 1

# If the Exemplar column already exists in merged_tracks_df, drop it to avoid duplications
if 'Exemplar' in merged_tracks_df.columns:
    merged_tracks_df.drop(columns='Exemplar', inplace=True)

# Merge the Exemplar column from exemplar_df to merged_tracks_df based on Unique_ID
merged_tracks_df = pd.merge(merged_tracks_df, exemplar_df[['Unique_ID', 'Exemplar']], on='Unique_ID', how='left')

# Handle cases where some rows in merged_tracks_df might not have a corresponding Exemplar label
merged_tracks_df['Exemplar'].fillna(0, inplace=True)  # Assigning 0 to cells that were not identified as exemplars

# Save the DataFrame with the identified clusters Exemplar label
merged_tracks_df.to_csv(Results_Folder + '/' + 'merged_Tracks.csv', index=False)


# Plotting clusters and exemplar points
if n_dimension == 1:
    plt.figure(figsize=(12,10))
    sns.stripplot(x='UMAP dimension 1', hue='Cluster_UMAP', data=umap_df, palette='viridis', jitter=0.05, size=spot_size)
    sns.stripplot(x='UMAP dimension 1', color='red', label='Exemplars', data=exemplar_df, jitter=0.05, size=spot_size, marker='X')
    plt.title('Clusters and Exemplar tracks Identified by HDBSCAN')
    plt.show()

elif n_dimension == 2:
    plt.figure(figsize=(12,10))
    sns.scatterplot(x='UMAP dimension 1', y='UMAP dimension 2', hue='Cluster_UMAP', palette='viridis', data=umap_df, s=spot_size)
    sns.scatterplot(x='UMAP dimension 1', y='UMAP dimension 2', color='red', label='Exemplars', data=exemplar_df, s=spot_size*2, marker='X')
    plt.title('Clusters and Exemplar tracks Identified by HDBSCAN')
    plt.show()

elif n_dimension == 3:
    fig = px.scatter_3d(umap_df,
                        x='UMAP dimension 1',
                        y='UMAP dimension 2',
                        z='UMAP dimension 3',
                        color='Cluster_UMAP',
                        color_discrete_sequence=px.colors.qualitative.Vivid)

    # Add a new column for coloring exemplars
    exemplar_df['ExemplarColor'] = 'Exemplar'
    exemplar_fig = px.scatter_3d(exemplar_df,
                                 x='UMAP dimension 1',
                                 y='UMAP dimension 2',
                                 z='UMAP dimension 3',
                                 color='ExemplarColor',
                                 color_discrete_map={'Exemplar':'red'})

    for trace in fig.data:
        trace.marker.size = spot_size

    for trace in exemplar_fig.data:
        trace.marker.size = spot_size
        trace.marker.symbol = 'x'

    fig.add_trace(exemplar_fig.data[0])
    fig.show()
    pyo.plot(fig, filename=f"{Results_Folder}/Umap/Examplar/HDBSCAN_examplar.html", auto_open=False)


## **2.9. See the exemplar tracks**
---

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.backends.backend_pdf import PdfPages

# @title ##Plot the examplar tracks for each cluster

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/Umap/Examplar"):
    os.makedirs(f"{Results_Folder}/Umap/Examplar")


# Extracting actual indices for exemplar rows
exemplar_indices = umap_df.iloc[flattened_exemplars].index

# Add a new column to umap_df to indicate if a point is an exemplar
umap_df['Exemplar'] = 0

# Mark the rows corresponding to exemplars as 1
umap_df.loc[exemplar_indices, 'Exemplar'] = 1

# Determine max and min coordinates from the DataFrame
min_x = merged_spots_df['POSITION_X'].min()
max_x = merged_spots_df['POSITION_X'].max()
min_y = merged_spots_df['POSITION_Y'].min()
max_y = merged_spots_df['POSITION_Y'].max()

# Extract exemplars from the umap_df
exemplar_info = umap_df[umap_df['Exemplar'] == 1]

# Determine the unique clusters from exemplar_info
clusters = exemplar_info['Cluster_UMAP'].unique()

# Create a PDF object to save the plots
with PdfPages(f"{Results_Folder}/Umap/Examplar/Examplar_tracks_Clusters.pdf") as pdf:

    # Iterate over each cluster
    for cluster in clusters:

        # Start a new figure for this cluster
        plt.figure(figsize=(7, 7))

        # Extract unique IDs for the current cluster from exemplar_info
        cluster_unique_ids = exemplar_info[exemplar_info['Cluster_UMAP'] == cluster]['Unique_ID'].tolist()

        # For each unique ID in the cluster, plot the track
        for unique_id in cluster_unique_ids:

            # Filter dataframe based on the unique ID
            unique_df = merged_spots_df[merged_spots_df['Unique_ID'] == unique_id].sort_values(by='POSITION_T')

            # Color code tracks based on 'Condition' using seaborn's color palette
            color = sns.color_palette('husl', n_colors=merged_spots_df['Condition'].nunique())[merged_spots_df['Condition'].unique().tolist().index(unique_df['Condition'].iloc[0])]

            plt.plot(unique_df['POSITION_X'], unique_df['POSITION_Y'], marker='o', linestyle='-', markersize=2, color=color, label=unique_df['Condition'].iloc[0])

            # Set title for the subplot
            plt.title(f'Coordinates for Cluster {cluster}')

            # Limit the plot dimensions based on your data's extent
            plt.xlim(min_x - 1, max_x + 1)
            plt.ylim(min_y - 1, max_y + 1)

            # Add legend to differentiate tracks based on condition
            plt.legend(loc='best')

            plt.xlabel('POSITION_X')
            plt.ylabel('POSITION_Y')

        # Save the figure in the PDF
        pdf.savefig()

# Adjust layout to avoid overlap
        plt.tight_layout()

        plt.show()


## **2.10. Find the exemplar on your raw images**
--------


<font size = 4>This Python script serves as a user-friendly tool for visualizing exemplar tracks within your microscopy video.

<font size = 4>To utilize it effectively, **users should provide the path to the directory containing the raw stacks of their data**.
<font size = 4>It's essential to ensure that these stack files have the same name as the
*  Use .tif or .tiff files only
*  It's essential to ensure that these tif files have the same name as the corresponding CSV file used in the analysis.
*   The Tif files can be in the same folder as your csv file

<font size = 4>Additionally, users are required to **specify the pixel calibration** value to accurately scale the visualization.
*   Use the same calibration as the one used before


<font size = 4>With these inputs, the script automates the retrieval of matching TIFF files, adjusts for pixel calibration, and overlays vital information on video frames. Users can interactively select a specific cluster and initiate the visualization process with a single click.

In [None]:
import requests
import zipfile
import os
from tqdm import tqdm
from tifffile import imread
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, widgets, Button, Output
from IPython.display import clear_output

# @title #Find your examplar tracks

Use_test_dataset = False

#@markdown ##Provide the path to the folder containing your .tif files

Video_path = ''  # @param {type: "string"}


#@markdown ###Provide the pixel calibration

Pixel_calibration = None # @param {type: "number"}


# Function to visualize a track for a cluster
def visualize_track_for_cluster(cluster_number):
    # Filter merged_tracks_df for exemplars and the selected cluster
    exemplar_tracks = merged_tracks_df[(merged_tracks_df['Cluster_UMAP'] == cluster_number) & (merged_tracks_df['Exemplar'] == 1)]

    if exemplar_tracks.empty:
        display_error_message("No exemplar found for this cluster.")
        return

    for idx, track in exemplar_tracks.iterrows():
        # Get the filename
        filename = track['File_name']

        # Find the corresponding tiff file
        full_path = find_matching_tiff_file(Video_path, filename)

        if not full_path:
            display_error_message(f"No matching .tif or .tiff file found for filename: {filename}")
            continue

        # Load the movie
        movie = imread(full_path)

        if len(movie.shape) != 3:
            display_error_message(f"Warning: The loaded movie from file '{filename}' is not 2D over time.")
            continue

        # Fetch the track coordinates from merged_spots_df and adjust for calibration
        track_id = track['Unique_ID']
        track_coordinates = merged_spots_df[merged_spots_df['Unique_ID'] == track_id][['POSITION_T', 'POSITION_X', 'POSITION_Y']].copy()
        track_coordinates['POSITION_X'] = track_coordinates['POSITION_X'] / Pixel_calibration
        track_coordinates['POSITION_Y'] = track_coordinates['POSITION_Y'] / Pixel_calibration

        # Define a function to update the display based on the frame slider
        def update_display(frame_number):
            plt.figure(figsize=(10, 10))
            frame_with_square = movie[frame_number, :, :].copy()
            coords_for_frame = track_coordinates[track_coordinates['POSITION_T'] == frame_number]
            if not coords_for_frame.empty:
                x, y = int(coords_for_frame['POSITION_X'].values[0]), int(coords_for_frame['POSITION_Y'].values[0])
                frame_with_square = overlay_square_on_frame(frame_with_square, x, y)
            plt.imshow(frame_with_square, cmap='gray')
            plt.title(f"Frame {frame_number} for Exemplar in Cluster {cluster_number} from file {filename}")
            plt.show()

        # Create a slider for frame navigation
        frame_slider = widgets.IntSlider(min=0, max=len(movie) - 1, description='Frame')

        # Display the visualization with interactive for more reactive updates
        w = interactive(update_display, frame_number=frame_slider)
        display(w)  # This line explicitly displays the widget
        break  # Only display for the first matching exemplar for the sake of demonstration


# Error message widget
error_output = Output()

if Use_test_dataset:
  Video_path = '/content/Tracks/Tracks'

# Dropdown widget for cluster selection
clusters = merged_tracks_df['Cluster_UMAP'].unique()
cluster_dropdown = widgets.Dropdown(
    options=clusters,
    description='Select Cluster:',
    disabled=False,
)

# Button to trigger visualization
plot_button = Button(description="Plot")

# Function to handle button click
def on_plot_button_click(b):
    cluster_number = cluster_dropdown.value
        # Clear the previous output
    clear_output()

    display(cluster_dropdown)
    display(plot_button)
    display(error_output)
    visualize_track_for_cluster(cluster_number)

# Bind the function to the button click event
plot_button.on_click(on_plot_button_click)

# Display the widgets
display(cluster_dropdown)
display(plot_button)
display(error_output)


## **2.11. Export movies of the exemplar tracks**
---

In [None]:
# @title ##Export movies with the examplar tracks labelled

import os
import numpy as np
from tifffile import imwrite
from tqdm.notebook import tqdm
import imageio

# Create a directory to store the exported videos
video_export_folder = Results_Folder + "/Umap/Examplar/Exported_Videos"
if not os.path.exists(video_export_folder):
    os.makedirs(video_export_folder)

# Iterate over all exemplar tracks
# Iterate over all exemplar tracks
for idx, track in tqdm(merged_tracks_df[merged_tracks_df['Exemplar'] == 1].iterrows(), total=merged_tracks_df[merged_tracks_df['Exemplar'] == 1].shape[0]):
    # Get the filename and cluster number
    filename = track['File_name']
    cluster_num = track['Cluster_UMAP']

    # Find the corresponding tiff file
    full_path = find_matching_tiff_file(Video_path, filename)

    if not full_path:
        print(f"No matching .tif or .tiff file found for filename: {filename}")
        continue

    # Load the movie
    movie = imread(full_path)

    # Check dimensions to ensure 2D video
    if len(movie.shape) != 3:
        print(f"Skipping {filename} as it is not a 2D video.")
        continue

    # Fetch the track coordinates from merged_spots_df
    track_id = track['Unique_ID']
    track_coordinates = merged_spots_df[merged_spots_df['Unique_ID'] == track_id][['POSITION_T', 'POSITION_X', 'POSITION_Y']]

    # Overlay the track on the video using the overlay_square_on_frame function
    for _, coord in track_coordinates.iterrows():
        frame_number = int(coord['POSITION_T'])
        x = int(coord['POSITION_X'] / Pixel_calibration)
        y = int(coord['POSITION_Y'] / Pixel_calibration)
        movie[frame_number] = overlay_square_on_frame(movie[frame_number], x, y)

    # Incorporate the cluster number in the output filenames
    output_video_path_tiff = os.path.join(video_export_folder, f"{filename}_Cluster_{cluster_num}_with_track.tiff")
    output_video_path_mp4 = os.path.join(video_export_folder, f"{filename}_Cluster_{cluster_num}_with_track.mp4")

    # Save the video with overlaid track as TIFF
    imwrite(output_video_path_tiff, movie)

    # Normalize and convert the movie to uint8
    movie_uint8 = percentile_normalize_and_convert_uint8(movie)

    # Convert and save as MP4
    imageio.mimwrite(output_video_path_mp4, movie_uint8, fps=10)

print("Video export completed.")



--------
# **Part 3. Explore your high-dimensional data using t-SNE and HDBSCAN**
--------

## **3.1. Choose the track metrics to use for clustering**
--------

In [None]:
# @title ##Choose the track metrics to use

import ipywidgets as widgets
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/Tsne"):
    os.makedirs(f"{Results_Folder}/Tsne")

excluded_columns = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar','TRACK_STOP', 'TRACK_START', 'Cluster_UMAP', 'Cluster_tsne']

# Columns you want to always include
columns_to_include = ['File_name', 'Repeat', 'Condition', 'Unique_ID']

selected_df = pd.DataFrame()
nan_columns = pd.DataFrame()
# Extract the columns you always want to include and ensure they exist in the original dataframe
saved_columns = {col: merged_tracks_df[col].copy() for col in columns_to_include if col in merged_tracks_df}

# Filter out non-numeric columns
numeric_df = merged_tracks_df.select_dtypes(include=['float64', 'int64'])  # Selecting only numeric columns

column_names = [col for col in numeric_df.columns if col not in excluded_columns]

# Text area for user to paste the list of metrics
text_area = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of metrics here, separated by commas.',
    description='Metrics:',
    disabled=False,
    layout=widgets.Layout(width='100%', height='100px')
)


# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip() in column_names]

# Create a checkbox for each column
def create_checkboxes(parsed_metrics):
    return [widgets.Checkbox(value=(col in parsed_metrics or not parsed_metrics), description=col, indent=False) for col in column_names]

checkboxes = create_checkboxes(column_names)  # Initialize with all metrics

# Grid for displaying checkboxes
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection
button = widgets.Button(description="Select the track parameters",
                        layout=widgets.Layout(width='400px'),
                        button_style='info')


def on_button_click(b):
    global selected_df
    global nan_columns
    parsed_metrics = parse_text_area(text_area.value)
    selected_columns = [box.description for box in checkboxes if box.value]

    # Extract the selected columns from the DataFrame
    selected_df = numeric_df[selected_columns].copy()

        # Prepare the parameters dictionary
    Tsne_params = {
        'Selected Columns': ', '.join(selected_columns)
    }

    # Save the parameters
    params_file_path = os.path.join(Results_Folder, "Tsne/analysis_parameters.csv")
    save_parameters(Tsne_params, params_file_path, 'Tsne')

    # Add back the always-included columns to selected_df
    for col, data in saved_columns.items():
        selected_df.loc[:, col] = data

    # Check if the DataFrame has any NaN values and print a warning if it does.
    nan_columns = selected_df.columns[selected_df.isna().any()].tolist()
    if nan_columns:
        for col in nan_columns:
            selected_df = selected_df.dropna(subset=[col])  # Drop NaN values only from columns containing them

    print("Done")

# Set the button click event handler
button.on_click(on_button_click)

# Function to update checkboxes based on text area input
def update_checkboxes(b):
    parsed_metrics = parse_text_area(text_area.value)
    global checkboxes
    checkboxes = create_checkboxes(parsed_metrics)
    grid.children = checkboxes

# Update checkboxes when text area content changes
text_area.observe(update_checkboxes, names='value')

# Display the text area, grid of checkboxes, and the button
display(text_area, grid, button)

## **3.2. t-SNE**
--------

The code snippet provided performs **t-Distributed Stochastic Neighbor Embedding (t-SNE)**, a powerful technique for dimensionality reduction, particularly suited for the visualization of high-dimensional datasets. The process is applied to the merged tracks dataframe, focusing on its numeric columns, with the goal of visualizing the data in a lower-dimensional space.

**Key Parameters of t-SNE:**

- **Perplexity (`perplexity`):**
  - This parameter is a measure of the effective number of local neighbors each point has.
  - Perplexity influences the t-SNE algorithm's ability to capture local versus global aspects of the data.
  - Typical values for perplexity range between 5 and 50, with the choice depending on dataset size and density.

- **Learning Rate (`learning_rate`):**
  - This parameter controls the step size in the optimization process.
  - A suitable learning rate helps t-SNE to converge to a meaningful low-dimensional representation.
  - Values too high might cause the algorithm to converge to a suboptimal solution, while too low values can slow down the convergence.

- **Number of Iterations (`n_iter`):**
  - This parameter defines the number of optimization iterations t-SNE will run.
  - A higher number of iterations allows the algorithm more time to find a stable configuration.
  - Generally, a value of 1000 iterations is sufficient for most datasets.

- **Number of Dimensions (`n_dimension`):**
  - The target dimensionality for the lower-dimensional space.
  - For visualization purposes, this is commonly set to 2, allowing the data to be plotted in a 2D scatter plot.


In [None]:
# @title ##Perform t-SNE
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import pandas as pd

# Check and create necessary directories
tsne_folder_path = f"{Results_Folder}/Tsne/"
if not os.path.exists(tsne_folder_path):
    os.makedirs(tsne_folder_path)

#@markdown ###t-SNE parameters:

perplexity = 20  # @param {type: "number"}
learning_rate = 100  # @param {type: "number"}
n_iter = 1000  # @param {type: "number"}
n_dimension = 2  # The number of dimensions is set to 2 for t-SNE as standard practice

#@markdown ###Display parameters:
spot_size = 15  # @param {type: "number"}

# Initialize t-SNE object with the specified settings
tsne = TSNE(n_components=n_dimension, perplexity=perplexity, learning_rate=learning_rate, n_iter=n_iter, random_state=42)

# Exclude non-numeric columns when fitting t-SNE
numeric_columns = selected_df._get_numeric_data()
embedding = tsne.fit_transform(numeric_columns)

  # Prepare the parameters dictionary
tsne_params = {
        'perplexity': perplexity,
        'learning_rate': learning_rate,
        'n_iter': n_iter,
        'n_dimension': n_dimension,
        'spot_size': spot_size
    }

    # Save the parameters
params_file_path = os.path.join(Results_Folder, "Tsne/analysis_parameters.csv")
save_parameters(tsne_params, params_file_path, 'tsne')

# Create dynamic column names based on n_components
column_names = [f't-SNE dimension {i+1}' for i in range(n_dimension)]

# Extract the columns_to_include from selected_df
included_data = selected_df[columns_to_include].reset_index(drop=True)

# Concatenate the t-SNE embedding with the included columns
tsne_df = pd.concat([pd.DataFrame(embedding, columns=column_names), included_data], axis=1)

# Check if the DataFrame has any NaN values and print a warning if it does.
nan_columns = tsne_df.columns[tsne_df.isna().any()].tolist()
if nan_columns:
  warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}")
  tsne_df.dropna(subset=nan_columns, inplace=True)  # Drop NaN values only from columns containing them

# Visualize the t-SNE projection
plt.figure(figsize=(12, 10))
sns.scatterplot(x=column_names[0], y=column_names[1], hue='Condition', data=tsne_df, palette='Set2', s=spot_size)
plt.title('t-SNE Projection of the Dataset')
tsne_output_path = os.path.join(tsne_folder_path, 'tsne_projection_2D.pdf')
plt.savefig(tsne_output_path)  # Save 2D plot as PDF
plt.show()


## **3.3. HDBSCAN**
---

<font size = 4> The provided code employs HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to identify clusters within a dataset that has already undergone UMAP dimensionality reduction. HDBSCAN is utilized for its proficiency in determining the optimal number of clusters while managing varied densities within the data.

<font size = 4>In the provided HDBSCAN code, the parameters `min_samples`, `min_cluster_size`, and `metric` are crucial for determining the structure and appearance of the resulting clusters in the data.

<font size = 4>`min_samples`: This parameter primarily controls the degree to which the algorithm is willing to declare noise. It's the number of samples in a neighborhood for a point to be considered as a core point.
- A smaller value of `min_samples` makes the algorithm more prone to declaring points as part of a cluster, potentially leading to larger clusters and fewer noise points.
- A larger value makes the algorithm more conservative, resulting in more points declared as noise and smaller, more defined clusters.
- The choice of `min_samples` typically depends on the density of the data; denser datasets may require a larger value.

<font size = 4>`min_cluster_size`: This parameter determines the smallest size grouping that you wish to consider a cluster.
- A smaller value will allow the formation of smaller clusters, whereas a larger value will prevent small isolated groups of points from being declared as clusters.
- The choice of `min_cluster_size` depends on the scale of the data and the desired level of granularity in the clustering.

<font size = 4>`metric`: This parameter is the metric used for distance computation between data points, and it affects the shape of the clusters.
- The `euclidean` metric is a good starting point, and depending on the clustering results and the data type, it might be beneficial to experiment with different metrics.


In [None]:
# @title ##Identify clusters using HDBSCAN
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#@markdown ###HDBSCAN parameters:
clustering_data_source = 'tsne'  # @param ['tsne', 'raw']
min_samples = 20  # @param {type: "number"}
min_cluster_size = 200  # @param {type: "number"}
metric = "canberra"  # @param ['euclidean', 'manhattan', 'chebyshev', 'braycurtis', 'canberra']

#@markdown ###Display parameters:
spot_size = 15 # @param {type: "number"}

# Apply HDBSCAN
clusterer = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric=metric)


  # Prepare the parameters dictionary
tsne_params = {
        'clustering_data_source': clustering_data_source,
        'min_samples': min_samples,
        'min_cluster_size': min_cluster_size,
        'metric': metric
    }

    # Save the parameters
params_file_path = os.path.join(Results_Folder, "Tsne/analysis_parameters.csv")
save_parameters(tsne_params, params_file_path, 'tsne')

# Depending on the data source, we fit HDBSCAN to the t-SNE dimensions or the raw data
if clustering_data_source == 'tsne':
    # We only have two t-SNE dimensions based on the previous t-SNE code provided
    clusterer.fit(tsne_df[['t-SNE dimension 1', 't-SNE dimension 2']])
else:
    # If raw data is selected, we use all the numerical columns for clustering
    clusterer.fit(selected_df.select_dtypes(include=['number']))

# Add the cluster labels to your t-SNE DataFrame
tsne_df['Cluster_tsne'] = clusterer.labels_

# If the Cluster column already exists in merged_tracks_df, drop it to avoid duplications
if 'Cluster_tsne' in merged_tracks_df.columns:
    merged_tracks_df.drop(columns='Cluster_tsne', inplace=True)

# Merge the Cluster column from tsne_df to merged_tracks_df based on Unique_ID
merged_tracks_df = pd.merge(merged_tracks_df, tsne_df[['Unique_ID', 'Cluster_tsne']], on='Unique_ID', how='left')

# Handle cases where some rows in merged_tracks_df might not have a corresponding cluster label
merged_tracks_df['Cluster_tsne'].fillna(-1, inplace=True)  # Assigning -1 to cells that were not assigned to any cluster

# Save the DataFrame with the identified clusters
save_dataframe_with_progress(merged_tracks_df, Results_Folder + '/' + 'merged_Tracks.csv')

# Plotting the results
plt.figure(figsize=(12, 10))
sns.scatterplot(x='t-SNE dimension 1', y='t-SNE dimension 2', hue='Cluster_tsne', palette='viridis', data=tsne_df, s=spot_size)
plt.title('Clusters Identified by HDBSCAN')
plt.savefig(os.path.join(Results_Folder, 'Tsne', 'HDBSCAN_clusters_2D.pdf'))  # Save 2D plot as PDF
plt.show()


## **3.4. Fingerprint**
---

<font size = 4>This section is designed to visualize the distribution of different clusters within each condition in a dataset, showing the 'fingerprint' of each cluster per condition.

In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# Group by 'Condition' and 'Cluster' and calculate the size of each group
cluster_counts = tsne_df.groupby(['Condition', 'Cluster_tsne']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = tsne_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'Condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100

# Save the percentage_df DataFrame as a CSV file
percentage_df.to_csv(os.path.join(Results_Folder, 'Tsne', 'TSNE_percentage_results.csv'), index=False)

# Pivot the percentage_df to have Conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster_tsne', values='percentage')

# Fill NaN values with 0 if any, as there might be some Condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

# Initialize PDF
pdf_path = os.path.join(Results_Folder, 'Tsne', 'TSNE_Cluster_Fingerprint_Plot.pdf')
pdf_pages = PdfPages(pdf_path)

# Plotting
fig, ax = plt.subplots(figsize=(10, 7))
pivot_df.plot(kind='bar', stacked=True, ax=ax, colormap='viridis')
plt.title('Percentage in each cluster per Condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)
plt.tight_layout()

# Save the figure to a PDF
pdf_pages.savefig(fig)

# Close the PDF
pdf_pages.close()

# Display the plot
plt.show()


## **3.5. Understand your clusters using heatmaps**
--------
<font size = 4>This section help visualize how different track parameters vary across the identified clusters. The approach is to display these variations using a heatmap, which offers a color-coded representation of the median values of each parameter for each cluster. This visualization technique can make it easier to spot differences or patterns among the clusters.


In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd
from scipy.stats import zscore

# @title ##Plot track normalized track parameters based on clusters as an heatmap

# Parameters to adapt in function of the notebook section
base_folder = f"{Results_Folder}/Tsne/Track_parameters"
Conditions = 'Cluster_tsne'

# Check and create necessary directories
folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

# Example usage
heatmap_comparison(merged_tracks_df, base_folder, Conditions)

## **3.6. Understand your clusters using box plots**
--------
<font size = 4>The provided code aims to visually represent the distribution of different track parameters across the identified clusters. Specifically, for each parameter selected, a boxplot is generated to showcase the spread of its values across different clusters. This approach provides a comprehensive view of how each track parameter varies within and across the clusters.




In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import ipywidgets as widgets

# @title ##Plot track parameters based on clusters

# Check and create "pdf" folder
if not os.path.exists(os.path.join(Results_Folder, "Tsne", "Track_parameters")):
    os.makedirs(os.path.join(Results_Folder, "Tsne", "Track_parameters"), exist_ok=True)

Cluster = "Cluster_tsne"
base_folder = f"{Results_Folder}/Tsne/Track_parameters/"

if not os.path.exists(base_folder):
  os.makedirs(base_folder)

checkboxes_dict, checkboxes_accordion = display_variable_checkboxes(categorize_columns(merged_tracks_df))
variable_checkboxes, checkboxes_widget = display_variable_checkboxes(get_selectable_columns_plots(merged_tracks_df))

# Create and display the plot button
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars_per_cluster(b, Cluster, checkboxes_dict, merged_tracks_df, base_folder));

# Display the UI components
display(VBox([checkboxes_accordion, button]))

## **3.7. Plot track parameters for a selected cluster**
---

In [None]:
# @title ##Plot track parameters for a selected cluster

base_folder = f"{Results_Folder}/Tsne/Track_parameters/"
Conditions = 'Condition'
df_to_plot = merged_tracks_df
Cluster = "Cluster_tsne"
base_folder = f"{Results_Folder}/Tsne/Track_parameters/"

# Check and create necessary directories
folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

condition_selector, condition_accordion = display_condition_selection(df_to_plot, Conditions)
checkboxes_dict, checkboxes_accordion = display_variable_checkboxes(categorize_columns(df_to_plot))
variable_checkboxes, checkboxes_widget = display_variable_checkboxes(get_selectable_columns_plots(df_to_plot))
stat_method_selector = widgets.Dropdown(
    options=['randomization test', 't-test'],
    value='randomization test',
    description='Stat Method:',
    style={'description_width': 'initial'}
)
cluster_dropdown = display_cluster_dropdown(merged_tracks_df, Cluster)

button = Button(description="Plot Selected Variables", layout=Layout(width='400px'), button_style='info')
button.on_click(lambda b: plot_selected_vars_cluster(b, checkboxes_dict, df_to_plot, Conditions, Cluster, cluster_dropdown, base_folder, condition_selector, stat_method_selector));

display(VBox([condition_accordion, checkboxes_accordion, stat_method_selector, cluster_dropdown, button]))

# **Part 4. Version log**
---
<font size = 4>While I strive to provide accurate and helpful information, please be aware that:
  - This notebook may contain bugs.
  - Features are currently limited and will be expanded in future releases.

<font size = 4>We encourage users to report any issues or suggestions for improvement. Please check the [repository](https://github.com/guijacquemet/CellTracksColab) regularly for updates and the latest version of this notebook.

#### **Known Issues**:
- Tracks are displayed in 2D in section 1.4

<font size = 4>**Version 1.0.1**
  - Includes a general data reader
  - Plotting functions are imported from the main code
    
<font size = 4>**Version 0.9.2**
  - Added the Origin normalized plots

<font size = 4>**Version 0.9.1**
  - Added the PIP freeze option to save a requirement text
  - Added the heatmap visualisation of track parameters
  - Heatmaps can now be displayed on multiple pages
  - Fix userwarning message during plotting (all box plots)
  - Added the possibility to copy and paste an existing list of selected metric for clustering analyses

<font size = 4>**Version 0.9**
  - Improved plotting strategy. Specific conditions can be chosen
  - absolute cohen d values are now shown
  - In the QC the heatmap is automatically divided in subplot when too many columns are in the df

<font size = 4>**Version 0.8**
  - Settings are now saved
  - Order of the section has been modified to help streamline biological discoveries
  - New section added to quality Control to check if the dataset is balanced
  - New section added to the UMAP and tsne section to plot track parameters for selected clusters
  - clusters for UMAP and t-sne are now saved in the dataframe separetly

<font size = 4>**Version 0.7**
  - check_for_nans function added
  - Clustering using t-SNE added

<font size = 4>**Version 0.6**
  - Improved organisation of the results
  - Tracks visualisation are now saved

<font size = 4>**Version 0.5**
  - Improved part 5
  - Added the possibility to find examplar on the raw movies when available
  - Added the possibility to export video with the examplar labeled
  - Code improved to deal with larger dataset (tested with over 50k tracks)
  - test dataset now contains raw video and is hosted on Zenodo
  - Results are now organised in folders
  - Added progress bars
  - Minor code fixes

<font size = 4>**Version 0.4**

  - Added the possibility to filter and smooth tracks
  - Added spatial and temporal calibration
  - Notebook is streamlined
  - multiple bug fix
  - Remove the t-sne
  - Improved documentation

<font size = 4>**Version 0.3**
  - Fix a nasty bug in the import functions
  - Add basic examplar for UMAP
  - Added the statistical analyses and their explanations.
  - Added a new quality control part that helps assessing the similarity of results between FOV, conditions and repeats
  - Improved part 5 (previously part 4).

<font size = 4>**Version 0.2**
  - Added support for 3D tracks
  - New documentation and metrics added.

<font size = 4>**Version 0.1**
This is the first release of this notebook.

---