# **CellTracksColab**
---

Colab Notebook for Analyzing Migration Tracks generated by [TrackMate](https://imagej.net/plugins/trackmate/)
This Colab notebook is designed to analyze migration tracks, placing emphasis on those generated by using TrackMate, renowned for its proficiency in detailing single-particle tracking data.

Notebook created by [Guillaume Jacquemet](https://cellmig.org/)


# **0. Before getting started**
---

---
<font size = 4>**Important note**

To load your TrackMate outputs, your dataset should be meticulously organized into a two-tiered folder hierarchy as depicted below.

<font size = 4>Here's a common data structure that can work:

## Folder Hierarchy

- 📁 **Experiments** `[Folder_path]`
  - 🌿 **Condition_1** `[‘condition’ is derived from this folder name]`
    - 🔄 **R1** `[‘repeat’ is derived from this folder name]`
      - 📄 `FOV_spots_1.csv`
      - 📄 `FOV_tracks_1.csv`
      - 📄 `FOV_spots_2.csv`
      - 📄 `FOV_tracks_2.csv`
    - 🔄 **R2**
      - 📄 `FOV_spots_1.csv`
      - 📄 `FOV_tracks_1.csv`
      - 📄 `FOV_spots_2.csv`
      - 📄 `FOV_tracks_2.csv`
  - 🌿 **Condition_2**
    - 🔄 **R1**
    - 🔄 **R2**

<font size = 4>In this representation, different symbols are used to represent folders and files clearly:

📁 represents the main folder or directory.
🌿 represents the condition folders.
🔄 represents the repeat folders.
📄 represents the individual CSV files.

---
<font size = 4>**Important note 2**

Be advised of two significant limitation inherent to this notebook.

1) <font size = 4 color="red">**It does not support Track splitting**</font>. For users aiming to compute additional track metrics within this environment, it is crucial to disable track splitting in TrackMate.

It’s important to clarify that the absence of track splitting support does not hinder the notebook's ability to compile and display results in part 3 of the analysis process. The results compilation and display mechanisms are designed to function independently of track splitting, allowing users to visualize and interpret the data accurately.

Before initiating the analysis, ensure that track splitting is disabled if the additional metrics computations are needed, to maintain the integrity and reliability of the results obtained through this notebook.

2) <font size = 4 color="red">**It is currently limited to the analysis of 2D tracks**</font>.




In [None]:
# @title #MIT License

print("""
**MIT License**

Copyright (c) 2023 Guillaume Jacquemet

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.""")

#Version log
---
<font size = 4>**Version 0.1**

This is the first release of this notebook. While I strive to provide accurate and helpful information, please be aware that:
  - This version may contain bugs.
  - Features are currently limited and will be expanded in future releases.

We encourage users to report any issues or suggestions for improvement. Please check the [repository](https://github.com/guijacquemet/CellTracksColab) regularly for updates and the latest version of this notebook.

### Known Issues:
- Part 4 is limited and unstable.

---

--------------------------------------------------------
# **Part 1: Complete the Colab session and Load your tracks and spots data**
--------------------------------------------------------


## **1.1. Install key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to install
!pip -q install pandas scikit-learn
!pip -q install hdbscan
!pip -q install umap-learn
!pip -q install plotly


import requests

# URL to the raw content of the version file in the repository
version_url = "https://raw.githubusercontent.com/guijacquemet/CellTracksColab/main/Notebook/latest_version.txt"

# Current version of the notebook the user is running
current_version = "0.1"

try:
    response = requests.get(version_url)
    response.raise_for_status()  # Check whether the request was successful
    latest_version = response.text.strip()  # Get the latest version from the version file

    if latest_version != current_version:
        print(f"A newer version of this notebook is available: {latest_version}. "
              f"Please download the latest version from the repository.")
    else:
        print("You are running the latest version of this notebook.")
except requests.RequestException as e:
    print("Could not check for the latest version of the notebook.")


## **1.2. Mount your Google Drive**
---
<font size = 4> To use this notebook on the data present in your Google Drive, you need to mount your Google Drive to this notebook.

<font size = 4> Play the cell below to mount your Google Drive and follow the link. In the new browser window, select your drive and select 'Allow', copy the code, paste into the cell and press enter. This will give Colab access to the data on the drive.

<font size = 4> Once this is done, your data are available in the **Files** tab on the top left of notebook.

In [None]:
#@markdown ##Play the cell to connect your Google Drive to Colab

from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive



## **1.3. Load your (or test) data**

<font size = 4> Please ensure that your data is properly organised (see above)


In [None]:
#@markdown ###Provide the path to your dataset

Folder_path = ''  # @param {type: "string"}

#@markdown ###Or use a test dataset
Use_test_dataset = True #@param {type:"boolean"}

#@markdown ###Provide the path to your Result folder

Results_Folder = "/content"  # @param {type: "string"}

import os
import re
import glob
import pandas as pd

def populate_columns(df, filepath):
    # Extract the parts of the file path
    path_parts = os.path.normpath(filepath).split(os.sep)

    if len(path_parts) < 3:
        # if there are not enough parts in the path to extract folder and parent folder
        print(f"Error: Cannot extract parent folder and folder from the filepath: {filepath}")
        return df

    # Assuming that the file is located at least two levels deep in the directory structure
    folder_name = path_parts[-2]  # The folder name is the second last part of the path
    parent_folder_name = path_parts[-3]  # The parent folder name is the third last part of the path

    df['File_name'] = os.path.splitext(os.path.basename(filepath))[0]
    df['Condition'] = parent_folder_name  # Populate 'Condition' with the parent folder name
    df['experiment_nb'] = folder_name  # Populate 'Repeat' with the folder name

    return df

def load_and_populate(file_pattern, usecols=None):
    df_list = []
    pattern = re.compile(file_pattern)  # Compile the file pattern to a regex object

    # Go through each root, dirs, files triplet returned by os.walk
    for dirpath, dirnames, filenames in os.walk(Folder_path):
        for filename in filenames:
            if pattern.match(filename):  # Check if the filename matches the file pattern
                filepath = os.path.join(dirpath, filename)
                df = pd.read_csv(filepath, skiprows=[1, 2, 3], usecols=usecols)
                df_list.append(populate_columns(df, filepath))

    if not df_list:  # if df_list is empty, return an empty DataFrame
        print(f"No files found with pattern: {file_pattern}")
        return pd.DataFrame()

    merged_df = pd.concat(df_list, ignore_index=True)
    return merged_df

def sort_and_generate_repeat(merged_df):
    merged_df.sort_values(['Condition', 'experiment_nb'], inplace=True)
    merged_df = merged_df.groupby('Condition', group_keys=False).apply(generate_repeat)
    return merged_df

def generate_repeat(group):
    unique_experiment_nbs = sorted(group['experiment_nb'].unique())
    experiment_nb_to_repeat = {experiment_nb: i+1 for i, experiment_nb in enumerate(unique_experiment_nbs)}
    group['Repeat'] = group['experiment_nb'].map(experiment_nb_to_repeat)
    return group

if (Use_test_dataset):
  print("Downloading test dataset")
  !wget -nc -O /content/T_cell_dataset.zip https://github.com/guijacquemet/CellTracksColab/raw/main/Test_dataset/T_cell_dataset.zip && unzip -q /content/T_cell_dataset.zip -d /content
  Folder_path = "/content/Tracks"

print("Merging CSV files....")

merged_tracks_df = load_and_populate(r'.*tracks.*\.csv')  # Use raw string to avoid escape character issues
merged_tracks_df = sort_and_generate_repeat(merged_tracks_df)
merged_tracks_df['Unique_ID'] = merged_tracks_df['Condition'] + "_" + merged_tracks_df['experiment_nb'] + "_" + merged_tracks_df['TRACK_ID'].astype(str)
merged_tracks_df.to_csv(Results_Folder + '/' + 'merged_Tracks.csv', index=False)

merged_spots_df = load_and_populate(r'.*spots.*\.csv')  # Use raw string to avoid escape character issues
merged_spots_df = sort_and_generate_repeat(merged_spots_df)
merged_spots_df['Unique_ID'] = merged_spots_df['Condition'] + "_" + merged_spots_df['experiment_nb'] + "_" + merged_spots_df['TRACK_ID'].astype(str)
merged_spots_df.to_csv(Results_Folder + '/' + 'merged_Spots.csv', index=False, compression='gzip')

print("Done")


## **1.4. Visualise your tracks**

In [None]:
# @title ##Run the cell and choose the file you want to inspect


import ipywidgets as widgets
from ipywidgets import interact
import matplotlib.pyplot as plt

# Extract unique filenames from the dataframe
filenames = merged_spots_df['File_name'].unique()

# Create a Dropdown widget with the filenames
filename_dropdown = widgets.Dropdown(
    options=filenames,
    value=filenames[0] if len(filenames) > 0 else None,  # Default selected value
    description='File Name:',
)

def plot_coordinates(filename):
    if filename:
        # Filter the DataFrame based on the selected filename
        filtered_df = merged_spots_df[merged_spots_df['File_name'] == filename]

        plt.figure(figsize=(10, 8))
        for unique_id in filtered_df['Unique_ID'].unique():
            unique_df = filtered_df[filtered_df['Unique_ID'] == unique_id].sort_values(by='POSITION_T')
            plt.plot(unique_df['POSITION_X'], unique_df['POSITION_Y'], marker='o', linestyle='-', markersize=2)

        plt.xlabel('POSITION_X')
        plt.ylabel('POSITION_Y')
        plt.title(f'Coordinates for {filename}')
        plt.show()
    else:
        print("No valid filename selected")

# Link the Dropdown widget to the plotting function
interact(plot_coordinates, filename=filename_dropdown)


--------------------------------------------------------
# **Part 2: Compute additional metrics**
--------------------------------------------------------
<font size = 4> Additional Metrics will be added later.


In [None]:
# @title ##Calculate directionality
import pandas as pd
import numpy as np

# Function to calculate Directionality
def calculate_directionality(group):
    group = group.sort_values('POSITION_T')
    start_point = group.iloc[0][['POSITION_X', 'POSITION_Y']]
    end_point = group.iloc[-1][['POSITION_X', 'POSITION_Y']]
    euclidean_distance = np.sqrt((end_point - start_point).pow(2).sum())

    deltas = np.sqrt(group['POSITION_X'].diff().fillna(0)**2 + group['POSITION_Y'].diff().fillna(0)**2)
    total_path_length = deltas.sum()

    D = euclidean_distance / total_path_length if total_path_length != 0 else 0
    return pd.Series({'Directionality': D})

# Sort the DataFrame by 'Unique_ID' and 'POSITION_T'
merged_spots_df.sort_values(by=['Unique_ID', 'POSITION_T'], inplace=True)

# Calculate directionality for each track
df_directionality = merged_spots_df.groupby('Unique_ID').apply(calculate_directionality).reset_index()

# Merge the directionality back into the original DataFrame
merged_tracks_df = pd.merge(merged_tracks_df, df_directionality, on='Unique_ID', how='left')

merged_tracks_df.to_csv(Results_Folder + '/' + 'merged_Tracks.csv', index=False)

In [None]:
# @title ##Plot directionality


import ipywidgets as widgets
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# List of variables to plot
variables_to_plot = ["Directionality"]


# Initialize PDF
pdf_pages = PdfPages(Results_Folder+'Boxplots.pdf')
# Create a single figure with 4 subplots, one for each variable
fig, axes = plt.subplots(len(variables_to_plot), 1, figsize=(10, 6))

# Make sure axes is a list in case there's only one subplot
if len(variables_to_plot) == 1:
    axes = [axes]

for ax, var in zip(axes, variables_to_plot):
    # Extract the data for this variable
    data_for_var = merged_tracks_df[['Condition', var]]

    # Save this data to a CSV file
    data_for_var.to_csv(f"{Results_Folder}/data_for_{var}.csv", index=False)
    sns.boxplot(x='Condition', y=var, data=merged_tracks_df, ax=ax, color='lightgray')  # Boxplot
    sns.stripplot(x='Condition', y=var, data=merged_tracks_df, ax=ax, hue='Repeat', dodge=True, jitter=True, alpha=0.2)  # Individual data points
    ax.set_title(f"{var}")
    ax.set_xlabel('Condition')
    ax.set_ylabel(var)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)


# Save the figure to a PDF
plt.tight_layout()
pdf_pages.savefig(fig)

# Close the PDF
pdf_pages.close()

-------------------------------------------

# **Part 3: Plot track parameters**
-------------------------------------------

<font size = 4> In this section you can plot all the track parameters previously computed. Data and graphs are automatically saved in your result folder.


In [None]:
# @title ##Plot useful tracks data


import ipywidgets as widgets
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# Create a list of potential variables from DataFrame columns
all_columns = merged_tracks_df.columns.tolist()

# Remove unwanted columns like 'condition' and 'repeat' from the list
selectable_columns = [col for col in all_columns if col not in ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION']]

# Create checkboxes for selectable columns
variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]


# Arrange and display checkboxes in the notebook
display(widgets.VBox([
    widgets.Label('Variables to Plot:'),
    widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3)),
]))

# Define the plotting function
def plot_selected_vars(button):
  print("Plotting in progress...")

  variables_to_plot = [box.description for box in variable_checkboxes if box.value]

  pdf_pages = PdfPages(f"{Results_Folder}/boxplots.pdf")

# Determine the number of variables to plot
  n_plots = len(variables_to_plot)

# If no variables are selected, avoid creating a plot
  if n_plots == 0:
    print("No variables selected for plotting")
  else:
    # Set the height of each subplot and figure width
    subplot_height = 5  # Adjust as per your requirement
    fig_width = 10  # Adjust as per your requirement

    # Calculate the total figure height
    fig_height = n_plots * subplot_height

    # Create subplots with dynamic figure size
    fig, axes = plt.subplots(n_plots, 1, figsize=(fig_width, fig_height))

    # Make axes iterable in case there's only one subplot
    if n_plots == 1:
        axes = [axes]

    for ax, var in zip(axes, variables_to_plot):
        data_for_var = merged_tracks_df[['Condition', var]]
        data_for_var.to_csv(f"{Results_Folder}/data_for_{var}.csv", index=False)
        sns.boxplot(x='Condition', y=var, data=merged_tracks_df, ax=ax, color='lightgray')
        sns.stripplot(x='Condition', y=var, data=merged_tracks_df, ax=ax, hue='Repeat', dodge=True, jitter=True, alpha=0.2)
        ax.set_title(f"{var}")
        ax.set_xlabel('Condition')
        ax.set_ylabel(var)
        ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

    plt.tight_layout()
    pdf_pages.savefig(fig)
    pdf_pages.close()
    plt.show()

# Create a button that will execute the plotting function when clicked
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(plot_selected_vars)
display(button)


--------
# **Part 4: Visualization of high-dimensional data (work in progress)**
--------

<font size = 4> The workflow provided below is inspired by [CellPlato](https://github.com/Michael-shannon/cellPLATO)

### **4.1: UMAP**

<font size = 4> The given code performs UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction on the merged tracks dataframe, focusing on its numeric columns, and visualizes the result. UMAP is often used for visualization of high-dimensional data in a 2D or 3D space.

In [None]:
# @title ##Perform UMAP

# User alterable parameters
umap_nn = 30  # UMAP nearest neighbors
min_dist = 0.0  # UMAP minimum distance
n_components = 3  # Number of UMAP dimensions to calculate

import umap
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Select only numeric columns
numeric_df = merged_tracks_df.select_dtypes(include=['number'])

# Select only numeric columns
numeric_df = merged_tracks_df.select_dtypes(include=['number'])

# Check if the DataFrame has any NaN values and print a warning if it does.
nan_columns = numeric_df.columns[numeric_df.isna().any()].tolist()

if nan_columns:
    warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}")
    numeric_df = numeric_df.dropna()

# Initialize UMAP object with the specified settings
reducer = umap.UMAP(n_neighbors=umap_nn, min_dist=min_dist, n_components=n_components, random_state=42)
embedding = reducer.fit_transform(numeric_df)

# Create dynamic column names based on n_components
column_names = [f'UMAP dimension {i}' for i in range(1, n_components + 1)]

# Create a DataFrame with the UMAP results
umap_df = pd.DataFrame(embedding, columns=column_names)

# Concatenate the conditions (if available)
if 'Condition' in merged_tracks_df.columns:
    umap_df = pd.concat([umap_df, merged_tracks_df['Condition'].reset_index(drop=True)], axis=1)

# Visualize the UMAP projection
plt.figure(figsize=(12, 10))

# The plot will adjust automatically based on the n_components
if n_components == 2:
    sns.scatterplot(x=column_names[0], y=column_names[1], hue='Condition', data=umap_df, palette='viridis', s=60)
    plt.title('UMAP Projection of the Dataset')
    plt.show()
elif n_components == 1:
    sns.stripplot(x=column_names[0], hue='Condition', data=umap_df, palette='viridis', jitter=0.05, size=6)
    plt.title('UMAP Projection of the Dataset')
    plt.show()
else:
    # umap_df should have columns like 'UMAP dimension 1', 'UMAP dimension 2', 'UMAP dimension 3', and 'condition'
    import plotly.express as px
    import pandas as pd
    import numpy as np

    fig = px.scatter_3d(umap_df,
                    x='UMAP dimension 1',
                    y='UMAP dimension 2',
                    z='UMAP dimension 3',
                    color='Condition')

    for trace in fig.data:
      trace.marker.size = 2  # You can set this to any desired value

    fig.show()

### 4.2 **HDBSCAN**

<font size = 4> The provided code employs HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to identify clusters within a dataset that has already undergone UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction. HDBSCAN is utilized for its proficiency in determining the optimal number of clusters while managing varied densities within the data.

In [None]:
# @title ##Identify clusters using HDBSCAN

import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns

# Suppose you have a DataFrame of features (numeric_df) and have performed UMAP on it, resulting in umap_df

# Apply HDBSCAN
clusterer = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=200, metric='euclidean')  # You may need to tune these parameters
clusterer.fit(umap_df[['UMAP dimension 1', 'UMAP dimension 2']])  # Use the UMAP results for clustering
#clusterer.fit(merged_tracks_df.select_dtypes(include=['number']))

# Add the cluster labels to your UMAP DataFrame
umap_df['Cluster'] = clusterer.labels_

if n_components == 2:
  # Plotting the results
  plt.figure(figsize=(12,10))
  sns.scatterplot(x='UMAP dimension 1', y='UMAP dimension 2', hue='Cluster', palette='viridis', data=umap_df, s=60)
  plt.title('Clusters Identified by HDBSCAN')
  plt.show()

if n_components == 3:
  import plotly.express as px
  import pandas as pd
  import numpy as np

  fig = px.scatter_3d(umap_df,
                    x='UMAP dimension 1',
                    y='UMAP dimension 2',
                    z='UMAP dimension 3',
                    color='Cluster')

  for trace in fig.data:
    trace.marker.size = 2  # You can set this to any desired value

  fig.show()

In [None]:
# @title ##Identify exemplar cells using HDBSCAN (not available)

# Extracting exemplar points
exemplars = []
for exemplar in clusterer.exemplars_:
    exemplars.extend(exemplar)

# Flatten the exemplars list of lists into a single list
flattened_exemplars = [index for sublist in exemplars for index in sublist]

# Now pass the flattened list to iloc
exemplar_df = umap_df.iloc[flattened_exemplars]

# Plotting clusters and exemplar points
plt.figure(figsize=(12,10))
sns.scatterplot(x='UMAP dimension 1', y='UMAP dimension 2', hue='Cluster', palette='viridis', data=umap_df, s=60)
sns.scatterplot(x='UMAP dimension 1', y='UMAP dimension 2', color='red', label='Exemplars', data=exemplar_df, s=100, marker='X')
plt.title('Clusters and Exemplar Cells Identified by HDBSCAN')
plt.show()

In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition

import pandas as pd

# Group by 'Condition' and 'Cluster' and calculate the size of each group
cluster_counts = umap_df.groupby(['Condition', 'Cluster']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = umap_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'Condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100


# Pivot the percentage_df to have Conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster', values='percentage')

# Fill NaN values with 0 if any, as there might be some Condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

import matplotlib.pyplot as plt

# Plotting
ax = pivot_df.plot(kind='bar', stacked=True, figsize=(10, 7), colormap='viridis')
plt.title('Percentage in each cluster per Condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()



In [None]:
# @title ##Perform t-SNE


import pandas as pd
from sklearn.manifold import TSNE
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

# Assuming df is your DataFrame containing the data
# Drop non-numeric columns or encode them to numeric

numeric_df = merged_tracks_df.select_dtypes(include=['float64', 'int64'])  # Selecting only numeric columns


# Check if the DataFrame has any NaN values and print a warning if it does.
nan_columns = numeric_df.columns[numeric_df.isna().any()].tolist()

if nan_columns:
    warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}", UserWarning)
    numeric_df = numeric_df.dropna()

scaled_df = StandardScaler().fit_transform(numeric_df)  # Scaling the numeric features

tsne = TSNE(n_components=2, perplexity=30, n_iter=300, random_state=42)
tsne_results = tsne.fit_transform(scaled_df)

tsne_df = pd.DataFrame(data=tsne_results, columns=['Dimension 1', 'Dimension 2'])
tsne_df['Condition'] = merged_tracks_df['Condition']

plt.figure(figsize=(10,8))
sns.scatterplot(
    x='Dimension 1', y='Dimension 2',
    hue='Condition',
    palette=sns.color_palette("hsv", len(tsne_df['Condition'].unique())),
    data=tsne_df,
    legend="full",
    alpha=0.9
)
plt.title('t-SNE Plot')
plt.show()


In [None]:
# @title ##Identify clusters using HDBSCAN

clusterer = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=500, metric='euclidean')  # You may need to tune these parameters

clusterer.fit(tsne_results)

# Create a DataFrame with the t-SNE results and cluster labels
clustered_df = pd.DataFrame(data=tsne_results, columns=['Dimension 1', 'Dimension 2'])
clustered_df['Cluster'] = clusterer.labels_
clustered_df['Condition'] = tsne_df['Condition']


# Visualize the clusters
plt.figure(figsize=(10,8))
sns.scatterplot(x='Dimension 1', y='Dimension 2', hue='Cluster', palette='viridis', data=clustered_df, s=60)
plt.title('Clusters identified by HDBSCAN on t-SNE Results')
plt.show()


In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition


import pandas as pd
import matplotlib.pyplot as plt

# Assuming your dataframe is clustered_df after t-SNE and HDBSCAN
# and it contains a 'condition' column with the original condition labels,
# and a 'Cluster' column with the HDBSCAN cluster labels

# Group by 'condition' and 'Cluster' and calculate the size of each group
cluster_counts = clustered_df.groupby(['Condition', 'Cluster']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = clustered_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100

# Pivot the percentage_df to have conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster', values='percentage')

# Fill NaN values with 0, as there might be some condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

# Plotting
ax = pivot_df.plot(kind='bar', stacked=True, figsize=(10, 7), colormap='viridis')
plt.title('Percentage in each cluster per condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)  # Changed from 90 to 45 for better readability
plt.tight_layout()
plt.show()
