# **Plot&Stats - dimensionality reduction**
---

<font size = 4>Colab Notebook for generating PCA, UMAP or t-SNE dimensional reduction of multidimensional datasets.


<font size = 4>Notebook created by [Guillaume Jacquemet](https://cellmig.org/)



# **Part 0. Before getting started**
---



##<font size = 5>**Important notes**

---
## Data Requirements for Analysis

<font size = 4>For a successful analysis using this notebook, ensure your data meets the following criteria:

## Notebook Data Format and Requirements Documentation

This document details the prerequisites for data to be analyzed effectively within this notebook. Ensuring adherence to these guidelines will facilitate accurate and efficient data analysis.

### File Format
- **CSV**: Data should be in CSV (Comma-Separated Values) format, easily generated from spreadsheet applications (e.g., Excel, Google Sheets) or statistical software (e.g., R, Python).
- **Copy and Paste**: Data can be directly copied and pasted from a spreedsheet software.

### Data Structure: Tidy Format
Data must follow the tidy data principles for optimal processing:
- **Each Variable Forms a Column**: Every column represents a single variable.
- **Each Observation Forms a Row**: Every row represents a single observation.
- **Each Type of Observational Unit Forms a Table**: Different observational units should be in separate tables or clearly distinguishable.

### Essential Columns
Your dataset must include specific columns for analysis:
- **Biological Repeat Column**: Identifies biological replicates. Names can vary (e.g., "Repeat", "Bio_Replicate") but must consistently identify each biological repeat.
- **Condition Column**: Categorizes observations by experimental conditions or treatments. Names can vary (e.g., "Condition", "Treatment") but must provide clear, consistent categorization.

### Data Preparation Tips
- **Consistency and Clarity**: Ensure consistent and descriptive naming within "Biological Repeat" and "Condition" columns.
- **Data Cleaning**: Address missing or erroneous entries in these essential columns to prevent analysis issues.

### Column Naming Flexibility
- The exact names of the "Biological Repeat" and "Condition" columns are flexible to fit various dataset structures and terminologies. You'll specify these columns when using the notebook.

Adhering to these guidelines ensures your data is primed for the notebook's analytical capabilities, allowing for insightful comparisons across biological repeats and conditions.

In [None]:
# @title #MIT License

print("""
**MIT License**

Copyright (c) 2023 Guillaume Jacquemet

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.""")

--------------------------------------------------------
# **Part 1. Prepare the session and load your data**
--------------------------------------------------------


## **1.1. Install key dependencies**
---
<font size = 4>

In [None]:
#@markdown ##Play to install
!pip -q install pandas scikit-learn
!pip -q install hdbscan
!pip -q install umap-learn
!pip -q install plotly
!pip -q install prettytable
!pip -q install adjustText





In [None]:
#@markdown ##Play to load the dependancies

import ipywidgets as widgets
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import numpy as np
import itertools
from matplotlib.gridspec import GridSpec
import requests

!pip freeze > requirements.txt

# Current version of the notebook the user is running
current_version = "0.1"
Notebook_name = 'dimensionality_reduction'

# URL to the raw content of the version file in the repository
version_url = "https://raw.githubusercontent.com/CellMigrationLab/Plot-Stats/main/Notebooks/latest_version.txt"

# Function to define colors for formatting messages
class bcolors:
    WARNING = '\033[91m'  # Red color for warning messages
    ENDC = '\033[0m'      # Reset color to default

# Check if this is the latest version of the notebook
try:
    All_notebook_versions = pd.read_csv(version_url, dtype=str)
    print('Notebook version: ' + current_version)

    # Check if 'Version' column exists in the DataFrame
    if 'Version' in All_notebook_versions.columns:
        Latest_Notebook_version = All_notebook_versions[All_notebook_versions["Notebook"] == Notebook_name]['Version'].iloc[0]
        print('Latest notebook version: ' + Latest_Notebook_version)

        if current_version == Latest_Notebook_version:
            print("This notebook is up-to-date.")
        else:
            print(bcolors.WARNING + "A new version of this notebook has been released. We recommend that you download it at https://github.com/CellMigrationLab/Plot-Stats" + bcolors.ENDC)
    else:
        print("The 'Version' column is not present in the version file.")
except requests.exceptions.RequestException as e:
    print("Unable to fetch the latest version information. Please check your internet connection.")
except Exception as e:
    print("An error occurred:", str(e))




# Function to calculate Cohen's d
def cohen_d(group1, group2):
    diff = group1.mean() - group2.mean()
    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()
    pooled_var = ((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2)
    d = diff / np.sqrt(pooled_var)
    return d

def save_dataframe_with_progress(df, path, desc="Saving", chunk_size=50000):
    """Save a DataFrame with a progress bar."""

    # Estimating the number of chunks based on the provided chunk size
    num_chunks = int(len(df) / chunk_size) + 1

    # Create a tqdm instance for progress tracking
    with tqdm(total=len(df), unit="rows", desc=desc) as pbar:
        # Open the file for writing
        with open(path, "w") as f:
            # Write the header once at the beginning
            df.head(0).to_csv(f, index=False)

            for chunk in np.array_split(df, num_chunks):
                chunk.to_csv(f, mode="a", header=False, index=False)
                pbar.update(len(chunk))

def check_for_nans(df, df_name):
    """
    Checks the given DataFrame for NaN values and prints the count for each column containing NaNs.

    Args:
    df (pd.DataFrame): DataFrame to be checked for NaN values.
    df_name (str): The name of the DataFrame as a string, used for printing.
    """
    # Check if the DataFrame has any NaN values and print a warning if it does.
    nan_columns = df.columns[df.isna().any()].tolist()

    if nan_columns:
        for col in nan_columns:
            nan_count = df[col].isna().sum()
            print(f"Column '{col}' in {df_name} contains {nan_count} NaN values.")
    else:
        print(f"No NaN values found in {df_name}.")


import pandas as pd
import os

def save_parameters(params, file_path, param_type):
    # Convert params dictionary to a DataFrame for human readability
    new_params_df = pd.DataFrame(list(params.items()), columns=['Parameter', 'Value'])
    new_params_df['Type'] = param_type

    if os.path.exists(file_path):
        # Read existing file
        existing_params_df = pd.read_csv(file_path)

        # Merge the new parameters with the existing ones
        # Update existing parameters or append new ones
        updated_params_df = pd.merge(existing_params_df, new_params_df,
                                     on=['Type', 'Parameter'],
                                     how='outer',
                                     suffixes=('', '_new'))

        # If there's a new value, update it, otherwise keep the old value
        updated_params_df['Value'] = updated_params_df['Value_new'].combine_first(updated_params_df['Value'])

        # Drop the temporary new value column
        updated_params_df.drop(columns='Value_new', inplace=True)
    else:
        # Use new parameters DataFrame directly if file doesn't exist
        updated_params_df = new_params_df

    # Save the updated DataFrame to CSV
    updated_params_df.to_csv(file_path, index=False)


## **1.2. Mount your Google Drive**
---
<font size = 4> To use this notebook on the data present in your Google Drive, you need to mount your Google Drive to this notebook.

<font size = 4> Play the cell below to mount your Google Drive and follow the instructions.

<font size = 4> Once this is done, your data are available in the **Files** tab on the top left of notebook.

In [None]:
#@markdown ##Play the cell to connect your Google Drive to Colab

from google.colab import drive
drive.mount('/content/gdrive')
%cd /content/gdrive



## **1.3. Load your dataset**
---

<font size = 4> Please ensure that your data is properly organised (see above)


In [None]:
#@markdown ##Load your dataset:


import pandas as pd
import os
from io import StringIO
import ipywidgets as widgets
from IPython.display import display, clear_output

# Initialize dataset_df as an empty DataFrame globally
dataset_df = pd.DataFrame()


# Create widgets
dataset_path_input = widgets.Text(
    value='',
    placeholder='Enter the path to your dataset',
    description='Dataset Path:',
    layout={'width': '80%'}
)

results_folder_input = widgets.Text(
    value='',
    placeholder='Enter the path to your results folder',
    description='Results Folder:',
    layout={'width': '80%'}
)

data_textarea = widgets.Textarea(
    value='',
    placeholder='Or copy and paste your tab sperated data here (direct copy and paste from a spreedsheet)',
    description='Or Paste Data:',
    layout={'width': '80%', 'height': '200px'}
)

load_button = widgets.Button(
    description='Load Data',
    button_style='success',  # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click to load the data',
)

output = widgets.Output()

# Load data function
def load_data(b):
    global dataset_df
    global Results_Folder

    with output:
        clear_output()
        Results_Folder = results_folder_input.value.strip()
        if not Results_Folder:
            Results_Folder = './Results'  # Default path if not provided
        if not os.path.exists(Results_Folder):
            os.makedirs(Results_Folder)  # Create the folder if it doesn't exist
        print(f"Results folder is located at: {Results_Folder}")

        if dataset_path_input.value.strip():
            dataset_path = dataset_path_input.value.strip()
            try:
                dataset_df = pd.read_csv(dataset_path)
                print(f"Loaded dataset from {dataset_path}")
            except Exception as e:
                print(f"Failed to load dataset from {dataset_path}: {e}")
        elif data_textarea.value.strip():
            input_data = StringIO(data_textarea.value)
            try:
                dataset_df = pd.read_csv(input_data, sep='\t')
                print("Loaded dataset from pasted tab-separated data")
            except Exception as e:
                print(f"Failed to load dataset from pasted data: {e}")
        else:
            print("No dataset path provided or data pasted. Please provide a dataset.")
            return

        # Perform a check for NaNs or any other required processing here
        check_for_nans(dataset_df, "your dataset")

        display(dataset_df.head())

# Set the button click event
load_button.on_click(load_data)

# Display the widgets
display(widgets.VBox([dataset_path_input, results_folder_input, data_textarea, load_button, output]))


## **1.4. Map your data**
---

## Required Columns

<font size = 4>To plot your data, we need to ensure the presence of specific columns in the dataset. Here's a breakdown of the required columns:

- **`Condition`**: Identifies the biological condition.

- **`Repeat`**: Represents the biological repeat.


In [None]:
#@markdown ##Map your dataset:


import ipywidgets as widgets  # Ensure we have the required widgets module imported
import pandas as pd

def single_stage_column_mapping(df):
    # Define the columns we need to map: Condition, Repeat
    mappings = {
        'Condition': 'Identifies the biological conditions.',
        'Repeat': 'Represents the biological repeats.'
    }

    dropdowns = {}
    for key, description in mappings.items():
        description_label = widgets.Label(f"{key} ({description}):")
        dropdowns[key] = widgets.Dropdown(options=df.columns, layout=widgets.Layout(width='250px'))

        # Use HBox to display the description label next to the dropdown
        hbox = widgets.HBox([description_label, dropdowns[key]])
        display(hbox)

    confirm_button = widgets.Button(description="Confirm Mappings")

    def confirm_mappings(button):
        # Perform the mapping based on the user selection
        column_mapping = {dropdown.value: key for key, dropdown in dropdowns.items()}
        new_df = df.rename(columns=column_mapping)

        print("Columns Mapped Successfully!")

        # Count and print unique conditions
        unique_conditions = new_df['Condition'].unique()
        print(f"Number of unique conditions: {len(unique_conditions)}")
        print("Conditions:", ", ".join(unique_conditions))

        # Count and print biological repeats
        unique_repeats = new_df['Repeat'].unique()
        print(f"Number of biological repeats: {len(unique_repeats)}")
        print("Repeats:", ", ".join(map(str, unique_repeats)))


        # Check that each biological condition has exactly the same repeat names
        condition_repeats = new_df.groupby('Condition')['Repeat'].apply(set)
        if len(set(map(frozenset, condition_repeats))) == 1:
            print("All biological conditions have exactly the same repeat names.")
        else:
            print("Warning: Not all biological conditions have the same repeat names.")

        # Update the global dataset_df with the new mappings
        global dataset_df
        dataset_df = new_df

    confirm_button.on_click(confirm_mappings)
    display(confirm_button)

single_stage_column_mapping(dataset_df)


--------
# **Part 2. Explore your high-dimensional data using PCA and HDBSCAN**
--------

## **2.1. Choose the Conditions to use**



In [None]:
#@markdown ##Choose the conditions to use:

import pandas as pd
import ipywidgets as widgets
from IPython.display import display
import os

PCA_folder_path = os.path.join(Results_Folder, "PCA")
if not os.path.exists(PCA_folder_path):
    os.makedirs(PCA_folder_path)

# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip()]

# Text area for user to paste the list of conditions
text_area_conditions = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of conditions here, separated by commas. Or tick the boxes below.',
    description='Conditions:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)

# Create checkboxes for each unique condition in the dataset
Condition_checkboxes = [widgets.Checkbox(value=True, description=str(condition)) for condition in dataset_df['Condition'].unique()]

# Function to filter dataframe based on selected checkbox values and text area input
def filter_dataframe(button):
    global filtered_df
    # Initialize an empty list to hold selected conditions
    selected_conditions = []

    # Check if the text area is not empty
    if text_area_conditions.value.strip():
        # Use conditions from the text area
        selected_conditions = parse_text_area(text_area_conditions.value)
    else:
        # Use conditions from checkboxes if the text area is empty
        selected_conditions = [box.description for box in Condition_checkboxes if box.value]

    # Filter DataFrame
    filtered_df = dataset_df[dataset_df['Condition'].isin(selected_conditions)]

    print("Selected Conditions:", selected_conditions)
    print("Filtered DataFrame length:", len(filtered_df))
    if len(filtered_df) == 0:
        print("No data matched the selected filters. Check filters and data for consistency.")

    # Save selected conditions as parameters
    params = {'Selected Conditions': ', '.join(selected_conditions)}
    params_file_path = os.path.join(PCA_folder_path, "PCA_analysis_parameters.csv")
    save_parameters(params, params_file_path, 'PCA Conditions')

# Button to trigger dataframe filtering
filter_button = widgets.Button(description="Filter Dataframe")
filter_button.on_click(filter_dataframe)

# Display text area, checkboxes, and button
display(text_area_conditions, widgets.VBox([widgets.Label('Select Conditions:')] + Condition_checkboxes + [filter_button]))


## **2.2. Choose the numeric data to use**



In [None]:
import os
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

#@markdown ##Choose the numeric data to use


PCA_folder_path = os.path.join(Results_Folder, "PCA")
if not os.path.exists(PCA_folder_path):
    os.makedirs(PCA_folder_path)

# Define columns to preserve and to exclude if present
preserve_columns = ['Condition']
exclude_columns = ['Repeat', 'Cluster_UMAP', 'Cluster_TSNE']

# Ensure 'Condition' column and potentially 'Repeat' column are handled in 'numeric_df' for later use
numeric_columns = filtered_df.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Keep a copy of the 'Condition' column along with numeric columns
# Exclude 'Repeat' column if it exists
numeric_df = filtered_df[numeric_columns + preserve_columns].copy()
if 'Repeat' in numeric_df.columns:
    numeric_df = numeric_df.drop(columns=['Repeat'])

# Text area for user to paste the list of metrics
text_area = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of metrics here, separated by commas. Or tick the boxes below.',
    description='Metrics:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)

# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip() in numeric_columns]

# Create a checkbox for each numeric column (excluding 'Condition' and potentially 'Repeat')
checkboxes = [widgets.Checkbox(value=True, description=col, indent=False) for col in numeric_columns if col not in exclude_columns]

# Arrange checkboxes in a grid
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection
button = widgets.Button(description="Select Track Parameters", layout=widgets.Layout(width='400px'))

# Define the button click event handler
def on_button_click(b):
    global selected_df

    # Use metrics from the text area if not empty, else from checkboxes
    if text_area.value.strip():
        selected_columns = parse_text_area(text_area.value)
    else:
        selected_columns = [box.description for box in checkboxes if box.value]

    selected_df = numeric_df[selected_columns + preserve_columns].copy()

    # Check for NaN values and handle them
    nan_columns = selected_df.columns[selected_df.isna().any()].tolist()
    if nan_columns:
        for col in nan_columns:
            selected_df.dropna(subset=[col], inplace=True)  # Drop rows with NaN values in these columns
            print(f"Dropped rows from column '{col}' due to NaN values.")

    print("Track parameters selected and NaN values handled.")
    params = {'Selected numeric data': ', '.join(selected_columns)}
    params_file_path = os.path.join(PCA_folder_path, "PCA_analysis_parameters.csv")
    save_parameters(params, params_file_path, 'PCA numeric data')

# Set the button click event handler
button.on_click(on_button_click)

# Display the text area, grid of checkboxes, and the button
display(text_area, grid, button)


## **2.3. Perform the PCA**



In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import pandas as pd
import plotly.express as px


# Verify 'Condition' column exists
if 'Condition' not in filtered_df.columns:
    raise ValueError("The 'Condition' column is missing from the DataFrame.")

# Check and create "PCA" folder
pca_folder_path = f"{Results_Folder}/PCA"
if not os.path.exists(pca_folder_path):
    os.makedirs(pca_folder_path)

#@markdown ###PCA parameters:
n_dimension = 2  # @param {type: "slider", min: 1, max: 3}

#@markdown ###Display parameters:
spot_size = 10 # @param {type: "number"}

# Assuming 'numeric_data' contains only numeric columns and excludes columns like 'Condition'
# If 'selected_df' already only contains numeric data and 'Condition', adjust accordingly
numeric_columns = selected_df.select_dtypes(include=['float64', 'int64']).columns
numeric_data = selected_df[numeric_columns]

# Initialize and fit StandardScaler on the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Initialize PCA object with the specified settings
pca = PCA(n_components=n_dimension, random_state=42)
# Fit PCA on the scaled data
pca_result = pca.fit_transform(scaled_data)

# Create dynamic column names based on n_components
column_names = [f'PCA Dimension {i}' for i in range(1, n_dimension + 1)]

# Combine PCA results with 'Condition' column
pca_df = pd.DataFrame(pca_result, columns=column_names)
pca_df['Condition'] = filtered_df['Condition'].values  # Directly assign 'Condition' values to ensure alignment


# Check for NaN values in the resulting DataFrame
nan_columns = pca_df.columns[pca_df.isna().any()].tolist()
if nan_columns:
    warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}")
    pca_df.dropna(subset=nan_columns, inplace=True)

# Visualization
plt.figure(figsize=(12, 10))
sns.set(style="whitegrid")

if n_dimension == 2:
    sns.scatterplot(x=column_names[0], y=column_names[1], hue='Condition', data=pca_df, palette='Set2', s=spot_size)
    plt.title('PCA Projection to 2D')
    plt.savefig(f"{pca_folder_path}/PCA_projection_2D.pdf")
elif n_dimension == 1:
    sns.stripplot(x=column_names[0], hue='Condition', data=pca_df, palette='Set2', jitter=True, size=spot_size)
    plt.title('PCA Projection to 1D')
    plt.savefig(f"{pca_folder_path}/PCA_projection_1D.pdf")
else:  # For 3D plots
# Using Plotly Express to create a 3D scatter plot
    fig = px.scatter_3d(pca_df, x='PCA Dimension 1', y='PCA Dimension 2', z='PCA Dimension 3',
                    color='Condition', title='3D PCA Projection', opacity=0.7)

# Optional: Customize the layout
    fig.update_layout(margin=dict(l=0, r=0, b=0, t=30))

# Show the plot
    fig.show()

# Save the plot as an HTML file
    html_file_path = os.path.join(Results_Folder, "PCA/PCA_3d_projection.html")
    fig.write_html(html_file_path)
    print(f"3D PCA plot saved to: {html_file_path}")

    plt.show()


# Save parameters used in PCA
PCA_params = {
    'n_dimension': n_dimension,
    'spot_size': spot_size
}
params_file_path = os.path.join(pca_folder_path, "PCA_analysis_parameters.csv")
save_parameters(PCA_params, params_file_path, 'PCA')



## **2.4. Display the PCA Loadings**



In [None]:
#@markdown ##Display the PCA Loadings

from adjustText import adjust_text
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import os

# Ensure the Results_Folder/PCA directory exists
save_path = os.path.join(Results_Folder, "PCA")
os.makedirs(save_path, exist_ok=True)

# Loadings calculation
loadings = pca.components_.T

if n_dimension == 1:
    plt.figure(figsize=(12, 10))
    sns.scatterplot(x='PCA Dimension 1', y=[0] * len(pca_df), hue='Condition', data=pca_df, palette='Set2', s=10, alpha=0.7)
    plt.xlabel('PCA Dimension 1')
    plt.title('PCA Projection to 1D with Loadings')
    # Loadings are not typically visualized in 1D PCA plots.

elif n_dimension == 2:
    plt.figure(figsize=(12, 10))
    sns.scatterplot(x='PCA Dimension 1', y='PCA Dimension 2', hue='Condition', data=pca_df, palette='Set2', s=10, alpha=0.7)
    scale_factor = max(pca_df['PCA Dimension 1'].max(), pca_df['PCA Dimension 2'].max()) / max(loadings[:, 0].max(), loadings[:, 1].max())
    texts = []
    for i, feature in enumerate(numeric_data.columns):
        plt.arrow(0, 0, loadings[i, 0] * scale_factor, loadings[i, 1] * scale_factor, color='r', alpha=0.5)
        texts.append(plt.text(loadings[i, 0] * scale_factor * 1.1, loadings[i, 1] * scale_factor * 1.1, feature, color='g', ha='center', va='center'))
    adjust_text(texts, arrowprops=dict(arrowstyle='->', color='gray'))
    plt.xlabel('PCA Dimension 1')
    plt.ylabel('PCA Dimension 2')
    plt.title('PCA Projection to 2D with Loadings')
    plt.grid(True)

elif n_dimension == 3:
    print("Feature not available for 3D yet.")

# Save the figure for 1D and 2D cases
if n_dimension in [1, 2]:
    plt_path = os.path.join(save_path, f'PCA_projection_with_loadings_{n_dimension}D.pdf')
    plt.savefig(plt_path, bbox_inches='tight')
    plt.show()
    print(f"Plot saved to: {plt_path}")


In [None]:
# @title ##Print the PCA loading vectors


# Now, create the DataFrame for loadings
loadings = pd.DataFrame(pca.components_.T,  # Transpose the components matrix
                        columns=[f'PCA Dimension {i}' for i in range(1, n_dimension + 1)],  # Naming columns based on the number of dimensions
                        index=numeric_columns)  # Using the numeric columns as the index

# Display the loadings DataFrame
print(loadings)

# Specify the path for the CSV file where you want to save the loadings
loadings_file_path = os.path.join(Results_Folder, "PCA", "PCA_loadings_vectors.csv")

# Save the loadings DataFrame to a CSV file
loadings.to_csv(loadings_file_path)

print(f"PCA loadings vectors saved to: {loadings_file_path}")



-------------------------------------------

# **Part 3. Explore your high-dimensional data using UMAP and HDBSCAN**
-------------------------------------------



## **3.1. Choose the Conditions to use**



In [None]:
#@markdown ##Choose the conditions to use:

import pandas as pd
import ipywidgets as widgets
from IPython.display import display
import os

UMAP_folder_path = os.path.join(Results_Folder, "UMAP")
if not os.path.exists(UMAP_folder_path):
    os.makedirs(UMAP_folder_path)

# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip()]

# Text area for user to paste the list of conditions
text_area_conditions = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of conditions here, separated by commas. Or tick the boxes below.',
    description='Conditions:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)

# Create checkboxes for each unique condition in the dataset
Condition_checkboxes = [widgets.Checkbox(value=True, description=str(condition)) for condition in dataset_df['Condition'].unique()]

# Function to filter dataframe based on selected checkbox values and text area input
def filter_dataframe(button):
    global filtered_df
    # Initialize an empty list to hold selected conditions
    selected_conditions = []

    # Check if the text area is not empty
    if text_area_conditions.value.strip():
        # Use conditions from the text area
        selected_conditions = parse_text_area(text_area_conditions.value)
    else:
        # Use conditions from checkboxes if the text area is empty
        selected_conditions = [box.description for box in Condition_checkboxes if box.value]

    # Filter DataFrame
    filtered_df = dataset_df[dataset_df['Condition'].isin(selected_conditions)]

    print("Selected Conditions:", selected_conditions)
    print("Filtered DataFrame length:", len(filtered_df))
    if len(filtered_df) == 0:
        print("No data matched the selected filters. Check filters and data for consistency.")

    # Save selected conditions as parameters
    params = {'Selected Conditions': ', '.join(selected_conditions)}
    params_file_path = os.path.join(UMAP_folder_path, "UMAP_analysis_parameters.csv")
    save_parameters(params, params_file_path, 'UMAP Conditions')

# Button to trigger dataframe filtering
filter_button = widgets.Button(description="Filter Dataframe")
filter_button.on_click(filter_dataframe)

# Display text area, checkboxes, and button
display(text_area_conditions, widgets.VBox([widgets.Label('Select Conditions:')] + Condition_checkboxes + [filter_button]))


## **3.2. Choose the numeric data to use**



In [None]:
import os
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

#@markdown ##Choose the numeric data to use


UMAP_folder_path = os.path.join(Results_Folder, "UMAP")
if not os.path.exists(UMAP_folder_path):
    os.makedirs(UMAP_folder_path)

# Define columns to preserve and to exclude if present
preserve_columns = ['Condition']
exclude_columns = ['Repeat', 'Cluster_UMAP', 'Cluster_TSNE']

# Ensure 'Condition' column and potentially 'Repeat' column are handled in 'numeric_df' for later use
numeric_columns = filtered_df.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Keep a copy of the 'Condition' column along with numeric columns
# Exclude 'Repeat' column if it exists
numeric_df = filtered_df[numeric_columns + preserve_columns].copy()
if 'Repeat' in numeric_df.columns:
    numeric_df = numeric_df.drop(columns=['Repeat'])

# Text area for user to paste the list of metrics
text_area = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of metrics here, separated by commas. Or tick the boxes below',
    description='Metrics:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)

# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip() in numeric_columns]

# Create a checkbox for each numeric column (excluding 'Condition' and potentially 'Repeat')
checkboxes = [widgets.Checkbox(value=True, description=col, indent=False) for col in numeric_columns if col not in exclude_columns]

# Arrange checkboxes in a grid
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection
button = widgets.Button(description="Select Track Parameters", layout=widgets.Layout(width='400px'))

# Define the button click event handler
def on_button_click(b):
    global selected_df

    # Use metrics from the text area if not empty, else from checkboxes
    if text_area.value.strip():
        selected_columns = parse_text_area(text_area.value)
    else:
        selected_columns = [box.description for box in checkboxes if box.value]

    selected_df = numeric_df[selected_columns + preserve_columns].copy()

    # Check for NaN values and handle them
    nan_columns = selected_df.columns[selected_df.isna().any()].tolist()
    if nan_columns:
        for col in nan_columns:
            selected_df.dropna(subset=[col], inplace=True)  # Drop rows with NaN values in these columns
            print(f"Dropped rows from column '{col}' due to NaN values.")

    print("Track parameters selected and NaN values handled.")
    params = {'Selected numeric data': ', '.join(selected_columns)}
    params_file_path = os.path.join(UMAP_folder_path, "UMAP_analysis_parameters.csv")
    save_parameters(params, params_file_path, 'UMAP numeric data')

# Set the button click event handler
button.on_click(on_button_click)

# Display the text area, grid of checkboxes, and the button
display(text_area, grid, button)


## **3.3. UMAP**
---

<font size = 4> The given code performs UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction on the merged tracks dataframe, focusing on its numeric columns, and visualizes the result. In the provided UMAP code, the parameters `n_neighbors`, `min_dist`, and `n_components` are crucial for determining the structure and appearance of the resulting low-dimensional representation of the data.

<font size = 4>`n_neighbors`: This parameter controls how UMAP balances local versus global structure in the data. It determines the size of the local neighborhood UMAP will look at when learning the manifold structure of the data.
- A smaller value emphasizes the local structure of the data, potentially at the expense of the global structure.
- A larger value allows UMAP to consider more distant neighbors, emphasizing more on the global structure of the data.
- Typically, values in the range of 5 to 50 are chosen, depending on the density and scale of the data.

<font size = 4>`min_dist`: This parameter controls how tightly UMAP is allowed to pack points together. It determines the minimum distance between points in the low-dimensional representation.
- Setting it to a low value will allow points to be packed more closely, potentially revealing clusters in the data.
- A higher value ensures that points are more spread out in the representation.
- Values usually range between 0 and 1.

<font size = 4>`n_dimension`: This parameter determines the number of dimensions in the low-dimensional space that the data will be reduced to.
For visualization purposes, `n_dimension` is typically set to 2 or 3 to obtain 2D or 3D representations, respectively.


In [None]:
import umap
import plotly.offline as pyo
import plotly.express as px
import pandas as pd
import os
import warnings

#@markdown ###UMAP parameters:

n_neighbors = 10  # @param {type: "number"}
min_dist = 0  # @param {type: "number"}
n_dimension = 2  # @param {type: "slider", min: 1, max: 3}

#@markdown ###Display parameters:
spot_size = 15 # @param {type: "number"}

# Assuming 'selected_df' is correctly defined and includes 'Condition' column
if 'Condition' not in selected_df.columns:
    raise KeyError("The 'Condition' column is missing from the selected_df DataFrame.")


# Initialize UMAP object with the specified settings
reducer = umap.UMAP(n_neighbors=n_neighbors, min_dist=min_dist, n_components=n_dimension, random_state=42)

# Assuming 'numeric_columns' contains the names of numeric columns used for UMAP
numeric_columns = selected_df.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_data = selected_df[numeric_columns]

# Fit UMAP and transform the data
embedding = reducer.fit_transform(numeric_data)

# Create dynamic column names based on n_components
column_names = [f'UMAP Dimension {i}' for i in range(1, n_dimension + 1)]

# Concatenate the UMAP embedding with the 'Condition' column
umap_df = pd.concat([pd.DataFrame(embedding, columns=column_names), selected_df[['Condition']].reset_index(drop=True)], axis=1)

# Prepare the parameters dictionary for saving
UMAP_params = {'n_neighbors': n_neighbors, 'min_dist': min_dist, 'n_dimension': n_dimension}

# Save the parameters
params_file_path = os.path.join(Results_Folder, "UMAP/UMAP_analysis_parameters.csv")

save_parameters(UMAP_params, params_file_path, 'UMAP parameters')

# Visualize the UMAP projection
if n_dimension in [1, 2]:
    plt.figure(figsize=(12, 10))
    sns.scatterplot(x=column_names[0], y=column_names[1] if n_dimension == 2 else [0]*len(umap_df), hue='Condition', data=umap_df, palette='Set2', s=spot_size)
    plt.title(f'UMAP Projection to {n_dimension}D')
    plt.savefig(os.path.join(Results_Folder, f"UMAP/UMAP_projection_{n_dimension}D.pdf"))
    plt.show()
elif n_dimension == 3:
    fig = px.scatter_3d(umap_df, x='UMAP Dimension 1', y='UMAP Dimension 2', z='UMAP Dimension 3', color='Condition', size_max=spot_size)
    fig.update_traces(marker=dict(size=spot_size))
    fig.show()
    html_file_path = os.path.join(Results_Folder, "UMAP/umap_projection_3D.html")
    pyo.plot(fig, filename=html_file_path, auto_open=False)
    print(f"UMAP 3D projection saved to: {html_file_path}")
else:
    warnings.warn("Invalid number of dimensions for UMAP projection.")


## **3.4. HDBSCAN**
---

<font size = 4> The provided code employs HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to identify clusters within a dataset that has already undergone UMAP dimensionality reduction. HDBSCAN is utilized for its proficiency in determining the optimal number of clusters while managing varied densities within the data.

<font size = 4>In the provided HDBSCAN code, the parameters `min_samples`, `min_cluster_size`, and `metric` are crucial for determining the structure and appearance of the resulting clusters in the data.

<font size = 4>`min_samples`: This parameter primarily controls the degree to which the algorithm is willing to declare noise. It's the number of samples in a neighborhood for a point to be considered as a core point.
- A smaller value of `min_samples` makes the algorithm more prone to declaring points as part of a cluster, potentially leading to larger clusters and fewer noise points.
- A larger value makes the algorithm more conservative, resulting in more points declared as noise and smaller, more defined clusters.
- The choice of `min_samples` typically depends on the density of the data; denser datasets may require a larger value.

<font size = 4>`min_cluster_size`: This parameter determines the smallest size grouping that you wish to consider a cluster.
- A smaller value will allow the formation of smaller clusters, whereas a larger value will prevent small isolated groups of points from being declared as clusters.
- The choice of `min_cluster_size` depends on the scale of the data and the desired level of granularity in the clustering.

<font size = 4>`metric`: This parameter is the metric used for distance computation between data points, and it affects the shape of the clusters.
- The `euclidean` metric is a good starting point, and depending on the clustering results and the data type, it might be beneficial to experiment with different metrics.


In [None]:
# @title ##Run to see more information about the available metrics
print("""
Metric                   Description                                                               Suitable For
-------------------------------------------------------------------------------------------------------------------------------------------------------
Euclidean                Standard distance metric.                                                 Numerical data.
Manhattan                Sum of absolute differences.                                              Numerical/Categorical data.
Chebyshev                Maximum value of absolute differences.                                    Numerical data.
Bray-Curtis              Dissimilarity between sample sets.                                        Numerical data.
Canberra                 Weighted version of Manhattan distance.                                   Numerical data.

""")


In [None]:
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np
import os

# @title ##Identify clusters using HDBSCAN

#@markdown ###HDBSCAN parameters:
clustering_data_source = 'umap'  # @param ['umap', 'raw']
min_samples = 10  # @param {type: "number"}
min_cluster_size = 100  # @param {type: "number"}
metric = "euclidean"  # @param ['euclidean', 'manhattan', 'chebyshev', 'braycurtis', 'canberra']

#@markdown ###Display parameters:
spot_size = 20 # @param {type: "number"}

umap_folder_path = os.path.join(Results_Folder, "UMAP")
os.makedirs(umap_folder_path, exist_ok=True)

# Initialize HDBSCAN with specified settings
clusterer = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric=metric)

# Fit HDBSCAN based on the specified data source
if clustering_data_source == 'umap':
    # Construct a list of UMAP dimension column names based on n_dimension
    umap_columns = [f'UMAP Dimension {i}' for i in range(1, n_dimension + 1)]
    data_for_clustering = umap_df[umap_columns]
else:
    # Use selected numeric data for clustering if 'raw' data source is chosen
    data_for_clustering = selected_df.select_dtypes(include=[np.number])

# Apply clustering
clusterer.fit(data_for_clustering)

# Add the cluster labels to umap_df
umap_df['Cluster_UMAP'] = clusterer.labels_

# Plotting the results based on n_dimension
plot_title = 'Clusters Identified by HDBSCAN'
if n_dimension == 1:
    plt.figure(figsize=(12, 6))
    sns.stripplot(x='UMAP Dimension 1', hue='Cluster_UMAP', data=umap_df, palette='viridis', size=spot_size)
    plt.title(f'{plot_title} (1D)')
elif n_dimension == 2:
    plt.figure(figsize=(12, 10))
    sns.scatterplot(x='UMAP Dimension 1', y='UMAP Dimension 2', hue='Cluster_UMAP', data=umap_df, palette='viridis', size=spot_size)
    plt.title(f'{plot_title} (2D)')
elif n_dimension == 3:
    fig = px.scatter_3d(umap_df, x='UMAP Dimension 1', y='UMAP Dimension 2', z='UMAP Dimension 3', color='Cluster_UMAP', size_max=spot_size)
    fig.update_traces(marker=dict(size=spot_size/10))
    fig.show()
    # Save the 3D plot as HTML
    html_file_path = os.path.join(umap_folder_path, "HDBSCAN_clusters_3D.html")
    fig.write_html(html_file_path, auto_open=False)
    print(f"3D cluster plot saved to: {html_file_path}")
else:
    print("Invalid n_dimension value for clustering visualization.")


# Save the plots as PDF for 1D and 2D visualizations
if n_dimension in [1, 2]:
    pdf_file_path = os.path.join(umap_folder_path, f"HDBSCAN_clusters_{n_dimension}D.pdf")
    plt.savefig(pdf_file_path)
    plt.show()
    print(f"{n_dimension}D cluster plot saved to: {pdf_file_path}")

dataset_df.reset_index(drop=True, inplace=True)
umap_df.reset_index(drop=True, inplace=True)

# If the Cluster column already exists in dataset_df, drop it to avoid duplications
if 'Cluster_UMAP' in dataset_df.columns:
    dataset_df.drop(columns='Cluster_UMAP', inplace=True)

# Directly join the 'Cluster_UMAP' column to dataset_df based on index
dataset_df['Cluster_UMAP'] = umap_df['Cluster_UMAP']

# Optionally, fill NaN values with -1 if there are unassigned clusters
dataset_df['Cluster_UMAP'].fillna(-1, inplace=True)

# Save the DataFrame with the identified clusters
dataset_df.to_csv(f"{Results_Folder}/merged_dataset_clusters.csv", index=False)


# Save clustering parameters and results
params_file_path = os.path.join(umap_folder_path, "UMAP_analysis_parameters.csv")
HDBSCAN_params = {
    'clustering_data_source': clustering_data_source,
    'min_samples': min_samples,
    'min_cluster_size': min_cluster_size,
    'metric': metric
}
save_parameters(HDBSCAN_params, params_file_path, 'HDBSCAN parameters')




## **3.5. Fingerprint**
---

<font size = 4>This section is designed to visualize the distribution of different clusters within each condition in a dataset, showing the 'fingerprint' of each cluster per condition.

In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# Group by 'Condition' and 'Cluster' and calculate the size of each group
cluster_counts = umap_df.groupby(['Condition', 'Cluster_UMAP']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = umap_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'Condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100

# Save the percentage_df DataFrame as a CSV file
percentage_df.to_csv(Results_Folder+'/UMAP/UMAP_percentage_results.csv', index=False)

# Pivot the percentage_df to have Conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster_UMAP', values='percentage')

# Fill NaN values with 0 if any, as there might be some Condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

# Initialize PDF
pdf_pages = PdfPages(Results_Folder+'/UMAP/UMAP_Cluster_Fingerprint_Plot.pdf')

# Plotting
fig, ax = plt.subplots(figsize=(10, 7))
pivot_df.plot(kind='bar', stacked=True, ax=ax, colormap='viridis')
plt.title('Percentage in each cluster per Condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)
plt.tight_layout()

ax.legend(title='Cluster_UMAP', loc='upper left', bbox_to_anchor=(1, 1))


# Save the figure to a PDF
pdf_pages.savefig(fig, bbox_inches='tight')

# Close the PDF
pdf_pages.close()

# Display the plot
plt.show()



## **3.6. Understand your clusters using heatmaps**
--------

<font size = 4>This section help visualize how different track parameters vary across the identified clusters. The approach is to display these variations using a heatmap, which offers a color-coded representation of the median values of each parameter for each cluster. This visualization technique can make it easier to spot differences or patterns among the clusters.


In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd
from scipy.stats import zscore

# @title ##Plot normalized numeric data based on clusters as an heatmap

# Parameters to adapt in function of the notebook section
base_folder = f"{Results_Folder}/UMAP"
Conditions = 'Cluster_UMAP'

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar','TRACK_STOP', 'TRACK_START', 'Cluster_UMAP', 'Cluster_tsne']
    # Select only numerical columns
    return [col for col in df.columns if (df[col].dtype.kind in 'biufc') and (col not in exclude_cols)]


def heatmap_comparison(df, Results_Folder, Conditions, variables_per_page=40):
    # Get all the selectable columns
    variables_to_plot = get_selectable_columns(df)

    # Drop rows where all elements are NaNs in the variables_to_plot columns
    df = df.dropna()

    # Compute median for each variable across Clusters
    median_values = df.groupby(Conditions)[variables_to_plot].median().transpose()

    # Normalize the median values using Z-score
    normalized_values = median_values.apply(zscore, axis=1)

    # Number of pages
    total_variables = len(variables_to_plot)
    num_pages = int(np.ceil(total_variables / variables_per_page))

    # Initialize an empty DataFrame to store all pages' data
    all_pages_data = pd.DataFrame()

    # Create a PDF file to save the heatmaps
    with PdfPages(f"{Results_Folder}/Heatmaps_Normalized_Median_Values_by_Cluster.pdf") as pdf:
        for page in range(num_pages):
            start = page * variables_per_page
            end = min(start + variables_per_page, total_variables)
            page_data = normalized_values.iloc[start:end]

            # Append this page's data to the all_pages_data DataFrame
            all_pages_data = pd.concat([all_pages_data, page_data])

            plt.figure(figsize=(16, 10))
            sns.heatmap(page_data, cmap='coolwarm', annot=True, linewidths=.1)
            plt.title(f"Z-score Normalized Median Values of Variables by Condition (Page {page + 1})")
            plt.tight_layout()

            pdf.savefig()  # saves the current figure into a pdf page
            plt.show()
            plt.close()

    # Save all pages data to a single CSV file
    all_pages_data.to_csv(f"{Results_Folder}/Normalized_Median_Values_by_Condition.csv")

    print(f"Heatmaps saved to {Results_Folder}/Heatmap_Normalized_Median_Values_by_Cluster.pdf")
    print(f"All data saved to {Results_Folder}/Normalized_Median_Values_by_Cluster_All.csv")

# Example usage
heatmap_comparison(dataset_df, base_folder, Conditions)

## **3.7. Understand your clusters using box plots**
--------

<font size = 4>The provided code aims to visually represent the distribution of different numeric data across the identified clusters. Specifically, for each numeric data selected, a boxplot is generated to showcase the spread of its values across different clusters. This approach provides a comprehensive view of how each numeric data varies within and across the clusters.




In [None]:
import os
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.gridspec import GridSpec
import pandas as pd
import ipywidgets as widgets

# @title ##Plot numeric data based on clusters

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/UMAP/Data_by_clusters"):
    os.makedirs(f"{Results_Folder}/UMAP/Data_by_clusters")

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'Repeat', 'Cluster_UMAP', 'Cluster_TSNE']
    # Select only numerical columns
    return [col for col in df.columns if (df[col].dtype.kind in 'biufc') and (col not in exclude_cols)]

def display_variable_checkboxes(selectable_columns):
    # Create checkboxes for selectable columns
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]

    # Display checkboxes in the notebook
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3)),
    ]))
    return variable_checkboxes

def plot_selected_vars(button, variable_checkboxes, df, Results_Folder):
    print("Plotting in progress...")

    # Get selected variables
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]
    n_plots = len(variables_to_plot)

    if n_plots == 0:
        print("No variables selected for plotting")
        return

    for var in variables_to_plot:
        # Extract data for the specific variable and cluster
        data_to_save = df[['Cluster_UMAP', var]]

        # Save data for the plot to CSV
        data_to_save.to_csv(f"{Results_Folder}/UMAP/Data_by_clusters/{var}_data_by_Cluster.csv", index=False)

        plt.figure(figsize=(16, 10))

        # Plotting
        sns.boxplot(x='Cluster_UMAP', y=var, data=df, color='lightgray')  # Boxplot by cluster
        sns.stripplot(x='Cluster_UMAP', y=var, data=df, jitter=True, alpha=0.2)  # Individual data points

        plt.title(f"{var} by Cluster")
        plt.xlabel('Cluster_UMAP')
        plt.ylabel(var)
        plt.xticks(rotation=90)
        plt.tight_layout()

        # Save the plot
        plt.savefig(f"{Results_Folder}/UMAP/Data_by_clusters/{var}_Boxplots_by_Cluster.pdf")
        plt.show()

selectable_columns = get_selectable_columns(dataset_df)
variable_checkboxes = display_variable_checkboxes(selectable_columns)

# Create and display the plot button
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes, dataset_df, Results_Folder))
display(button)


## **3.8. Plot numeric data for a selected cluster**
---

In [None]:
# @title ##Plot numeric data for a selected cluster

import ipywidgets as widgets
from ipywidgets import Layout, VBox, Button, Accordion, SelectMultiple, IntText
import pandas as pd
import os
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.ticker import FixedLocator


# Parameters to adapt in function of the notebook section
base_folder = f"{Results_Folder}/UMAP/Data_selected_cluster"
Conditions = 'Condition'
df_to_plot = dataset_df

# Check and create necessary directories
folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'Repeat', 'Cluster_UMAP', 'Cluster_TSNE']
    # Select only numerical columns
    return [col for col in df.columns if (df[col].dtype.kind in 'biufc') and (col not in exclude_cols)]


def display_cluster_dropdown(df):
    # Extract unique clusters
    unique_clusters = df['Cluster_UMAP'].unique()
    cluster_dropdown = widgets.Dropdown(
        options=unique_clusters,
        description='Select Cluster:',
        disabled=False,
    )
    display(cluster_dropdown)
    return cluster_dropdown


def display_variable_checkboxes(selectable_columns):
    # Create checkboxes for selectable columns
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]

    # Display checkboxes in the notebook
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3)),
    ]))
    return variable_checkboxes

def create_condition_selector(df, column_name):
    conditions = df[column_name].unique()
    condition_selector = SelectMultiple(
        options=conditions,
        description='Conditions:',
        disabled=False,
        layout=Layout(width='80%')  # Adjusting the layout width
    )
    return condition_selector

def display_condition_selection(df, column_name):
    condition_selector = create_condition_selector(df, column_name)

    condition_accordion = Accordion(children=[VBox([condition_selector])])
    condition_accordion.set_title(0, 'Select Conditions')
    display(condition_accordion)
    return condition_selector


def plot_selected_vars(button, variable_checkboxes, df, Conditions, cluster_dropdown, Results_Folder, condition_selector):

    selected_cluster = cluster_dropdown.value
    print(f"Plotting in progress for Cluster {selected_cluster}...")

    plt.clf()  # Clear the current figure before creating a new plot


  # Get selected variables
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]
    n_plots = len(variables_to_plot)

    if n_plots == 0:
        print("No variables selected for plotting")
        return

  # Get selected conditions
    selected_conditions = condition_selector.value
    n_selected_conditions = len(selected_conditions)

    if n_selected_conditions == 0:
        print("No conditions selected for plotting")
        return

# Use only selected and ordered conditions
    filtered_df = df[(df[Conditions].isin(selected_conditions)) & (df['Cluster_UMAP'] == selected_cluster)].copy()

# Initialize matrices to store effect sizes and p-values for each variable
    effect_size_matrices = {}
    p_value_matrices = {}
    bonferroni_matrices = {}

    unique_conditions = filtered_df[Conditions].unique().tolist()
    num_comparisons = len(unique_conditions) * (len(unique_conditions) - 1) // 2
    alpha = 0.05
    corrected_alpha = alpha / num_comparisons
    n_iterations = 1000

# Loop through each variable to plot
    for var in variables_to_plot:

      pdf_pages = PdfPages(f"{Results_Folder}/pdf/Cluster_{selected_cluster}_{var}_Boxplots_and_Statistics.pdf")
      effect_size_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      p_value_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      bonferroni_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)

      for cond1, cond2 in itertools.combinations(unique_conditions, 2):
        group1 = df[df[Conditions] == cond1][var]
        group2 = df[df[Conditions] == cond2][var]

        original_d = abs(cohen_d(group1, group2))
        effect_size_matrix.loc[cond1, cond2] = original_d
        effect_size_matrix.loc[cond2, cond1] = original_d  # Mirroring

        count_extreme = 0
        for i in range(n_iterations):
            combined = pd.concat([group1, group2])
            shuffled = combined.sample(frac=1, replace=False).reset_index(drop=True)
            new_group1 = shuffled[:len(group1)]
            new_group2 = shuffled[len(group1):]

            new_d = abs(cohen_d(new_group1, new_group2))
            if np.abs(new_d) >= np.abs(original_d):
                count_extreme += 1

        p_value = count_extreme / n_iterations
        p_value_matrix.loc[cond1, cond2] = p_value
        p_value_matrix.loc[cond2, cond1] = p_value  # Mirroring

        # Apply Bonferroni correction
        bonferroni_corrected_p_value = min(p_value * num_comparisons, 1.0)
        bonferroni_matrix.loc[cond1, cond2] = bonferroni_corrected_p_value
        bonferroni_matrix.loc[cond2, cond1] = bonferroni_corrected_p_value  # Mirroring

      effect_size_matrices[var] = effect_size_matrix
      p_value_matrices[var] = p_value_matrix
      bonferroni_matrices[var] = bonferroni_matrix

    # Concatenate the three matrices side-by-side
      combined_df = pd.concat(
        [
            effect_size_matrices[var].rename(columns={col: f"{col} (Effect Size)" for col in effect_size_matrices[var].columns}),
            p_value_matrices[var].rename(columns={col: f"{col} (P-Value)" for col in p_value_matrices[var].columns}),
            bonferroni_matrices[var].rename(columns={col: f"{col} (Bonferroni-corrected P-Value)" for col in bonferroni_matrices[var].columns})
        ], axis=1
    )

    # Save the combined DataFrame to a CSV file
      combined_df.to_csv(f"{Results_Folder}/csv/Cluster_{selected_cluster}_{var}_statistics_combined.csv")

    # Create a new figure
      fig = plt.figure(figsize=(16, 10))

    # Create a gridspec for 2 rows and 4 columns
      gs = GridSpec(2, 3, height_ratios=[1.5, 1])

    # Create the ax for boxplot using the gridspec
      ax_box = fig.add_subplot(gs[0, :])

    # Extract the data for this variable
      data_for_var = df[[Conditions, var, 'Repeat']]

    # Save the data_for_var to a CSV for replotting
      data_for_var.to_csv(f"{Results_Folder}/csv/Cluster_{selected_cluster}_{var}_boxplot_data.csv", index=False)

    # Calculate the Interquartile Range (IQR) using the 25th and 75th percentiles
      Q1 = df[var].quantile(0.25)
      Q3 = df[var].quantile(0.75)
      IQR = Q3 - Q1

    # Define bounds for the outliers
      multiplier = 10
      lower_bound = Q1 - multiplier * IQR
      upper_bound = Q3 + multiplier * IQR

    # Plotting
      sns.boxplot(x=Conditions, y=var, data=filtered_df, ax=ax_box, color='lightgray')  # Boxplot
      sns.stripplot(x=Conditions, y=var, data=filtered_df, ax=ax_box, hue='Repeat', dodge=True, jitter=True, alpha=0.2)  # Individual data points
      ax_box.set_ylim([max(min(filtered_df[var]), lower_bound), min(max(filtered_df[var]), upper_bound)])
      ax_box.set_title(f"{var} for Cluster {selected_cluster}")
      ax_box.set_xlabel('Condition')
      ax_box.set_ylabel(var)
      tick_labels = ax_box.get_xticklabels()
      tick_locations = ax_box.get_xticks()
      ax_box.xaxis.set_major_locator(FixedLocator(tick_locations))
      ax_box.set_xticklabels(tick_labels, rotation=90)
      ax_box.legend(loc='center left', bbox_to_anchor=(1, 0.5), title='Repeat')

    # Statistical Analyses and Heatmaps

    # Effect Size heatmap ax
      ax_d = fig.add_subplot(gs[1, 0])
      sns.heatmap(effect_size_matrices[var].fillna(0), annot=True, cmap="viridis", cbar=True, square=True, ax=ax_d, vmax=1)
      ax_d.set_title(f"Effect Size (Cohen's d) for {var}")

    # p-value heatmap ax
      ax_p = fig.add_subplot(gs[1, 1])
      sns.heatmap(p_value_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_p, vmax=0.1)
      ax_p.set_title(f"Randomization Test p-value for {var}")

    # Bonferroni corrected p-value heatmap ax
      ax_bonf = fig.add_subplot(gs[1, 2])
      sns.heatmap(bonferroni_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_bonf, vmax=0.1)
      ax_bonf.set_title(f"Bonferroni-corrected p-value for {var}")

      plt.tight_layout()
      pdf_pages.savefig(fig)

    # Close the PDF
      pdf_pages.close()

condition_selector = display_condition_selection(df_to_plot, Conditions)
selectable_columns = get_selectable_columns(df_to_plot)
variable_checkboxes = display_variable_checkboxes(selectable_columns)
cluster_dropdown = display_cluster_dropdown(dataset_df)


button = Button(description="Plot Selected Variables", layout=Layout(width='400px'), button_style='info')
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes, df_to_plot, Conditions, cluster_dropdown, base_folder, condition_selector))
display(button)

--------
# **Part 4. Explore your high-dimensional data using t-SNE and HDBSCAN**
--------

## **4.1. Choose the conditions to use**



In [None]:
#@markdown ##Choose the conditions to use:

import pandas as pd
import ipywidgets as widgets
from IPython.display import display
import os

TSNE_folder_path = os.path.join(Results_Folder, "TSNE")
if not os.path.exists(TSNE_folder_path):
    os.makedirs(TSNE_folder_path)

# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip()]

# Text area for user to paste the list of conditions
text_area_conditions = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of conditions here, separated by commas. Or tick the boxes below.',
    description='Conditions:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)

# Create checkboxes for each unique condition in the dataset
Condition_checkboxes = [widgets.Checkbox(value=True, description=str(condition)) for condition in dataset_df['Condition'].unique()]

# Function to filter dataframe based on selected checkbox values and text area input
def filter_dataframe(button):
    global filtered_df
    # Initialize an empty list to hold selected conditions
    selected_conditions = []

    # Check if the text area is not empty
    if text_area_conditions.value.strip():
        # Use conditions from the text area
        selected_conditions = parse_text_area(text_area_conditions.value)
    else:
        # Use conditions from checkboxes if the text area is empty
        selected_conditions = [box.description for box in Condition_checkboxes if box.value]

    # Filter DataFrame
    filtered_df = dataset_df[dataset_df['Condition'].isin(selected_conditions)]

    print("Selected Conditions:", selected_conditions)
    print("Filtered DataFrame length:", len(filtered_df))
    if len(filtered_df) == 0:
        print("No data matched the selected filters. Check filters and data for consistency.")

    # Save selected conditions as parameters
    params = {'Selected Conditions': ', '.join(selected_conditions)}
    params_file_path = os.path.join(TSNE_folder_path, "TSNE_analysis_parameters.csv")
    save_parameters(params, params_file_path, 'TSNE Conditions')

# Button to trigger dataframe filtering
filter_button = widgets.Button(description="Filter Dataframe")
filter_button.on_click(filter_dataframe)

# Display text area, checkboxes, and button
display(text_area_conditions, widgets.VBox([widgets.Label('Select Conditions:')] + Condition_checkboxes + [filter_button]))


## **4.2. Choose the numeric data to use**



In [None]:
import os
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

#@markdown ##Choose the numeric data to use


TSNE_folder_path = os.path.join(Results_Folder, "TSNE")
if not os.path.exists(TSNE_folder_path):
    os.makedirs(TSNE_folder_path)

# Define columns to preserve and to exclude if present
preserve_columns = ['Condition']
exclude_columns = ['Repeat', 'Cluster_UMAP', 'Cluster_TSNE']

# Ensure 'Condition' column and potentially 'Repeat' column are handled in 'numeric_df' for later use
numeric_columns = filtered_df.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Keep a copy of the 'Condition' column along with numeric columns
# Exclude 'Repeat' column if it exists
numeric_df = filtered_df[numeric_columns + preserve_columns].copy()
if 'Repeat' in numeric_df.columns:
    numeric_df = numeric_df.drop(columns=['Repeat'])

# Text area for user to paste the list of metrics
text_area = widgets.Textarea(
    value='',
    placeholder='Copy and paste your list of metrics here, separated by commas. Or tick the boxes below',
    description='Metrics:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)

# Function to parse the text area content into a list
def parse_text_area(text):
    return [item.strip() for item in text.split(',') if item.strip() in numeric_columns]

# Create a checkbox for each numeric column (excluding 'Condition' and potentially 'Repeat')
checkboxes = [widgets.Checkbox(value=True, description=col, indent=False) for col in numeric_columns if col not in exclude_columns]

# Arrange checkboxes in a grid
grid = widgets.GridBox(checkboxes, layout=widgets.Layout(grid_template_columns="repeat(2, 300px)"))

# Create a button to trigger the selection
button = widgets.Button(description="Select Track Parameters", layout=widgets.Layout(width='400px'))

# Define the button click event handler
def on_button_click(b):
    global selected_df

    # Use metrics from the text area if not empty, else from checkboxes
    if text_area.value.strip():
        selected_columns = parse_text_area(text_area.value)
    else:
        selected_columns = [box.description for box in checkboxes if box.value]

    selected_df = numeric_df[selected_columns + preserve_columns].copy()

    # Check for NaN values and handle them
    nan_columns = selected_df.columns[selected_df.isna().any()].tolist()
    if nan_columns:
        for col in nan_columns:
            selected_df.dropna(subset=[col], inplace=True)  # Drop rows with NaN values in these columns
            print(f"Dropped rows from column '{col}' due to NaN values.")

    print("Track parameters selected and NaN values handled.")
    params = {'Selected numeric data': ', '.join(selected_columns)}
    params_file_path = os.path.join(TSNE_folder_path, "TSNE_analysis_parameters.csv")
    save_parameters(params, params_file_path, 'TSNE numeric data')

# Set the button click event handler
button.on_click(on_button_click)

# Display the text area, grid of checkboxes, and the button
display(text_area, grid, button)


## **4.3. t-SNE**
--------

The code snippet provided performs **t-Distributed Stochastic Neighbor Embedding (t-SNE)**, a powerful technique for dimensionality reduction, particularly suited for the visualization of high-dimensional datasets. The process is applied to the merged tracks dataframe, focusing on its numeric columns, with the goal of visualizing the data in a lower-dimensional space.

### Key Parameters of t-SNE:

- **Perplexity (`perplexity`):**
  - This parameter is a measure of the effective number of local neighbors each point has.
  - Perplexity influences the t-SNE algorithm's ability to capture local versus global aspects of the data.
  - Typical values for perplexity range between 5 and 50, with the choice depending on dataset size and density.

- **Learning Rate (`learning_rate`):**
  - This parameter controls the step size in the optimization process.
  - A suitable learning rate helps t-SNE to converge to a meaningful low-dimensional representation.
  - Values too high might cause the algorithm to converge to a suboptimal solution, while too low values can slow down the convergence.

- **Number of Iterations (`n_iter`):**
  - This parameter defines the number of optimization iterations t-SNE will run.
  - A higher number of iterations allows the algorithm more time to find a stable configuration.
  - Generally, a value of 1000 iterations is sufficient for most datasets.

- **Number of Dimensions (`n_dimension`):**
  - The target dimensionality for the lower-dimensional space.
  - For visualization purposes, this is commonly set to 2, allowing the data to be plotted in a 2D scatter plot.


In [None]:
# @title ##Perform t-SNE
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import pandas as pd

# Check and create necessary directories
tsne_folder_path = f"{Results_Folder}/TSNE/"
if not os.path.exists(tsne_folder_path):
    os.makedirs(tsne_folder_path)

#@markdown ###t-SNE parameters:

perplexity = 20  # @param {type: "number"}
learning_rate = 100  # @param {type: "number"}
n_iter = 1000  # @param {type: "number"}
n_dimension = 2  # The number of dimensions is set to 2 for t-SNE as standard practice

#@markdown ###Display parameters:
spot_size = 15  # @param {type: "number"}

# Initialize t-SNE object with the specified settings
tsne = TSNE(n_components=n_dimension, perplexity=perplexity, learning_rate=learning_rate, n_iter=n_iter, random_state=42)

# Exclude non-numeric columns when fitting t-SNE
numeric_columns = selected_df._get_numeric_data()
embedding = tsne.fit_transform(numeric_columns)

  # Prepare the parameters dictionary
tsne_params = {
        'perplexity': perplexity,
        'learning_rate': learning_rate,
        'n_iter': n_iter,
        'n_dimension': n_dimension,
        'spot_size': spot_size
    }

    # Save the parameters
params_file_path = os.path.join(tsne_folder_path, "TSNE_analysis_parameters.csv")
save_parameters(tsne_params, params_file_path, 'TSNE')

# Create dynamic column names based on n_components
column_names = [f't-SNE dimension {i+1}' for i in range(n_dimension)]

# Extract the columns_to_include from selected_df
included_data = selected_df.reset_index(drop=True)

# Concatenate the t-SNE embedding with the included columns
tsne_df = pd.concat([pd.DataFrame(embedding, columns=column_names), included_data], axis=1)

# Check if the DataFrame has any NaN values and print a warning if it does.
nan_columns = tsne_df.columns[tsne_df.isna().any()].tolist()
if nan_columns:
  warnings.warn(f"The DataFrame contains NaN values in the following columns: {', '.join(nan_columns)}")
  tsne_df.dropna(subset=nan_columns, inplace=True)  # Drop NaN values only from columns containing them

# Visualize the t-SNE projection
plt.figure(figsize=(12, 10))
sns.scatterplot(x=column_names[0], y=column_names[1], hue='Condition', data=tsne_df, palette='Set2', s=spot_size)
plt.title('t-SNE Projection of the Dataset')
tsne_output_path = os.path.join(tsne_folder_path, 'TSNE_projection_2D.pdf')
plt.savefig(tsne_output_path)  # Save 2D plot as PDF
plt.show()


## **4.4. HDBSCAN**
---

<font size = 4> The provided code employs HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to identify clusters within a dataset that has already undergone UMAP dimensionality reduction. HDBSCAN is utilized for its proficiency in determining the optimal number of clusters while managing varied densities within the data.

<font size = 4>In the provided HDBSCAN code, the parameters `min_samples`, `min_cluster_size`, and `metric` are crucial for determining the structure and appearance of the resulting clusters in the data.

<font size = 4>`min_samples`: This parameter primarily controls the degree to which the algorithm is willing to declare noise. It's the number of samples in a neighborhood for a point to be considered as a core point.
- A smaller value of `min_samples` makes the algorithm more prone to declaring points as part of a cluster, potentially leading to larger clusters and fewer noise points.
- A larger value makes the algorithm more conservative, resulting in more points declared as noise and smaller, more defined clusters.
- The choice of `min_samples` typically depends on the density of the data; denser datasets may require a larger value.

<font size = 4>`min_cluster_size`: This parameter determines the smallest size grouping that you wish to consider a cluster.
- A smaller value will allow the formation of smaller clusters, whereas a larger value will prevent small isolated groups of points from being declared as clusters.
- The choice of `min_cluster_size` depends on the scale of the data and the desired level of granularity in the clustering.

<font size = 4>`metric`: This parameter is the metric used for distance computation between data points, and it affects the shape of the clusters.
- The `euclidean` metric is a good starting point, and depending on the clustering results and the data type, it might be beneficial to experiment with different metrics.


In [None]:
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np
import os

# @title ##Identify clusters using HDBSCAN

#@markdown ###HDBSCAN parameters:
clustering_data_source = 'tsne'  # @param ['tsne', 'raw']
min_samples = 10  # @param {type: "number"}
min_cluster_size = 100  # @param {type: "number"}
metric = "euclidean"  # @param ['euclidean', 'manhattan', 'chebyshev', 'braycurtis', 'canberra']

#@markdown ###Display parameters:
spot_size = 20 # @param {type: "number"}

tsne_folder_path = os.path.join(Results_Folder, "TSNE")
os.makedirs(tsne_folder_path, exist_ok=True)

column_names = [f't-SNE dimension {i+1}' for i in range(n_dimension)]

# Initialize HDBSCAN
clusterer = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric=metric)

# Use the correct column names for clustering
data_for_clustering = tsne_df[column_names]

# Apply clustering
clusterer.fit(data_for_clustering)

# Add the cluster labels to tsne_df
tsne_df['Cluster_TSNE'] = clusterer.labels_

# Plotting the results
if n_dimension == 1:
    plt.figure(figsize=(12, 6))
    sns.stripplot(x=column_names[0], hue='Cluster_TSNE', data=tsne_df, palette='viridis', size=spot_size)
    plt.title('Clusters Identified by HDBSCAN (1D)')
    plt.savefig(os.path.join(tsne_folder_path, "HDBSCAN_clusters_1D.pdf"))
elif n_dimension == 2:
    plt.figure(figsize=(12, 10))
    sns.scatterplot(x=column_names[0], y=column_names[1], hue='Cluster_TSNE', data=tsne_df, palette='viridis', size=spot_size)
    plt.title('Clusters Identified by HDBSCAN (2D)')
    plt.savefig(os.path.join(tsne_folder_path, "HDBSCAN_clusters_2D.pdf"))
elif n_dimension == 3:
    fig = px.scatter_3d(tsne_df, x=column_names[0], y=column_names[1], z=column_names[2], color='Cluster_TSNE', size_max=spot_size)
    fig.update_traces(marker=dict(size=spot_size/10))
    fig.show()
    fig.write_html(os.path.join(tsne_folder_path, "HDBSCAN_clusters_3D.html"), auto_open=False)
else:
    print("Invalid n_dimension value for clustering visualization.")

plt.show()

dataset_df.reset_index(drop=True, inplace=True)
tsne_df.reset_index(drop=True, inplace=True)

# If the Cluster column already exists in dataset_df, drop it to avoid duplications
if 'Cluster_TSNE' in dataset_df.columns:
    dataset_df.drop(columns='Cluster_TSNE', inplace=True)

# Directly join the 'Cluster_UMAP' column to dataset_df based on index
dataset_df['Cluster_TSNE'] = tsne_df['Cluster_TSNE']

# Optionally, fill NaN values with -1 if there are unassigned clusters
dataset_df['Cluster_TSNE'].fillna(-1, inplace=True)

# Save the DataFrame with the identified clusters
dataset_df.to_csv(f"{Results_Folder}/merged_dataset_clusters.csv", index=False)

# Save clustering parameters
params_file_path = os.path.join(tsne_folder_path, "TSNE_analysis_parameters.csv")

# Save clustering parameters and results
HDBSCAN_params = {
    'clustering_data_source': clustering_data_source,
    'min_samples': min_samples,
    'min_cluster_size': min_cluster_size,
    'metric': metric
}
save_parameters(HDBSCAN_params, params_file_path, 'HDBSCAN parameters')


## **3.5. Fingerprint**
---

<font size = 4>This section is designed to visualize the distribution of different clusters within each condition in a dataset, showing the 'fingerprint' of each cluster per condition.

In [None]:
# @title ##Plot the 'fingerprint' of each cluster per condition

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# Group by 'Condition' and 'Cluster' and calculate the size of each group
cluster_counts = tsne_df.groupby(['Condition', 'Cluster_TSNE']).size().reset_index(name='counts')

# Calculate the total number of points per condition
total_counts = tsne_df.groupby('Condition').size().reset_index(name='total_counts')

# Merge the DataFrames on 'Condition' to calculate percentages
percentage_df = pd.merge(cluster_counts, total_counts, on='Condition')
percentage_df['percentage'] = (percentage_df['counts'] / percentage_df['total_counts']) * 100

# Save the percentage_df DataFrame as a CSV file
percentage_df.to_csv(Results_Folder+'/TSNE/TSNE_percentage_results.csv', index=False)

# Pivot the percentage_df to have Conditions as index, Clusters as columns, and percentages as values
pivot_df = percentage_df.pivot(index='Condition', columns='Cluster_TSNE', values='percentage')

# Fill NaN values with 0 if any, as there might be some Condition-Cluster combinations that are not present
pivot_df.fillna(0, inplace=True)

# Initialize PDF
pdf_pages = PdfPages(Results_Folder+'/TSNE/TSNE_Cluster_Fingerprint_Plot.pdf')

# Plotting
fig, ax = plt.subplots(figsize=(10, 7))
pivot_df.plot(kind='bar', stacked=True, ax=ax, colormap='viridis')
plt.title('Percentage in each cluster per Condition')
plt.ylabel('Percentage')
plt.xlabel('Condition')
plt.xticks(rotation=90)
plt.tight_layout()

ax.legend(title='Cluster_TSNE', loc='upper left', bbox_to_anchor=(1, 1))


# Save the figure to a PDF
pdf_pages.savefig(fig, bbox_inches='tight')

# Close the PDF
pdf_pages.close()

# Display the plot
plt.show()



## **3.6. Understand your clusters using heatmaps**
--------

<font size = 4>This section help visualize how different track parameters vary across the identified clusters. The approach is to display these variations using a heatmap, which offers a color-coded representation of the median values of each parameter for each cluster. This visualization technique can make it easier to spot differences or patterns among the clusters.


In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd
from scipy.stats import zscore

# @title ##Plot normalized numeric data based on clusters as an heatmap

# Parameters to adapt in function of the notebook section
base_folder = f"{Results_Folder}/TSNE"
Conditions = 'Cluster_TSNE'

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'experiment_nb', 'File_name', 'Repeat', 'Unique_ID', 'LABEL', 'TRACK_INDEX', 'TRACK_ID', 'TRACK_X_LOCATION', 'TRACK_Y_LOCATION', 'TRACK_Z_LOCATION', 'Exemplar','TRACK_STOP', 'TRACK_START', 'Cluster_UMAP', 'Cluster_TSNE']
    # Select only numerical columns
    return [col for col in df.columns if (df[col].dtype.kind in 'biufc') and (col not in exclude_cols)]


def heatmap_comparison(df, Results_Folder, Conditions, variables_per_page=40):
    # Get all the selectable columns
    variables_to_plot = get_selectable_columns(df)

    # Drop rows where all elements are NaNs in the variables_to_plot columns
    df = df.dropna()

    # Compute median for each variable across Clusters
    median_values = df.groupby(Conditions)[variables_to_plot].median().transpose()

    # Normalize the median values using Z-score
    normalized_values = median_values.apply(zscore, axis=1)

    # Number of pages
    total_variables = len(variables_to_plot)
    num_pages = int(np.ceil(total_variables / variables_per_page))

    # Initialize an empty DataFrame to store all pages' data
    all_pages_data = pd.DataFrame()

    # Create a PDF file to save the heatmaps
    with PdfPages(f"{Results_Folder}/Heatmaps_Normalized_Median_Values_by_Cluster.pdf") as pdf:
        for page in range(num_pages):
            start = page * variables_per_page
            end = min(start + variables_per_page, total_variables)
            page_data = normalized_values.iloc[start:end]

            # Append this page's data to the all_pages_data DataFrame
            all_pages_data = pd.concat([all_pages_data, page_data])

            plt.figure(figsize=(16, 10))
            sns.heatmap(page_data, cmap='coolwarm', annot=True, linewidths=.1)
            plt.title(f"Z-score Normalized Median Values of Variables by Condition (Page {page + 1})")
            plt.tight_layout()

            pdf.savefig()  # saves the current figure into a pdf page
            plt.show()
            plt.close()

    # Save all pages data to a single CSV file
    all_pages_data.to_csv(f"{Results_Folder}/Normalized_Median_Values_by_Condition.csv")

    print(f"Heatmaps saved to {Results_Folder}/Heatmap_Normalized_Median_Values_by_Cluster.pdf")
    print(f"All data saved to {Results_Folder}/Normalized_Median_Values_by_Cluster_All.csv")

# Example usage
heatmap_comparison(dataset_df, base_folder, Conditions)

## **3.7. Understand your clusters using box plots**
--------

<font size = 4>The provided code aims to visually represent the distribution of different numeric data across the identified clusters. Specifically, for each numeric data selected, a boxplot is generated to showcase the spread of its values across different clusters. This approach provides a comprehensive view of how each numeric data varies within and across the clusters.




In [None]:
import os
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.gridspec import GridSpec
import pandas as pd
import ipywidgets as widgets

# @title ##Plot numeric data based on clusters

# Check and create "pdf" folder
if not os.path.exists(f"{Results_Folder}/TSNE/Data_by_clusters"):
    os.makedirs(f"{Results_Folder}/TSNE/Data_by_clusters")

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'Repeat', 'Cluster_UMAP', 'Cluster_TSNE']
    # Select only numerical columns
    return [col for col in df.columns if (df[col].dtype.kind in 'biufc') and (col not in exclude_cols)]

def display_variable_checkboxes(selectable_columns):
    # Create checkboxes for selectable columns
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]

    # Display checkboxes in the notebook
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3)),
    ]))
    return variable_checkboxes

def plot_selected_vars(button, variable_checkboxes, df, Results_Folder):
    print("Plotting in progress...")

    # Get selected variables
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]
    n_plots = len(variables_to_plot)

    if n_plots == 0:
        print("No variables selected for plotting")
        return

    for var in variables_to_plot:
        # Extract data for the specific variable and cluster
        data_to_save = df[['Cluster_TSNE', var]]

        # Save data for the plot to CSV
        data_to_save.to_csv(f"{Results_Folder}/TSNE/Data_by_clusters/{var}_data_by_Cluster.csv", index=False)

        plt.figure(figsize=(16, 10))

        # Plotting
        sns.boxplot(x='Cluster_TSNE', y=var, data=df, color='lightgray')  # Boxplot by cluster
        sns.stripplot(x='Cluster_TSNE', y=var, data=df, jitter=True, alpha=0.2)  # Individual data points

        plt.title(f"{var} by Cluster")
        plt.xlabel('Cluster_TSNE')
        plt.ylabel(var)
        plt.xticks(rotation=90)
        plt.tight_layout()

        # Save the plot
        plt.savefig(f"{Results_Folder}/TSNE/Data_by_clusters/{var}_Boxplots_by_Cluster.pdf")
        plt.show()

selectable_columns = get_selectable_columns(dataset_df)
variable_checkboxes = display_variable_checkboxes(selectable_columns)

# Create and display the plot button
button = widgets.Button(description="Plot Selected Variables", layout=widgets.Layout(width='400px'))
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes, dataset_df, Results_Folder))
display(button)


## **3.8. Plot numeric data for a selected cluster**
---

In [None]:
# @title ##Plot numeric data for a selected cluster

import ipywidgets as widgets
from ipywidgets import Layout, VBox, Button, Accordion, SelectMultiple, IntText
import pandas as pd
import os
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.ticker import FixedLocator


# Parameters to adapt in function of the notebook section
base_folder = f"{Results_Folder}/TSNE/Data_selected_cluster"
Conditions = 'Condition'
df_to_plot = dataset_df

# Check and create necessary directories
folders = ["pdf", "csv"]
for folder in folders:
    dir_path = os.path.join(base_folder, folder)
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

def get_selectable_columns(df):
    # Exclude certain columns from being plotted
    exclude_cols = ['Condition', 'Repeat', 'Cluster_UMAP', 'Cluster_TSNE']
    # Select only numerical columns
    return [col for col in df.columns if (df[col].dtype.kind in 'biufc') and (col not in exclude_cols)]


def display_cluster_dropdown(df):
    # Extract unique clusters
    unique_clusters = df['Cluster_TSNE'].unique()
    cluster_dropdown = widgets.Dropdown(
        options=unique_clusters,
        description='Select Cluster:',
        disabled=False,
    )
    display(cluster_dropdown)
    return cluster_dropdown


def display_variable_checkboxes(selectable_columns):
    # Create checkboxes for selectable columns
    variable_checkboxes = [widgets.Checkbox(value=False, description=col) for col in selectable_columns]

    # Display checkboxes in the notebook
    display(widgets.VBox([
        widgets.Label('Variables to Plot:'),
        widgets.GridBox(variable_checkboxes, layout=widgets.Layout(grid_template_columns="repeat(%d, 300px)" % 3)),
    ]))
    return variable_checkboxes

def create_condition_selector(df, column_name):
    conditions = df[column_name].unique()
    condition_selector = SelectMultiple(
        options=conditions,
        description='Conditions:',
        disabled=False,
        layout=Layout(width='80%')  # Adjusting the layout width
    )
    return condition_selector

def display_condition_selection(df, column_name):
    condition_selector = create_condition_selector(df, column_name)

    condition_accordion = Accordion(children=[VBox([condition_selector])])
    condition_accordion.set_title(0, 'Select Conditions')
    display(condition_accordion)
    return condition_selector


def plot_selected_vars(button, variable_checkboxes, df, Conditions, cluster_dropdown, Results_Folder, condition_selector):

    selected_cluster = cluster_dropdown.value
    print(f"Plotting in progress for Cluster {selected_cluster}...")

    plt.clf()  # Clear the current figure before creating a new plot


  # Get selected variables
    variables_to_plot = [box.description for box in variable_checkboxes if box.value]
    n_plots = len(variables_to_plot)

    if n_plots == 0:
        print("No variables selected for plotting")
        return

  # Get selected conditions
    selected_conditions = condition_selector.value
    n_selected_conditions = len(selected_conditions)

    if n_selected_conditions == 0:
        print("No conditions selected for plotting")
        return

# Use only selected and ordered conditions
    filtered_df = df[(df[Conditions].isin(selected_conditions)) & (df['Cluster_TSNE'] == selected_cluster)].copy()

# Initialize matrices to store effect sizes and p-values for each variable
    effect_size_matrices = {}
    p_value_matrices = {}
    bonferroni_matrices = {}

    unique_conditions = filtered_df[Conditions].unique().tolist()
    num_comparisons = len(unique_conditions) * (len(unique_conditions) - 1) // 2
    alpha = 0.05
    corrected_alpha = alpha / num_comparisons
    n_iterations = 1000

# Loop through each variable to plot
    for var in variables_to_plot:

      pdf_pages = PdfPages(f"{Results_Folder}/pdf/Cluster_{selected_cluster}_{var}_Boxplots_and_Statistics.pdf")
      effect_size_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      p_value_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)
      bonferroni_matrix = pd.DataFrame(index=unique_conditions, columns=unique_conditions)

      for cond1, cond2 in itertools.combinations(unique_conditions, 2):
        group1 = df[df[Conditions] == cond1][var]
        group2 = df[df[Conditions] == cond2][var]

        original_d = abs(cohen_d(group1, group2))
        effect_size_matrix.loc[cond1, cond2] = original_d
        effect_size_matrix.loc[cond2, cond1] = original_d  # Mirroring

        count_extreme = 0
        for i in range(n_iterations):
            combined = pd.concat([group1, group2])
            shuffled = combined.sample(frac=1, replace=False).reset_index(drop=True)
            new_group1 = shuffled[:len(group1)]
            new_group2 = shuffled[len(group1):]

            new_d = abs(cohen_d(new_group1, new_group2))
            if np.abs(new_d) >= np.abs(original_d):
                count_extreme += 1

        p_value = count_extreme / n_iterations
        p_value_matrix.loc[cond1, cond2] = p_value
        p_value_matrix.loc[cond2, cond1] = p_value  # Mirroring

        # Apply Bonferroni correction
        bonferroni_corrected_p_value = min(p_value * num_comparisons, 1.0)
        bonferroni_matrix.loc[cond1, cond2] = bonferroni_corrected_p_value
        bonferroni_matrix.loc[cond2, cond1] = bonferroni_corrected_p_value  # Mirroring

      effect_size_matrices[var] = effect_size_matrix
      p_value_matrices[var] = p_value_matrix
      bonferroni_matrices[var] = bonferroni_matrix

    # Concatenate the three matrices side-by-side
      combined_df = pd.concat(
        [
            effect_size_matrices[var].rename(columns={col: f"{col} (Effect Size)" for col in effect_size_matrices[var].columns}),
            p_value_matrices[var].rename(columns={col: f"{col} (P-Value)" for col in p_value_matrices[var].columns}),
            bonferroni_matrices[var].rename(columns={col: f"{col} (Bonferroni-corrected P-Value)" for col in bonferroni_matrices[var].columns})
        ], axis=1
    )

    # Save the combined DataFrame to a CSV file
      combined_df.to_csv(f"{Results_Folder}/csv/Cluster_{selected_cluster}_{var}_statistics_combined.csv")

    # Create a new figure
      fig = plt.figure(figsize=(16, 10))

    # Create a gridspec for 2 rows and 4 columns
      gs = GridSpec(2, 3, height_ratios=[1.5, 1])

    # Create the ax for boxplot using the gridspec
      ax_box = fig.add_subplot(gs[0, :])

    # Extract the data for this variable
      data_for_var = df[[Conditions, var, 'Repeat']]

    # Save the data_for_var to a CSV for replotting
      data_for_var.to_csv(f"{Results_Folder}/csv/Cluster_{selected_cluster}_{var}_boxplot_data.csv", index=False)

    # Calculate the Interquartile Range (IQR) using the 25th and 75th percentiles
      Q1 = df[var].quantile(0.25)
      Q3 = df[var].quantile(0.75)
      IQR = Q3 - Q1

    # Define bounds for the outliers
      multiplier = 10
      lower_bound = Q1 - multiplier * IQR
      upper_bound = Q3 + multiplier * IQR

    # Plotting
      sns.boxplot(x=Conditions, y=var, data=filtered_df, ax=ax_box, color='lightgray')  # Boxplot
      sns.stripplot(x=Conditions, y=var, data=filtered_df, ax=ax_box, hue='Repeat', dodge=True, jitter=True, alpha=0.2)  # Individual data points
      ax_box.set_ylim([max(min(filtered_df[var]), lower_bound), min(max(filtered_df[var]), upper_bound)])
      ax_box.set_title(f"{var} for Cluster {selected_cluster}")
      ax_box.set_xlabel('Condition')
      ax_box.set_ylabel(var)
      tick_labels = ax_box.get_xticklabels()
      tick_locations = ax_box.get_xticks()
      ax_box.xaxis.set_major_locator(FixedLocator(tick_locations))
      ax_box.set_xticklabels(tick_labels, rotation=90)
      ax_box.legend(loc='center left', bbox_to_anchor=(1, 0.5), title='Repeat')

    # Statistical Analyses and Heatmaps

    # Effect Size heatmap ax
      ax_d = fig.add_subplot(gs[1, 0])
      sns.heatmap(effect_size_matrices[var].fillna(0), annot=True, cmap="viridis", cbar=True, square=True, ax=ax_d, vmax=1)
      ax_d.set_title(f"Effect Size (Cohen's d) for {var}")

    # p-value heatmap ax
      ax_p = fig.add_subplot(gs[1, 1])
      sns.heatmap(p_value_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_p, vmax=0.1)
      ax_p.set_title(f"Randomization Test p-value for {var}")

    # Bonferroni corrected p-value heatmap ax
      ax_bonf = fig.add_subplot(gs[1, 2])
      sns.heatmap(bonferroni_matrices[var].fillna(1), annot=True, cmap="viridis_r", cbar=True, square=True, ax=ax_bonf, vmax=0.1)
      ax_bonf.set_title(f"Bonferroni-corrected p-value for {var}")

      plt.tight_layout()
      pdf_pages.savefig(fig)

    # Close the PDF
      pdf_pages.close()

condition_selector = display_condition_selection(df_to_plot, Conditions)
selectable_columns = get_selectable_columns(df_to_plot)
variable_checkboxes = display_variable_checkboxes(selectable_columns)
cluster_dropdown = display_cluster_dropdown(dataset_df)


button = Button(description="Plot Selected Variables", layout=Layout(width='400px'), button_style='info')
button.on_click(lambda b: plot_selected_vars(b, variable_checkboxes, df_to_plot, Conditions, cluster_dropdown, base_folder, condition_selector))
display(button)

# **Part 5. Version log**
---
<font size = 4>While I strive to provide accurate and helpful information, please be aware that:
  - This notebook may contain bugs.
  - Features are currently limited and will be expanded in future releases.

<font size = 4>We encourage users to report any issues or suggestions for improvement. Please check the [repository](https://github.com/guijacquemet/CellTracksColab) regularly for updates and the latest version of this notebook.


<font size = 4>**Version 0.1**
This is the first release of this notebook.

---