# **Analyzing Cell Attachment Patterns: Simulated vs Observed Marker Overlaps**
---

This notebook simulates and compares cell attachment patterns on specific markers to determine if they occur by chance or exhibit a specific trend.

---

#### 1. Load Dataset
- Load the dataset by entering a file path or pasting tab-separated data.
- The dataset must include FOV, marker information, and cell counts.
- The results folder is created for output.

---

#### 2. Extract FOV Masks
- Unzip the provided ZIP file containing FOV masks.
- Masks correspond to specific FOVs in the dataset and are used in the simulation.

---

#### 3. Load and Process Masks
- For each FOV, the corresponding binary mask is loaded.
- If a mask is missing, the analysis for that FOV is skipped.

---

#### 4. Simulate Overlaps
- Cells are randomly placed in the FOV, and overlaps with markers are tracked.
- Simulations run 1,000 times to estimate random overlap distributions.
- The mean simulated overlaps are calculated for each FOV.

---

#### 5. Compare Observed vs Simulated
- The observed overlaps are compared to the simulated (random) overlaps to assess if the patterns differ from chance.

---

#### 6. Plot Results
- Visualize the comparison using a combined dot plot and boxplot to show both observed and expected overlaps for each marker.

---

#### 7. Save Results
- The simulation results are saved as a CSV file for further analysis, containing observed and simulated overlap counts.

---

<font size = 4>Notebook created by [Guillaume Jacquemet](https://cellmig.org/)



--------------------------------------------------------
# **Part 1. Prepare the session and load your data**
--------------------------------------------------------


## **1.1. Mount your Google Drive**
---
<font size = 4> To use this notebook on the data present in your Google Drive, you need to mount your Google Drive to this notebook.

<font size = 4> Play the cell below to mount your Google Drive and follow the instructions.

<font size = 4> Once this is done, your data are available in the **Files** tab on the top left of notebook.

In [None]:
#@markdown ##Play the cell to connect your Google Drive to Colab

from google.colab import drive
drive.mount('/content/gdrive')
%cd /gdrive



## **1.2. Load your dataset**
---

<font size = 4> Please ensure that your data is properly organised (see above)


In [None]:
#@markdown ##Load your dataset:

import pandas as pd
import os
from io import StringIO
import ipywidgets as widgets
from IPython.display import display, clear_output

def check_for_nans(df, df_name):
    """
    Checks the given DataFrame for NaN values and prints the count for each column containing NaNs.

    Args:
    df (pd.DataFrame): DataFrame to be checked for NaN values.
    df_name (str): The name of the DataFrame as a string, used for printing.
    """
    # Check if the DataFrame has any NaN values and print a warning if it does.
    nan_columns = df.columns[df.isna().any()].tolist()

    if nan_columns:
        for col in nan_columns:
            nan_count = df[col].isna().sum()
            print(f"Column '{col}' in {df_name} contains {nan_count} NaN values.")
    else:
        print(f"No NaN values found in {df_name}.")


# Initialize dataset_df as an empty DataFrame globally
dataset_df = pd.DataFrame()


# Create widgets
dataset_path_input = widgets.Text(
    value='',
    placeholder='Enter the path to your dataset',
    description='Dataset Path:',
    layout={'width': '80%'}
)

results_folder_input = widgets.Text(
    value='',
    placeholder='Enter the path to your results folder',
    description='Results Folder:',
    layout={'width': '80%'}
)

data_textarea = widgets.Textarea(
    value='',
    placeholder='Or copy and paste your tab sperated data here (direct copy and paste from a spreedsheet)',
    description='Or Paste Data:',
    layout={'width': '80%', 'height': '200px'}
)

load_button = widgets.Button(
    description='Load Data',
    button_style='success',  # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click to load the data',
)

output = widgets.Output()

# Load data function
def load_data(b):
    global dataset_df
    global Results_Folder

    with output:
        clear_output()
        Results_Folder = results_folder_input.value.strip()
        if not Results_Folder:
            Results_Folder = './Results'  # Default path if not provided
        if not os.path.exists(Results_Folder):
            os.makedirs(Results_Folder)  # Create the folder if it doesn't exist
        print(f"Results folder is located at: {Results_Folder}")

        if dataset_path_input.value.strip():
            dataset_path = dataset_path_input.value.strip()
            try:
                dataset_df = pd.read_csv(dataset_path)
                print(f"Loaded dataset from {dataset_path}")
            except Exception as e:
                print(f"Failed to load dataset from {dataset_path}: {e}")
        elif data_textarea.value.strip():
            input_data = StringIO(data_textarea.value)
            try:
                dataset_df = pd.read_csv(input_data, sep='\t')
                print("Loaded dataset from pasted tab-separated data")
            except Exception as e:
                print(f"Failed to load dataset from pasted data: {e}")
        else:
            print("No dataset path provided or data pasted. Please provide a dataset.")
            return

        # Perform a check for NaNs or any other required processing here
        check_for_nans(dataset_df, "your dataset")

        display(dataset_df.head())

# Set the button click event
load_button.on_click(load_data)

# Display the widgets
display(widgets.VBox([dataset_path_input, results_folder_input, data_textarea, load_button, output]))


## **1.3. Unzip your masks (optional)**

In [None]:
import zipfile
import os

# Path to the zip file
zip_file_path = '/content/Masks_AsPc1.zip' # @param {type: "string"}

# Directory where the contents should be extracted
extract_dir = '/content/' # @param {type: "string"}

# Unzip the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)


# **2. Start the simulations**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skimage.draw import disk
from PIL import Image
import os
from tqdm.notebook import tqdm

# Directory where FOV masks are stored
mask_directory = '/content/Masks_AsPc1'  # @param {type: "string"}

# Area of a single cancer cell in um²
cell_area_um2 = 236  # @param {type: "number"}

# Specify the result folder where you want to save the CSV
Result_folder = Results_Folder

# Cell type for naming the output files
Cell_type = 'AsPC1'  # @param {type: "string"}

# Generate the output file path using the cell type in the file name
output_file_path = os.path.join(Result_folder, f'simulation_results_{Cell_type}.csv')

def load_mask(fov_name, mask_directory):
    """Load the binary mask corresponding to a given FOV in .tif format."""
    mask_path = os.path.join(mask_directory, f"{fov_name}.tif")  # Assuming masks are saved as .tif files
    try:
        with Image.open(mask_path) as img:
            mask = np.array(img)
        return mask
    except FileNotFoundError:
        print(f"Error: Mask file not found for FOV: {fov_name}. Skipping this FOV.")
        return None

def simulate_overlaps(dataset_df, mask_directory, cell_area_um2, pixel_size_um=0.6341464, num_simulations=1000):
    results = []

    # Use tqdm to add a progress bar
    for index, row in tqdm(dataset_df.iterrows(), total=dataset_df.shape[0], desc="Simulating overlaps"):
        fov_name = row['FOV']
        marker = row['Marker']
        repeat = row['Repeat']
        num_attached_cells = row['num_attached_cells']
        observed_num_overlaps = row['num_contact_cells']

        # Load the mask for the specific FOV
        marker_mask = load_mask(fov_name, mask_directory)

        if marker_mask is None:
            # If the mask is missing, assign NaN for the analysis
            mean_simulated_overlaps = np.nan
        else:
            mask_height, mask_width = marker_mask.shape

            # Convert cell area from um² to pixels²
            cell_area_pixels = cell_area_um2 / (pixel_size_um**2)

            # Convert cell area to radius in pixels
            cell_radius_pixels = np.sqrt(cell_area_pixels / np.pi)

            # Run multiple simulations to estimate the distribution of overlaps
            simulated_overlaps_list = []
            for _ in range(num_simulations):
                simulated_overlaps = 0
                for _ in range(num_attached_cells):
                    # Randomly place the cell within the area defined by the mask
                    x_center, y_center = np.random.randint(0, mask_width), np.random.randint(0, mask_height)

                    # Create a circular footprint for the cell on the mask
                    rr, cc = disk((y_center, x_center), cell_radius_pixels, shape=marker_mask.shape)

                    # Check if any part of the cell footprint overlaps with the marker mask
                    if np.any(marker_mask[rr, cc]):
                        simulated_overlaps += 1

                simulated_overlaps_list.append(simulated_overlaps)

            # Calculate the mean of the simulated overlaps
            mean_simulated_overlaps = np.mean(simulated_overlaps_list)

        # Store the results for this FOV
        results.append({
            'FOV': fov_name,
            'Marker': marker,
            'Repeat': repeat,
            'num_attached_cells': num_attached_cells,
            'observed_num_overlaps': observed_num_overlaps,
            'mean_simulated_overlaps': mean_simulated_overlaps
        })

    # Convert results to a DataFrame
    results_df = pd.DataFrame(results)

    # Calculate normalized observed and expected overlaps
    results_df['normalized_observed'] = results_df['observed_num_overlaps'] / results_df['num_attached_cells']
    results_df['normalized_expected'] = results_df['mean_simulated_overlaps'] / results_df['num_attached_cells']
    results_df['Ratio'] = results_df['normalized_observed'] / results_df['normalized_expected']

    return results_df

# Run the full simulation across all FOVs
simulation_results_df = simulate_overlaps(dataset_df, mask_directory, cell_area_um2)

# Save the DataFrame as a CSV file in the results folder using the cell type in the file name
simulation_results_df.to_csv(output_file_path, index=False)

# Print confirmation
print(f"Simulation results saved to {output_file_path}")


# **3. Plot and save the results**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import os

#@markdown ##Plot and save your results

def plot_observed_vs_expected_and_ratio(simulation_results_df, result_folder, cell_type):
    # Filter out FOVs with fewer than 5 attached cells
    filtered_df = simulation_results_df[simulation_results_df['num_attached_cells'] >= 5]

    # Save the filtered DataFrame with cell type in the file name
    filtered_output_path = os.path.join(result_folder, f'filtered_simulation_results_{cell_type}.csv')
    filtered_df.to_csv(filtered_output_path, index=False)
    print(f"Filtered simulation results saved to {filtered_output_path}")

    # Melt the DataFrame to plot both observed and expected in the same plot
    melted_df = filtered_df.melt(id_vars=['Marker'],
                                 value_vars=['normalized_observed', 'normalized_expected'],
                                 var_name='Type',
                                 value_name='Proportion')

    # Create a new figure with two subplots
    fig, axes = plt.subplots(1, 2, figsize=(16, 8))

    # Create the boxplot on the first subplot (axes[0])
    sns.boxplot(x='Marker', y='Proportion', hue='Type', data=melted_df, palette="Set2", showfliers=False, ax=axes[0])

    # Overlay the dotplot on the first subplot (axes[0])
    sns.stripplot(x='Marker', y='Proportion', hue='Type', data=melted_df,
                  dodge=True, jitter=True, palette="Set1", linewidth=1, alpha=0.7, ax=axes[0])

    # Adjusting the legend for the first plot
    handles, labels = axes[0].get_legend_handles_labels()
    axes[0].legend(handles[0:2], ['Observed', 'Expected'], title='Type', bbox_to_anchor=(1, 1), loc='upper left')
    axes[0].set_xlabel('Marker')
    axes[0].set_ylabel('Proportion of Cells Overlapping')
    axes[0].set_title(f'Observed vs Expected Cell Overlaps by Marker for {cell_type}')
    axes[0].tick_params(axis='x', rotation=45)

    # Plot the ratio as a boxplot in the second subplot (axes[1])
    sns.boxplot(x='Marker', y='Ratio', data=filtered_df, showfliers=False, ax=axes[1], palette="Set3")

    # Overlay the dotplot on the second subplot (axes[1])
    sns.stripplot(x='Marker', y='Ratio', data=filtered_df, dodge=True, jitter=True, color='blue', linewidth=1, alpha=0.7, ax=axes[1])

    # Add a red line at y=1 to represent the expected ratio
    axes[1].axhline(1, color='red', linestyle='--', linewidth=2, alpha=0.5)

    # Labeling and title for the second subplot
    axes[1].set_xlabel('Marker')
    axes[1].set_ylabel('Observed to Expected Ratio')
    axes[1].set_title(f'Ratio of Observed to Expected Overlaps by Marker for {cell_type}')
    axes[1].tick_params(axis='x', rotation=45)

    # Tight layout for proper spacing
    plt.tight_layout()

    # Save the plot as a PDF in the result folder with cell type in the file name
    plot_output_path = os.path.join(result_folder, f'observed_vs_expected_and_ratio_plot_{cell_type}.pdf')
    fig.savefig(plot_output_path, format='pdf')
    print(f"Plot saved to {plot_output_path}")

    # Display the plot
    plt.show()

# Example usage:
plot_observed_vs_expected_and_ratio(simulation_results_df, Result_folder, Cell_type)
