# Scatter Plot Examples

This notebook demonstrates the various features and options available in the `scatter_plot` function.

## Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

from omero_screen_plots import scatter_plot
from omero_screen_plots.colors import COLOR
from omero_screen_plots.utils import save_fig

# Setup output directory
path = Path("../images")
path.mkdir(parents=True, exist_ok=True)

# Load sample data
df = pd.read_csv("data/sample_plate_data.csv")

# Define conditions for examples
conditions = ['control', 'cond01', 'cond02', 'cond03']
print("Available conditions:", conditions)
print("Available cell lines:", df['cell_line'].unique())
print("Data shape:", df.shape)
print("\nSample features:")
print("- integrated_int_DAPI_norm: DNA content (for cell cycle)")
print("- intensity_mean_EdU_nucleus_norm: EdU intensity (S-phase marker)")
print("- intensity_mean_p21_nucleus: p21 protein expression")
print("- area_cell: Cell area")
print("- cell_cycle: Cell cycle phase classification")

## 1. Basic Scatter Plot - Single Condition (DNA vs EdU)

The default scatter plot shows DNA content vs EdU intensity with automatic:
- Log scales (base 2) for both axes
- Cell cycle phase coloring (if available)
- KDE density overlay
- Reference lines at x=3, y=3 for cell cycle gating

In [None]:
# Default DNA vs EdU scatter plot - automatically detects cell cycle context
fig, ax = scatter_plot(
    df=df, 
    conditions="control",  # Single condition as string
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=3000,  # Sample 3000 cells for performance
    title="Cell Cycle Analysis - DNA vs EdU",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)
print("Auto-detected DNA vs EdU plot with cell cycle coloring and KDE overlay")

## 2. Multiple Conditions - Subplots

Multiple conditions create separate subplots, each with individual subplot titles.

In [None]:
# Multiple conditions create separate subplots
fig, axes = scatter_plot(
    df=df,
    conditions=conditions,  # List of conditions
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=2000,  # Fewer cells per plot for performance
    title="Cell Cycle Analysis - Multiple Conditions",
    show_title=True,
    fig_size=(16, 4),  # 4cm per condition width
    save=True,
    path=path,
    file_format="pdf"
)
print(f"Created {len(axes)} subplots for {len(conditions)} conditions")

## 3. Custom Features - DNA vs Protein Expression

When using DNA content on x-axis with other features, automatic settings apply:
- X-axis: Log scale, limits (1,16), reference line at 3
- Cell cycle coloring (if available)
- No KDE overlay (only for DNA vs EdU)

In [None]:
# DNA content vs protein expression - auto-detects DNA settings
fig, axes = scatter_plot(
    df=df,
    y_feature="intensity_mean_p21_nucleus",  # Change y-feature
    conditions=conditions,
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=3000,
    y_scale="linear",
    y_limits=(1000, 12000),  # Custom y-limits for p21 intensity
    title="DNA Content vs p21 Expression",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)
print("DNA content x-axis auto-detected: log scale, limits (1,16), cell cycle colors")

## 4. Threshold-Based Coloring

When a threshold is specified, it overrides cell cycle coloring:
- Blue points: Below threshold
- Red points: Above threshold

In [None]:
# Threshold-based coloring overrides cell cycle colors
fig, axes = scatter_plot(
    df=df,
    y_feature="intensity_mean_p21_nucleus",
    conditions=conditions,
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=3000,
    threshold=5000,  # Threshold for p21 expression
    y_limits=(1000, 12000),
    title="p21 Expression Threshold Analysis",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)
print("Threshold coloring: Blue (below 5000) vs Red (above 5000)")

## 5. Linear Scale Scatter Plot

For non-DNA features, linear scales are used by default.

In [None]:
# Linear scale plot - no auto-detection since neither axis is DNA/EdU
fig, ax = scatter_plot(
    df=df,
    x_feature="area_cell",
    y_feature="intensity_mean_p21_nucleus",
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=5000,
    x_limits=(0, 8000),  # Cell area range
    y_limits=(1000, 12000),  # p21 intensity range
    title="Cell Area vs p21 Expression (Linear Scale)",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)
print("Linear scales used - no special DNA/EdU detection")

## 6. Custom Scales and Reference Lines

Override automatic detection with custom settings.

In [None]:
# Override automatic settings with custom scales
fig, ax = scatter_plot(
    df=df,
    x_feature="area_cell",
    y_feature="intensity_mean_p21_nucleus",
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=5000,
    x_scale="log",  # Force log scale on area
    y_scale="log",  # Force log scale on p21
    x_limits=(100, 10000),
    y_limits=(1000, 15000),
    vline=2000,  # Custom vertical reference line
    hline=5000,  # Custom horizontal reference line
    kde_overlay=True,  # Force KDE overlay
    title="Custom Log Scales with Reference Lines",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)
print("Custom log scales and reference lines applied")

## 7. Manual Cell Cycle Coloring

Explicitly control cell cycle phase coloring and order.

In [None]:
# Manual cell cycle phase control
fig, ax = scatter_plot(
    df=df,
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=4000,
    hue="cell_cycle",  # Explicit hue setting
    hue_order=["Sub-G1", "G1", "S", "G2/M", "Polyploid"],  # Custom order
    show_legend=True,  # Show legend
    legend_title="Cell Cycle Phase",
    title="Manual Cell Cycle Phase Coloring",
    show_title=True,
    fig_size=(6, 6),  # Square plot
    save=True,
    path=path,
    file_format="pdf"
)
print("Manual cell cycle control with legend")

## 8. KDE Overlay Customization

Control the appearance of KDE density overlays.

In [None]:
# Compare different KDE overlay settings
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
kde_settings = [
    {"kde_alpha": 0.1, "kde_cmap": "rocket_r", "title": "Subtle (α=0.1)"},
    {"kde_alpha": 0.3, "kde_cmap": "viridis", "title": "Medium (α=0.3)"},
    {"kde_alpha": 0.6, "kde_cmap": "plasma", "title": "Strong (α=0.6)"}
]

for ax, settings in zip(axes, kde_settings):
    scatter_plot(
        df=df,
        conditions="control",
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        cell_number=3000,
        kde_overlay=True,
        kde_alpha=settings["kde_alpha"],
        kde_cmap=settings["kde_cmap"],
        axes=ax
    )
    ax.set_title(settings["title"])

fig.suptitle("KDE Overlay Customization", fontsize=14)
fig.tight_layout()
save_fig(fig, path, "scatter_kde_comparison", fig_extension="pdf")
print("KDE overlay comparison created")

## 9. Size and Alpha Customization

Control point size and transparency for different data densities.

In [None]:
# Compare different point sizes and transparencies
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
settings = [
    {"size": 1, "alpha": 1.0, "title": "Small, Opaque"},
    {"size": 4, "alpha": 1.0, "title": "Large, Opaque"},
    {"size": 2, "alpha": 0.3, "title": "Medium, Transparent"},
    {"size": 1, "alpha": 0.1, "title": "Small, Very Transparent"}
]

for ax, settings in zip(axes.flat, settings):
    scatter_plot(
        df=df,
        conditions="control",
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        cell_number=5000,  # More points to show density effects
        size=settings["size"],
        alpha=settings["alpha"],
        kde_overlay=False,  # No KDE to focus on points
        axes=ax
    )
    ax.set_title(settings["title"])

fig.suptitle("Point Size and Transparency Options", fontsize=14)
fig.tight_layout()
save_fig(fig, path, "scatter_size_alpha_comparison", fig_extension="pdf")
print("Size and alpha comparison created")

## 10. Cell Number Sampling Effects

Compare different cell sampling sizes for performance vs detail trade-offs.

In [None]:
# Compare different cell sampling sizes
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
cell_numbers = [500, 1000, 3000, 10000]

for ax, n_cells in zip(axes, cell_numbers):
    scatter_plot(
        df=df,
        conditions="control",
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        cell_number=n_cells,
        size=1.5,
        alpha=0.6,
        kde_overlay=True,
        kde_alpha=0.2,
        axes=ax
    )
    ax.set_title(f"{n_cells} cells")
    ax.set_xlabel("")  # Remove x-label for cleaner comparison

# Only show y-label on first plot
for ax in axes[1:]:
    ax.set_ylabel("")

fig.suptitle("Cell Sampling Size Comparison", fontsize=14)
fig.tight_layout()
save_fig(fig, path, "scatter_sampling_comparison", fig_extension="pdf")
print("Cell sampling comparison created")

## 11. Complex Multi-Panel Analysis

Create a comprehensive figure combining different scatter plot views.

In [None]:
# Create a complex multi-panel analysis
fig = plt.figure(figsize=(16, 12))

# Top row: Different conditions for DNA vs EdU
for i, cond in enumerate(conditions[:4]):
    ax = plt.subplot(3, 4, i+1)
    scatter_plot(
        df=df,
        conditions=cond,
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        cell_number=2000,
        size=1,
        alpha=0.7,
        kde_overlay=True,
        kde_alpha=0.15,
        axes=ax
    )
    ax.set_title(f"DNA vs EdU - {cond}", fontsize=10)
    if i > 0:
        ax.set_ylabel("")  # Only first plot gets y-label

# Middle row: DNA vs different features
features = ["intensity_mean_p21_nucleus", "area_cell", "intensity_mean_EdU_nucleus_norm"]
feature_names = ["p21 Expression", "Cell Area", "EdU Intensity"]
for i, (feat, name) in enumerate(zip(features, feature_names)):
    ax = plt.subplot(3, 4, i+5)
    scatter_plot(
        df=df,
        y_feature=feat,
        conditions="control",
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        cell_number=3000,
        size=1.5,
        alpha=0.5,
        axes=ax
    )
    ax.set_title(f"DNA vs {name}", fontsize=10)
    if i > 0:
        ax.set_ylabel("")  # Only first plot gets y-label

# Bottom row: Threshold analysis
thresholds = [4000, 5000, 6000]
for i, thresh in enumerate(thresholds):
    ax = plt.subplot(3, 4, i+9)
    scatter_plot(
        df=df,
        y_feature="intensity_mean_p21_nucleus",
        conditions="control",
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        cell_number=3000,
        threshold=thresh,
        size=1.5,
        alpha=0.6,
        axes=ax
    )
    ax.set_title(f"p21 Threshold: {thresh}", fontsize=10)
    if i > 0:
        ax.set_ylabel("")  # Only first plot gets y-label

fig.suptitle("Comprehensive Scatter Plot Analysis", fontsize=16, fontweight='bold')
fig.tight_layout()
save_fig(fig, path, "scatter_comprehensive_analysis", fig_extension="pdf")
print("Comprehensive analysis figure created")

## 12. Custom Colors and Styling

Override default colors and styling options.

In [None]:
# Custom color palettes
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Standard cell cycle colors (auto-detected)
scatter_plot(
    df=df,
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=3000,
    hue="cell_cycle",
    axes=axes[0]
)
axes[0].set_title("Standard Cell Cycle Colors")

# Custom cell cycle colors
custom_colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FECA57']
scatter_plot(
    df=df,
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=3000,
    hue="cell_cycle",
    palette=custom_colors,
    axes=axes[1]
)
axes[1].set_title("Custom Cell Cycle Colors")

# Threshold colors (blue/red)
scatter_plot(
    df=df,
    y_feature="intensity_mean_p21_nucleus",
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    cell_number=3000,
    threshold=5000,
    axes=axes[2]
)
axes[2].set_title("Threshold Colors (Blue/Red)")

fig.suptitle("Color Customization Options", fontsize=14)
fig.tight_layout()
save_fig(fig, path, "scatter_color_customization", fig_extension="pdf")
print("Color customization examples created")

## Summary

The `scatter_plot` function provides intelligent defaults and extensive customization:

### Smart Auto-Detection:
1. **DNA Content (x-axis)**: Automatically applies log scale, limits (1,16), reference line at 3, cell cycle coloring
2. **EdU Intensity (y-axis)**: Automatically applies log scale, reference line at 3
3. **DNA vs EdU**: Full cell cycle plot with KDE overlay
4. **Threshold Override**: When threshold is set, uses blue (below) and red (above) coloring

### Key Features:
- **Flexible input**: Single condition (string) or multiple conditions (list)
- **Cell sampling**: Control performance with `cell_number` parameter
- **Multiple scales**: Linear or log (with auto-detection)
- **KDE overlays**: Density contours with customizable appearance
- **Reference lines**: Automatic or custom positioning
- **Color schemes**: Cell cycle phases, threshold-based, or custom palettes
- **Size control**: Point size and transparency for different data densities

### Common Use Cases:
1. **Cell cycle analysis**: DNA vs EdU with phase coloring and gating
2. **Biomarker analysis**: DNA vs protein expression with threshold coloring
3. **Morphological analysis**: Cell size vs protein levels
4. **Multi-condition comparisons**: Side-by-side treatment effects
5. **Quality control**: Data distribution visualization with density overlays

### Performance Tips:
- Use `cell_number=1000-5000` for fast visualization
- Lower `alpha` values (0.1-0.5) for high-density data
- KDE overlays add computational cost but improve interpretation