# Histogram Plot Examples

This notebook demonstrates the various features and options available in the `histogram_plot` function.

## Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

from omero_screen_plots import histogram_plot
from omero_screen_plots.colors import COLOR
from omero_screen_plots.utils import save_fig

# Setup output directory
path = Path("../images")
path.mkdir(parents=True, exist_ok=True)

# Load sample data
df = pd.read_csv("data/sample_plate_data.csv")

# Define conditions for examples
conditions = ['control', 'cond01', 'cond02', 'cond03']
print("Available conditions:", conditions)
print("Available cell lines:", df['cell_line'].unique())
print("Data shape:", df.shape)
print("\nSample features:")
print("- integrated_int_DAPI_norm: DNA content (for cell cycle)")
print("- intensity_mean_p21_nucleus: p21 protein expression")
print("- area_cell: Cell area")

## 1. Basic Histogram - Single Condition

In [None]:
# Default histogram for a single condition
fig, ax = histogram_plot(
    df=df, 
    feature="intensity_mean_p21_nucleus",
    conditions="control",  # Single condition as string
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    title="Basic Histogram - p21 Expression",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)

## 2. Multiple Conditions - Subplots

In [None]:
# Multiple conditions create separate subplots
fig, axes = histogram_plot(
    df=df,
    feature="intensity_mean_p21_nucleus",
    conditions=conditions,  # List of conditions
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    bins=50,  # Fewer bins for clearer visualization
    title="Multiple Conditions - p21 Expression",
    show_title=True,
    fig_size=(16, 4),  # Automatically sized: 4cm per condition
    save=True,
    path=path,
    file_format="pdf"
)
print(f"Created {len(axes)} subplots")

## 3. DNA Content with Log Scale

In [None]:
# DNA content histogram with log2 scale (common for cell cycle analysis)
fig, axes = histogram_plot(
    df=df,
    feature="integrated_int_DAPI_norm",
    conditions=conditions,
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    bins=100,  # Default 100 bins for good resolution
    log_scale=True,
    log_base=2,  # Base 2 for DNA content (2N, 4N, etc.)
    x_limits=(1, 16),  # Typical DNA content range
    title="DNA Content Distribution (Log2 Scale)",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)

## 4. KDE Overlay - Single Plot Comparison

In [None]:
# KDE overlay mode: Shows only smooth density curves in a single plot
fig, ax = histogram_plot(
    df=df,
    feature="integrated_int_DAPI_norm",
    conditions=conditions,  # All conditions overlaid
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    kde_overlay=True,  # Enables KDE-only mode
    kde_smoothing=0.8,  # Smoothing factor (0.5-2.0)
    log_scale=True,
    log_base=2,
    x_limits=(1, 16),
    title="KDE Overlay - DNA Content Comparison",
    show_title=True,
    fig_size=(8, 5),
    save=True,
    path=path,
    file_format="pdf"
)
print("Note: Returns single Axes object for KDE overlay mode")

## 5. KDE Smoothing Comparison

In [None]:
# Compare different smoothing levels
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
smoothing_values = [0.5, 1.0, 2.0]
titles = ["Smooth (0.5)", "Default (1.0)", "Detailed (2.0)"]

for ax, smooth, title in zip(axes, smoothing_values, titles):
    histogram_plot(
        df=df,
        feature="intensity_mean_p21_nucleus",
        conditions=['control', 'cond02'],
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        kde_overlay=True,
        kde_smoothing=smooth,
        axes=ax
    )
    ax.set_title(title)

fig.suptitle("KDE Smoothing Comparison", fontsize=12)
fig.tight_layout()
save_fig(fig, path, "histogram_kde_smoothing_comparison", fig_extension="pdf")

## 6. Normalized Histograms (Density)

The histogram uses stat="density" in seaborn's histplot, which:

  1. Converts counts to probability density: Each bar's height represents the probability density rather than raw count
  2. The area under the histogram sums to 1: This is true probability density normalization
  3. Formula: density = count / (total_count * bin_width)

  This means:
  - The y-axis shows probability density (not probability)
  - The area of each bar (height × width) represents the probability of data falling in that bin
  - The total area of all bars = 1.0

  Key Points:

  - Not a simple percentage: It's not just dividing by total count
  - Bin width matters: Narrower bins will have higher density values for the same probability
  - Units: The y-axis units are 1/[x-axis units] (e.g., if x is cell area in μm², y is 1/μm²)

  Why use density normalization?

  1. Compare distributions with different sample sizes: A condition with 1000 cells can be directly compared to one with 10,000 cells
  2. Compatible with KDE: KDE curves are always density-based, so normalized histograms match KDE scale
  3. Standard statistical practice: This is the standard way to create probability density histograms

In [None]:
# Normalized histograms show density instead of counts
# Useful for comparing distributions with different sample sizes
fig, axes = histogram_plot(
    df=df,
    feature="area_cell",
    conditions=conditions,
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    bins=50,
    normalize=True,  # Show density instead of counts
    title="Cell Area Distribution (Normalized)",
    show_title=True,
    save=True,
    path=path,
    file_format="pdf"
)

## 7. Custom Binning Strategies

In [None]:
# Compare different binning strategies
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
bin_strategies = [30, 100, 'auto', 'sturges']
titles = ['30 bins', '100 bins (default)', 'Auto', 'Sturges']

for ax, bins, title in zip(axes.flat, bin_strategies, titles):
    histogram_plot(
        df=df,
        feature="intensity_mean_p21_nucleus",
        conditions='control',
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        bins=bins,
        axes=ax
    )
    ax.set_title(title)

fig.suptitle("Binning Strategy Comparison", fontsize=12)
fig.tight_layout()
save_fig(fig, path, "histogram_binning_comparison", fig_extension="pdf")

## 8. Custom Colors

In [None]:
# Custom colors for histograms
custom_colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

# Regular histograms with custom colors (all use first color)
fig1, axes1 = histogram_plot(
    df=df,
    feature="intensity_mean_p21_nucleus",
    conditions=conditions,
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    colors=['#FF6B6B'],  # Single color for all histograms
    bins=50,
    title="Custom Color Histograms",
    show_title=True
)

# KDE overlay with custom colors (uses different colors)
fig2, ax2 = histogram_plot(
    df=df,
    feature="intensity_mean_p21_nucleus",
    conditions=conditions,
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    kde_overlay=True,
    colors=custom_colors,  # Different colors for each KDE line
    title="Custom Color KDE Overlay",
    show_title=True,
    fig_size=(8, 5)
)

save_fig(fig1, path, "histogram_custom_colors", fig_extension="pdf")
save_fig(fig2, path, "histogram_kde_custom_colors", fig_extension="pdf")

## 9. Combining with Other Plots

In [None]:
# Create a complex figure combining different histogram views
fig = plt.figure(figsize=(12, 8))

# Top: Multiple histograms for different conditions
for i, cond in enumerate(conditions[:2]):
    ax = plt.subplot(2, 3, i+1)
    histogram_plot(
        df=df,
        feature="integrated_int_DAPI_norm",
        conditions=cond,
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        log_scale=True,
        log_base=2,
        x_limits=(1, 16),
        axes=ax
    )
    ax.set_title(f"DNA Content - {cond}", fontsize=10)

# Top right: KDE overlay
ax3 = plt.subplot(2, 3, 3)
histogram_plot(
    df=df,
    feature="integrated_int_DAPI_norm",
    conditions=conditions[:2],
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    kde_overlay=True,
    log_scale=True,
    log_base=2,
    x_limits=(1, 16),
    axes=ax3
)
ax3.set_title("KDE Comparison", fontsize=10)

# Bottom: Different features
features = ["intensity_mean_p21_nucleus", "area_cell", "intensity_mean_EdU_nucleus_norm"]
for i, feat in enumerate(features):
    ax = plt.subplot(2, 3, i+4)
    histogram_plot(
        df=df,
        feature=feat,
        conditions="control",
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        bins=50,
        axes=ax
    )
    ax.set_title(feat.replace("_", " ").title()[:20], fontsize=10)

fig.suptitle("Comprehensive Histogram Analysis", fontsize=14, fontweight='bold')
fig.tight_layout()
save_fig(fig, path, "histogram_comprehensive", fig_extension="pdf")

## 10. Advanced KDE Parameters

In [None]:
# Advanced KDE customization
fig, ax = histogram_plot(
    df=df,
    feature="intensity_mean_p21_nucleus",
    conditions=conditions,
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    kde_overlay=True,
    kde_smoothing=0.6,  # Extra smooth
    kde_params={
        'gridsize': 500,     # High resolution for very smooth curves
        'bw_method': 'scott', # Bandwidth selection method
        'cut': 3,            # Extend KDE beyond data range
        'alpha': 0.9,        # Transparency
        'linewidth': 3       # Thicker lines
    },
    title="Advanced KDE Parameters",
    show_title=True,
    fig_size=(10, 5),
    save=True,
    path=path,
    file_format="pdf"
)

## 11. Figure Size Control

In [None]:
# Demonstrate figure size control
sizes = [(4, 4), (8, 4), (6, 8)]
titles = ["Square (4x4 cm)", "Wide (8x4 cm)", "Tall (6x8 cm)"]

for size, title in zip(sizes, titles):
    fig, ax = histogram_plot(
        df=df,
        feature="intensity_mean_p21_nucleus",
        conditions="control",
        condition_col="condition",
        selector_col="cell_line",
        selector_val="MCF10A",
        fig_size=size,
        size_units="cm",  # Can also use "inches"
        title=title,
        show_title=True
    )
    plt.show()
    plt.close()

## 12. X-axis Customization

In [None]:
# Demonstrate x-axis customization options
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

# Default x-axis
histogram_plot(
    df=df,
    feature="area_cell",
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    axes=axes[0]
)
axes[0].set_title("Default X-axis")

# Custom x-limits
histogram_plot(
    df=df,
    feature="area_cell",
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    x_limits=(0, 5000),  # Focus on smaller cells
    axes=axes[1]
)
axes[1].set_title("Custom X-limits (0-5000)")

# Rotated labels (useful for long feature names)
histogram_plot(
    df=df,
    feature="area_cell",
    conditions="control",
    condition_col="condition",
    selector_col="cell_line",
    selector_val="MCF10A",
    rotation=45,  # Rotate x-tick labels
    axes=axes[2]
)
axes[2].set_title("Rotated X-labels")

fig.suptitle("X-axis Customization Options", fontsize=14)
fig.tight_layout()
save_fig(fig, path, "histogram_xaxis_options", fig_extension="pdf")

## Summary

The `histogram_plot` function provides:

1. **Flexible input**: Single condition (string) or multiple conditions (list)
2. **Two visualization modes**:
   - Regular histograms (separate subplots for multiple conditions)
   - KDE overlay mode (single plot with overlaid density curves)
3. **Log scale support**: Essential for DNA content and other exponential data
4. **Binning control**: Integer count, 'auto', or other strategies
5. **Normalization**: Show density instead of counts
6. **Extensive customization**: Colors, figure size, titles, axis formatting
7. **KDE smoothing**: Adjustable smoothness for density curves

### Key Parameters:
- `kde_overlay=True`: Switches to KDE-only mode
- `kde_smoothing`: Controls curve smoothness (0.5-2.0)
- `log_scale=True, log_base=2`: For DNA content analysis
- `normalize=True`: For comparing distributions of different sizes
- `bins`: 100 (default), 'auto', or custom value
- `fig_size`: Dynamic default or custom (width, height) in cm or inches