# Cell State Classification for OSDR v2.0

This notebook demonstrates the complete workflow for classifying cell types and functional states in TNBC IMC data.

**OSDR v2.0 Core Features:**
- PD1+ vs PD1- T cell states (immune exhaustion)
- CAF vs resting fibroblast states
- Additional states: Cytotoxic activity, Macrophage polarization, Tregs

**Pipeline:**
1. Load TNBC data
2. Determine thresholds for all markers
3. Classify cell types
4. Classify functional states
5. Validate and visualize
6. Save classified data for OSDR v2.0 inference

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

# Import state classification module
from osdr_validation.state_classification import (
    auto_determine_thresholds,
    classify_all_states,
    validate_state_classification,
    plot_state_distributions,
    visualize_marker_distribution
)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100

## 1. Load TNBC Data

Load the cell_table with spatial coordinates and marker intensities.

In [None]:
# Load data (adjust path as needed)
# data_path = '../data/cell_table_bothneighbourhoods.csv'
data_path = '/path/to/your/tnbc_data.csv'  # MODIFY THIS

df = pd.read_csv(data_path)

print(f"Loaded {len(df)} cells")
print(f"\nAvailable columns:")
print(df.columns.tolist())

## 2. Define Markers for Classification

Specify which markers are available for threshold determination.

In [None]:
# Core markers for cell type identification
cell_type_markers = [
    'CD3',          # T cells
    'CD4',          # CD4+ T cells
    'CD8a',         # CD8+ T cells
    'CD68',         # Macrophages
    'Vimentin',     # Mesenchymal (fibroblasts)
    'Pan-Keratin',  # Epithelial/cancer cells
]

# State markers (OSDR v2.0)
state_markers = [
    'PD1',          # T cell exhaustion (PRIMARY v2.0 marker)
    'Alpha-SMA',    # CAF marker (PRIMARY v2.0 marker)
    'Granzyme-B',   # Cytotoxic activity (additional)
    'CD163',        # Macrophage M2 polarization (additional)
    'FOXP3',        # Regulatory T cells (additional)
]

# Combine all markers
all_markers = cell_type_markers + state_markers

# Check which markers are available
available_markers = [m for m in all_markers if m in df.columns]
missing_markers = [m for m in all_markers if m not in df.columns]

print(f"Available markers ({len(available_markers)}):")
for m in available_markers:
    print(f"  ✓ {m}")

if missing_markers:
    print(f"\nMissing markers ({len(missing_markers)}):")
    for m in missing_markers:
        print(f"  ✗ {m}")

## 3. Automatic Threshold Determination

Use Gaussian Mixture Models to find optimal thresholds for each marker.

**Methods available:**
- `'gmm'`: Gaussian Mixture Model (best for bimodal distributions)
- `'otsu'`: Otsu's method (maximizes between-class variance)
- `'median'`: Simple median split
- `'percentile'`: 75th percentile

In [None]:
# Option 1: Automatic threshold determination (GMM - RECOMMENDED)
print("Determining thresholds using Gaussian Mixture Models...\n")

thresholds = auto_determine_thresholds(
    df,
    markers=available_markers,
    method='gmm',
    plot=True  # Set to False to skip visualization
)

print("\n" + "="*60)
print("DETERMINED THRESHOLDS:")
print("="*60)
for marker, threshold in sorted(thresholds.items()):
    print(f"{marker:15s}: {threshold:8.3f}")

In [None]:
# Option 2: Manual threshold specification (if you want to override)

# Uncomment and modify if you want manual control:
# thresholds_manual = {
#     'PD1': 0.5,
#     'Alpha-SMA': 0.3,
#     'CD3': 0.4,
#     'CD4': 0.3,
#     'CD8a': 0.3,
#     'CD68': 0.4,
#     'Granzyme-B': 0.2,
#     'CD163': 0.3,
#     'FOXP3': 0.2,
#     'Vimentin': 0.5,
#     'Pan-Keratin': 0.4,
# }
# thresholds = thresholds_manual

## 4. Visualize Individual Marker Distributions

Examine specific markers of interest with their thresholds.

In [None]:
# Visualize key v2.0 markers
key_markers = ['PD1', 'Alpha-SMA', 'Granzyme-B']

for marker in key_markers:
    if marker in df.columns and marker in thresholds:
        visualize_marker_distribution(
            df,
            marker,
            threshold=thresholds[marker]
        )

## 5. Classify Cell Types and States

Apply the full classification pipeline.

In [None]:
# Classify all states
df_classified = classify_all_states(
    df,
    thresholds=thresholds,
    states_to_classify=['PD1', 'CAF', 'Cytotoxic', 'Macrophage', 'Treg']
)

print("Classification complete!")
print(f"\nNew columns added:")
new_cols = [c for c in df_classified.columns if c not in df.columns]
for col in new_cols:
    print(f"  - {col}")

## 6. Validation and Summary Statistics

In [None]:
# Generate summary table
summary = validate_state_classification(df_classified)

print("\nCLASSIFICATION SUMMARY:")
print("="*70)
print(summary.to_string(index=False))
print("="*70)

In [None]:
# Visualize all state distributions
plot_state_distributions(df_classified)

## 7. Focus on OSDR v2.0 Core States

Analyze the primary v2.0 features: PD1+ T cells and CAF states.

In [None]:
# PD1 states in T cells
t_cells = df_classified[df_classified['PD1_State'] != 'N/A']

print("PD1+ T CELL ANALYSIS:")
print("="*60)
print(f"Total T cells analyzed: {len(t_cells)}")
print(f"\nPD1 state distribution:")
pd1_dist = t_cells['PD1_State'].value_counts()
for state, count in pd1_dist.items():
    pct = 100 * count / len(t_cells)
    print(f"  {state:10s}: {count:6d} cells ({pct:5.1f}%)")

# By T cell subtype
print(f"\nPD1 states by T cell subtype:")
pd1_by_type = pd.crosstab(
    t_cells['Cell_Type'],
    t_cells['PD1_State'],
    normalize='index'
) * 100
print(pd1_by_type.round(1))

In [None]:
# CAF states in fibroblasts
fibroblasts = df_classified[df_classified['CAF_State'] != 'N/A']

print("CAF ANALYSIS:")
print("="*60)
print(f"Total fibroblasts analyzed: {len(fibroblasts)}")
print(f"\nCAF state distribution:")
caf_dist = fibroblasts['CAF_State'].value_counts()
for state, count in caf_dist.items():
    pct = 100 * count / len(fibroblasts)
    print(f"  {state:10s}: {count:6d} cells ({pct:5.1f}%)")

## 8. Spatial Visualization of States

Visualize where different states are located in tissue.

In [None]:
# Spatial plot of PD1 states
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# PD1+ vs PD1- T cells
pd1_pos = t_cells[t_cells['PD1_State'] == 'PD1+']
pd1_neg = t_cells[t_cells['PD1_State'] == 'PD1-']

ax1.scatter(pd1_neg['X'], pd1_neg['Y'], s=1, c='blue', alpha=0.5, label='PD1- (functional)')
ax1.scatter(pd1_pos['X'], pd1_pos['Y'], s=1, c='red', alpha=0.5, label='PD1+ (exhausted)')
ax1.set_xlabel('X (μm)')
ax1.set_ylabel('Y (μm)')
ax1.set_title(f'PD1 T Cell States (n={len(t_cells)})')
ax1.legend()
ax1.set_aspect('equal')

# CAF vs Resting fibroblasts
if len(fibroblasts) > 0:
    caf = fibroblasts[fibroblasts['CAF_State'] == 'CAF']
    resting = fibroblasts[fibroblasts['CAF_State'] == 'Resting']
    
    ax2.scatter(resting['X'], resting['Y'], s=1, c='green', alpha=0.5, label='Resting')
    ax2.scatter(caf['X'], caf['Y'], s=1, c='orange', alpha=0.5, label='CAF')
    ax2.set_xlabel('X (μm)')
    ax2.set_ylabel('Y (μm)')
    ax2.set_title(f'Fibroblast States (n={len(fibroblasts)})')
    ax2.legend()
    ax2.set_aspect('equal')

plt.tight_layout()
plt.show()

## 9. Save Classified Data

Save the fully classified dataset for OSDR v2.0 inference.

In [None]:
# Save classified data
output_path = '../data/tnbc_classified_states.csv'
df_classified.to_csv(output_path, index=False)

print(f"Classified data saved to: {output_path}")
print(f"  Total cells: {len(df_classified)}")
print(f"  Columns: {len(df_classified.columns)}")

In [None]:
# Save thresholds for reproducibility
threshold_df = pd.DataFrame([
    {'Marker': marker, 'Threshold': threshold}
    for marker, threshold in thresholds.items()
])

threshold_path = '../data/marker_thresholds.csv'
threshold_df.to_csv(threshold_path, index=False)

print(f"\nThresholds saved to: {threshold_path}")
print(threshold_df.to_string(index=False))

## 10. Next Steps: OSDR v2.0 Inference

With states now classified, you can proceed to:

1. **Multivariate logistic regression** for state probability models
2. **State equilibration** in tissue simulations
3. **Treatment response prediction** (if longitudinal data available)

See notebooks:
- `6_osdr2_state_models.ipynb` - State probability regression
- `7_osdr2_simulation.ipynb` - Two-step simulation with states
- `8_osdr2_validation.ipynb` - Validation on simulated + real data

## Summary

**Completed:**
- ✅ Automatic threshold determination for all markers
- ✅ Cell type classification (T cells, Macrophages, Fibroblasts, Cancer)
- ✅ PD1 state classification (primary v2.0 feature)
- ✅ CAF state classification (primary v2.0 feature)
- ✅ Additional states (Cytotoxic, Macrophage polarization, Tregs)
- ✅ Validation and visualization
- ✅ Data saved for downstream analysis

**Ready for:**
- Phase 2: Multivariate state probability models
- Phase 3: Two-step simulation framework
- Phase 4: OSDR v2.0 validation