# EqualWidthBinning Illustration

This notebook demonstrates the complete workflow of using EqualWidthBinning from the binning package, including:

1. Basic usage with different input formats (numpy arrays, pandas DataFrames, polars DataFrames)
2. Various output formats and transformations
3. sklearn integration and pipeline usage
4. Serialization and parameter transfer
5. Advanced features and edge cases

Let's start by importing the necessary libraries and creating sample data.

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
try:
    import polars as pl
    POLARS_AVAILABLE = True
except ImportError:
    POLARS_AVAILABLE = False
    print("Polars not available - skipping polars examples")

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import warnings

# Import binning classes
from binning.methods import EqualWidthBinning

print("Libraries imported successfully!")
print(f"Polars available: {POLARS_AVAILABLE}")

Libraries imported successfully!
Polars available: True


In [3]:
# Create sample data for demonstration
np.random.seed(42)

# Generate synthetic data with different distributions
n_samples = 1000

# Feature 1: Normal distribution
feature1 = np.random.normal(50, 15, n_samples)

# Feature 2: Exponential distribution
feature2 = np.random.exponential(2, n_samples)

# Feature 3: Uniform distribution
feature3 = np.random.uniform(0, 100, n_samples)

# Feature 4: Bimodal distribution
feature4 = np.concatenate([
    np.random.normal(20, 5, n_samples//2),
    np.random.normal(80, 5, n_samples//2)
])

# Create target variable for classification
target = (feature1 + feature2 + feature3 + feature4 > 150).astype(int)

# Create datasets in different formats
# 1. NumPy array
X_numpy = np.column_stack([feature1, feature2, feature3, feature4])

# 2. Pandas DataFrame
X_pandas = pd.DataFrame({
    'normal_feature': feature1,
    'exponential_feature': feature2,
    'uniform_feature': feature3,
    'bimodal_feature': feature4
})

# 3. Polars DataFrame (if available)
if POLARS_AVAILABLE:
    X_polars = pl.DataFrame({
        'normal_feature': feature1,
        'exponential_feature': feature2,
        'uniform_feature': feature3,
        'bimodal_feature': feature4
    })

print("Sample data created:")
print(f"NumPy array shape: {X_numpy.shape}")
print(f"Pandas DataFrame shape: {X_pandas.shape}")
if POLARS_AVAILABLE:
    print(f"Polars DataFrame shape: {X_polars.shape}")
print(f"Target distribution: {np.bincount(target)}")

Sample data created:
NumPy array shape: (1000, 4)
Pandas DataFrame shape: (1000, 4)
Polars DataFrame shape: (1000, 4)
Target distribution: [481 519]


## 1. Basic Usage with Different Input Formats

Let's start with the basic workflow of EqualWidthBinning, showing how it works with different input formats.

In [6]:
# 1.1 Basic usage with NumPy arrays
print("=== NumPy Array Example ===")

# Initialize the binning transformer
binning_numpy = EqualWidthBinning(n_bins=5)

# Fit and transform the data
X_binned_numpy = binning_numpy.fit_transform(X_numpy)

print(f"Original data shape: {X_numpy.shape}")
print(f"Binned data shape: {X_binned_numpy.shape}")
print(f"Original data sample:\n{X_numpy[:3]}")
print(f"Binned data sample:\n{X_binned_numpy[:3]}")

# Check the bin edges
print(f"\nBin edges for each column:")
for i, edges in enumerate(binning_numpy._bin_edges.values()):
    print(f"Column {i}: {edges}")

=== NumPy Array Example ===
Original data shape: (1000, 4)
Binned data shape: (1000, 4)
Original data sample:
[[57.4507123   0.36660227 21.90688091 16.96649853]
 [47.92603548  0.22089763  3.67213625 21.05641846]
 [59.71532807  2.02356823 10.80257541 26.00039478]]
Binned data sample:
[[2 0 1 0]
 [2 0 0 0]
 [2 0 0 1]]

Bin edges for each column:
Column 0: [np.float64(1.380989898963911), np.float64(22.66298639113529), np.float64(43.94498288330667), np.float64(65.22697937547805), np.float64(86.50897586764944), np.float64(107.79097235982081)]
Column 1: [np.float64(0.00644690670446925), np.float64(2.9818466892453372), np.float64(5.957246471786205), np.float64(8.932646254327073), np.float64(11.908046036867942), np.float64(14.883445819408811)]
Column 2: [np.float64(0.0011634755366141114), np.float64(19.957347894068853), np.float64(39.91353231260109), np.float64(59.86971673113333), np.float64(79.82590114966557), np.float64(99.78208556819781)]
Column 3: [np.float64(5.044320145496384), np.float64

In [9]:
# 1.2 Usage with Pandas DataFrames
print("=== Pandas DataFrame Example ===")

# Initialize with different parameters
binning_pandas = EqualWidthBinning(n_bins=3)

# Fit and transform
X_binned_pandas = binning_pandas.fit_transform(X_pandas)

print(f"Original DataFrame:\n{X_pandas.head()}")
print(f"\nBinned DataFrame:")
if isinstance(X_binned_pandas, pd.DataFrame):
    print(X_binned_pandas.head())
    print(f"\nData types - Original: {X_pandas.dtypes.tolist()}")
    print(f"Data types - Binned: {X_binned_pandas.dtypes.tolist()}")
    # Show column names are preserved
    print(f"\nColumn names preserved: {list(X_pandas.columns) == list(X_binned_pandas.columns)}")
else:
    # It's a numpy array
    print(X_binned_pandas[:5])  # Show first 5 rows
    print(f"\nData types - Original: {X_pandas.dtypes.tolist()}")
    print(f"Data types - Binned: {X_binned_pandas.dtype}")
    print(f"\nOutput type: numpy array (not DataFrame)")

# Check statistics
print(f"\nOriginal data statistics:")
print(X_pandas.describe())
print(f"\nBinned data statistics:")
if isinstance(X_binned_pandas, pd.DataFrame):
    print(X_binned_pandas.describe())
else:
    # Create a temporary DataFrame for statistics display
    temp_df = pd.DataFrame(X_binned_pandas, columns=X_pandas.columns)
    print(temp_df.describe())

=== Pandas DataFrame Example ===
Original DataFrame:
   normal_feature  exponential_feature  uniform_feature  bimodal_feature
0       57.450712             0.366602        21.906881        16.966499
1       47.926035             0.220898         3.672136        21.056418
2       59.715328             2.023568        10.802575        26.000395
3       72.845448             2.451590        33.886065        17.540488
4       46.487699             0.064191        80.258568        10.617236

Binned DataFrame:
[[1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [2 0 1 0]
 [1 0 2 0]]

Data types - Original: [dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64')]
Data types - Binned: int64

Output type: numpy array (not DataFrame)

Original data statistics:
       normal_feature  exponential_feature  uniform_feature  bimodal_feature
count     1000.000000          1000.000000      1000.000000      1000.000000
mean        50.289981             2.015972        49.449499        50.036445
std       

In [11]:
# 1.3 Usage with Polars DataFrames (if available)
if POLARS_AVAILABLE:
    print("=== Polars DataFrame Example ===")
    
    binning_polars = EqualWidthBinning(n_bins=4)
    X_binned_polars = binning_polars.fit_transform(X_polars)
    
    print(f"Original Polars DataFrame:")
    print(X_polars.head())
    print(f"\nBinned Polars DataFrame:")
    if hasattr(X_binned_polars, 'head'):
        # It's still a polars DataFrame
        print(X_binned_polars.head())
        print(f"\nData types - Original: {X_polars.dtypes}")
        print(f"Data types - Binned: {X_binned_polars.dtypes}")
    else:
        # It's a numpy array
        print("Output type: numpy array")
        print(X_binned_polars[:5])  # Show first 5 rows
        print(f"\nData types - Original: {X_polars.dtypes}")
        print(f"Data types - Binned: {X_binned_polars.dtype}")
else:
    print("Polars not available - skipping polars example")

=== Polars DataFrame Example ===
Original Polars DataFrame:
shape: (5, 4)
┌────────────────┬─────────────────────┬─────────────────┬─────────────────┐
│ normal_feature ┆ exponential_feature ┆ uniform_feature ┆ bimodal_feature │
│ ---            ┆ ---                 ┆ ---             ┆ ---             │
│ f64            ┆ f64                 ┆ f64             ┆ f64             │
╞════════════════╪═════════════════════╪═════════════════╪═════════════════╡
│ 57.450712      ┆ 0.366602            ┆ 21.906881       ┆ 16.966499       │
│ 47.926035      ┆ 0.220898            ┆ 3.672136        ┆ 21.056418       │
│ 59.715328      ┆ 2.023568            ┆ 10.802575       ┆ 26.000395       │
│ 72.845448      ┆ 2.45159             ┆ 33.886065       ┆ 17.540488       │
│ 46.487699      ┆ 0.064191            ┆ 80.258568       ┆ 10.617236       │
└────────────────┴─────────────────────┴─────────────────┴─────────────────┘

Binned Polars DataFrame:
Output type: numpy array
[[2 0 0 0]
 [1 0 0 0]
 [2 0 

## 2. Different Output Formats and Transformations

EqualWidthBinning supports various output formats and transformation modes.

In [7]:
# 2.1 Different numbers of bins
print("=== Different Numbers of Bins ===")

# Try different n_bins values
for n_bins in [2, 5, 10]:
    binning = EqualWidthBinning(n_bins=n_bins)
    X_binned = binning.fit_transform(X_pandas.iloc[:, 0:1])  # Just first column
    
    print(f"\nWith {n_bins} bins:")
    # Convert to pandas if it's numpy for consistent access
    if isinstance(X_binned, np.ndarray):
        unique_values = np.unique(X_binned[:, 0])
    else:
        unique_values = np.unique(X_binned.iloc[:, 0])
    print(f"Unique values: {unique_values}")
    print(f"Bin edges: {list(binning._bin_edges.values())[0]}")

# 2.2 Handling edge cases and warnings
print("\n=== Edge Cases ===")

# Data with constant values (should trigger warning)
constant_data = pd.DataFrame({'constant': np.ones(100)})
binning_constant = EqualWidthBinning(n_bins=5)

with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    X_constant_binned = binning_constant.fit_transform(constant_data)
    
    if w:
        print(f"Warning caught: {w[0].message}")
    # Handle both numpy and pandas outputs
    if isinstance(X_constant_binned, np.ndarray):
        unique_values = np.unique(X_constant_binned[:, 0])
    else:
        unique_values = np.unique(X_constant_binned.iloc[:, 0])
    print(f"Constant data binned: unique values = {unique_values}")

# 2.3 Separate fit and transform
print("\n=== Separate Fit and Transform ===")

binning_separate = EqualWidthBinning(n_bins=3)

# Fit on training data
X_train, X_test = train_test_split(X_pandas, test_size=0.3, random_state=42)
binning_separate.fit(X_train)

# Transform both training and test data using the same bins
X_train_binned = binning_separate.transform(X_train)
X_test_binned = binning_separate.transform(X_test)

print(f"Training data binned shape: {X_train_binned.shape}")
print(f"Test data binned shape: {X_test_binned.shape}")
print(f"Bin edges determined from training data:")
for i, (col_id, edges) in enumerate(binning_separate._bin_edges.items()):
    print(f"  Column {col_id}: [{edges[0]:.2f}, {edges[-1]:.2f}] with {len(edges)-1} bins")

=== Different Numbers of Bins ===

With 2 bins:
Unique values: [0 1]
Bin edges: [np.float64(1.380989898963911), np.float64(54.58598112939236), np.float64(107.79097235982081)]

With 5 bins:
Unique values: [0 1 2 3 4]
Bin edges: [np.float64(1.380989898963911), np.float64(22.66298639113529), np.float64(43.94498288330667), np.float64(65.22697937547805), np.float64(86.50897586764944), np.float64(107.79097235982081)]

With 10 bins:
Unique values: [0 1 2 3 4 5 6 7 8 9]
Bin edges: [np.float64(1.380989898963911), np.float64(12.021988145049601), np.float64(22.66298639113529), np.float64(33.30398463722098), np.float64(43.94498288330667), np.float64(54.58598112939236), np.float64(65.22697937547805), np.float64(75.86797762156374), np.float64(86.50897586764944), np.float64(97.14997411373511), np.float64(107.79097235982081)]

=== Edge Cases ===
Constant data binned: unique values = [2]

=== Separate Fit and Transform ===
Training data binned shape: (700, 4)
Test data binned shape: (300, 4)
Bin edges 

## 3. Scikit-learn Integration

EqualWidthBinning is fully compatible with scikit-learn pipelines and workflows.

In [12]:
# 3.1 Pipeline Integration
print("=== Scikit-learn Pipeline Integration ===")

# Create a pipeline with binning and classification
pipeline = Pipeline([
    ('binning', EqualWidthBinning(n_bins=5)),
    ('classifier', RandomForestClassifier(n_estimators=50, random_state=42))
])

# Split data for training and testing
X_train_pipe, X_test_pipe, y_train, y_test = train_test_split(
    X_pandas, target, test_size=0.3, random_state=42
)

# Train the pipeline
pipeline.fit(X_train_pipe, y_train)

# Make predictions
y_pred = pipeline.predict(X_test_pipe)
accuracy = accuracy_score(y_test, y_pred)

print(f"Pipeline accuracy: {accuracy:.4f}")

# Access the fitted binning transformer
fitted_binning = pipeline.named_steps['binning']
print(f"Binning fitted: {fitted_binning._fitted}")
print(f"Number of columns processed: {len(fitted_binning._bin_edges)}")

# 3.2 Manual sklearn-style usage
print("\n=== Manual sklearn-style usage ===")

# Initialize binning
binning_sklearn = EqualWidthBinning(n_bins=4)

# Check if fitted (should be False initially)
print(f"Initially fitted: {hasattr(binning_sklearn, '_fitted') and binning_sklearn._fitted}")

# Fit the binning
binning_sklearn.fit(X_train_pipe)

# Check if fitted (should be True after fitting)
print(f"After fit: {binning_sklearn._fitted}")

# Transform data
X_train_transformed = binning_sklearn.transform(X_train_pipe)
X_test_transformed = binning_sklearn.transform(X_test_pipe)

print(f"Transformed training data shape: {X_train_transformed.shape}")
print(f"Transformed test data shape: {X_test_transformed.shape}")

# Verify consistent binning
print(f"Training data bin ranges:")
if isinstance(X_train_transformed, pd.DataFrame):
    for col in X_train_transformed.columns:
        print(f"  {col}: [{X_train_transformed[col].min()}, {X_train_transformed[col].max()}]")
else:
    for i in range(X_train_transformed.shape[1]):
        col_name = X_pandas.columns[i] if i < len(X_pandas.columns) else f"col_{i}"
        print(f"  {col_name}: [{X_train_transformed[:, i].min()}, {X_train_transformed[:, i].max()}]")

print(f"Test data bin ranges:")
if isinstance(X_test_transformed, pd.DataFrame):
    for col in X_test_transformed.columns:
        print(f"  {col}: [{X_test_transformed[col].min()}, {X_test_transformed[col].max()}]")
else:
    for i in range(X_test_transformed.shape[1]):
        col_name = X_pandas.columns[i] if i < len(X_pandas.columns) else f"col_{i}"
        print(f"  {col_name}: [{X_test_transformed[:, i].min()}, {X_test_transformed[:, i].max()}]")

=== Scikit-learn Pipeline Integration ===
Pipeline accuracy: 0.9300
Binning fitted: True
Number of columns processed: 4

=== Manual sklearn-style usage ===
Initially fitted: False
After fit: True
Transformed training data shape: (700, 4)
Transformed test data shape: (300, 4)
Training data bin ranges:
  normal_feature: [0, 3]
  exponential_feature: [0, 3]
  uniform_feature: [0, 3]
  bimodal_feature: [0, 3]
Test data bin ranges:
  normal_feature: [0, 3]
  exponential_feature: [0, 3]
  uniform_feature: [0, 3]
  bimodal_feature: [0, 3]


## 4. Serialization and Parameter Transfer

One of the key features of EqualWidthBinning is the ability to serialize fitted models and recreate them without refitting.

In [17]:
# 4.1 Parameter Transfer Workflow
print("=== Parameter Transfer Workflow ===")

# Step 1: Create and fit original binning
original_binning = EqualWidthBinning(n_bins=7)
original_binning.fit(X_pandas)

# Transform some data with the original
original_result = original_binning.transform(X_pandas.iloc[:5])
print("Original binning result:")
print(original_result)

# Step 2: Get parameters from fitted binning
fitted_params = original_binning.get_params()
print(f"\nFitted parameters keys: {list(fitted_params.keys())}")

# Step 3: Create new binning instance with these parameters
# When bin_edges are provided in constructor, the binning is automatically fitted
new_binning = EqualWidthBinning(**fitted_params)

# Step 4: Verify the new binning produces identical results WITHOUT additional fitting
new_result = new_binning.transform(X_pandas.iloc[:5])
print(f"\nNew binning result (ready to use):")
print(new_result)

# Step 5: Verify results are identical
results_identical = np.allclose(original_result, new_result)
print(f"\nResults are identical: {results_identical}")
print(f"New binning is fitted: {new_binning._fitted}")

# 4.2 Detailed Parameter Inspection
print("\n=== Detailed Parameter Inspection ===")

# Show the key parameters that enable reconstruction
important_params = ['n_bins', 'bin_range', 'bin_edges', 'bin_representatives', 'clip', 'preserve_dataframe']
for key in important_params:
    if key in fitted_params:
        value = fitted_params[key]
        if isinstance(value, dict) and len(value) > 0:
            # For dictionaries, show structure
            first_key, first_val = next(iter(value.items()))
            if isinstance(first_val, (list, np.ndarray)):
                print(f"{key}: Dict with {len(value)} columns, e.g., column {first_key} has {len(first_val)} values")
            else:
                print(f"{key}: {value}")
        else:
            print(f"{key}: {value}")

# 4.3 Robustness Test
print("\n=== Robustness Test ===")

# Test with different data to ensure both binnings behave identically
test_data = pd.DataFrame({
    'normal_feature': [25, 75, 50, 100, 0],
    'exponential_feature': [0.5, 2.0, 1.0, 5.0, 0.1],
    'uniform_feature': [10, 90, 50, 100, 0],
    'bimodal_feature': [15, 85, 50, 95, 5]
})

original_test_result = original_binning.transform(test_data)
new_test_result = new_binning.transform(test_data)

print("Test data:")
print(test_data)
print(f"\nOriginal binning on test data:")
print(original_test_result)
print(f"\nNew binning on test data:")
print(new_test_result)

# Compare results
test_results_identical = np.allclose(original_test_result, new_test_result)
print(f"\nTest results identical: {test_results_identical}")

=== Parameter Transfer Workflow ===
Original binning result:
[[3 0 1 0]
 [3 0 0 1]
 [3 0 0 1]
 [4 1 2 0]
 [2 0 5 0]]

Fitted parameters keys: ['bin_edges', 'bin_range', 'bin_representatives', 'clip', 'fit_jointly', 'n_bins', 'preserve_dataframe']

New binning result (ready to use):
[[3 0 1 0]
 [3 0 0 1]
 [3 0 0 1]
 [4 1 2 0]
 [2 0 5 0]]

Results are identical: True
New binning is fitted: True

=== Detailed Parameter Inspection ===
n_bins: 7
bin_range: None
bin_edges: Dict with 4 columns, e.g., column normal_feature has 8 values
bin_representatives: Dict with 4 columns, e.g., column normal_feature has 7 values
clip: True
preserve_dataframe: False

=== Robustness Test ===
Test data:
   normal_feature  exponential_feature  uniform_feature  bimodal_feature
0              25                  0.5               10               15
1              75                  2.0               90               85
2              50                  1.0               50               50
3             100 

In [18]:
# 4.4 Practical Serialization Examples
print("=== Practical Serialization Examples ===")

import json
import pickle

# Example 1: JSON serialization (for configuration files)
print("1. JSON Serialization:")

# Get parameters and convert numpy arrays to lists for JSON compatibility
params_for_json = original_binning.get_params()
json_compatible_params = {}

for key, value in params_for_json.items():
    if isinstance(value, dict):
        # Handle dictionary parameters (like bin_edges)
        json_compatible_params[key] = {
            k: v.tolist() if hasattr(v, 'tolist') else v 
            for k, v in value.items()
        }
    elif hasattr(value, 'tolist'):
        # Handle numpy arrays
        json_compatible_params[key] = value.tolist()
    else:
        # Handle regular parameters
        json_compatible_params[key] = value

# Simulate saving to JSON file
json_str = json.dumps(json_compatible_params, indent=2)
print(f"JSON configuration (first 200 chars): {json_str[:200]}...")

# Simulate loading from JSON
loaded_params = json.loads(json_str)

# Convert lists back to numpy arrays where needed
def restore_numpy_arrays(params):
    restored = {}
    for key, value in params.items():
        if isinstance(value, dict):
            restored[key] = {k: np.array(v) if isinstance(v, list) else v for k, v in value.items()}
        elif isinstance(value, list) and key in ['bin_edges', 'bin_representatives']:
            restored[key] = np.array(value)
        else:
            restored[key] = value
    return restored

restored_params = restore_numpy_arrays(loaded_params)
json_binning = EqualWidthBinning(**restored_params)

# Test JSON-restored binning
json_result = json_binning.transform(X_pandas.iloc[:3])
print(f"JSON-restored binning works: {np.allclose(original_result[:3], json_result)}")

print("\n2. Pickle Serialization (for Python objects):")

# Pickle serialization (preserves exact object state)
import io
buffer = io.BytesIO()
pickle.dump(original_binning, buffer)
buffer.seek(0)
pickled_binning = pickle.load(buffer)

# Test pickled binning
pickle_result = pickled_binning.transform(X_pandas.iloc[:3])
print(f"Pickled binning works: {np.allclose(original_result[:3], pickle_result)}")
print(f"Pickled binning is fitted: {pickled_binning._fitted}")

print("\n3. Custom Serialization Function:")

def serialize_binning(binning_obj):
    """Custom serialization function for binning objects."""
    return {
        'class_name': binning_obj.__class__.__name__,
        'params': binning_obj.get_params(),
        'fitted': binning_obj._fitted,
        'version': '1.0'  # For version compatibility
    }

def deserialize_binning(serialized_data):
    """Custom deserialization function for binning objects."""
    if serialized_data['class_name'] != 'EqualWidthBinning':
        raise ValueError(f"Unsupported class: {serialized_data['class_name']}")
    
    return EqualWidthBinning(**serialized_data['params'])

# Test custom serialization
serialized = serialize_binning(original_binning)
deserialized_binning = deserialize_binning(serialized)

custom_result = deserialized_binning.transform(X_pandas.iloc[:3])
print(f"Custom serialization works: {np.allclose(original_result[:3], custom_result)}")

print("\n4. Summary:")
print("✓ Parameter transfer enables object reconstruction")
print("✓ JSON serialization works for configuration storage") 
print("✓ Pickle serialization preserves complete object state")
print("✓ Custom serialization allows version control and validation")
print("✓ All methods produce identical transformation results")

=== Practical Serialization Examples ===
1. JSON Serialization:
JSON configuration (first 200 chars): {
  "bin_edges": {
    "normal_feature": [
      1.380989898963911,
      16.58241596480061,
      31.78384203063731,
      46.98526809647401,
      62.18669416231071,
      77.38812022814741,
      9...
JSON-restored binning works: True

2. Pickle Serialization (for Python objects):
Pickled binning works: True
Pickled binning is fitted: True

3. Custom Serialization Function:
Custom serialization works: True

4. Summary:
✓ Parameter transfer enables object reconstruction
✓ JSON serialization works for configuration storage
✓ Pickle serialization preserves complete object state
✓ Custom serialization allows version control and validation
✓ All methods produce identical transformation results


## 5. Advanced Features and Edge Cases

Let's explore some advanced features and how EqualWidthBinning handles various edge cases.

In [19]:
# 5.1 Handling Missing Values
print("=== Handling Missing Values ===")

# Create data with missing values
data_with_na = X_pandas.copy()
data_with_na.iloc[10:20, 0] = np.nan
data_with_na.iloc[50:55, 1] = np.nan

binning_na = EqualWidthBinning(n_bins=4)
result_with_na = binning_na.fit_transform(data_with_na)

print(f"Original data with NaN count per column:")
print(data_with_na.isnull().sum())
print(f"\nBinned data with NaN count per column:")
# Handle both DataFrame and array outputs
if isinstance(result_with_na, pd.DataFrame):
    print(result_with_na.isnull().sum())
    print(f"\nBinned data sample with NaN:")
    print(result_with_na.iloc[10:15])
else:
    # For numpy arrays, count NaN values
    nan_counts = [np.isnan(result_with_na[:, i]).sum() for i in range(result_with_na.shape[1])]
    for i, count in enumerate(nan_counts):
        col_name = data_with_na.columns[i] if i < len(data_with_na.columns) else f"col_{i}"
        print(f"{col_name}: {count}")
    print(f"\nBinned data sample with NaN:")
    print(result_with_na[10:15])

# 5.2 Out-of-bounds values during transform
print("\n=== Out-of-bounds Values ===")

# Fit on a subset
subset_data = X_pandas.iloc[200:800]  # Middle portion
binning_subset = EqualWidthBinning(n_bins=5)
binning_subset.fit(subset_data)

print(f"Training data range:")
for col in subset_data.columns:
    print(f"  {col}: [{subset_data[col].min():.2f}, {subset_data[col].max():.2f}]")

# Transform full data (includes out-of-bounds values)
full_result = binning_subset.transform(X_pandas)

print(f"\nFull data range:")
for col in X_pandas.columns:
    print(f"  {col}: [{X_pandas[col].min():.2f}, {X_pandas[col].max():.2f}]")

print(f"\nBinned full data range:")
if isinstance(full_result, pd.DataFrame):
    for col in full_result.columns:
        print(f"  {col}: [{full_result[col].min():.2f}, {full_result[col].max():.2f}]")
else:
    for i in range(full_result.shape[1]):
        col_name = X_pandas.columns[i] if i < len(X_pandas.columns) else f"col_{i}"
        print(f"  {col_name}: [{full_result[:, i].min():.2f}, {full_result[:, i].max():.2f}]")

# 5.3 Single column vs multi-column
print("\n=== Single Column Processing ===")

# Process just one column
single_col_binning = EqualWidthBinning(n_bins=6)
single_col_result = single_col_binning.fit_transform(X_pandas[['normal_feature']])

print(f"Single column binning:")
print(f"Bin edges: {list(single_col_binning._bin_edges.values())[0]}")
# Handle both DataFrame and array outputs
if isinstance(single_col_result, pd.DataFrame):
    unique_values = np.unique(single_col_result.iloc[:, 0])
else:
    unique_values = np.unique(single_col_result[:, 0])
print(f"Unique bin values: {unique_values}")

# 5.4 Reproducibility
print("\n=== Reproducibility ===")

# Create identical binning instances
binning1 = EqualWidthBinning(n_bins=5)
binning2 = EqualWidthBinning(n_bins=5)

# Fit on same data
binning1.fit(X_pandas)
binning2.fit(X_pandas)

# Check if bin edges are identical
edges_identical = all(
    np.allclose(edges1, edges2) 
    for edges1, edges2 in zip(binning1._bin_edges.values(), binning2._bin_edges.values())
)

print(f"Bin edges are identical across instances: {edges_identical}")

# Transform and check results
result1 = binning1.transform(X_pandas.iloc[:10])
result2 = binning2.transform(X_pandas.iloc[:10])

# Handle different output types for comparison
if isinstance(result1, pd.DataFrame) and isinstance(result2, pd.DataFrame):
    results_identical = np.allclose(result1.values, result2.values)
elif isinstance(result1, np.ndarray) and isinstance(result2, np.ndarray):
    results_identical = np.allclose(result1, result2)
else:
    # Mixed types - convert to arrays for comparison
    arr1 = result1.values if hasattr(result1, 'values') else result1
    arr2 = result2.values if hasattr(result2, 'values') else result2
    results_identical = np.allclose(arr1, arr2)

print(f"Transform results are identical: {results_identical}")

=== Handling Missing Values ===
Original data with NaN count per column:
normal_feature         10
exponential_feature     5
uniform_feature         0
bimodal_feature         0
dtype: int64

Binned data with NaN count per column:
normal_feature: 0
exponential_feature: 0
uniform_feature: 0
bimodal_feature: 0

Binned data sample with NaN:
[[-1  0  0  0]
 [-1  0  2  0]
 [-1  0  0  0]
 [-1  0  1  0]
 [-1  0  2  0]]

=== Out-of-bounds Values ===
Training data range:
  normal_feature: [1.38, 107.79]
  exponential_feature: [0.01, 12.81]
  uniform_feature: [0.00, 99.78]
  bimodal_feature: [5.04, 94.34]

Full data range:
  normal_feature: [1.38, 107.79]
  exponential_feature: [0.01, 14.88]
  uniform_feature: [0.00, 99.78]
  bimodal_feature: [5.04, 94.34]

Binned full data range:
  normal_feature: [0.00, 4.00]
  exponential_feature: [0.00, 4.00]
  uniform_feature: [0.00, 4.00]
  bimodal_feature: [0.00, 4.00]

=== Single Column Processing ===
Single column binning:
Bin edges: [np.float64(1.380989

## 6. Summary and Best Practices

This notebook demonstrated the comprehensive functionality of EqualWidthBinning:

### Key Features Demonstrated:
1. **Multi-format Support**: Works seamlessly with NumPy arrays, Pandas DataFrames, and Polars DataFrames
2. **Sklearn Integration**: Full compatibility with scikit-learn pipelines and workflows
3. **Serialization**: Complete parameter transfer allowing model recreation without refitting
4. **Edge Case Handling**: Robust handling of missing values, constant data, and out-of-bounds values
5. **Reproducibility**: Consistent results across identical configurations

### Best Practices:
- Always use separate `fit()` and `transform()` for train/test splits to avoid data leakage
- Leverage `get_params()` and `set_params()` for model serialization and deployment
- Use pipeline integration for clean, maintainable ML workflows
- Handle edge cases (missing values, constant features) appropriately in your preprocessing pipeline

### Typical Workflow:
```python
# 1. Initialize
binning = EqualWidthBinning(n_bins=5)

# 2. Fit on training data
binning.fit(X_train)

# 3. Transform both train and test
X_train_binned = binning.transform(X_train)
X_test_binned = binning.transform(X_test)

# 4. Save parameters for later use
params = binning.get_params()

# 5. Recreate identical binning later
new_binning = EqualWidthBinning().set_params(**params)
X_new_binned = new_binning.transform(X_new)  # No fit required!
```