# üéØ Manual and Domain-Driven Binning Methods: Expert Knowledge Integration

Welcome to a comprehensive exploration of manual and domain-driven binning methods in the `binlearn` package. These powerful techniques enable you to incorporate expert knowledge, business rules, and domain-specific requirements directly into your data preprocessing pipeline.

## üîß Overview of Manual Binning Approaches

Manual binning methods bridge the gap between automated algorithms and real-world requirements. They allow data scientists to encode domain expertise, regulatory constraints, and business logic directly into the binning process, ensuring that the resulting bins are not only statistically meaningful but also practically relevant.

## üõ†Ô∏è Methods Covered in This Notebook

### **ManualIntervalBinning** üìê
- **Principle**: Define explicit breakpoints based on domain knowledge or business rules
- **Strengths**: Full control over boundaries, incorporates expert knowledge, interpretable
- **Best for**: Industry standards, regulatory thresholds, established business segments
- **Examples**: Credit score ranges, age groups, income brackets

### **ManualFlexibleBinning** üîß
- **Principle**: Create complex, customizable bin patterns with flexible rules and conditions
- **Strengths**: Maximum flexibility, handles irregular patterns, supports complex logic
- **Best for**: Multi-criteria binning, complex business rules, mixed data types
- **Examples**: Customer segmentation, risk categorization, performance tiers

### **SingletonBinning** üîπ
- **Principle**: Preserve unique values as individual bins, maintaining categorical integrity
- **Strengths**: No information loss, perfect for discrete data, maintains exact values
- **Best for**: Categorical data, unique identifiers, discrete ordinal scales
- **Examples**: Product categories, geographic regions, discrete ratings

## üéØ Key Advantages of Manual Binning

‚úÖ **Domain Expertise Integration**: Leverage years of industry knowledge and experience  
‚úÖ **Business Alignment**: Ensure bins match existing business processes and terminology  
‚úÖ **Regulatory Compliance**: Meet specific legal, regulatory, or compliance requirements  
‚úÖ **Stakeholder Buy-in**: Create interpretable bins that resonate with business users  
‚úÖ **Consistency**: Maintain consistent binning across different projects and teams  
‚úÖ **Flexibility**: Handle complex, irregular, or multi-criteria binning scenarios  

## üéØ Strategic Applications

### **Industry Standards & Regulations** üìã
- Credit scoring (FICO ranges: 300-579, 580-669, 670-739, 740-799, 800-850)
- Healthcare risk categories (Low, Moderate, High, Critical)
- Financial risk assessment (Basel III compliance tiers)
- Environmental impact levels (EPA standards)

### **Business Process Integration** üè¢
- Customer lifecycle stages (Prospect, New, Active, Loyal, At-Risk, Churned)
- Sales performance tiers (Bronze, Silver, Gold, Platinum)
- Product pricing categories (Economy, Standard, Premium, Luxury)
- Geographic market segments (Local, Regional, National, International)

### **Data Quality & Governance** üõ°Ô∏è
- Maintaining categorical integrity
- Standardizing across different data sources
- Ensuring reproducible and auditable binning
- Supporting data lineage and documentation requirements

## üîÑ When to Choose Manual Binning Methods

### **Ideal Scenarios** ‚≠ê
- Strong domain expertise is available
- Regulatory or compliance requirements exist
- Business stakeholder alignment is critical
- Existing binning standards must be maintained
- Interpretability is paramount
- Complex, multi-criteria binning is needed

### **Consider Alternatives When** ‚ö†Ô∏è
- Exploring new domains without established knowledge
- Automated pattern discovery is preferred
- Large-scale feature engineering with many variables
- Rapid prototyping and experimentation phases
- Domain knowledge is limited or uncertain

## üìö What You'll Learn

1. **Expert Knowledge Integration**: How to encode domain expertise into binning rules
2. **Business Rule Implementation**: Translating business logic into technical specifications
3. **Regulatory Compliance**: Meeting specific industry and legal requirements
4. **Interpretability Optimization**: Creating bins that stakeholders understand and trust
5. **Flexibility Showcase**: Handling complex, irregular, and multi-criteria scenarios
6. **Quality Assurance**: Validation and testing of manual binning rules
7. **Best Practices**: Guidelines for maintainable and scalable manual binning

Let's explore how to effectively leverage domain knowledge in your binning strategy!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import time
import warnings
warnings.filterwarnings('ignore')

# Import manual binning methods from binlearn
from binlearn.methods import ManualIntervalBinning, SingletonBinning

# Set plotting style
plt.style.use('default')
sns.set_palette("Set2")
np.random.seed(42)

print("üìù Manual and Domain-Driven Binning Methods")
print("=" * 50)
print("‚úÖ All libraries imported successfully!")
print(f"üìä NumPy version: {np.__version__}")
print(f"üêº Pandas version: {pd.__version__}")
print(f"üéØ Methods: ManualInterval, SingletonBinning")
print(f"üíº Focus: Leveraging domain expertise for intelligent binning!")
print(f"üè¢ Use cases: Credit scoring, demographics, regulatory compliance")

## 1. Real-World Scenarios

Let's create realistic datasets that benefit from domain knowledge and manual binning approaches.

In [None]:
# Create realistic datasets for manual binning demonstrations
print("üè¢ Creating Real-World Datasets")
print("=" * 40)

# Dataset 1: Credit Scores (Standard industry breakpoints)
np.random.seed(42)
credit_scores = np.random.beta(2, 2, 1000) * 400 + 300  # Scale to 300-700 range
credit_scores = np.clip(credit_scores, 300, 850)  # Realistic FICO range

# Dataset 2: Customer Ages (Business-relevant age groups)
age_groups = np.concatenate([
    np.random.normal(25, 5, 200),  # Young adults
    np.random.normal(40, 8, 300),  # Middle-aged
    np.random.normal(60, 10, 200)  # Seniors
])
age_groups = np.clip(age_groups, 18, 85)

# Dataset 3: Income levels (Tax bracket-based)
income_levels = np.concatenate([
    np.random.lognormal(10.5, 0.5, 400),   # Lower income
    np.random.lognormal(11.2, 0.3, 300),   # Middle income  
    np.random.lognormal(12.0, 0.4, 200),   # Higher income
    np.random.lognormal(13.0, 0.6, 100)    # High income
])

# Dataset 4: Product ratings (discrete values)
product_ratings = np.random.choice([1, 2, 3, 4, 5], size=800, 
                                 p=[0.05, 0.10, 0.25, 0.35, 0.25])

# Dataset 5: Medical test results (clinical thresholds)
glucose_levels = np.random.gamma(2, 50, 600)  # mg/dL
glucose_levels = np.clip(glucose_levels, 70, 400)

# Create comprehensive DataFrame
df_manual = pd.DataFrame({
    'credit_score': credit_scores,
    'age': age_groups,
    'income': income_levels,
    'rating': np.random.choice(product_ratings, len(credit_scores)),
    'glucose': np.random.choice(glucose_levels, len(credit_scores))
})

# Visualize the raw data
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

# Plot each feature
features = ['credit_score', 'age', 'income', 'rating', 'glucose']
titles = [
    'Credit Scores\n(Industry Standard Ranges)',
    'Customer Ages\n(Life Stage Groups)',
    'Income Levels\n(Tax Bracket Based)',
    'Product Ratings\n(Discrete Values)',
    'Glucose Levels\n(Clinical Thresholds)'
]

colors = ['lightcoral', 'skyblue', 'lightgreen', 'orange', 'plum']

for i, (feature, title, color) in enumerate(zip(features, titles, colors)):
    if feature == 'rating':
        # Bar plot for discrete ratings
        rating_counts = df_manual[feature].value_counts().sort_index()
        axes[i].bar(rating_counts.index, rating_counts.values, color=color, alpha=0.7, edgecolor='black')
        axes[i].set_xticks(rating_counts.index)
    else:
        # Histogram for continuous features
        axes[i].hist(df_manual[feature], bins=30, alpha=0.7, color=color, edgecolor='black')
    
    axes[i].set_title(title, fontweight='bold')
    axes[i].set_xlabel(feature.replace('_', ' ').title())
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

# Display summary statistics
print("\nüìä Dataset Summary Statistics:")
print("=" * 50)
print(df_manual.describe().round(2))

print(f"\n‚úÖ Generated realistic datasets:")
print(f"   ‚Ä¢ Credit scores: {len(credit_scores)} samples (300-850 range)")
print(f"   ‚Ä¢ Ages: {len(age_groups)} samples (18-85 range)")
print(f"   ‚Ä¢ Income: {len(income_levels)} samples (log-normal distribution)")
print(f"   ‚Ä¢ Ratings: {len(product_ratings)} samples (1-5 discrete)")
print(f"   ‚Ä¢ Glucose: {len(glucose_levels)} samples (70-400 mg/dL)")

## 2. Manual Interval Binning - Industry Standards

Using established industry breakpoints and regulatory thresholds for meaningful business categories.

In [None]:
print("üìã Manual Interval Binning with Industry Standards")
print("=" * 55)

# Define industry-standard breakpoints
domain_breakpoints = {
    'credit_score': {
        'breakpoints': [300, 580, 670, 740, 800, 850],
        'labels': ['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'],
        'description': 'FICO Credit Score Categories'
    },
    'age': {
        'breakpoints': [18, 25, 35, 50, 65, 85],
        'labels': ['Gen Z', 'Millennials', 'Gen X', 'Boomers', 'Silent'],
        'description': 'Generational Cohorts'
    },
    'income': {
        'breakpoints': [0, 30000, 60000, 100000, 200000, np.inf],
        'labels': ['Low', 'Lower-Mid', 'Middle', 'Upper-Mid', 'High'],
        'description': 'Income Brackets (USD)'
    },
    'glucose': {
        'breakpoints': [0, 100, 125, 200, 400],
        'labels': ['Normal', 'Prediabetic', 'Diabetic', 'Severe'],
        'description': 'Clinical Glucose Levels (mg/dL)'
    }
}

# Apply manual interval binning
manual_results = {}

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for i, (feature, config) in enumerate(domain_breakpoints.items()):
    print(f"\nüéØ Processing {feature} - {config['description']}")
    
    # Create manual interval binner
    binner = ManualIntervalBinning(
        intervals={feature: config['breakpoints']}
    )
    
    # Fit and transform
    feature_data = df_manual[[feature]]
    binner.fit(feature_data)
    binned_data = binner.transform(feature_data)
    
    # Store results
    manual_results[feature] = {
        'binner': binner,
        'original': feature_data,
        'binned': binned_data,
        'breakpoints': config['breakpoints'],
        'labels': config['labels']
    }
    
    # Create visualization
    ax = axes[i]
    
    # Plot original distribution
    original_vals = feature_data[feature].values
    ax.hist(original_vals, bins=30, alpha=0.6, color='lightblue', 
           label='Original Distribution', density=True)
    
    # Add breakpoint lines
    for j, breakpoint in enumerate(config['breakpoints']):
        if breakpoint != np.inf and breakpoint != 0:
            ax.axvline(breakpoint, color='red', linestyle='--', alpha=0.8, linewidth=2)
            if j < len(config['labels']):
                # Add label at midpoint of bin
                if j == 0:
                    mid_point = (original_vals.min() + breakpoint) / 2
                elif j == len(config['breakpoints']) - 1:
                    mid_point = (config['breakpoints'][j-1] + original_vals.max()) / 2
                else:
                    mid_point = (config['breakpoints'][j-1] + breakpoint) / 2
                
                ax.text(mid_point, ax.get_ylim()[1] * 0.8, config['labels'][j], 
                       rotation=90, ha='center', va='bottom', fontweight='bold',
                       bbox=dict(boxstyle="round,pad=0.3", facecolor='yellow', alpha=0.7))
    
    ax.set_title(f"{config['description']}\n{feature.replace('_', ' ').title()}", fontweight='bold')
    ax.set_xlabel(feature.replace('_', ' ').title())
    ax.set_ylabel('Density')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Print bin statistics
    bin_counts = binned_data[feature].value_counts().sort_index()
    print(f"   üìä Bin distribution:")
    for bin_val, count in bin_counts.items():
        percentage = (count / len(binned_data)) * 100
        if bin_val < len(config['labels']):
            label = config['labels'][int(bin_val)]
            print(f"      {label}: {count} samples ({percentage:.1f}%)")

plt.tight_layout()
plt.show()

print(f"\n‚úÖ Applied manual interval binning to {len(domain_breakpoints)} features")
print("üìã All breakpoints based on established industry standards")