# MadMatcher: Comprehensive Usage Examples

This notebook provides a professional, end-to-end demonstration of all public API functions and abstract classes in the MadMatcher toolkit. Each section includes explanations, code examples, and output displays for different input types and configurations.

## What is MadMatcher?

MadMatcher is a toolkit for entity matching and record linkage that provides:
- Multiple tokenization strategies for text preprocessing
- Various similarity functions for comparing records
- Feature engineering capabilities for machine learning
- Active learning support for efficient labeling
- Extensible abstract classes for custom implementations

## Table of Contents
1. [Setup](#Setup)
2. [Public API Overview](#Public-API-Overview)  
3. [Tokenizers and Similarity Functions](#Tokenizers-and-Similarity-Functions)
4. [Feature Creation](#Feature-Creation)
5. [Featurization](#Featurization)
6. [Down Sampling](#Down-Sampling)
7. [Seed Creation](#Seed-Creation)
8. [Training a Matcher](#Training-a-Matcher)
9. [Applying a Matcher](#Applying-a-Matcher)
10. [Active Learning Labeling](#Active-Learning-Labeling)
11. [Custom Abstract Classes](#Custom-Abstract-Classes)
12. [Advanced Examples](#Advanced-Examples)

## Setup

Install MadMatcher and its dependencies:

```bash
pip install madmatcher_tools
```

Import all public API functions and classes:

In [32]:
from madmatcher_tools import (
    # Core functions
    get_base_tokenizers, get_extra_tokenizers, get_base_sim_functions,
    create_features, featurize, down_sample,
    create_seeds, train_matcher, apply_matcher, label_data,
    # Abstract base classes for customization
    Tokenizer, Vectorizer, Feature, MLModel, Labeler, CustomLabeler
)
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

# Initialize Spark
spark = SparkSession.builder.appName('MadMatcherComprehensiveDemo').getOrCreate()

## Public API Overview

MadMatcher exposes the following public API functions and abstract classes:

### Core Functions
- **`get_base_tokenizers()`**: Returns default tokenizers for text processing
- **`get_extra_tokenizers()`**: Returns additional specialized tokenizers  
- **`get_base_sim_functions()`**: Returns default similarity functions
- **`create_features(A, B, a_cols, b_cols, ...)`**: Generate feature objects for comparing records
- **`featurize(features, A, B, candidates, ...)`**: Apply features to candidate pairs
- **`down_sample(fvs, percent, search_id_column, ...)`**: Reduce dataset size by sampling
- **`create_seeds(fvs, nseeds, labeler, ...)`**: Generate initial labeled examples
- **`train_matcher(model_spec, labeled_data, ...)`**: Train a matching model
- **`apply_matcher(model, df, feature_col, output_col)`**: Apply trained model for predictions
- **`label_data(model_spec, mode, labeler_spec, fvs, ...)`**: Active learning for labeling

### Abstract Base Classes (for customization)
- **`Tokenizer`**: Base class for text tokenization strategies
- **`Vectorizer`**: Base class for converting tokens to vectors  
- **`Feature`**: Base class for similarity/distance features
- **`MLModel`**: Base class for machine learning models
- **`Labeler`**: Base class for labeling strategies
- **`CustomLabeler`**: Extended labeler with access to full record data

Each section below demonstrates these with different input types and configurations.


## Tokenizers and Similarity Functions

MadMatcher provides various tokenizers for text preprocessing and similarity functions for comparing records. Understanding these building blocks is essential for effective feature engineering.

In [None]:
# Get available tokenizers and similarity functions
base_tokenizers = get_base_tokenizers()
extra_tokenizers = get_extra_tokenizers()
base_sim_functions = get_base_sim_functions()

print('BASE TOKENIZERS:')
for i, tokenizer in enumerate(base_tokenizers):
    print(f'  {i+1}. {tokenizer.__class__.__name__}: {tokenizer.NAME}')

print('EXTRA TOKENIZERS:')
for i, tokenizer in enumerate(extra_tokenizers):
    print(f'  {i+1}. {tokenizer.__class__.__name__}: {tokenizer.NAME}')

print('BASE SIMILARITY FUNCTIONS:')
for i, sim_func in enumerate(base_sim_functions):
    print(f'  {i+1}. {sim_func.__name__}')

### Testing Tokenizers with Different Input Types

Let's see how different tokenizers handle various types of input data:


In [None]:
# Test different input types with tokenizers
test_inputs = [
    'Alice Smith',           # Normal name
    'Bob Jones Jr.',         # Name with suffix
    'Jean-Pierre O\'Connor', # Name with special characters
    'Company123 Inc.',       # Mixed alphanumeric
    '123-456-7890',          # Phone number
    '',                      # Empty string
    None,                    # None value
    42                       # Numeric value
]

print("TOKENIZER TESTING ON DIFFERENT INPUT TYPES\n")

# Test different tokenizers
tokenizers_to_test = [
    base_tokenizers[0],  # StrippedWhiteSpaceTokenizer
    base_tokenizers[1],  # NumericTokenizer  
    base_tokenizers[2],  # QGramTokenizer
    extra_tokenizers[0]  # AlphaNumericTokenizer
]

for tokenizer in tokenizers_to_test:
    print(f"{tokenizer.__class__.__name__} ({tokenizer.NAME}):")
    for input_val in test_inputs:
        try:
            if input_val is None:
                tokens = tokenizer.tokenize(input_val)
            else:
                tokens = tokenizer.tokenize(str(input_val))
            print(f"  Input: {repr(input_val):<20} → Tokens: {tokens}")
        except Exception as e:
            print(f"  Input: {repr(input_val):<20} → Error: {type(e).__name__}")
    print()


## Feature Creation

Feature creation is the core of MadMatcher's functionality. The `create_features()` function automatically generates appropriate features based on your data characteristics.

In [None]:
# Create more comprehensive toy DataFrames for testing
A = pd.DataFrame({
    '_id': [1, 2, 3, 4],
    'name': ['Alice Smith', 'Bob Jones', 'Carol Davis', 'David Wilson'],
    'age': [25, 30, 28, None],  # Include missing values
    'email': ['alice@email.com', 'bob@email.com', None, 'david@email.com'],
    'phone': ['123-456-7890', '987-654-3210', '555-123-4567', ''],
    'address': ['123 Main St', '456 Oak Ave', '789 Pine Rd', '321 Elm St']
})

B = pd.DataFrame({
    '_id': [101, 102, 103, 104], 
    'name': ['Alicia Smith', 'Robert Jones', 'Caroline Davis', 'Dave Wilson'],
    'age': [26, 29, 28, 35],
    'email': ['alicia@email.com', 'robert@gmail.com', 'carol@email.com', None],
    'phone': ['123-456-7891', '987-654-3211', '', '555-999-8888'],
    'address': ['124 Main St', '457 Oak Ave', '790 Pine Rd', '322 Elm St']
})

print("SAMPLE DATASETS:")
print("\nTable A:")
print(A.to_string(index=False))
print("\nTable B:")
print(B.to_string(index=False))

### Different Feature Creation Configurations

The `create_features()` function can be customized in several ways:


In [None]:
print("FEATURE CREATION CONFIGURATIONS:\n")

# 1. Default configuration (automatic feature generation)
features_default = create_features(A, B, a_cols=['name', 'age'], b_cols=['name', 'age'])
print(f"DEFAULT FEATURES ({len(features_default)} features):")
for i, feature in enumerate(features_default):
    print(f"   {i+1}. {feature}")
print()

# 2. Only numeric columns (for RelDiff features)
features_numeric = create_features(A, B, a_cols=['age'], b_cols=['age'])
print(f"NUMERIC-ONLY FEATURES ({len(features_numeric)} features):")
for i, feature in enumerate(features_numeric):
    print(f"   {i+1}. {feature}")
print()

# 3. Custom tokenizers and similarity functions
from madmatcher_tools._internal.tokenizer.tokenizer import AlphaNumericTokenizer
from madmatcher_tools._internal.feature.token_feature import JaccardFeature, CosineFeature

custom_tokenizers = [AlphaNumericTokenizer()]
custom_sim_functions = [JaccardFeature, CosineFeature]

features_custom = create_features(
    A, B, 
    a_cols=['name'], b_cols=['name'],
    tokenizers=custom_tokenizers,
    sim_functions=custom_sim_functions
)
print(f"CUSTOM TOKENIZERS & SIM FUNCTIONS ({len(features_custom)} features):")
for i, feature in enumerate(features_custom):
    print(f"   {i+1}. {feature}")
print()

# 4. High null threshold (filters out columns with many missing values)
features_filtered = create_features(
    A, B, 
    a_cols=['name', 'age', 'email'], b_cols=['name', 'age', 'email'],
    null_threshold=0.3  # Only use columns with <30% null values
)
print(f"NULL-FILTERED FEATURES ({len(features_filtered)} features):")
for i, feature in enumerate(features_filtered):
    print(f"   {i+1}. {feature}")
print()


## Featurization

Featurization applies the generated features to candidate record pairs, producing feature vectors for machine learning.


In [None]:
# Create different types of candidate pairs
print("FEATURIZATION EXAMPLES:\n")

# 1. Basic candidate pairs (one-to-one mapping)
candidates_basic = pd.DataFrame({
    'id1_list': [[1], [2], [3], [4]],  # Each record from A
    'id2': [101, 102, 103, 104]        # Paired with record from B
})

# 2. One-to-many candidates (blocking results)
candidates_blocked = pd.DataFrame({
    'id1_list': [[1, 2], [3], [4]],    # Multiple A records can match one B record
    'id2': [101, 103, 104]
})

# 3. Custom candidates with additional metadata
candidates_custom = pd.DataFrame({
    'id1_list': [[1], [2], [3]],
    'id2': [101, 102, 103],
    'block_id': ['name_block_1', 'name_block_2', 'name_block_1'],  # Additional info
    'similarity_score': [0.8, 0.6, 0.9]  # Pre-computed scores
})

print("CANDIDATE PAIR EXAMPLES:")
print("\n1. Basic pairs:")
print(candidates_basic)
print("\n2. Blocked pairs (one-to-many):")  
print(candidates_blocked)
print("\n3. Custom pairs with metadata:")
print(candidates_custom)


In [None]:
# Apply featurization with different configurations
print("\nFEATURIZATION RESULTS:\n")

# Use the default features for comprehensive comparison
features = features_default

# 1. Basic featurization
fvs_basic = featurize(features, A, B, candidates_basic)
print(f"BASIC FEATURIZATION:")
print(f" Shape: {fvs_basic.shape}")
print(f" Columns: {list(fvs_basic.columns)}")
print(f" Feature vector length: {len(fvs_basic['features'].iloc[0])}")
print(f" Sample feature vector: {fvs_basic['features'].iloc[0][:5]}... (showing first 5)")
print()

# 2. Featurization with custom output column name
fvs_custom_col = featurize(features, A, B, candidates_basic, output_col='similarity_features')
print(f"CUSTOM OUTPUT COLUMN:")
print(f" Columns: {list(fvs_custom_col.columns)}")
print()

# 3. Featurization with fill_na parameter
fvs_filled = featurize(features, A, B, candidates_basic, fill_na=0.0)
print(f"WITH NaN FILLING:")
print(f" NaN count in features: {sum(np.isnan(x).sum() for x in fvs_basic['features'])}")
print(f" NaN count after filling: {sum(np.isnan(x).sum() for x in fvs_filled['features'])}")
print()

# Display sample results
print("SAMPLE FEATURIZATION OUTPUT:")
print(fvs_basic[['id2', 'id1', '_id']].head())
print("\nFeature vectors (first 3 features for first 3 pairs):")
for i in range(min(3, len(fvs_basic))):
    print(f" Pair {i+1}: {fvs_basic['features'].iloc[i][:3]}")


## Down Sampling

Down sampling reduces the size of your feature vector dataset while preserving the most promising candidate pairs based on a scoring function.

In [None]:
# First, let's work with our basic feature vectors and add scores
fvs = fvs_basic.copy()

# Add various types of scores for demonstration
np.random.seed(42)  # For reproducible results
fvs['score'] = np.random.beta(2, 5, len(fvs))  # Simulated similarity scores
fvs['random_score'] = np.random.uniform(0, 1, len(fvs))

print("DOWN SAMPLING EXAMPLES:\n")
print(f"Original dataset size: {len(fvs)} pairs")
print("Original scores:", fvs['score'].round(3).tolist())
print()

# 1. Basic down sampling (50%)
down_50 = down_sample(fvs, percent=0.5, search_id_column='id2')
print(f"50% DOWN SAMPLING:")
print(f" Result size: {len(down_50)} pairs")
print(f" Retained scores: {down_50['score'].round(3).tolist()}")
print()

# 2. Aggressive down sampling (25%)  
down_25 = down_sample(fvs, percent=0.25, search_id_column='id2')
print(f"25% DOWN SAMPLING:")
print(f" Result size: {len(down_25)} pairs")
print(f" Retained scores: {down_25['score'].round(3).tolist()}")
print()

# 3. Custom score column
down_custom = down_sample(fvs, percent=0.5, search_id_column='id2', score_column='random_score')
print(f"CUSTOM SCORE COLUMN:")
print(f" Result size: {len(down_custom)} pairs")
print(f" Retained random_scores: {down_custom['random_score'].round(3).tolist()}")
print()

# 4. Custom bucket size for large datasets
down_buckets = down_sample(fvs, percent=0.75, search_id_column='id2', bucket_size=2)
print(f"CUSTOM BUCKET SIZE:")
print(f" Result size: {len(down_buckets)} pairs")
print(f" Bucket size parameter: 2 (for demonstration)")
print()

print("Note: Down sampling preserves the highest-scoring pairs within each hash bucket.")

## Seed Creation

Seeds are initial labeled examples used to train machine learning models. MadMatcher supports various labeling strategies for creating these seeds.

In [None]:
print("SEED CREATION EXAMPLES:\n")

# Create different types of labelers for demonstration

# 1. Gold standard labeler (ground truth)
gold_labels = pd.DataFrame({
    'id1': [1, 3, 4],        # Records from table A
    'id2': [101, 103, 104]   # Matching records from table B  
})
print("Ground truth matches:")
print(gold_labels)
print()

gold_labeler = {'name': 'gold', 'gold': gold_labels}

# 2. Create a custom always-positive labeler
class AlwaysPositiveLabeler(Labeler):
    def __call__(self, id1, id2):
        return 1.0  # Always return positive match

# 3. Create a custom rule-based labeler  
class RuleBasedLabeler(Labeler):
    def __call__(self, id1, id2):
        # Simple rule: match if id2 - id1 == 100
        return 1.0 if (id2 - id1) == 100 else 0.0

print("DIFFERENT LABELING STRATEGIES:\n")

# Test with gold labeler
seeds_gold = create_seeds(fvs, nseeds=3, labeler=gold_labeler)
print("GOLD STANDARD LABELER:")
print(f" Seeds created: {len(seeds_gold)}")
print(f" Positive labels: {sum(seeds_gold['label'] == 1.0)}")
print(f" Negative labels: {sum(seeds_gold['label'] == 0.0)}")
print("  Sample seeds:")
print(seeds_gold[['id1', 'id2', 'score', 'label']].to_string(index=False))
print()

# Test with rule-based labeler
rule_labeler = RuleBasedLabeler()
seeds_rule = create_seeds(fvs, nseeds=3, labeler=rule_labeler)
print("RULE-BASED LABELER:")
print(f" Seeds created: {len(seeds_rule)}")
print(f" Positive labels: {sum(seeds_rule['label'] == 1.0)}")
print(f" Negative labels: {sum(seeds_rule['label'] == 0.0)}")
print("  Sample seeds:")
print(seeds_rule[['id1', 'id2', 'score', 'label']].to_string(index=False))
print()

# Test with always-positive labeler
positive_labeler = AlwaysPositiveLabeler()
seeds_positive = create_seeds(fvs, nseeds=2, labeler=positive_labeler)
print("ALWAYS-POSITIVE LABELER:")
print(f" Seeds created: {len(seeds_positive)}")
print(f" All labels: {seeds_positive['label'].tolist()}")
print()

## Training a Matcher

MadMatcher supports various machine learning models for training matchers, including scikit-learn models and custom implementations.

In [None]:
print("TRAINING DIFFERENT TYPES OF MATCHERS:\n")

# Use the gold standard seeds for training
seeds = seeds_gold
labeled_data = seeds.copy()

print(f"Training data: {len(labeled_data)} labeled examples")
print(f"Feature vector dimension: {len(labeled_data['features'].iloc[0])}")
print()

# 1. Logistic Regression
model_lr = train_matcher(
    {'model_type': 'sklearn', 'model': LogisticRegression, 'nan_fill': 0, 'model_args': {'random_state': 42}}, 
    labeled_data
)
print("LOGISTIC REGRESSION MODEL:")
print(f" Model type: {type(model_lr.trained_model).__name__}")
print(f" Parameters: {model_lr.params_dict()}")
print()

# 2. Random Forest
model_rf = train_matcher(
    {'model_type': 'sklearn', 'model': RandomForestClassifier, 'nan_fill': 0,
     'model_args': {'n_estimators': 10, 'random_state': 42}}, 
    labeled_data
)
print("RANDOM FOREST MODEL:")
print(f" Model type: {type(model_rf.trained_model).__name__}")
print(f" Parameters: {model_rf.params_dict()}")
print()

# 3. Custom MLModel implementation
class SimpleThresholdModel(MLModel):
    def __init__(self, threshold=0.5):
        self.threshold = threshold
        self._trained = False
        self._trained_model = None
        
    @property
    def nan_fill(self): return 0.0
    @property  
    def use_vectors(self): return False
    @property
    def use_floats(self): return True

    def trained_model(self):
        return self._trained_model
        
    def train(self, df, vector_col, label_column, return_estimator=False):
        self._trained = True
        self._trained_model = self
        return self
        
    def predict(self, df, vector_col, output_col):
        # Simple rule: predict 1 if first feature > threshold
        df = df.copy()
        df[output_col] = df[vector_col].apply(lambda x: 1.0 if len(x) > 0 and x[0] > self.threshold else 0.0)
        return df
        
    def prediction_conf(self, df, vector_col, label_column):
        df = df.copy()
        df['conf'] = 0.8  # Fixed confidence
        return df
        
    def entropy(self, df, vector_col, output_col):
        df = df.copy()
        df[output_col] = 0.5  # Fixed entropy
        return df
        
    def params_dict(self):
        return {'threshold': self.threshold, 'trained': self._trained}

custom_model = SimpleThresholdModel(threshold=0.1)
model_custom = train_matcher(custom_model, labeled_data)
print("CUSTOM THRESHOLD MODEL:")
print(f" Model type: {type(model_custom).__name__}")
print(f" Parameters: {model_custom.params_dict()}")
print()

print("All models trained successfully!")

## Applying a Matcher

Once trained, matchers can be applied to new data to generate predictions and confidence scores.

In [None]:
print("APPLYING TRAINED MATCHERS:\n")

# Apply different trained models to the same feature vectors
test_fvs = fvs.copy()

# 1. Apply Logistic Regression model
result_lr = apply_matcher(model_lr, test_fvs, feature_col='features', output_col='lr_prediction')
print("LOGISTIC REGRESSION PREDICTIONS:")
print(result_lr[['id1', 'id2', 'score', 'lr_prediction']].head())
print(f" Positive predictions: {sum(result_lr['lr_prediction'] == 1.0)}/{len(result_lr)}")
print()

# 2. Apply Random Forest model
result_rf = apply_matcher(model_rf, test_fvs, feature_col='features', output_col='rf_prediction')
print("RANDOM FOREST PREDICTIONS:")
print(result_rf[['id1', 'id2', 'score', 'rf_prediction']].head())
print(f" Positive predictions: {sum(result_rf['rf_prediction'] == 1.0)}/{len(result_rf)}")
print()

# 3. Apply Custom Threshold model
result_custom = apply_matcher(model_custom, test_fvs, feature_col='features', output_col='custom_prediction')
print("CUSTOM THRESHOLD PREDICTIONS:")
print(result_custom[['id1', 'id2', 'score', 'custom_prediction']].head())
print(f" Positive predictions: {sum(result_custom['custom_prediction'] == 1.0)}/{len(result_custom)}")
print()

# 4. Compare all predictions
comparison = pd.DataFrame({
    'id1': result_lr['id1'],
    'id2': result_lr['id2'],
    'original_score': result_lr['score'].round(3),
    'lr_pred': result_lr['lr_prediction'],
    'rf_pred': result_rf['rf_prediction'],
    'custom_pred': result_custom['custom_prediction']
})

print("PREDICTION COMPARISON:")
print(comparison.to_string(index=False))
print()

# Calculate agreement between models
lr_rf_agreement = sum(result_lr['lr_prediction'] == result_rf['rf_prediction']) / len(result_lr)
print(f"Model Agreement:")
print(f" LR vs RF: {lr_rf_agreement:.2%}")
print(f" LR vs Custom: {sum(result_lr['lr_prediction'] == result_custom['custom_prediction']) / len(result_lr):.2%}")
print(f" RF vs Custom: {sum(result_rf['rf_prediction'] == result_custom['custom_prediction']) / len(result_rf):.2%}")

## Active Learning Labeling

**Note**: The `label_data` function has a bug in the current version (line 157 in tools.py). This section demonstrates the intended usage, but the function may need to be fixed first.

Active learning helps efficiently label data by selecting the most informative examples for human review.

In [None]:
print("ACTIVE LEARNING LABELING (INTENDED USAGE):\n")

# Due to the bug in label_data function, we'll demonstrate the intended usage
# and provide a manual implementation

print("❌ Current issue: label_data function has a bug in create_seeds when handling Spark DataFrames")
print("🔧 Here's how it should work:\n")

print("INTENDED USAGE:")
print("""
# Batch mode active learning
labels_batch = label_data(
    model_spec={'model_type': 'sklearn', 'model': LogisticRegression, 'model_args': {}},
    mode='batch',
    labeler_spec=gold_labeler,
    fvs=fvs
)

# Continuous mode active learning  
labels_continuous = label_data(
    model_spec={'model_type': 'sklearn', 'model': LogisticRegression, 'model_args': {}},
    mode='continuous',
    labeler_spec=gold_labeler,
    fvs=fvs,
    seeds=seeds_gold  # Optional pre-existing seeds
)
""")

print("\nMANUAL ACTIVE LEARNING SIMULATION:\n")

# Simulate active learning workflow manually
def simulate_active_learning(fvs, model, labeler, n_iterations=2):
    """Simulate the active learning process"""
    current_labeled = seeds_gold.copy()
    
    for iteration in range(n_iterations):
        print(f"Iteration {iteration + 1}:")
        
        # Train model on current labeled data
        model = train_matcher(
            {'model_type': 'sklearn', 'model': LogisticRegression, 'nan_fill': 0.0, 'model_args': {'random_state': 42}},
            current_labeled
        )
        
        # Apply model to get predictions and uncertainty (entropy would be better)
        predictions = apply_matcher(model, fvs, 'features', 'prediction')
        
        # Simple uncertainty: pick examples closest to decision boundary (0.5)
        predictions['uncertainty'] = predictions['prediction'].apply(lambda x: 1 - abs(x - 0.5) * 2)
        
        # Select most uncertain unlabeled example
        unlabeled = predictions[~predictions.index.isin(current_labeled.index)]
        if len(unlabeled) == 0:
            break
            
        most_uncertain = unlabeled.loc[unlabeled['uncertainty'].idxmax()]
        
        # Get label from labeler
        label = labeler(most_uncertain['id1'], most_uncertain['id2'])
        
        # Add to labeled set - only copy the necessary columns to avoid confusion
        new_labeled = pd.DataFrame({
            'id1': [most_uncertain['id1']],
            'id2': [most_uncertain['id2']],
            'features': [most_uncertain['features']],
            'score': [most_uncertain['score']],
            'label': [float(label)]  # Ensure label is float
        })
        current_labeled = pd.concat([current_labeled, new_labeled], ignore_index=True)
        
        print(f" Selected pair: ({most_uncertain['id1']}, {most_uncertain['id2']})")
        print(f" Uncertainty: {most_uncertain['uncertainty']:.3f}")
        print(f" Label: {label}")
        print(f" Total labeled: {len(current_labeled)}")
        print()
    
    return current_labeled

# Run simulation
final_labeled = simulate_active_learning(fvs, model_lr, rule_labeler)
print("FINAL LABELED DATASET:")
print(final_labeled[['id1', 'id2', 'score', 'label']].to_string(index=False))

## Custom Abstract Classes

MadMatcher's power comes from its extensibility. You can implement custom tokenizers, features, ML models, and labelers by extending the abstract base classes.


In [None]:
print("CUSTOM IMPLEMENTATION EXAMPLES:\n")

# 1. Custom Tokenizer
class ReverseWordTokenizer(Tokenizer):
    """Tokenizer that reverses each word before returning tokens"""
    NAME = 'reverse_word_tokens'
    
    def tokenize(self, s):
        if not isinstance(s, str):
            return None
        words = s.lower().split()
        return [word[::-1] for word in words]  # Reverse each word

# Test custom tokenizer
reverse_tokenizer = ReverseWordTokenizer()
print("CUSTOM TOKENIZER (ReverseWordTokenizer):")
test_string = "Alice Smith"
print(f" Input: '{test_string}'")
print(f" Tokens: {reverse_tokenizer.tokenize(test_string)}")
print(f" Name: {reverse_tokenizer.NAME}")
print()

# 2. Custom Feature
class WordCountDifferenceFeature(Feature):
    """Feature that computes absolute difference in word count"""
    
    def __str__(self):
        return f'word_count_diff({self.a_attr}, {self.b_attr})'
    
    def __call__(self, rec, recs):
        b_value = rec[self.b_attr]
        a_values = recs[self.a_attr]
        
        if not isinstance(b_value, str):
            return pd.Series(np.nan, index=a_values.index)
        
        b_word_count = len(b_value.split()) if b_value else 0
        
        def word_count_diff(a_value):
            if not isinstance(a_value, str):
                return np.nan
            a_word_count = len(a_value.split()) if a_value else 0
            return abs(a_word_count - b_word_count)
        
        return a_values.apply(word_count_diff).astype(np.float64)
    
    def _preprocess(self, data, input_col):
        return data  # No preprocessing needed
    
    def _preprocess_output_column(self, attr):
        return None  # No preprocessing output

# Test custom feature
custom_feature = WordCountDifferenceFeature('name', 'name')
print("CUSTOM FEATURE (WordCountDifferenceFeature):")
print(f" Feature string: {custom_feature}")

# Create test data for feature
test_rec = {'name': 'Alice Smith'}
test_recs = pd.DataFrame({'name': ['Bob Jones', 'Jean-Pierre O\'Connor', 'X']})
feature_result = custom_feature(test_rec, test_recs)
print(f" Input B: '{test_rec['name']}' (2 words)")
print(f" Input A values: {test_recs['name'].tolist()}")
print(f" Word count differences: {feature_result.tolist()}")
print()

# 3. Custom Labeler with complex logic
class SmartSimilarityLabeler(CustomLabeler):
    """Labeler that uses multiple criteria for matching"""
    
    def label_pair(self, row1, row2):
        # Get name similarity (simple word overlap)
        name1_words = set(row1['name'].lower().split())
        name2_words = set(row2['name'].lower().split())
        name_overlap = len(name1_words & name2_words) / max(len(name1_words | name2_words), 1)
        
        # Get age similarity 
        age1, age2 = row1.get('age'), row2.get('age')
        age_diff = abs(age1 - age2) if (age1 is not None and age2 is not None) else float('inf')
        
        # Labeling logic
        if name_overlap >= 0.5 and age_diff <= 5:
            return 1.0  # Strong match
        elif name_overlap >= 0.3 or age_diff <= 2:
            return 0.5  # Uncertain (could be treated as unsure)
        else:
            return 0.0  # No match

smart_labeler = SmartSimilarityLabeler(A, B)
print("CUSTOM LABELER (SmartSimilarityLabeler):")
print(" Logic: High name overlap + small age difference = match")

# Test the smart labeler
test_pairs = [(1, 101), (2, 102), (3, 103), (4, 104)]
for id1, id2 in test_pairs:
    label = smart_labeler(id1, id2)
    row1 = A[A['_id'] == id1].iloc[0]
    row2 = B[B['_id'] == id2].iloc[0]
    print(f" Pair ({id1},{id2}): '{row1['name']}' vs '{row2['name']}' → {label}")
print()

# 4. Demonstrate custom implementations in pipeline
print("USING CUSTOM COMPONENTS IN PIPELINE:")

# Create features with custom tokenizer
custom_features = create_features(
    A, B, 
    a_cols=['name'], b_cols=['name'],
    tokenizers=[reverse_tokenizer],
    sim_functions=get_base_sim_functions()[:2]  # Use first 2 similarity functions
)

# Add our custom feature
custom_features.append(custom_feature)

print(f" Created {len(custom_features)} features (including custom)")
for i, feature in enumerate(custom_features):
    print(f"   {i+1}. {feature}")
print()

# Test featurization with custom features
small_candidates = pd.DataFrame({'id1_list': [[1], [2]], 'id2': [101, 102]})
custom_fvs = featurize(custom_features, A, B, small_candidates)

print(f" Custom featurization result shape: {custom_fvs.shape}")
print(f" Feature vector length: {len(custom_fvs['features'].iloc[0])}")
print("  Custom components integrated successfully!")


## Advanced Examples

This section demonstrates more complex usage patterns and best practices for real-world entity matching scenarios.


In [None]:
print("ADVANCED USAGE PATTERNS:\n")

# 1. Multi-column feature engineering
print("MULTI-COLUMN FEATURE ENGINEERING:")

# Create features for all available columns
all_features = create_features(
    A, B,
    a_cols=['name', 'age', 'email', 'phone', 'address'],
    b_cols=['name', 'age', 'email', 'phone', 'address'],
    null_threshold=0.7  # Allow columns with up to 70% missing values
)

print(f" Generated {len(all_features)} features across all columns:")
feature_types = {}
for feature in all_features:
    feature_type = type(feature).__name__
    feature_types[feature_type] = feature_types.get(feature_type, 0) + 1

for ftype, count in feature_types.items():
    print(f"   {ftype}: {count}")
print()

# 2. Feature selection and evaluation
print("FEATURE EVALUATION:")

# Apply all features to get comprehensive feature vectors
comprehensive_fvs = featurize(all_features, A, B, candidates_basic)
comprehensive_fvs['simple_score'] = comprehensive_fvs['features'] \
    .apply(lambda fv: np.nansum(fv))
print(f" Full feature vector dimension: {len(comprehensive_fvs['features'].iloc[0])}")

# Calculate feature statistics
feature_stats = []
for i in range(len(comprehensive_fvs['features'].iloc[0])):
    feature_values = [fv[i] for fv in comprehensive_fvs['features']]
    arr = np.array(feature_values, dtype=float)
    non_nan = arr[~np.isnan(arr)]
    unique_values = len(set(non_nan))
    stats = {
        'feature_idx': i,
        'mean': np.nanmean(feature_values),
        'std': np.nanstd(feature_values),
        'nan_count': sum(np.isnan(feature_values)),
        'unique_values': unique_values
    }
    feature_stats.append(stats)

# Display top features by variance (good features should have variability)
feature_stats_df = pd.DataFrame(feature_stats)
feature_stats_df['variance'] = feature_stats_df['std'] ** 2
top_features = feature_stats_df.nlargest(5, 'variance')

print(" Top 5 features by variance:")
print(top_features[['feature_idx', 'mean', 'std', 'variance']].round(4).to_string(index=False))
print()

# 3. Model comparison and ensemble
print("MODEL ENSEMBLE:")

# Train multiple models on the comprehensive features
models = {}
model_configs = [
    ('LR', {'model_type': 'sklearn', 'model': LogisticRegression, 'nan_fill': 0.0, 'model_args': {'random_state': 42}}),
    ('RF', {'model_type': 'sklearn', 'model': RandomForestClassifier, 'nan_fill': 0.0, 'model_args': {'n_estimators': 10, 'random_state': 42}}),
]

# Use comprehensive seeds for training
comprehensive_seeds = create_seeds(comprehensive_fvs, nseeds=4, labeler=gold_labeler, score_column='simple_score')
print(comprehensive_seeds)
for name, config in model_configs:
    models[name] = train_matcher(config, comprehensive_seeds)
    print(f" Trained {name} model")

# Apply all models and create ensemble
ensemble_results = comprehensive_fvs[['id1', 'id2']].copy()

for name, model in models.items():
    result = apply_matcher(model, comprehensive_fvs, 'features', f'{name}_pred')
    ensemble_results[f'{name}_pred'] = result[f'{name}_pred']

# Simple ensemble: average predictions
ensemble_results['ensemble_pred'] = ensemble_results[['LR_pred', 'RF_pred']].mean(axis=1)
ensemble_results['ensemble_pred_binary'] = (ensemble_results['ensemble_pred'] > 0.5).astype(float)

print(" Ensemble results:")
print(ensemble_results.round(3).to_string(index=False))
print()

# 4. Performance evaluation simulation
print("PERFORMANCE EVALUATION:")

# Simulate evaluation against ground truth
ground_truth = {(1, 101): 1, (2, 102): 0, (3, 103): 1, (4, 104): 1}

def evaluate_predictions(predictions, truth_dict):
    correct = 0
    total = 0
    for _, row in predictions.iterrows():
        pair = (row['id1'], row['id2'])
        if pair in truth_dict:
            pred = row['ensemble_pred_binary']
            true_label = truth_dict[pair]
            if pred == true_label:
                correct += 1
            total += 1
    return correct / total if total > 0 else 0

accuracy = evaluate_predictions(ensemble_results, ground_truth)
print(f" Ensemble accuracy: {accuracy:.2%}")

# Calculate precision and recall
true_positives = sum(1 for _, row in ensemble_results.iterrows() 
                    if (row['id1'], row['id2']) in ground_truth 
                    and ground_truth[(row['id1'], row['id2'])] == 1 
                    and row['ensemble_pred_binary'] == 1)

predicted_positives = sum(ensemble_results['ensemble_pred_binary'])
actual_positives = sum(ground_truth.values())

precision = true_positives / predicted_positives if predicted_positives > 0 else 0
recall = true_positives / actual_positives if actual_positives > 0 else 0

print(f" Precision: {precision:.2%}")
print(f" Recall: {recall:.2%}")
print()

print("Advanced examples completed successfully!")


---

## Summary

This comprehensive notebook has demonstrated all the key functionality of MadMatcher:

### Core Functions Covered
**Tokenizers & Similarity Functions**: Understanding the building blocks  
**Feature Creation**: Automatic and custom feature generation  
**Featurization**: Converting candidate pairs to feature vectors  
**Down Sampling**: Efficient dataset reduction  
**Seed Creation**: Generating initial training labels  
**Model Training**: Multiple ML approaches  
**Model Application**: Making predictions  
**Active Learning**: Efficient labeling strategies  

### Abstract Classes Extended
**Custom Tokenizer**: `ReverseWordTokenizer`  
**Custom Feature**: `WordCountDifferenceFeature`  
**Custom MLModel**: `SimpleThresholdModel`  
**Custom Labeler**: `SmartSimilarityLabeler`  

### Advanced Patterns
**Multi-column Feature Engineering**  
**Feature Evaluation & Selection**  
**Model Ensembles**  
**Performance Evaluation**  


For more information, visit the [MadMatcher Website](https://madmatcher.ai) or explore the source code in the `madmatcher_tools` package.

In [None]:
# Clean up Spark session
spark.stop()
print("Spark session stopped. Notebook complete!")
