# MadLib: Comprehensive Usage Examples

This notebook provides an end-to-end demonstration of all public API functions and abstract classes in the MadLib toolkit. Each section includes explanations, code examples, and output displays for different input types and configurations.

## What is MadLib?

MadLib is a toolkit for the matching step of entity matching that provides:
- Multiple tokenization strategies for text preprocessing
- Various similarity functions for comparing records
- Active learning support for efficient labeling
- Extensible abstract classes for custom implementations

## Table of Contents
1. [Setup](#Setup)
2. [Public API Overview](#Public-API-Overview)  
3. [Tokenizers and Similarity Functions](#Tokenizers-and-Similarity-Functions)
4. [Feature Creation](#Feature-Creation)
5. [Featurization](#Featurization)
6. [Down Sampling](#Down-Sampling)
7. [Seed Creation](#Seed-Creation)
8. [Training a Matcher](#Training-a-Matcher)
9. [Applying a Matcher](#Applying-a-Matcher)
10. [Active Learning Labeling](#Active-Learning-Labeling)
11. [Custom Abstract Classes](#Custom-Abstract-Classes)

## Setup

Install MadLib and its dependencies:

```bash
pip install MadLib
```

Import all public API functions and classes:

In [None]:
from MadLib import (
    # Core functions
    get_base_tokenizers, get_extra_tokenizers, get_base_sim_functions,
    create_features, featurize, down_sample,
    create_seeds, train_matcher, apply_matcher, label_data,
    # Abstract base classes for customization
    Tokenizer, Vectorizer, Feature, MLModel, Labeler, CustomLabeler
)
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

# Initialize Spark - needed if you use label_data, featurize, a SparkML Model, or any Spark DataFrame
spark = SparkSession.builder.appName('MadLibComprehensiveDemo').getOrCreate()

25/06/20 14:41:37 WARN Utils: Your hostname, Devs-MacBook-Pro-3.local resolves to a loopback address: 127.0.0.1; using 192.168.1.96 instead (on interface en0)
25/06/20 14:41:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/20 14:41:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Public API Overview

MadMatcher exposes the following public API functions and abstract classes:

### Core Functions
- **`get_base_tokenizers()`**: Returns default tokenizers for text processing
- **`get_extra_tokenizers()`**: Returns additional specialized tokenizers  
- **`get_base_sim_functions()`**: Returns default similarity functions
- **`create_features(A, B, a_cols, b_cols, tokenizers=None, sim_functions=None, null_threshold=0.5)`**: Generate feature objects for comparing records
- **`featurize(features, A, B, candidates, output_col='features', fill_na=None)`**: Apply features to candidate pairs
- **`down_sample(fvs, percent, search_id_column, score_column='score', bucket_size=1000)`**: Reduce dataset size by sampling
- **`create_seeds(fvs, nseeds, labeler, score_column='score')`**: Generate initial labeled examples
- **`train_matcher(model_spec, labeled_data, feature_col='features', label_col='label')`**: Train a matching model
- **`apply_matcher(model, df, feature_col, output_col)`**: Apply trained model for predictions
- **`label_data(model_spec, mode, labeler_spec, fvs, seeds=None)`**: Active learning for labeling


### Abstract Base Classes (for customization)
- **`Tokenizer`**: Base class for text tokenization strategies
- **`Vectorizer`**: Base class for converting tokens to vectors  
- **`Feature`**: Base class for similarity/distance features
- **`MLModel`**: Base class for machine learning models
- **`Labeler`**: Base class for labeling strategies
- **`CustomLabeler`**: Extended labeler with access to full record data

Each section below demonstrates these with different input types and configurations.


## Tokenizers and Similarity Functions

MadMatcher provides various tokenizers for text preprocessing and similarity functions for comparing records. Understanding these building blocks is essential for effective feature engineering.

In [2]:
# Get available tokenizers and similarity functions
base_tokenizers = get_base_tokenizers()
extra_tokenizers = get_extra_tokenizers()
base_sim_functions = get_base_sim_functions()

print('BASE TOKENIZERS:')
for i, tokenizer in enumerate(base_tokenizers):
    print(f'  {i+1}. {tokenizer.__class__.__name__}: {tokenizer.NAME}')

print('EXTRA TOKENIZERS:')
for i, tokenizer in enumerate(extra_tokenizers):
    print(f'  {i+1}. {tokenizer.__class__.__name__}: {tokenizer.NAME}')

print('BASE SIMILARITY FUNCTIONS:')
for i, sim_func in enumerate(base_sim_functions):
    print(f'  {i+1}. {sim_func.__name__}')

BASE TOKENIZERS:
  1. StrippedWhiteSpaceTokenizer: stripped_whitespace_tokens
  2. NumericTokenizer: num_tokens
  3. QGramTokenizer: 3gram_tokens
EXTRA TOKENIZERS:
  1. AlphaNumericTokenizer: alnum_tokens
  2. QGramTokenizer: 5gram_tokens
  3. StrippedQGramTokenizer: stripped_3gram_tokens
  4. StrippedQGramTokenizer: stripped_5gram_tokens
BASE SIMILARITY FUNCTIONS:
  1. TFIDFFeature
  2. JaccardFeature
  3. SIFFeature
  4. OverlapCoeffFeature
  5. CosineFeature


### Testing Tokenizers with Different Input Types

Let's see how different tokenizers handle various types of input data:


In [3]:
# Test different input types with tokenizers
test_inputs = [
    'Alice Smith',           # Normal name
    'Bob Jones Jr.',         # Name with suffix
    'Jean-Pierre O\'Connor', # Name with special characters
    'Company123 Inc.',       # Mixed alphanumeric
    '123-456-7890',          # Phone number
    '',                      # Empty string
    None,                    # None value
    42                       # Numeric value
]

print("TOKENIZER TESTING ON DIFFERENT INPUT TYPES\n")

# Test different tokenizers
tokenizers_to_test = [
    base_tokenizers[0],  # StrippedWhiteSpaceTokenizer
    base_tokenizers[1],  # NumericTokenizer  
    base_tokenizers[2],  # QGramTokenizer
    extra_tokenizers[0]  # AlphaNumericTokenizer
]

for tokenizer in tokenizers_to_test:
    print(f"{tokenizer.__class__.__name__} ({tokenizer.NAME}):")
    for input_val in test_inputs:
        try:
            if input_val is None:
                tokens = tokenizer.tokenize(input_val)
            else:
                tokens = tokenizer.tokenize(str(input_val))
            print(f"  Input: {repr(input_val):<20} → Tokens: {tokens}")
        except Exception as e:
            print(f"  Input: {repr(input_val):<20} → Error: {type(e).__name__}")
    print()


TOKENIZER TESTING ON DIFFERENT INPUT TYPES

StrippedWhiteSpaceTokenizer (stripped_whitespace_tokens):
  Input: 'Alice Smith'        → Tokens: ['alice', 'smith']
  Input: 'Bob Jones Jr.'      → Tokens: ['bob', 'jones', 'jr']
  Input: "Jean-Pierre O'Connor" → Tokens: ['jeanpierre', 'oconnor']
  Input: 'Company123 Inc.'    → Tokens: ['company123', 'inc']
  Input: '123-456-7890'       → Tokens: ['1234567890']
  Input: ''                   → Tokens: []
  Input: None                 → Tokens: None
  Input: 42                   → Tokens: ['42']

NumericTokenizer (num_tokens):
  Input: 'Alice Smith'        → Tokens: []
  Input: 'Bob Jones Jr.'      → Tokens: []
  Input: "Jean-Pierre O'Connor" → Tokens: []
  Input: 'Company123 Inc.'    → Tokens: ['123']
  Input: '123-456-7890'       → Tokens: ['123', '456', '7890']
  Input: ''                   → Tokens: []
  Input: None                 → Tokens: None
  Input: 42                   → Tokens: ['42']

QGramTokenizer (3gram_tokens):
  Input: 'Alice

## Feature Creation

Feature creation is the core of MadLib's functionality. The `create_features()` function automatically generates appropriate features based on your data characteristics.

In [4]:
# Create more comprehensive toy DataFrames for testing
A = pd.DataFrame({
    '_id': [1, 2, 3, 4],
    'name': ['Alice Smith', 'Bob Jones', 'Carol Davis', 'David Wilson'],
    'age': [25, 30, 28, None],  # Include missing values
    'email': ['alice@email.com', 'bob@email.com', None, 'david@email.com'],
    'phone': ['123-456-7890', '987-654-3210', '555-123-4567', ''],
    'address': ['123 Main St', '456 Oak Ave', '789 Pine Rd', '321 Elm St']
})

B = pd.DataFrame({
    '_id': [101, 102, 103, 104], 
    'name': ['Alicia Smith', 'Robert Jones', 'Caroline Davis', 'Dave Wilson'],
    'age': [26, 29, 28, 35],
    'email': ['alicia@email.com', 'robert@gmail.com', 'carol@email.com', None],
    'phone': ['123-456-7891', '987-654-3211', '', '555-999-8888'],
    'address': ['124 Main St', '457 Oak Ave', '790 Pine Rd', '322 Elm St']
})

print("SAMPLE DATASETS:")
print("\nTable A:")
print(A.to_string(index=False))
print("\nTable B:")
print(B.to_string(index=False))

SAMPLE DATASETS:

Table A:
 _id         name  age           email        phone     address
   1  Alice Smith 25.0 alice@email.com 123-456-7890 123 Main St
   2    Bob Jones 30.0   bob@email.com 987-654-3210 456 Oak Ave
   3  Carol Davis 28.0            None 555-123-4567 789 Pine Rd
   4 David Wilson  NaN david@email.com               321 Elm St

Table B:
 _id           name  age            email        phone     address
 101   Alicia Smith   26 alicia@email.com 123-456-7891 124 Main St
 102   Robert Jones   29 robert@gmail.com 987-654-3211 457 Oak Ave
 103 Caroline Davis   28  carol@email.com              790 Pine Rd
 104    Dave Wilson   35             None 555-999-8888  322 Elm St


### Different Feature Creation Configurations

The `create_features()` function can be customized in several ways:


In [None]:
print("FEATURE CREATION CONFIGURATIONS:\n")

# 1. Default configuration (automatic feature generation)
features_default = create_features(A, B, a_cols=['name', 'age'], b_cols=['name', 'age'])
print(f"DEFAULT FEATURES ({len(features_default)} features):")
for i, feature in enumerate(features_default):
    print(f"   {i+1}. {feature}")
print()

# 2. Only numeric columns (for RelDiff features)
features_numeric = create_features(A, B, a_cols=['age'], b_cols=['age'])
print(f"NUMERIC-ONLY FEATURES ({len(features_numeric)} features):")
for i, feature in enumerate(features_numeric):
    print(f"   {i+1}. {feature}")
print()

# 3. Custom tokenizers and similarity functions
from MadLib._internal.tokenizer.tokenizer import AlphaNumericTokenizer
from MadLib._internal.feature.token_feature import JaccardFeature, CosineFeature

custom_tokenizers = [AlphaNumericTokenizer()]
custom_sim_functions = [JaccardFeature, CosineFeature]

features_custom = create_features(
    A, B, 
    a_cols=['name'], b_cols=['name'],
    tokenizers=custom_tokenizers,
    sim_functions=custom_sim_functions
)
print(f"CUSTOM TOKENIZERS & SIM FUNCTIONS ({len(features_custom)} features):")
for i, feature in enumerate(features_custom):
    print(f"   {i+1}. {feature}")
print()

# 4. High null threshold (filters out columns with many missing values)
features_filtered = create_features(
    A, B, 
    a_cols=['name', 'age', 'email'], b_cols=['name', 'age', 'email'],
    null_threshold=0.3  # Only use columns with <30% null values
)
print(f"NULL-FILTERED FEATURES ({len(features_filtered)} features):")
for i, feature in enumerate(features_filtered):
    print(f"   {i+1}. {feature}")
print()


FEATURE CREATION CONFIGURATIONS:

DEFAULT FEATURES (8 features):
   1. exact_match(name, name)
   2. exact_match(age, age)
   3. rel_diff(age, age)
   4. tf_idf_3gram_tokens(name, name)
   5. jaccard(3gram_tokens(name), 3gram_tokens(name))
   6. sif_3gram_tokens(name, name)
   7. overlap_coeff(3gram_tokens(name), 3gram_tokens(name))
   8. cosine(3gram_tokens(name), 3gram_tokens(name))

NUMERIC-ONLY FEATURES (2 features):
   1. exact_match(age, age)
   2. rel_diff(age, age)

CUSTOM TOKENIZERS & SIM FUNCTIONS (4 features):
   1. exact_match(name, name)
   2. monge_elkan_jw(name, name)
   3. edit_distance(name, name)
   4. smith_waterman(name, name)

NULL-FILTERED FEATURES (14 features):
   1. exact_match(name, name)
   2. exact_match(age, age)
   3. exact_match(email, email)
   4. rel_diff(age, age)
   5. tf_idf_3gram_tokens(name, name)
   6. jaccard(3gram_tokens(name), 3gram_tokens(name))
   7. sif_3gram_tokens(name, name)
   8. overlap_coeff(3gram_tokens(name), 3gram_tokens(name))
   9

## Featurization

Featurization applies the generated features to candidate record pairs, producing feature vectors for machine learning.


In [6]:
# Create different types of candidate pairs
print("FEATURIZATION EXAMPLES:\n")

# 1. Basic candidate pairs (one-to-one mapping)
candidates_basic = pd.DataFrame({
    'id1_list': [[1], [2], [3], [4]],  # Each record from A
    'id2': [101, 102, 103, 104]        # Paired with record from B
})

# 2. One-to-many candidates (blocking results)
candidates_blocked = pd.DataFrame({
    'id1_list': [[1, 2], [3], [4]],    # Multiple A records can match one B record
    'id2': [101, 103, 104]
})

# 3. Custom candidates with additional metadata
candidates_custom = pd.DataFrame({
    'id1_list': [[1], [2], [3]],
    'id2': [101, 102, 103],
    'block_id': ['name_block_1', 'name_block_2', 'name_block_1'],  # Additional info
    'similarity_score': [0.8, 0.6, 0.9]  # Pre-computed scores
})

print("CANDIDATE PAIR EXAMPLES:")
print("\n1. Basic pairs:")
print(candidates_basic)
print("\n2. Blocked pairs (one-to-many):")  
print(candidates_blocked)
print("\n3. Custom pairs with metadata:")
print(candidates_custom)


FEATURIZATION EXAMPLES:

CANDIDATE PAIR EXAMPLES:

1. Basic pairs:
  id1_list  id2
0      [1]  101
1      [2]  102
2      [3]  103
3      [4]  104

2. Blocked pairs (one-to-many):
  id1_list  id2
0   [1, 2]  101
1      [3]  103
2      [4]  104

3. Custom pairs with metadata:
  id1_list  id2      block_id  similarity_score
0      [1]  101  name_block_1               0.8
1      [2]  102  name_block_2               0.6
2      [3]  103  name_block_1               0.9


In [7]:
# Apply featurization with different configurations
print("\nFEATURIZATION RESULTS:\n")

# Use the default features for comprehensive comparison
features = features_default

# 1. Basic featurization
fvs_basic = featurize(features, A, B, candidates_basic)
print(f"BASIC FEATURIZATION:")
print(f" Shape: {fvs_basic.shape}")
print(f" Columns: {list(fvs_basic.columns)}")
print(f" Feature vector length: {len(fvs_basic['features'].iloc[0])}")
print(f" Sample feature vector: {fvs_basic['features'].iloc[0][:5]}... (showing first 5)")
print()

# 2. Featurization with custom output column name
fvs_custom_col = featurize(features, A, B, candidates_basic, output_col='similarity_features')
print(f"CUSTOM OUTPUT COLUMN:")
print(f" Columns: {list(fvs_custom_col.columns)}")
print()

# 3. Featurization with fill_na parameter
fvs_filled = featurize(features, A, B, candidates_basic, fill_na=0.0)
print(f"WITH NaN FILLING:")
print(f" NaN count in features: {sum(np.isnan(x).sum() for x in fvs_basic['features'])}")
print(f" NaN count after filling: {sum(np.isnan(x).sum() for x in fvs_filled['features'])}")
print()

# Display sample results
print("SAMPLE FEATURIZATION OUTPUT:")
print(fvs_basic[['id2', 'id1', '_id']].head())
print("\nFeature vectors (first 3 features for first 3 pairs):")
for i in range(min(3, len(fvs_basic))):
    print(f" Pair {i+1}: {fvs_basic['features'].iloc[i][:3]}")



FEATURIZATION RESULTS:



                                                                                

BASIC FEATURIZATION:
 Shape: (4, 4)
 Columns: ['id2', 'id1', 'features', '_id']
 Feature vector length: 8
 Sample feature vector: [0.0, 0.0, 0.03846153989434242, 0.5474452376365662, 0.4615384638309479]... (showing first 5)

CUSTOM OUTPUT COLUMN:
 Columns: ['id2', 'id1', 'similarity_features', '_id']

WITH NaN FILLING:
 NaN count in features: 0
 NaN count after filling: 0

SAMPLE FEATURIZATION OUTPUT:
   id2  id1          _id
0  101    1   8589934592
1  102    2   8589934593
2  104    4  17179869184
3  103    3  25769803776

Feature vectors (first 3 features for first 3 pairs):
 Pair 1: [0.0, 0.0, 0.03846153989434242]
 Pair 2: [0.0, 0.0, 0.03333333507180214]
 Pair 3: [0.0, 0.0, 0.0]


## Down Sampling

Down sampling reduces the size of your feature vector dataset while preserving the most promising candidate pairs based on a scoring function.

In [8]:
# First, let's work with our basic feature vectors and add scores
fvs = fvs_basic.copy()

# Add various types of scores for demonstration
np.random.seed(42)  # For reproducible results
fvs['score'] = np.random.beta(2, 5, len(fvs))  # Simulated similarity scores
fvs['random_score'] = np.random.uniform(0, 1, len(fvs))

print("DOWN SAMPLING EXAMPLES:\n")
print(f"Original dataset size: {len(fvs)} pairs")
print("Original scores:", fvs['score'].round(3).tolist())
print()

# 1. Basic down sampling (50%)
down_50 = down_sample(fvs, percent=0.5, search_id_column='id2')
print(f"50% DOWN SAMPLING:")
print(f" Result size: {len(down_50)} pairs")
print(f" Retained scores: {down_50['score'].round(3).tolist()}")
print()

# 2. Aggressive down sampling (25%)  
down_25 = down_sample(fvs, percent=0.25, search_id_column='id2')
print(f"25% DOWN SAMPLING:")
print(f" Result size: {len(down_25)} pairs")
print(f" Retained scores: {down_25['score'].round(3).tolist()}")
print()

# 3. Custom score column
down_custom = down_sample(fvs, percent=0.5, search_id_column='id2', score_column='random_score')
print(f"CUSTOM SCORE COLUMN:")
print(f" Result size: {len(down_custom)} pairs")
print(f" Retained random_scores: {down_custom['random_score'].round(3).tolist()}")
print()

# 4. Custom bucket size for large datasets
down_buckets = down_sample(fvs, percent=0.75, search_id_column='id2', bucket_size=2)
print(f"CUSTOM BUCKET SIZE:")
print(f" Result size: {len(down_buckets)} pairs")
print(f" Bucket size parameter: 2 (for demonstration)")
print()

print("Note: Down sampling preserves the highest-scoring pairs within each hash bucket.")

DOWN SAMPLING EXAMPLES:

Original dataset size: 4 pairs
Original scores: [0.354, 0.249, 0.416, 0.16]

50% DOWN SAMPLING:
 Result size: 2 pairs
 Retained scores: [0.354, 0.416]

25% DOWN SAMPLING:
 Result size: 1 pairs
 Retained scores: [0.416]

CUSTOM SCORE COLUMN:
 Result size: 2 pairs
 Retained random_scores: [0.525, 0.432]

CUSTOM BUCKET SIZE:
 Result size: 2 pairs
 Bucket size parameter: 2 (for demonstration)

Note: Down sampling preserves the highest-scoring pairs within each hash bucket.


## Seed Creation

Seeds are initial labeled examples used to train machine learning models. MadLib supports various labeling strategies for creating these seeds.

In [9]:
print("SEED CREATION EXAMPLES:\n")

# Create different types of labelers for demonstration

# 1. Gold standard labeler (ground truth)
gold_labels = pd.DataFrame({
    'id1': [1, 3, 4],        # Records from table A
    'id2': [101, 103, 104]   # Matching records from table B  
})
print("Ground truth matches:")
print(gold_labels)
print()

gold_labeler = {'name': 'gold', 'gold': gold_labels}

# 2. Create a custom always-positive labeler
class AlwaysPositiveLabeler(Labeler):
    def __call__(self, id1, id2):
        return 1.0  # Always return positive match

# 3. Create a custom rule-based labeler  
class RuleBasedLabeler(Labeler):
    def __call__(self, id1, id2):
        # Simple rule: match if id2 - id1 == 100
        return 1.0 if (id2 - id1) == 100 else 0.0

print("DIFFERENT LABELING STRATEGIES:\n")

# Test with gold labeler
seeds_gold = create_seeds(fvs, nseeds=3, labeler=gold_labeler)
print("GOLD STANDARD LABELER:")
print(f" Seeds created: {len(seeds_gold)}")
print(f" Positive labels: {sum(seeds_gold['label'] == 1.0)}")
print(f" Negative labels: {sum(seeds_gold['label'] == 0.0)}")
print("  Sample seeds:")
print(seeds_gold[['id1', 'id2', 'score', 'label']].to_string(index=False))
print()

# Test with rule-based labeler
rule_labeler = RuleBasedLabeler()
seeds_rule = create_seeds(fvs, nseeds=3, labeler=rule_labeler)
print("RULE-BASED LABELER:")
print(f" Seeds created: {len(seeds_rule)}")
print(f" Positive labels: {sum(seeds_rule['label'] == 1.0)}")
print(f" Negative labels: {sum(seeds_rule['label'] == 0.0)}")
print("  Sample seeds:")
print(seeds_rule[['id1', 'id2', 'score', 'label']].to_string(index=False))
print()

# Test with always-positive labeler
positive_labeler = AlwaysPositiveLabeler()
seeds_positive = create_seeds(fvs, nseeds=2, labeler=positive_labeler)
print("ALWAYS-POSITIVE LABELER:")
print(f" Seeds created: {len(seeds_positive)}")
print(f" All labels: {seeds_positive['label'].tolist()}")
print()

SEED CREATION EXAMPLES:

Ground truth matches:
   id1  id2
0    1  101
1    3  103
2    4  104

DIFFERENT LABELING STRATEGIES:

seeds: pos_count = 2 neg_count = 1
GOLD STANDARD LABELER:
 Seeds created: 3
 Positive labels: 2
 Negative labels: 1
  Sample seeds:
 id1  id2    score  label
   4  104 0.415959    1.0
   3  103 0.159968    1.0
   2  102 0.248558    0.0

seeds: pos_count = 3 neg_count = 0
RULE-BASED LABELER:
 Seeds created: 3
 Positive labels: 3
 Negative labels: 0
  Sample seeds:
 id1  id2    score  label
   4  104 0.415959    1.0
   3  103 0.159968    1.0
   2  102 0.248558    1.0

seeds: pos_count = 2 neg_count = 0
ALWAYS-POSITIVE LABELER:
 Seeds created: 2
 All labels: [1.0, 1.0]



## Training a Matcher

MadLib supports various machine learning models for training matchers, including scikit-learn models and custom implementations.

In [10]:
print("TRAINING DIFFERENT TYPES OF MATCHERS:\n")

# Use the gold standard seeds for training
seeds = seeds_gold
labeled_data = seeds.copy()

print(f"Training data: {len(labeled_data)} labeled examples")
print(f"Feature vector dimension: {len(labeled_data['features'].iloc[0])}")
print()

# 1. Logistic Regression
model_lr = train_matcher(
    {'model_type': 'sklearn', 'model': LogisticRegression, 'nan_fill': 0, 'model_args': {'random_state': 42}}, 
    labeled_data
)
print("LOGISTIC REGRESSION MODEL:")
print(f" Model type: {type(model_lr.trained_model).__name__}")
print(f" Parameters: {model_lr.params_dict()}")
print()

# 2. Random Forest
model_rf = train_matcher(
    {'model_type': 'sklearn', 'model': RandomForestClassifier, 'nan_fill': 0,
     'model_args': {'n_estimators': 10, 'random_state': 42}}, 
    labeled_data
)
print("RANDOM FOREST MODEL:")
print(f" Model type: {type(model_rf.trained_model).__name__}")
print(f" Parameters: {model_rf.params_dict()}")
print()

# 3. Custom MLModel implementation
class SimpleThresholdModel(MLModel):
    def __init__(self, threshold=0.5):
        self.threshold = threshold
        self._trained = False
        self._trained_model = None
        
    @property
    def nan_fill(self): return 0.0
    @property  
    def use_vectors(self): return False
    @property
    def use_floats(self): return True

    def trained_model(self):
        return self._trained_model
        
    def train(self, df, vector_col, label_column, return_estimator=False):
        self._trained = True
        self._trained_model = self
        return self
        
    def predict(self, df, vector_col, output_col):
        # Simple rule: predict 1 if first feature > threshold
        df = df.copy()
        df[output_col] = df[vector_col].apply(lambda x: 1.0 if len(x) > 0 and x[0] > self.threshold else 0.0)
        return df
        
    def prediction_conf(self, df, vector_col, label_column):
        df = df.copy()
        df['conf'] = 0.8  # Fixed confidence
        return df
        
    def entropy(self, df, vector_col, output_col):
        df = df.copy()
        df[output_col] = 0.5  # Fixed entropy
        return df
        
    def params_dict(self):
        return {'threshold': self.threshold, 'trained': self._trained}

custom_model = SimpleThresholdModel(threshold=0.1)
model_custom = train_matcher(custom_model, labeled_data)
print("CUSTOM THRESHOLD MODEL:")
print(f" Model type: {type(model_custom).__name__}")
print(f" Parameters: {model_custom.params_dict()}")
print()

print("All models trained successfully!")

TRAINING DIFFERENT TYPES OF MATCHERS:

Training data: 3 labeled examples
Feature vector dimension: 8

LOGISTIC REGRESSION MODEL:
 Model type: LogisticRegression
 Parameters: {'model': "<class 'sklearn.linear_model._logistic.LogisticRegression'>", 'nan_fill': 0, 'model_args': {'random_state': 42}}

RANDOM FOREST MODEL:
 Model type: RandomForestClassifier
 Parameters: {'model': "<class 'sklearn.ensemble._forest.RandomForestClassifier'>", 'nan_fill': 0, 'model_args': {'n_estimators': 10, 'random_state': 42}}

CUSTOM THRESHOLD MODEL:
 Model type: SimpleThresholdModel
 Parameters: {'threshold': 0.1, 'trained': True}

All models trained successfully!


## Applying a Matcher

Once trained, matchers can be applied to new data to generate predictions and confidence scores.

In [11]:
print("APPLYING TRAINED MATCHERS:\n")

# Apply different trained models to the same feature vectors
test_fvs = fvs.copy()

# 1. Apply Logistic Regression model
result_lr = apply_matcher(model_lr, test_fvs, feature_col='features', output_col='lr_prediction')
print("LOGISTIC REGRESSION PREDICTIONS:")
print(result_lr[['id1', 'id2', 'score', 'lr_prediction']].head())
print(f" Positive predictions: {sum(result_lr['lr_prediction'] == 1.0)}/{len(result_lr)}")
print()

# 2. Apply Random Forest model
result_rf = apply_matcher(model_rf, test_fvs, feature_col='features', output_col='rf_prediction')
print("RANDOM FOREST PREDICTIONS:")
print(result_rf[['id1', 'id2', 'score', 'rf_prediction']].head())
print(f" Positive predictions: {sum(result_rf['rf_prediction'] == 1.0)}/{len(result_rf)}")
print()

# 3. Apply Custom Threshold model
result_custom = apply_matcher(model_custom, test_fvs, feature_col='features', output_col='custom_prediction')
print("CUSTOM THRESHOLD PREDICTIONS:")
print(result_custom[['id1', 'id2', 'score', 'custom_prediction']].head())
print(f" Positive predictions: {sum(result_custom['custom_prediction'] == 1.0)}/{len(result_custom)}")
print()

# 4. Compare all predictions
comparison = pd.DataFrame({
    'id1': result_lr['id1'],
    'id2': result_lr['id2'],
    'original_score': result_lr['score'].round(3),
    'lr_pred': result_lr['lr_prediction'],
    'rf_pred': result_rf['rf_prediction'],
    'custom_pred': result_custom['custom_prediction']
})

print("PREDICTION COMPARISON:")
print(comparison.to_string(index=False))
print()

# Calculate agreement between models
lr_rf_agreement = sum(result_lr['lr_prediction'] == result_rf['rf_prediction']) / len(result_lr)
print(f"Model Agreement:")
print(f" LR vs RF: {lr_rf_agreement:.2%}")
print(f" LR vs Custom: {sum(result_lr['lr_prediction'] == result_custom['custom_prediction']) / len(result_lr):.2%}")
print(f" RF vs Custom: {sum(result_rf['rf_prediction'] == result_custom['custom_prediction']) / len(result_rf):.2%}")

APPLYING TRAINED MATCHERS:

LOGISTIC REGRESSION PREDICTIONS:
   id1  id2     score  lr_prediction
0    1  101  0.353677            1.0
1    2  102  0.248558            1.0
2    4  104  0.415959            1.0
3    3  103  0.159968            1.0
 Positive predictions: 4/4

RANDOM FOREST PREDICTIONS:
   id1  id2     score  rf_prediction
0    1  101  0.353677            1.0
1    2  102  0.248558            0.0
2    4  104  0.415959            1.0
3    3  103  0.159968            1.0
 Positive predictions: 3/4

CUSTOM THRESHOLD PREDICTIONS:
   id1  id2     score  custom_prediction
0    1  101  0.353677                0.0
1    2  102  0.248558                0.0
2    4  104  0.415959                0.0
3    3  103  0.159968                0.0
 Positive predictions: 0/4

PREDICTION COMPARISON:
 id1  id2  original_score  lr_pred  rf_pred  custom_pred
   1  101           0.354      1.0      1.0          0.0
   2  102           0.249      1.0      0.0          0.0
   4  104           0.416    

## Active Learning Labeling


Active learning helps efficiently label data by selecting the most informative examples for human review.

In [12]:
print("ACTIVE LEARNING LABELING:\n")

print("BATCH MODE ACTIVE LEARNING:")
print("Batch mode waits for a batch of examples to be labeled before training a new model.")
labels_batch = label_data(
    model_spec={'model_type': 'sklearn', 'model': LogisticRegression, 'nan_fill': 0.0, 'model_args': {'random_state': 42}},
    mode='batch',
    labeler_spec=gold_labeler,
    fvs=fvs
)
print(f"✓ Batch mode completed: {len(labels_batch)} examples labeled")
print(f"  - Positive matches: {sum(labels_batch['label'] == 1.0)}")
print(f"  - Negative matches: {sum(labels_batch['label'] == 0.0)}")
print()

print("CONTINUOUS MODE ACTIVE LEARNING:")
print("Continuous mode trains new models as examples are labeled.")
labels_continuous = label_data(
    model_spec={'model_type': 'sklearn', 'model': LogisticRegression, 'nan_fill': 0.0, 'model_args': {'random_state': 42}},
    mode='continuous',
    labeler_spec=gold_labeler,
    fvs=fvs,
)
print(f"✓ Continuous mode completed: {len(labels_continuous)} examples labeled")
print(f"  - Positive matches: {sum(labels_continuous['label'] == 1.0)}")
print(f"  - Negative matches: {sum(labels_continuous['label'] == 0.0)}")
print()

print("ACTIVE LEARNING INSIGHTS:")
print("• Both modes automatically stop when all available examples are labeled")
print("• With small datasets (like this 4-record example), both modes will label all examples")
print("• In larger datasets, active learning focuses on the most uncertain/informative examples")
print("• The 'Ran out of examples' message simply indicates the algorithm has processed all available data")
print()

print("LABELED RESULTS COMPARISON:")
print("Batch mode results:")
print(labels_batch[['id1', 'id2', 'label']].to_string(index=False))
print()
print("Continuous mode results:")
print(labels_continuous[['id1', 'id2', 'label']].to_string(index=False))

ACTIVE LEARNING LABELING:

BATCH MODE ACTIVE LEARNING:
Batch mode waits for a batch of examples to be labeled before training a new model.
Ran out of examples before reaching nseeds
seeds: pos_count = 2 neg_count = 1


[ent_active_learner.py:120 - train() ] 2025-06-20 14:42:10,059 : running al to completion would label everything, but self._terminate_if_label_everything is False so AL will still run
[ent_active_learner.py:138 - train() ] 2025-06-20 14:42:10,103 : max iter = 4
[ent_active_learner.py:142 - train() ] 2025-06-20 14:42:10,104 : starting iteration 0
[ent_active_learner.py:144 - train() ] 2025-06-20 14:42:10,104 : training model
[ent_active_learner.py:152 - train() ] 2025-06-20 14:42:10,285 : selecting and labeling new examples
[ent_active_learner.py:186 - train() ] 2025-06-20 14:42:10,683 : new batch positive = 1.0 negative = 0.0, total positive = 3.0 negative = 1.0
[ent_active_learner.py:189 - train() ] 2025-06-20 14:42:10,685 : all fvs labeled, terminating active learning


✓ Batch mode completed: 4 examples labeled
  - Positive matches: 3
  - Negative matches: 1

CONTINUOUS MODE ACTIVE LEARNING:
Continuous mode trains new models as examples are labeled.
Ran out of examples before reaching nseeds
seeds: pos_count = 2 neg_count = 1


[cont_entropy_active_learner.py:177 - _training_loop() ] 2025-06-20 14:42:11,896 : Insufficient examples for active learning: 4 total, 3 seeds. No unlabeled examples to select from. Labeling all examples.


✓ Continuous mode completed: 4 examples labeled
  - Positive matches: 3
  - Negative matches: 1

ACTIVE LEARNING INSIGHTS:
• Both modes automatically stop when all available examples are labeled
• With small datasets (like this 4-record example), both modes will label all examples
• In larger datasets, active learning focuses on the most uncertain/informative examples
• The 'Ran out of examples' message simply indicates the algorithm has processed all available data

LABELED RESULTS COMPARISON:
Batch mode results:
 id1  id2  label
   4  104    1.0
   3  103    1.0
   2  102    0.0
   1  101    1.0

Continuous mode results:
 id1  id2  label
   4  104    1.0
   3  103    1.0
   1  101    1.0
   2  102    0.0


## Custom Abstract Classes

MadLib also supports customization for your specific use case. You can implement custom tokenizers, features, ML models, and labelers by extending the abstract base classes.


In [13]:
print("CUSTOM IMPLEMENTATION EXAMPLES:\n")

# 1. Custom Tokenizer
class ReverseWordTokenizer(Tokenizer):
    """Tokenizer that reverses each word before returning tokens"""
    NAME = 'reverse_word_tokens'
    
    def tokenize(self, s):
        if not isinstance(s, str):
            return None
        words = s.lower().split()
        return [word[::-1] for word in words]  # Reverse each word

# Test custom tokenizer
reverse_tokenizer = ReverseWordTokenizer()
print("CUSTOM TOKENIZER (ReverseWordTokenizer):")
test_string = "Alice Smith"
print(f" Input: '{test_string}'")
print(f" Tokens: {reverse_tokenizer.tokenize(test_string)}")
print(f" Name: {reverse_tokenizer.NAME}")
print()

# 2. Custom Feature
class WordCountDifferenceFeature(Feature):
    """Feature that computes absolute difference in word count"""
    
    def __str__(self):
        return f'word_count_diff({self.a_attr}, {self.b_attr})'
    
    def __call__(self, rec, recs):
        b_value = rec[self.b_attr]
        a_values = recs[self.a_attr]
        
        if not isinstance(b_value, str):
            return pd.Series(np.nan, index=a_values.index)
        
        b_word_count = len(b_value.split()) if b_value else 0
        
        def word_count_diff(a_value):
            if not isinstance(a_value, str):
                return np.nan
            a_word_count = len(a_value.split()) if a_value else 0
            return abs(a_word_count - b_word_count)
        
        return a_values.apply(word_count_diff).astype(np.float64)
    
    def _preprocess(self, data, input_col):
        return data  # No preprocessing needed
    
    def _preprocess_output_column(self, attr):
        return None  # No preprocessing output

# Test custom feature
custom_feature = WordCountDifferenceFeature('name', 'name')
print("CUSTOM FEATURE (WordCountDifferenceFeature):")
print(f" Feature string: {custom_feature}")

# Create test data for feature
test_rec = {'name': 'Alice Smith'}
test_recs = pd.DataFrame({'name': ['Bob Jones', 'Jean-Pierre O\'Connor', 'X']})
feature_result = custom_feature(test_rec, test_recs)
print(f" Input B: '{test_rec['name']}' (2 words)")
print(f" Input A values: {test_recs['name'].tolist()}")
print(f" Word count differences: {feature_result.tolist()}")
print()

# 3. Custom Labeler with complex logic
class SmartSimilarityLabeler(CustomLabeler):
    """Labeler that uses multiple criteria for matching"""
    
    def label_pair(self, row1, row2):
        # Get name similarity (simple word overlap)
        name1_words = set(row1['name'].lower().split())
        name2_words = set(row2['name'].lower().split())
        name_overlap = len(name1_words & name2_words) / max(len(name1_words | name2_words), 1)
        
        # Get age similarity 
        age1, age2 = row1.get('age'), row2.get('age')
        age_diff = abs(age1 - age2) if (age1 is not None and age2 is not None) else float('inf')
        
        # Labeling logic
        if name_overlap >= 0.5 and age_diff <= 5:
            return 1.0  # Strong match
        elif name_overlap >= 0.3 or age_diff <= 2:
            return 0.5  # Uncertain (could be treated as unsure)
        else:
            return 0.0  # No match

smart_labeler = SmartSimilarityLabeler(A, B)
print("CUSTOM LABELER (SmartSimilarityLabeler):")
print(" Logic: High name overlap + small age difference = match")

# Test the smart labeler
test_pairs = [(1, 101), (2, 102), (3, 103), (4, 104)]
for id1, id2 in test_pairs:
    label = smart_labeler(id1, id2)
    row1 = A[A['_id'] == id1].iloc[0]
    row2 = B[B['_id'] == id2].iloc[0]
    print(f" Pair ({id1},{id2}): '{row1['name']}' vs '{row2['name']}' → {label}")
print()

# 4. Demonstrate custom implementations in pipeline
print("USING CUSTOM COMPONENTS IN PIPELINE:")

# Create features with custom tokenizer
custom_features = create_features(
    A, B, 
    a_cols=['name'], b_cols=['name'],
    tokenizers=[reverse_tokenizer],
    sim_functions=get_base_sim_functions()[:2]  # Use first 2 similarity functions
)

# Add our custom feature
custom_features.append(custom_feature)

print(f" Created {len(custom_features)} features (including custom)")
for i, feature in enumerate(custom_features):
    print(f"   {i+1}. {feature}")
print()

# Test featurization with custom features
small_candidates = pd.DataFrame({'id1_list': [[1], [2]], 'id2': [101, 102]})
custom_fvs = featurize(custom_features, A, B, small_candidates)

print(f" Custom featurization result shape: {custom_fvs.shape}")
print(f" Feature vector length: {len(custom_fvs['features'].iloc[0])}")
print("  Custom components integrated successfully!")


CUSTOM IMPLEMENTATION EXAMPLES:

CUSTOM TOKENIZER (ReverseWordTokenizer):
 Input: 'Alice Smith'
 Tokens: ['ecila', 'htims']
 Name: reverse_word_tokens

CUSTOM FEATURE (WordCountDifferenceFeature):
 Feature string: word_count_diff(name, name)
 Input B: 'Alice Smith' (2 words)
 Input A values: ['Bob Jones', "Jean-Pierre O'Connor", 'X']
 Word count differences: [0.0, 0.0, 1.0]

CUSTOM LABELER (SmartSimilarityLabeler):
 Logic: High name overlap + small age difference = match
 Pair (1,101): 'Alice Smith' vs 'Alicia Smith' → 0.5
 Pair (2,102): 'Bob Jones' vs 'Robert Jones' → 0.5
 Pair (3,103): 'Carol Davis' vs 'Caroline Davis' → 0.5
 Pair (4,104): 'David Wilson' vs 'Dave Wilson' → 0.5

USING CUSTOM COMPONENTS IN PIPELINE:
 Created 2 features (including custom)
   1. exact_match(name, name)
   2. word_count_diff(name, name)

 Custom featurization result shape: (2, 4)
 Feature vector length: 2
  Custom components integrated successfully!


In [14]:
# Clean up Spark session
spark.stop()
print("Spark session stopped. Notebook complete!")


Spark session stopped. Notebook complete!


---

## Summary

This comprehensive notebook has demonstrated all the key functionality of MadLib:

### Core Functions Covered
**Tokenizers & Similarity Functions**: Understanding the building blocks  
**Feature Creation**: Automatic and custom feature generation  
**Featurization**: Converting candidate pairs to feature vectors  
**Down Sampling**: Intelligent dataset reduction  
**Seed Creation**: Generating initial training labels  
**Model Training**: Multiple ML approaches  
**Model Application**: Making predictions  
**Active Learning**: Efficient labeling strategies  

### Abstract Classes Extended
**Custom Tokenizer**: `ReverseWordTokenizer`  
**Custom Feature**: `WordCountDifferenceFeature`  
**Custom MLModel**: `SimpleThresholdModel`  
**Custom Labeler**: `SmartSimilarityLabeler`  



For more information, visit the [MadMatcher Website](https://madmatcher.ai) or explore the source code in the `MadLib` package.