# Medical Specialty Standardization System

This notebook implements a comprehensive system to standardize healthcare provider specialties against the NUCC (National Uniform Claim Committee) taxonomy. It uses multiple matching strategies including exact matching, fuzzy matching, and semantic similarity to map raw specialty text to standardized codes.

## Step 1: Import Required Libraries

We start by importing all necessary libraries for data processing, machine learning, and text matching.

In [16]:
!pip install rapidfuzz sentence-transformers torch scikit-learn pandas numpy
!pip install rapidfuzz sentence-transformers torch scikit-learn pandas numpy




In [17]:
import pandas as pd
import numpy as np
import re
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
from enum import Enum
import warnings
warnings.filterwarnings('ignore')

from rapidfuzz import fuzz
from sentence_transformers import SentenceTransformer, util
import torch
from sklearn.isotonic import IsotonicRegression

print("✓ All imports successful")

✓ All imports successful


## Step 2: Define Comprehensive Medical Abbreviations Map

This dictionary maps common medical abbreviations to their full specialty names. This is crucial for preprocessing raw specialty text that contains abbreviated forms like 'cardio' → 'cardiology', 'obgyn' → 'obstetrics and gynecology', etc.

In [18]:
MEDICAL_ABBREVIATIONS = {
    r"\bcardio\b": "cardiology",
    r"\bcard\b": "cardiology",
    r"\bcv\b": "cardiovascular",
    r"\bcvs\b": "cardiovascular",
    r"\bent\b": "otolaryngology",
    r"\bento\b": "otolaryngology",
    r"\bot\b": "otolaryngology",
    r"\bsurg\b": "surgery",
    r"\bcardiothoracic\b": "cardiac surgery",
    r"\bthracic\b": "thoracic surgery",
    r"\bobgyn\b": "obstetrics and gynecology",
    r"\bob-gyn\b": "obstetrics and gynecology",
    r"\bobs\b": "obstetrics",
    r"\bgyn\b": "gynecology",
    r"\burol\b": "urology",
    r"\buro\b": "urology",
    r"\bortho\b": "orthopedics",
    r"\borthopaedic\b": "orthopedics",
    r"\borthopedic\b": "orthopedics",
    r"\bpsych\b": "psychiatry",
    r"\bpsy\b": "psychiatry",
    r"\bneuro\b": "neurology",
    r"\bneuro surg\b": "neurological surgery",
    r"\bderma\b": "dermatology",
    r"\bderm\b": "dermatology",
    r"\bpath\b": "pathology",
    r"\blap path\b": "laboratory pathology",
    r"\brad\b": "radiology",
    r"\bradiotherapy\b": "radiation therapy",
    r"\bicu\b": "critical care medicine",
    r"\bccu\b": "cardiac care",
    r"\bcritical\b": "critical care medicine",
    r"\bpedi\b": "pediatrics",
    r"\bped\b": "pediatrics",
    r"\bpediatric\b": "pediatrics",
    r"\bim\b": "internal medicine",
    r"\bim doc\b": "internal medicine",
    r"\bpt\b": "physical therapy",
    r"\bphysical med\b": "physical medicine and rehabilitation",
    r"\bmd\b": "medical doctor",
    r"\brn\b": "registered nurse",
    r"\blpn\b": "licensed practical nurse",
    r"\bpa\b": "physician assistant",
    r"\bnp\b": "nurse practitioner",
}

print(f"✓ Loaded {len(MEDICAL_ABBREVIATIONS)} abbreviation mappings")

✓ Loaded 44 abbreviation mappings


## Step 3: Define Enums and Data Classes

We use enums to track the matching method used, and dataclasses to structure the match results. This provides type safety and clarity about what data each match operation produces.

In [19]:
class MatchMethod(Enum):
    """Enum to track which matching strategy was used"""
    EXACT_MATCH = "exact_match"
    FUZZY_MATCH = "fuzzy_match"
    SEMANTIC_MATCH = "semantic_match"
    FALLBACK_MATCH = "fallback_match"
    NO_MATCH = "no_match"
    EMPTY_INPUT = "empty_input"

@dataclass
class MatchResult:
    """Dataclass to hold results of a specialty matching operation"""
    primary_code: str
    primary_confidence: float
    calibrated_confidence: float
    method: MatchMethod
    is_multi_specialty: bool
    alternatives: List[Tuple[str, float]]

print("✓ MatchMethod and MatchResult defined")

✓ MatchMethod and MatchResult defined


## Step 4: Create the Specialty Preprocessor

The preprocessor handles:
- **Null/empty value handling**: Returns empty string for invalid inputs
- **ID removal**: Removes NUCC codes in various formats
- **Lowercasing**: Converts to lowercase for consistent matching
- **Abbreviation expansion**: Uses our abbreviation map to expand shortened forms
- **Character normalization**: Handles slashes, hyphens, underscores, special characters
- **Stopword removal**: Removes common non-informative words like 'service', 'center', 'clinic'
- **Misspelling correction**: Fixes common typos in medical terms

In [20]:
class SpecialtyPreprocessor:
    """Preprocesses raw specialty text for matching"""

    def __init__(self):
        self.abbreviation_map = MEDICAL_ABBREVIATIONS

    def preprocess(self, text: str) -> Tuple[str, bool]:
        """
        Main preprocessing function.
        Returns: (cleaned_text, is_compound_specialty)
        """
        # Handle null and empty values
        if pd.isna(text) or text == '':
            return '', False

        text = str(text).strip()
        if len(text) < 2:
            return '', False

        # Remove NUCC codes (format: 10 alphanumeric chars, optionally ending with X)
        text = re.sub(r'\s*-\s*[0-9A-Z]{10}X?\s*$', '', text, flags=re.IGNORECASE)
        text = re.sub(r'^[0-9A-Z]{10}X?\s*-\s*', '', text, flags=re.IGNORECASE)

        # Convert to lowercase
        text = text.lower()

        # Expand abbreviations
        for abbrev_pattern, expansion in self.abbreviation_map.items():
            text = re.sub(abbrev_pattern, expansion, text, flags=re.IGNORECASE)

        # Normalize special characters
        text = re.sub(r'[/&]', ' and ', text)
        text = re.sub(r'[\-_]', ' ', text)
        text = re.sub(r'[,()]', ' ', text)

        # Remove stopwords
        stop_words = {'service', 'center', 'clinic', 'hospital', 'department',
                      'medical', 'healthcare', 'provider', 'physician', 'doctor',
                      'general', 'office', 'practice', 'specialty', 'specialization'}
        words = text.split()
        words = [w for w in words if w not in stop_words and len(w) > 1]
        text = ' '.join(words)

        # Fix common misspellings
        text = self._fix_common_misspellings(text)

        # Clean up extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()

        # Detect if this is a compound specialty (contains 'and' or 3+ words)
        is_compound = ' and ' in text or (len(text.split()) >= 3)

        return text, is_compound

    def _fix_common_misspellings(self, text: str) -> str:
        """Fix commonly misspelled medical terms"""
        corrections = {
            'clinal': 'clinical',
            'cardiak': 'cardiac',
            'diabetus': 'diabetes',
            'ural': 'urology',
            'oncolog': 'oncology',
            'patho': 'pathology',
            'radiolog': 'radiology',
            'throacic': 'thoracic',
            'neurolog': 'neurology',
        }

        for typo, correction in corrections.items():
            text = re.sub(r'\b' + typo + r'\b', correction, text, flags=re.IGNORECASE)

        return text

# Test the preprocessor
test_preprocessor = SpecialtyPreprocessor()
test_cases = [
    'Cardio Surgery - 2058367891X',
    'obgyn & Obstetrics',
    'neuro_surg(brain)',
    'ENT/Otolaryngology Department',
]

print("\n✓ Preprocessor test cases:")
for test in test_cases:
    cleaned, is_compound = test_preprocessor.preprocess(test)
    print(f"  '{test}' → '{cleaned}' (compound: {is_compound})")


✓ Preprocessor test cases:
  'Cardio Surgery - 2058367891X' → 'cardiology surgery' (compound: False)
  'obgyn & Obstetrics' → 'obstetrics and gynecology and obstetrics' (compound: True)
  'neuro_surg(brain)' → 'neuro surg brain' (compound: True)
  'ENT/Otolaryngology Department' → 'otolaryngology and otolaryngology' (compound: True)


## Step 5: Create the Specialty Matcher

The matcher implements a multi-strategy approach:
1. **Exact Match**: Perfect or near-perfect matches
2. **Fuzzy Match**: Handles typos and variations (using Levenshtein distance)
3. **Semantic Match**: Uses sentence embeddings to find conceptually similar specialties
4. **Multi-Specialty Match**: Handles combined specialties like 'cardiology and internal medicine'
5. **Fallback Match**: When no good matches found, returns lowest confidence option

In [21]:
class SpecialtyMatcher:
    """Matches specialty text against NUCC taxonomy using multiple strategies"""

    def __init__(self, nucc_df: pd.DataFrame):
        self.nucc_df = nucc_df.copy()
        self.preprocessor = SpecialtyPreprocessor()

        # Build lookup dictionaries
        self.code_to_display = dict(zip(nucc_df['Code'], nucc_df['Display_Name']))
        self.nucc_display_clean = [
            self.preprocessor.preprocess(name)[0] for name in nucc_df['Display_Name']
        ]

        # Load semantic model for embeddings
        try:
            self.model = SentenceTransformer('all-MiniLM-L6-v2')
            self.nucc_embeddings = self.model.encode(self.nucc_display_clean, convert_to_tensor=True)
            self.semantic_ready = True
            print("✓ Semantic model loaded")
        except Exception as e:
            print(f"⚠ Semantic model failed: {e}")
            self.semantic_ready = False

    def match(self, specialty: str) -> MatchResult:
        """Main matching function - tries multiple strategies in order"""
        cleaned, is_compound = self.preprocessor.preprocess(specialty)

        # Check for empty input
        if not cleaned or len(cleaned) < 2:
            return MatchResult(
                primary_code='JUNK',
                primary_confidence=0.0,
                calibrated_confidence=0.0,
                method=MatchMethod.EMPTY_INPUT,
                is_multi_specialty=False,
                alternatives=[]
            )

        # Try exact match first (highest confidence)
        exact_result = self._exact_match(cleaned)
        if exact_result:
            code, confidence = exact_result
            return self._create_result(code, confidence, MatchMethod.EXACT_MATCH, is_compound, cleaned)

        # Try fuzzy match (good for typos)
        fuzzy_result = self._fuzzy_match(cleaned)
        if fuzzy_result and fuzzy_result[1] >= 0.85:
            code, confidence = fuzzy_result
            return self._create_result(code, confidence, MatchMethod.FUZZY_MATCH, is_compound, cleaned)

        # Try semantic match (good for paraphrasing)
        if self.semantic_ready:
            semantic_result = self._semantic_match(cleaned)
            if semantic_result and semantic_result[1] >= 0.50:
                code, confidence = semantic_result
                return self._create_result(code, confidence, MatchMethod.SEMANTIC_MATCH, is_compound, cleaned)

        # Try multi-specialty match for compound inputs
        if is_compound and ' and ' in cleaned:
            multi_result = self._multi_specialty_match(cleaned)
            if multi_result:
                code, confidence = multi_result
                return self._create_result(code, confidence, MatchMethod.SEMANTIC_MATCH, True, cleaned)

        # Fallback: use fuzzy match with low confidence
        if fuzzy_result:
            code, confidence = fuzzy_result
            confidence = min(confidence, 0.45)
            return self._create_result(code, confidence, MatchMethod.FALLBACK_MATCH, is_compound, cleaned)

        # No match found at all
        return MatchResult(
            primary_code='JUNK',
            primary_confidence=0.0,
            calibrated_confidence=0.0,
            method=MatchMethod.NO_MATCH,
            is_multi_specialty=is_compound,
            alternatives=[]
        )

    def _exact_match(self, cleaned: str) -> Optional[Tuple[str, float]]:
        """Check for exact or near-exact matches"""
        for i, nucc_clean in enumerate(self.nucc_display_clean):
            if nucc_clean == cleaned or (nucc_clean in cleaned and len(cleaned) > 5):
                if fuzz.ratio(cleaned, nucc_clean) > 95:
                    code = self.nucc_df.iloc[i]['Code']
                    return code, 0.98
        return None

    def _fuzzy_match(self, cleaned: str) -> Optional[Tuple[str, float]]:
        """Fuzzy matching using token-based Levenshtein distance"""
        best_code = None
        best_score = 0

        for i, nucc_clean in enumerate(self.nucc_display_clean):
            score = fuzz.token_set_ratio(cleaned, nucc_clean) / 100.0
            if score > best_score:
                best_score = score
                best_code = self.nucc_df.iloc[i]['Code']

        if best_score >= 0.70:
            return best_code, best_score
        return None

    def _semantic_match(self, cleaned: str) -> Optional[Tuple[str, float]]:
        """Semantic matching using transformer embeddings"""
        if not self.semantic_ready:
            return None

        input_embedding = self.model.encode(cleaned, convert_to_tensor=True)
        similarities = util.pytorch_cos_sim(input_embedding, self.nucc_embeddings)[0]

        best_idx = torch.argmax(similarities).item()
        best_score = float(similarities[best_idx])

        if best_score >= 0.40:
            code = self.nucc_df.iloc[best_idx]['Code']
            return code, best_score

        return None

    def _get_top_alternatives(self, cleaned: str, top_n: int = 5) -> List[Tuple[str, float]]:
        """Get top N alternative matches for reporting"""
        if not self.semantic_ready:
            return []

        input_embedding = self.model.encode(cleaned, convert_to_tensor=True)
        similarities = util.pytorch_cos_sim(input_embedding, self.nucc_embeddings)[0]

        top_scores, top_indices = torch.topk(similarities, k=min(top_n + 1, len(similarities)))

        alternatives = []
        for idx, score in zip(top_indices.tolist(), top_scores.tolist()):
            code = self.nucc_df.iloc[idx]['Code']
            if score >= 0.35:
                alternatives.append((code, float(score)))

        return alternatives

    def _multi_specialty_match(self, cleaned: str) -> Optional[Tuple[str, float]]:
        """Handle compound specialties like 'cardiology and internal medicine'"""
        parts = [p.strip() for p in re.split(r'\s+and\s+', cleaned)]
        if len(parts) < 2:
            return None

        part_matches = []
        for part in parts:
            if len(part) < 3:
                continue
            result = self._semantic_match(part)
            if result:
                part_matches.append(result)

        if not part_matches:
            return None

        best_code, best_conf = max(part_matches, key=lambda x: x[1])
        return best_code, best_conf * 0.95

    def _create_result(self, code: str, confidence: float, method: MatchMethod,
                     is_compound: bool, cleaned: str) -> MatchResult:
        """Create a result object with alternatives"""
        alternatives = self._get_top_alternatives(cleaned, top_n=5)
        alternatives = [(c, s) for c, s in alternatives if c != code and s < confidence]

        return MatchResult(
            primary_code=code,
            primary_confidence=confidence,
            calibrated_confidence=confidence,
            method=method,
            is_multi_specialty=is_compound,
            alternatives=alternatives
        )

print("✓ SpecialtyMatcher class defined")

✓ SpecialtyMatcher class defined


## Step 6: Create the Confidence Calibrator

This class uses isotonic regression to calibrate raw confidence scores into well-calibrated probabilities. This means if the model says 0.80 confidence, it should be correct approximately 80% of the time.

In [22]:
class ConfidenceCalibrator:
    """Calibrates raw confidence scores to true probabilities"""

    def __init__(self):
        self.iso_reg = IsotonicRegression(out_of_bounds='clip')
        self.is_fitted = False

    def fit(self, original_scores: np.ndarray, ground_truth: np.ndarray):
        """
        Fit the calibrator using validation data

        Args:
            original_scores: Raw confidence scores from the matcher
            ground_truth: Binary labels (1 if correct match, 0 if incorrect)
        """
        original_scores = np.array(original_scores).flatten()
        ground_truth = np.array(ground_truth).flatten()

        self.iso_reg.fit(original_scores, ground_truth)
        self.is_fitted = True

    def calibrate(self, scores: np.ndarray) -> np.ndarray:
        """Apply calibration to new scores"""
        if not self.is_fitted:
            return scores
        return self.iso_reg.predict(scores)

print("✓ ConfidenceCalibrator class defined")

✓ ConfidenceCalibrator class defined


## Step 7: Create the Junk Classifier

This classifier determines whether a match result should be classified as 'JUNK' (unmappable). It uses threshold rules based on the matching method used to decide when to mark a result as unmappable.

In [23]:
class JunkClassifier:
    """Determines if a match should be classified as unmappable (JUNK)"""

    @staticmethod
    def should_classify_junk(result: MatchResult, raw_text: str) -> bool:
        """
        Determine if this match should be marked as JUNK.
        Uses different thresholds based on the matching method.
        """
        # Empty input is always junk
        if result.method == MatchMethod.EMPTY_INPUT:
            return True

        # Very short text is likely junk
        if len(raw_text.strip()) < 2:
            return True

        # Apply method-specific thresholds
        if result.method == MatchMethod.EXACT_MATCH:
            return result.primary_confidence < 0.95
        elif result.method == MatchMethod.FUZZY_MATCH:
            return result.primary_confidence < 0.80
        elif result.method == MatchMethod.SEMANTIC_MATCH:
            return result.primary_confidence < 0.50
        elif result.method == MatchMethod.FALLBACK_MATCH:
            return result.primary_confidence < 0.35
        elif result.method == MatchMethod.NO_MATCH:
            return True

        return False

print("✓ JunkClassifier class defined")

✓ JunkClassifier class defined


## Step 8: Create the Main Standardizer

This is the main orchestrator class that:
1. Takes raw specialty data
2. Runs it through the matcher
3. Applies junk classification
4. Optionally calibrates confidences
5. Returns formatted results with alternatives

In [24]:
class ProviderSpecialtyStandardizer:
    """Main orchestrator for specialty standardization"""

    def __init__(self, nucc_df: pd.DataFrame):
        self.nucc_df = nucc_df
        self.matcher = SpecialtyMatcher(nucc_df)
        self.calibrator = ConfidenceCalibrator()

    def standardize(self, input_df: pd.DataFrame,
                   specialty_column: str = 'raw_specialty',
                   apply_calibration: bool = False) -> pd.DataFrame:
        """
        Main standardization function.

        Args:
            input_df: DataFrame with specialty data
            specialty_column: Name of the column containing specialties (default: 'raw_specialty')
            apply_calibration: Whether to apply learned calibration

        Returns:
            DataFrame with standardized specialties
        """
        # Verify column exists
        if specialty_column not in input_df.columns:
            raise KeyError(f"Column '{specialty_column}' not found. Available: {input_df.columns.tolist()}")

        results = []

        # Process each specialty
        for idx, row in input_df.iterrows():
            specialty = row[specialty_column]
            match_result = self.matcher.match(specialty)

            # Apply junk classification
            is_junk = JunkClassifier.should_classify_junk(match_result, specialty)

            if is_junk:
                match_result.primary_code = 'JUNK'
                match_result.primary_confidence = 0.0
                match_result.calibrated_confidence = 0.0
            elif apply_calibration:
                # Apply learned calibration if available
                cal_score = self.calibrator.calibrate(
                    np.array([match_result.primary_confidence])
                )[0]
                match_result.calibrated_confidence = cal_score
            else:
                # Apply simple calibration rules
                match_result.calibrated_confidence = self._simple_calibrate(
                    match_result.primary_confidence,
                    match_result.method
                )

            results.append(match_result)

            # Progress reporting
            if (idx + 1) % 1000 == 0:
                print(f"  Processed {idx + 1} records...")

        return self._format_results(input_df, results, specialty_column)

    def _simple_calibrate(self, score: float, method: MatchMethod) -> float:
        """Apply simple calibration rules based on matching method"""
        if method == MatchMethod.EXACT_MATCH:
            return min(score * 1.02, 0.95)
        elif method == MatchMethod.FUZZY_MATCH:
            return min(score * 1.05, 0.90)
        elif method == MatchMethod.SEMANTIC_MATCH:
            return min(score ** 0.5, 0.85)
        elif method == MatchMethod.FALLBACK_MATCH:
            return min(score * 0.95, 0.50)
        else:
            return score

    def _format_results(self, input_df: pd.DataFrame,
                      results: List[MatchResult],
                      specialty_column: str) -> pd.DataFrame:
        """Format results into a comprehensive output DataFrame"""
        output_rows = []

        for i, (idx, row) in enumerate(input_df.iterrows()):
            result = results[i]
            cleaned, _ = self.matcher.preprocessor.preprocess(row[specialty_column])

            # Build output row
            output_row = {
                'Specialty': row[specialty_column],
                'Preprocessed': cleaned,
                'Primary_Code': result.primary_code,
                'Original_Confidence': round(result.primary_confidence, 4),
                'Calibrated_Confidence': round(result.calibrated_confidence, 4),
                'Method': result.method.value,
                'Is_Multi_Specialty': result.is_multi_specialty,
            }

            # Add alternative matches
            for j, (alt_code, alt_score) in enumerate(result.alternatives[:5]):
                output_row[f'Alternative_Code_{j+1}'] = alt_code
                output_row[f'Alternative_Score_{j+1}'] = round(float(alt_score), 4)

            output_rows.append(output_row)

        output_df = pd.DataFrame(output_rows)

        # Ensure all alternative columns exist
        for j in range(1, 6):
            if f'Alternative_Code_{j}' not in output_df.columns:
                output_df[f'Alternative_Code_{j}'] = np.nan
            output_df[f'Alternative_Score_{j}'] = np.nan

        return output_df

    def compute_validation_metrics(self, output_df: pd.DataFrame) -> Dict:
        """
        Compute comprehensive validation metrics for the standardization results.

        Returns:
            Dictionary with metrics including junk rate, success rate, confidence stats, etc.
        """
        metrics = {}

        # Basic counts
        total = len(output_df)
        junk_count = (output_df['Primary_Code'] == 'JUNK').sum()

        metrics['total_records'] = total
        metrics['junk_records'] = junk_count
        metrics['mapped_records'] = total - junk_count
        metrics['junk_percentage'] = round((junk_count / total) * 100, 2)
        metrics['mapping_success_rate'] = round(((total - junk_count) / total) * 100, 2)

        # Confidence metrics
        non_junk = output_df[output_df['Primary_Code'] != 'JUNK']

        metrics['avg_original_confidence'] = round(non_junk['Original_Confidence'].mean(), 4)
        metrics['avg_calibrated_confidence'] = round(non_junk['Calibrated_Confidence'].mean(), 4)
        metrics['confidence_improvement'] = round(
            metrics['avg_calibrated_confidence'] - metrics['avg_original_confidence'], 4
        )

        # Method distribution
        metrics['method_distribution'] = output_df['Method'].value_counts().to_dict()

        # Confidence by method
        metrics['confidence_by_method'] = {}
        for method in output_df['Method'].unique():
            method_data = non_junk[non_junk['Method'] == method]['Calibrated_Confidence']
            if len(method_data) > 0:
                metrics['confidence_by_method'][method] = round(method_data.mean(), 4)

        # Multi-specialty metrics
        multi = output_df[output_df['Is_Multi_Specialty'] == True]
        if len(multi) > 0:
            metrics['multi_specialty_count'] = len(multi)
            metrics['multi_specialty_avg_confidence'] = round(
                multi[multi['Primary_Code'] != 'JUNK']['Calibrated_Confidence'].mean(), 4
            )

        # Low confidence tracking
        low_conf = non_junk[non_junk['Calibrated_Confidence'] < 0.60]
        metrics['low_confidence_count'] = len(low_conf)
        metrics['low_confidence_percentage'] = round((len(low_conf) / len(non_junk)) * 100, 2)

        return metrics

print("✓ ProviderSpecialtyStandardizer class defined")

✓ ProviderSpecialtyStandardizer class defined


## Step 9: Load Data and Initialize Standardizer

Load the NUCC taxonomy master file and the input specialties data, then initialize the standardizer.

In [25]:
print("\n" + "="*50)
print("LOADING DATA")
print("="*50)

# Load your data
nucc_df = pd.read_csv('nucc_taxonomy_master.csv')
input_df = pd.read_csv('input_specialties.csv')

print(f"✓ NUCC records: {len(nucc_df)}")
print(f"✓ Input specialties: {len(input_df)}")
print(f"✓ Input columns: {input_df.columns.tolist()}")

# Create standardizer
print("\n" + "="*50)
print("INITIALIZING STANDARDIZER")
print("="*50)

standardizer = ProviderSpecialtyStandardizer(nucc_df)
print("\n✓ Standardizer initialized and ready")


LOADING DATA
✓ NUCC records: 879
✓ Input specialties: 10050
✓ Input columns: ['raw_specialty']

INITIALIZING STANDARDIZER
✓ Semantic model loaded

✓ Standardizer initialized and ready


## Step 10: Run Standardization

Execute the standardization process on all input specialties. This will match each specialty against the NUCC taxonomy and assign codes with confidence scores.

In [26]:
print("\n" + "="*50)
print("RUNNING STANDARDIZATION")
print("="*50 + "\n")

# Run standardization with correct column name
output_df = standardizer.standardize(
    input_df,
    specialty_column='raw_specialty'  # <-- CORRECT COLUMN NAME
)

print("\n✓ Standardization completed")


RUNNING STANDARDIZATION

  Processed 1000 records...
  Processed 2000 records...
  Processed 3000 records...
  Processed 4000 records...
  Processed 5000 records...
  Processed 6000 records...
  Processed 7000 records...
  Processed 8000 records...
  Processed 9000 records...
  Processed 10000 records...

✓ Standardization completed


## Step 11: Compute and Display Validation Metrics

Compute comprehensive metrics showing how well the standardization performed, including success rates, confidence distributions, and method effectiveness.

In [27]:
print("\n" + "="*50)
print("VALIDATION METRICS")
print("="*50)

# Compute metrics
metrics = standardizer.compute_validation_metrics(output_df)

print("\n=== CORE METRICS ===")
print(f"Total records: {metrics['total_records']}")
print(f"Successfully mapped: {metrics['mapped_records']} ({metrics['mapping_success_rate']}%)")
print(f"Unmappable (JUNK): {metrics['junk_records']} ({metrics['junk_percentage']}%)")

print("\n=== CONFIDENCE METRICS ===")
print(f"Average original confidence: {metrics['avg_original_confidence']}")
print(f"Average calibrated confidence: {metrics['avg_calibrated_confidence']}")
print(f"Confidence improvement: {metrics['confidence_improvement']}")

print("\n=== METHOD DISTRIBUTION ===")
for method, count in sorted(metrics['method_distribution'].items(), key=lambda x: x[1], reverse=True):
    print(f"{method}: {count}")

print("\n=== CONFIDENCE BY METHOD ===")
for method, conf in sorted(metrics['confidence_by_method'].items(), key=lambda x: x[1], reverse=True):
    print(f"{method}: {conf}")

if 'multi_specialty_count' in metrics:
    print(f"\nMulti-specialty matches: {metrics['multi_specialty_count']}")
    print(f"Multi-specialty avg confidence: {metrics['multi_specialty_avg_confidence']}")

print(f"\nLow confidence matches (<0.60): {metrics['low_confidence_count']} ({metrics['low_confidence_percentage']}%)")


VALIDATION METRICS

=== CORE METRICS ===
Total records: 10050
Successfully mapped: 9547 (95.0%)
Unmappable (JUNK): 503 (5.0%)

=== CONFIDENCE METRICS ===
Average original confidence: 0.9575
Average calibrated confidence: 0.909
Confidence improvement: -0.0485

=== METHOD DISTRIBUTION ===
fuzzy_match: 4874
exact_match: 3586
semantic_match: 1043
no_match: 401
empty_input: 102
fallback_match: 44

=== CONFIDENCE BY METHOD ===
exact_match: 0.95
fuzzy_match: 0.9
semantic_match: 0.8307
fallback_match: 0.4275

Multi-specialty matches: 3673
Multi-specialty avg confidence: 0.9066

Low confidence matches (<0.60): 44 (0.46%)


## Step 12: Save Standardized Results

Save the standardized output to a CSV file with all matching details including the primary NUCC code, confidence scores, and alternative matches.

In [28]:
print("\n" + "="*50)
print("SAVING RESULTS")
print("="*50)

# Save output
output_df.to_csv('Output_detailed.csv', index=False)
print("✓ Saved to Output_detailed.csv")

print(f"\nOutput shape: {output_df.shape}")
print("\nFirst few results:")
print(output_df[['Specialty', 'Preprocessed', 'Primary_Code', 'Calibrated_Confidence', 'Method']].head(10))


SAVING RESULTS
✓ Saved to Output_detailed.csv

Output shape: (10050, 17)

First few results:
                           Specialty                     Preprocessed  \
0                        ACUPUNCTURE                      acupuncture   
1                ADOLESCENT MEDICINE              adolescent medicine   
2               ALLERGY & IMMUNOLOGY           allergy and immunology   
3      ANATOMIC & CLINICAL PATHOLOGY  anatomic and clinical pathology   
4                     ANESTHESIOLOGY                   anesthesiology   
5  APPLIED BEHAVIORAL ANALYSIS (ABA)  applied behavioral analysis aba   
6                          AUDIOLOGY                        audiology   
7                  BARIATRIC SURGERY                bariatric surgery   
8          CARDIAC ELECTROPHYSIOLOGY        cardiac electrophysiology   
9                    CARDIAC SURGERY                  cardiac surgery   

  Primary_Code  Calibrated_Confidence          Method  
0   171100000X                 0.8500  semanti

## Step 13: Create Explainable Output Format

Create a simplified, pipe-separated output format with explainable rationale. This provides a user-friendly view showing:
- Raw specialty input
- NUCC codes (primary and alternatives)
- Confidence scores
- Simple explanation of the match

In [29]:
print("\n" + "="*50)
print("CREATING EXPLAINABLE OUTPUT")
print("="*50 + "\n")

def create_explain_row(row: pd.Series) -> pd.Series:
    """
    Processes a single row from the main output_df to create the
    pipe-separated 'explain' format.
    """

    # 1. Compile Codes and Confidences
    # Start with the primary match
    codes = [row['Primary_Code']]
    confidences = [str(row['Calibrated_Confidence'])]

    # Add alternatives, checking for NaNs
    for i in range(1, 6):
        alt_code = row[f'Alternative_Code_{i}']
        alt_score = row[f'Alternative_Score_{i}']

        if pd.notna(alt_code):
            codes.append(str(alt_code))
            confidences.append(str(round(float(alt_score), 4)))

    # 2. Create the Rationale
    explain_text = ""
    if row['Primary_Code'] == 'JUNK':
        explain_text = "Input was empty, too short, or unmappable (JUNK)."
    else:
        method = row['Method']
        confidence = row['Calibrated_Confidence']
        explain_text = f"Mapped via {method} with confidence {confidence:.2f}."

        if row['Is_Multi_Specialty']:
            explain_text += " (Detected multi-specialty input)"

    # 3. Return the new row as a Series
    return pd.Series({
        'raw_specialty': row['Specialty'],  # 'Specialty' holds the raw input
        'nucc_codes': '|'.join(codes),
        'confidence': '|'.join(confidences),
        'explain': explain_text
    })

# Apply the function across the main output DataFrame
explain_df = output_df.apply(create_explain_row, axis=1)

# Save the new CSV
explain_df.to_csv('Final_Submission_output.csv', index=False)

print("✓ Saved to Final_Submission_output.csv")
print(f"\nExplain output shape: {explain_df.shape}")
print("\nFirst few 'explain' results:")
print(explain_df.head(10))


CREATING EXPLAINABLE OUTPUT

✓ Saved to Final_Submission_output.csv

Explain output shape: (10050, 4)

First few 'explain' results:
                       raw_specialty  \
0                        ACUPUNCTURE   
1                ADOLESCENT MEDICINE   
2               ALLERGY & IMMUNOLOGY   
3      ANATOMIC & CLINICAL PATHOLOGY   
4                     ANESTHESIOLOGY   
5  APPLIED BEHAVIORAL ANALYSIS (ABA)   
6                          AUDIOLOGY   
7                  BARIATRIC SURGERY   
8          CARDIAC ELECTROPHYSIOLOGY   
9                    CARDIAC SURGERY   

                                          nucc_codes  \
0  171100000X|208VP0000X|207LP2900X|2081P2900X|26...   
1  207QA0000X|2080A0000X|207RA0000X|207QA0505X|20...   
2  207K00000X|207KI0005X|207RA0201X|207KA0200X|20...   
3  207ZP0101X|207ZP0102X|207ZC0006X|207ZP0105X|20...   
4  207L00000X|1223D0004X|367H00000X|207LP3000X|20...   
5  103K00000X|106E00000X|251S00000X|106S00000X|10...   
6  2355A2700X|231H00000X|231HA2400

## Step 14: Summary and Next Steps

### What You Now Have:

1. **Output_detailed.csv** - Comprehensive output with:
   - Original specialty text
   - Preprocessed text
   - Primary NUCC code with confidence
   - Up to 5 alternative matches with scores
   - Matching method used
   - Multi-specialty flag

2. **Final_Submission_output.csv** - Simple explainable format with:
   - Raw input
   - Pipe-separated NUCC codes (primary and alternatives)
   - Pipe-separated confidence scores
   - Plain English explanation

### Key Metrics:
- **Mapping Success Rate**: Percentage of records successfully mapped
- **Junk Rate**: Unmappable records
- **Confidence Scores**: Calibrated probabilities for each match
- **Method Distribution**: Which matching strategies worked best

### To Use This Notebook:
1. Update the file paths in Step 9 to point to your data
2. Ensure your input CSV has a column named `raw_specialty`
3. Ensure your NUCC master CSV has columns `Code` and `Display_Name`
4. Run cells sequentially from top to bottom