# APMC DATA EXPLORER

This notebook provides analytical approaches for exploring the `txtRawStateData` field and related parsed fields to help determine optimal database normalization strategy.

> The real value resides in well-parsed descriptive text fields from original catalog sources rather than elaborate normalized structures that often remain unused.

## Contents
[0. Setup and Data Loading](#0-Setup-and-Data-Loading) - _Load CSV data and configure display settings_

[1. Cardinality Analysis](#1-Cardinality-Analysis) - _Determine which fields justify lookup tables vs inline storage_

[2. Text Pattern Mining](#2-Text-Pattern-Mining) - _Extract structural patterns from txtRawStateData to understand formatting consistency_

[3. Field Population Analysis](#3-Field-Population-Analysis) - _Identify essential vs sparsely-populated fields by population rates_

[4. Normalization Trade-off Analysis](#4-Normalization-Trade-off-Analysis) - _Evaluate cost/benefit of different normalization approaches_

[5. Philatelic-Specific Analysis](#5-Philatelic-Specific-Analysis) - _Domain-specific explorations: town names, dates, colors, sizes_

[6. Raw vs Parsed Field Comparison](#6-Raw-vs-Parsed-Field-Comparison) - _Examine whether parsed fields capture all information from raw source_

7\. Specific Issue Investigation:
   - [7.1. Geographic Coverage Analysis](#71-Geographic-Coverage-Analysis) - _State-by-state record distribution_
   - [7.2. Color Handling Ambiguity Analysis](#72-Color-Handling-Ambiguity-Analysis) - _Investigate multi-color notation semantics and ASCC documentation gaps_

[8. Decision Framework & Analysis Summary](#8-Decision-Framework--Analysis-Summary) - _Normalization criteria and final recommendations_


## 0. Setup and Data Loading

In [92]:
import pandas as pd
import numpy as np
import re
from collections import Counter
from pathlib import Path

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)

In [93]:
# Load the main data file
# Adjust path as needed for your environment
CSV_PATH = "./wip/out/tblRawStateData.csv"

print("Loading data...")
df = pd.read_csv(CSV_PATH, low_memory=False)
print(f"Loaded {len(df):,} records with {len(df.columns)} columns")

# Uncomment to filter only to approved, non-deleted records
#approved = df[(df['ynDeleted'] == 0) & (df['approve_status'] == 'Approved')].copy()
#print(f"Approved, non-deleted records: {len(approved):,}")

Loading data...
Loaded 51,392 records with 31 columns


In [94]:
# Quick overview of columns
print("Columns in dataset:")
for i, col in enumerate(df.columns):
    print(f"  {i:2}. {col}")

Columns in dataset:
   0. nRawStateDataID
   1. nRawStateDataID_parent
   2. nStateID
   3. txtRawStateData
   4. txtPostmark
   5. txtDatesSeen
   6. txtSizes
   7. txtColors
   8. txtRates
   9. txtRatesText
  10. txtValue
  11. txtTown
  12. txtTownPostmark
  13. txtTownmarkShape
  14. txtTownmarkLettering
  15. txtTownmarkDateFormat
  16. txtTownmarkFraming
  17. txtTownmarkColor
  18. nWidth
  19. nHeight
  20. txtOther
  21. nEarliestUseDay
  22. nEarliestUseYear
  23. nLatestUseDay
  24. nLatestUseYear
  25. ynManuscript
  26. ynBackstamp
  27. txtDefaultImage
  28. nOrder
  29. nImageCount
  30. ynForReview


---
## 1. Cardinality Analysis

Determine which fields justify lookup tables vs inline storage

**Evidence for normalization (via lookup table)**:
- Low cardinality (< 50-100 unique values)
- Values appear repeatedly across records
- Values need controlled vocabulary for filtering/faceting
- Values need additional metadata (descriptions, display order)

**Evidence _against_ normalization**:
- High cardinality (approaches 1:1 with records)
- Values are primarily free-text with minimal repetition
- Forcing into categories loses information

In [95]:
def cardinality_analysis(df, columns_of_interest):
    """
    Analyze fields to determine normalization candidates.
    
    Key metrics:
    - Unique count vs total records
    - Frequency distribution
    - Top N values coverage percentage
    """
    results = []
    total_records = len(df)
    
    for col in columns_of_interest:
        non_null = df[df[col].notna() & (df[col] != '') & (df[col] != 'NULL')]
        populated_count = len(non_null)
        unique_values = non_null[col].nunique()
        
        if populated_count > 0:
            # Top 10 values coverage
            value_counts = non_null[col].value_counts()
            top_10_coverage = value_counts.head(10).sum() / populated_count * 100
            top_25_coverage = value_counts.head(25).sum() / populated_count * 100
            
            # Concentration ratio (does a small set dominate?)
            concentration = value_counts.head(5).sum() / populated_count * 100
            
            results.append({
                'column': col,
                'total_records': total_records,
                'populated_count': populated_count,
                'population_rate': populated_count / total_records * 100,
                'unique_values': unique_values,
                'cardinality_ratio': unique_values / populated_count,
                'top_10_coverage_pct': top_10_coverage,
                'top_20_coverage_pct': top_25_coverage,
                'top_5_concentration_pct': concentration,
                'normalization_candidate': unique_values < 100 and concentration > 50
            })
    
    return pd.DataFrame(results)

In [96]:
def examine_value_distribution(df, column, top_n=30):
    """
    Deep dive into a specific column's value distribution.
    Useful for understanding if a controlled vocabulary fits.
    """
    non_null = df[df[column].notna() & (df[column] != '') & (df[column] != 'NULL')]
    
    print(f"\n{'='*60}")
    print(f"Value Distribution: {column}")
    print(f"{'='*60}")
    print(f"Total records: {len(df):,}")
    print(f"Populated: {len(non_null):,} ({len(non_null)/len(df)*100:.1f}%)")
    print(f"Unique values: {non_null[column].nunique():,}")
    print(f"\nTop {top_n} values:")
    print(non_null[column].value_counts().head(top_n).to_string())
    
    return non_null[column].value_counts()

In [97]:
# Run cardinality analysis on classification fields
classification_cols = [
    'txtTownmarkShape', 'txtTownmarkLettering', 'txtTownmarkDateFormat',
    'txtTownmarkFraming','txtTownmarkColor'
]

cardinality_df = cardinality_analysis(approved, classification_cols)
cardinality_df

Unnamed: 0,column,total_records,populated_count,population_rate,unique_values,cardinality_ratio,top_10_coverage_pct,top_20_coverage_pct,top_5_concentration_pct,normalization_candidate
0,txtTownmarkShape,43068,1227,2.848983,24,0.01956,97.310513,100.0,92.09454,True
1,txtTownmarkLettering,43068,3209,7.451008,4,0.001246,100.0,100.0,100.0,True
2,txtTownmarkDateFormat,43068,249,0.578155,10,0.040161,100.0,100.0,85.140562,True
3,txtTownmarkFraming,43068,160,0.371506,6,0.0375,100.0,100.0,99.375,True
4,txtTownmarkColor,43068,3724,8.646791,93,0.024973,95.649839,97.986037,89.661654,True


In [98]:
# Deep dive into specific fields
for col in cardinality_df[cardinality_df['normalization_candidate'] == True]['column']:
    examine_value_distribution(approved, col)


Value Distribution: txtTownmarkShape
Total records: 43,068
Populated: 1,227 (2.8%)
Unique values: 24

Top 30 values:
txtTownmarkShape
Straight line             484
Circle                    381
Double Circle             138
Oval                       65
Arc                        62
Double Line Circle         18
Double Oval                16
Fancy/morticed             12
Box                        10
Octagon                     8
Straight Line               5
Pictoral                    4
Dotted Oval                 4
Double Line Oval            4
Fancy Oval                  3
Tombstone                   3
Framed Arc                  2
Fancy Box                   2
Shell Design                1
Dashed Circle               1
Double Lined Box            1
Dotted Circle               1
Straight line - 2 line      1
Straight line - 3 line      1

Value Distribution: txtTownmarkLettering
Total records: 43,068
Populated: 3,209 (7.5%)
Unique values: 4

Top 30 values:
txtTownmarkLettering
Nor

---
## 2. Text Pattern Mining

**Purpose**: Extract structural patterns from `txtRawStateData` to understand what information is present and how consistently it's formatted.

In [99]:
def extract_raw_data_components(raw_text):
    """
    Parse a txtRawStateData value into component parts.
    
    Typical format example:
    - "Alexa.(Alexandria)(E)(May 21, 1772;Ms;Black) 1,500"
    - "FREDERICKSBURG(\"F\" 5mm high, used as bkstp)(March 1, 1775;SL-50x3,MDD below;Black,Red) 1,200"
    - "(L)(June 27, 1775) 1,000"
    
    Components:
    - Town postmark text
    - Town name (if different from postmark)
    - (E) = Earliest known, (L) = Latest known
    - Date range
    - Size specifications (SL-50x3 = Straight Line 50mm x 3mm)
    - Date format (MDD = Month-Day-Day)
    - Colors
    - Value
    """
    if pd.isna(raw_text) or raw_text == 'NULL':
        return None
    
    components = {
        'raw': raw_text,
        'has_earliest_marker': '(E)' in raw_text,
        'has_latest_marker': '(L)' in raw_text,
        'has_backstamp': 'backstamp' in raw_text.lower() or 'bkstp' in raw_text.lower(),
        'has_manuscript': 'Ms' in raw_text,
        'has_size_spec': bool(re.search(r'SL-\d+x[\d.]+', raw_text)),
        'has_circle': bool(re.search(r'C-\d+', raw_text)),
        'has_value': bool(re.search(r'\s[\d,]+$', raw_text.strip())),
    }
    
    # Extract size specifications
    size_match = re.search(r'(SL|C)-(\d+)x([\d.]+)', raw_text)
    if size_match:
        components['shape'] = 'Straight line' if size_match.group(1) == 'SL' else 'Circle'
        components['width'] = float(size_match.group(2))
        components['height'] = float(size_match.group(3))
    
    # Extract colors (common patterns)
    color_pattern = r';(Black|Red|Blue|Brown|Green|Orange|Magenta)(?:,|[)\]])'
    colors = re.findall(color_pattern, raw_text, re.IGNORECASE)
    components['colors'] = colors
    
    # Extract value
    value_match = re.search(r'\s([\d,]+)$', raw_text.strip())
    if value_match:
        components['value'] = value_match.group(1).replace(',', '')
    
    return components

In [100]:
def pattern_frequency_analysis(df, column='txtRawStateData', sample_size=None):
    """
    Analyze patterns in the raw data to understand formatting consistency.
    This helps determine what parsing rules will work across the dataset.
    """
    data = df[df[column].notna() & (df[column] != 'NULL')][column]
    
    if sample_size and len(data) > sample_size:
        data = data.sample(sample_size, random_state=42)
    
    patterns = {
        'has_parenthetical_name': 0,  # Town(Full Name)
        'has_E_marker': 0,
        'has_L_marker': 0,
        'has_backstamp': 0,
        'has_manuscript': 0,
        'has_SL_size': 0,
        'has_circle_size': 0,
        'has_trailing_value': 0,
        'has_color_spec': 0,
        'has_date_format_spec': 0,  # MDD, MD, YMDD, etc.
    }
    
    for text in data:
        if re.search(r'\([A-Z][a-z]+.*?\)', text):
            patterns['has_parenthetical_name'] += 1
        if '(E)' in text:
            patterns['has_E_marker'] += 1
        if '(L)' in text:
            patterns['has_L_marker'] += 1
        if re.search(r'backstamp|bkstp', text, re.I):
            patterns['has_backstamp'] += 1
        if ';Ms;' in text or ';Ms,' in text:
            patterns['has_manuscript'] += 1
        if re.search(r'SL-\d+', text):
            patterns['has_SL_size'] += 1
        if re.search(r'C-\d+', text):
            patterns['has_circle_size'] += 1
        if re.search(r'\s[\d,]+$', text.strip()):
            patterns['has_trailing_value'] += 1
        if re.search(r';(Black|Red|Blue|Brown|Green)', text, re.I):
            patterns['has_color_spec'] += 1
        if re.search(r'MDD|MD |YMDD|YMD ', text):
            patterns['has_date_format_spec'] += 1
    
    total = len(data)
    print(f"\nPattern Frequency Analysis (n={total:,})")
    print("="*50)
    
    results = []
    for pattern, count in sorted(patterns.items(), key=lambda x: -x[1]):
        pct = count/total*100
        print(f"{pattern:30} {count:6,} ({pct:5.1f}%)")
        results.append({'pattern': pattern, 'count': count, 'percentage': pct})
    
    return pd.DataFrame(results)

In [101]:
# Run pattern analysis on raw data
pattern_df = pattern_frequency_analysis(approved, 'txtRawStateData', sample_size=5000)


Pattern Frequency Analysis (n=5,000)
has_trailing_value              4,608 ( 92.2%)
has_color_spec                  2,632 ( 52.6%)
has_parenthetical_name            698 ( 14.0%)
has_manuscript                    287 (  5.7%)
has_SL_size                       216 (  4.3%)
has_circle_size                   210 (  4.2%)
has_E_marker                      208 (  4.2%)
has_L_marker                      179 (  3.6%)
has_date_format_spec              151 (  3.0%)
has_backstamp                       7 (  0.1%)


In [102]:
# Sample some raw data entries to see the format
print("Sample txtRawStateData entries:")
print("="*70)
samples = approved[approved['txtRawStateData'].notna() & 
                   (approved['txtRawStateData'] != 'NULL')]['txtRawStateData'].sample(15, random_state=42)
for i, s in enumerate(samples, 1):
    print(f"{i:2}. {s}")

Sample txtRawStateData entries:
 1. West Batavia 1854 35
 2. Same/VA.(1860s;32;Black) Union mail 20
 3. Bethel 1841-42 10
 4. Point Coupee 1821-50 75/40
 5. Hudsonville 1841 75
 6. (L)(April 1, 1760;CD) 1,000
 7. COOKSVILLE,/Wisn.(185-;30;PAID/3[C];Black) 25
 8. Andover 1841,1849-50s 15
 9. *Cooke’s Store 1852 --
10. Orange C.H. 1801-32,1852 40
11. Lyme Bridge 1849 250
12. Church Hill 1846 75
13. AKRON,O(August 2, 1833;DL box-34x16,MD;Black) 400
14. Bucksport 1852-55,1860 150
15. Whitneyville 1845-54 20


---
## 3. Field Population Analysis

**Purpose**: Understand which parsed fields are actually populated. Low population rates suggest fields that might not justify dedicated columns.

In [103]:
def field_population_report(df, exclude_system_cols=True):
    """
    Generate population rates for all fields.
    Helps identify which fields are essential vs sparsely used.
    """
    system_cols = ['dtEntered', 'dtUpdated', 'ynActive', 'ynDeleted', 'nOrder']
    
    results = []
    for col in df.columns:
        if exclude_system_cols and col in system_cols:
            continue
            
        non_null = df[df[col].notna()]
        
        # Also exclude 'NULL' strings and empty strings
        if df[col].dtype == 'object':
            non_null = non_null[
                (non_null[col] != 'NULL') & 
                (non_null[col] != '') &
                (non_null[col] != 'n/a') &
                (non_null[col] != '--')
            ]
        
        results.append({
            'column': col,
            'dtype': str(df[col].dtype),
            'populated': len(non_null),
            'population_rate': len(non_null) / len(df) * 100,
            'unique_values': non_null[col].nunique() if len(non_null) > 0 else 0
        })
    
    report = pd.DataFrame(results).sort_values('population_rate', ascending=False)
    return report

In [104]:
def essential_vs_sparse_fields(df, threshold=50):
    """
    Categorize fields into essential (>threshold%) vs sparse (<threshold%).
    """
    report = field_population_report(df)
    
    essential = report[report['population_rate'] >= threshold]
    sparse = report[report['population_rate'] < threshold]
    
    print(f"\n{'='*60}")
    print(f"ESSENTIAL FIELDS (>{threshold}% populated)")
    print(f"{'='*60}")
    display(essential[['column', 'population_rate', 'unique_values']])
    
    print(f"\n{'='*60}")
    print(f"SPARSE FIELDS (<{threshold}% populated)")
    print(f"{'='*60}")
    display(sparse[['column', 'population_rate', 'unique_values']])
    
    return essential, sparse

In [105]:
# Run field population analysis
essential, sparse = essential_vs_sparse_fields(approved, threshold=30)


ESSENTIAL FIELDS (>30% populated)


Unnamed: 0,column,population_rate,unique_values
0,nRawStateDataID,100.0,43068
1,nRawStateDataID_parent,100.0,32844
30,ynForReview,100.0,2
29,nImageCount,100.0,14
27,ynBackstamp,100.0,2
26,ynManuscript,100.0,2
31,approve_status,100.0,1
2,nStateID,100.0,55
13,txtTownPostmark,99.969815,27883
12,txtTown,99.923377,13896



SPARSE FIELDS (<30% populated)


Unnamed: 0,column,population_rate,unique_values
28,txtDefaultImage,14.89505,6415
18,txtTownmarkColor,8.646791,93
23,nEarliestUseYear,8.36584,126
15,txtTownmarkLettering,7.451008,4
25,nLatestUseYear,5.523823,121
19,nWidth,4.295533,70
14,txtTownmarkShape,2.848983,24
9,txtRates,2.809511,606
22,nEarliestUseDay,2.807189,31
21,txtOther,2.326553,427


---
## 4. Normalization Trade-off Analysis

**Purpose**: Evaluate the cost/benefit of different normalization approaches

In [106]:
def text_preservation_analysis(df, text_col, parsed_col):
    """
    Compare original text field to its parsed equivalent.
    Determines if parsing loses information that users need.
    
    Can the parsed field REPLACE the original text?
    Or does the original text contain nuances that must be preserved?
    """
    both_populated = df[
        df[text_col].notna() & (df[text_col] != 'NULL') &
        df[parsed_col].notna() & (df[parsed_col] != 'NULL') & (df[parsed_col] != 'n/a')
    ]
    
    print(f"\nText Preservation Analysis: {text_col} -> {parsed_col}")
    print("="*60)
    print(f"Records with both fields populated: {len(both_populated):,}")
    
    # Sample comparisons
    print("\nSample comparisons (original -> parsed):")
    sample = both_populated.sample(min(10, len(both_populated)), random_state=42)
    for _, row in sample.iterrows():
        orig = str(row[text_col])[:60]
        print(f"  '{orig}...' -> '{row[parsed_col]}'")

In [107]:
def lookup_justification_report(df, field, lookup_values):
    """
    Check if existing lookup table values provide coverage for the actual data.
    """
    actual_values = df[df[field].notna() & (df[field] != 'NULL')][field].unique()
    
    covered = set(actual_values) & set(lookup_values)
    uncovered = set(actual_values) - set(lookup_values)
    unused_lookups = set(lookup_values) - set(actual_values)
    
    print(f"\nLookup Justification: {field}")
    print("="*60)
    print(f"Actual unique values: {len(actual_values)}")
    print(f"Lookup table values: {len(lookup_values)}")
    print(f"Covered by lookup: {len(covered)}")
    print(f"Uncovered (needs adding): {len(uncovered)}")
    print(f"Unused lookups: {len(unused_lookups)}")
    
    if uncovered:
        print(f"\nUncovered values:")
        for v in list(uncovered)[:20]:
            print(f"  - {v}")

In [108]:
# Compare raw data to parsed fields
text_preservation_analysis(approved, 'txtRawStateData', 'txtTownmarkShape')


Text Preservation Analysis: txtRawStateData -> txtTownmarkShape
Records with both fields populated: 1,079

Sample comparisons (original -> parsed):
  'ERIE,ALA.(1846-47;Box-37x21;[ms date in box below town];V,PA...' -> 'Box'
  'WILMN D.(1799-1813;27;PAID,FREE[box];Black,Red) 75...' -> 'Straight line'
  '+MOUND CITY/K.T.(Feb. 21, --;C-35;Black) 400...' -> 'Circle'
  'WASHINGTON CITY/D.C.(1857-58; 32.5,YMDD;PAID,3;Black) 25...' -> 'Straight line'
  'NUEVO MEXICO.(E)(June 3, 1800;SL-51x4 letters slanting;3[ms]...' -> 'Straight line'
  'Same(1864;DC-29,YMDD;Black) 30...' -> 'Double Circle'
  'Same UNPAID(1873;21;--;Black) 350...' -> 'Straight line'
  'PAID/AT/SAN JUAN PORTO RICO(1844-64;25[crowned];Black,Red) 2...' -> 'Double Circle'
  'VIEQUES(1876-77;34x5;Black) 200...' -> 'Straight line'
  'SIOUX FALLS CITY/D.T.(E)(Aug. 15, 1859;oval--;Black) --...' -> 'Oval'


In [109]:
# Example: Check lookup table coverage for shapes
known_shapes = [
    'Straight line', 'Circle', 'Double Circle', 'Oval', 'Arc',
    'Double Line Circle', 'Double Oval', 'Fancy/morticed', 'Box',
    'Octagon', 'Pictoral', 'Tombstone'
]

lookup_justification_report(approved, 'txtTownmarkShape', known_shapes)


Lookup Justification: txtTownmarkShape
Actual unique values: 24
Lookup table values: 12
Covered by lookup: 12
Uncovered (needs adding): 12
Unused lookups: 0

Uncovered values:
  - Fancy Oval
  - Fancy Box
  - Straight Line
  - Framed Arc
  - Double Line Oval
  - Straight line - 3 line
  - Shell Design
  - Double Lined Box
  - Straight line - 2 line
  - Dotted Oval
  - Dashed Circle
  - Dotted Circle


---
## 5. Philatelic-Specific Analysis

**Purpose**: Domain-specific explorations for postal history cataloging

In [110]:
def town_name_variations(df, town_col='txtTown', postmark_col='txtTownPostmark'):
    """
    Analyze relationship between town names and postmark text.
    
    The postmark text on the cover may differ
    from the normalized town name (abbreviations, historical spellings).
    Both need to be preserved for different purposes:
    - Normalized town: for filtering, grouping, geographic lookup
    - Postmark text: for exact matching, historical accuracy
    """
    both_populated = df[
        df[town_col].notna() & (df[town_col] != 'NULL') &
        df[postmark_col].notna() & (df[postmark_col] != 'NULL')
    ]
    
    # Find cases where they differ
    different = both_populated[
        both_populated[town_col].str.lower() != both_populated[postmark_col].str.lower()
    ]
    
    print(f"\nTown Name vs Postmark Text Analysis")
    print("="*60)
    print(f"Records with both: {len(both_populated):,}")
    print(f"Records where they differ: {len(different):,} ({len(different)/len(both_populated)*100:.1f}%)")
    
    print("\nSamples where town name aligns with postmark:")
    sample = different.sample(min(15, len(different)), random_state=42)
    for _, row in sample.iterrows():
        print(f"  Town: '{row[town_col]}' | Postmark: '{row[postmark_col]}'")
    
    return different

In [111]:
def date_format_patterns(df, dates_col='txtDatesSeen'):
    """
    Analyze date format variations in the catalog.
    
    This helps determine:
    - Whether dates can be reliably parsed into structured fields
    - What date formats exist (e.g., "May 21, 1772", "1771", "dateline April 24, 1767")
    """
    non_null = df[df[dates_col].notna() & (df[dates_col] != 'NULL')][dates_col]
    
    patterns = Counter()
    for date_str in non_null:
        # Classify pattern
        if re.match(r'^\d{4}$', str(date_str)):
            patterns['year_only'] += 1
        elif re.match(r'^[A-Z][a-z]+\.?\s+\d{4}$', str(date_str)):
            patterns['month_year'] += 1
        elif re.match(r'^[A-Z][a-z]+\.?\s+\d{1,2},?\s+\d{4}$', str(date_str)):
            patterns['full_date'] += 1
        elif '-' in str(date_str) or ' to ' in str(date_str).lower():
            patterns['date_range'] += 1
        else:
            patterns['other'] += 1
    
    print(f"\nDate Format Patterns in {dates_col}")
    print("="*60)
    total = len(non_null)
    for pattern, count in patterns.most_common():
        print(f"{pattern:25} {count:6,} ({count/total*100:5.1f}%)")
    
    return patterns

In [112]:
def color_vocabulary_analysis(df, color_col='txtTownmarkColor'):
    """
    Analyze color values to determine if controlled vocabulary is appropriate.
    
    Colors in postal markings are often complex:
    - Basic: Black, Red, Blue, Brown
    - Compound: Black,Red (multiple colors on one marking)
    - Qualified: Light Blue, Dark Green
    """
    non_null = df[df[color_col].notna() & (df[color_col] != 'NULL') & (df[color_col] != 'n/a')]
    
    colors = non_null[color_col].value_counts()
    
    print(f"\nColor Vocabulary Analysis")
    print("="*60)
    print(f"Unique color values: {len(colors):,}")
    print(f"\nTop 30 color values:")
    print(colors.head(30).to_string())
    
    # Check for compound colors
    compound = non_null[non_null[color_col].str.contains(',', na=False)]
    print(f"\nCompound colors (contain comma): {len(compound):,}")
    print(compound[color_col].value_counts().head(10).to_string())
    
    return colors

In [113]:
def size_pattern_analysis(df, sizes_col='txtSizes'):
    """
    Analyze size specification patterns.
    
    Common formats:
    - SL-50x3 (Straight Line, 50mm wide x 3mm tall)
    - C-28 (Circle, 28mm diameter)
    - Ms (Manuscript - no standard size)
    - O-30x36 (Oval dimensions)
    """
    non_null = df[df[sizes_col].notna() & (df[sizes_col] != 'NULL') & (df[sizes_col] != '')]
    sizes = non_null[sizes_col]
    
    print(f"\nSize Specification Analysis")
    print("="*60)
    print(f"Populated records: {len(sizes):,}")
    print(f"Unique patterns: {sizes.nunique():,}")
    
    # Categorize patterns
    categories = Counter()
    for s in sizes:
        s = str(s)
        if re.match(r'^SL-\d+x[\d.]+$', s):
            categories['SL-WxH (standard straight line)'] += 1
        elif re.match(r'^C-\d+$', s):
            categories['C-D (standard circle)'] += 1
        elif 'Ms' in s:
            categories['Ms (manuscript)'] += 1
        elif 'SL' in s:
            categories['SL with extras'] += 1
        elif 'C-' in s:
            categories['Circle with extras'] += 1
        elif re.match(r'^[\d.]+x[\d.]+$', s):
            categories['WxH numeric only'] += 1
        elif re.match(r'^\d+$', s):
            categories['Single number (diameter)'] += 1
        else:
            categories['other'] += 1
    
    print("\nSize format categories:")
    for cat, count in sorted(categories.items(), key=lambda x: -x[1]):
        print(f"  {cat:40} {count:6,} ({count/len(sizes)*100:5.1f}%)")
    
    return categories

In [114]:
# Run philatelic-specific analyses
town_variations = town_name_variations(approved)


Town Name vs Postmark Text Analysis
Records with both: 43,023
Records where they differ: 25,407 (59.1%)

Samples where town name aligns with postmark:
  Town: 'Quakertown' | Postmark: 'QUAKERTOWN/P(“P” in center)'
  Town: 'Baltimore' | Postmark: 'BALTIMORE/MD'
  Town: 'Bentonsport' | Postmark: 'Bentons Port I.T.'
  Town: 'Charlestown' | Postmark: 'CHARLES/TOWN(backstamp)'
  Town: 'Fort Mitchell' | Postmark: 'FORT MITCHELL AL'
  Town: 'Urbana' | Postmark: 'URBANA O.'
  Town: 'Atalissa' | Postmark: 'ATALISSA/IOWA'
  Town: 'Stevens Point' | Postmark: 'STEVENS POINT/Wis'
  Town: 'Milbury' | Postmark: 'MILBURY/Ms.'
  Town: 'Belvidere' | Postmark: 'BELVIDERE/lll.'
  Town: 'St. Augustine' | Postmark: 'ST.AUGUSTINE/Fl.T.'
  Town: 'Uxbridge' | Postmark: 'UXBRIDGE/MS.'
  Town: 'Solon' | Postmark: 'SOLON/O.'
  Town: 'Jefferson' | Postmark: 'JEFFERSON/Ga.'
  Town: 'Palmyra' | Postmark: 'PALMYRA/N.Y.'


In [115]:
date_patterns = date_format_patterns(approved)


Date Format Patterns in txtDatesSeen
date_range                17,773 ( 43.5%)
year_only                 14,058 ( 34.4%)
full_date                  4,608 ( 11.3%)
other                      4,311 ( 10.6%)
month_year                    73 (  0.2%)


In [116]:
color_vocab = color_vocabulary_analysis(approved)


Color Vocabulary Analysis
Unique color values: 93

Top 30 color values:
txtTownmarkColor
Black                          2283
Red                             512
Blue                            280
Black,Red                       169
Green                            95
Black,Blue                       62
Black,Blue,Red                   57
Blue,Red                         52
Brown                            37
Black,Brown                      15
Red brown                        14
Magenta                          10
Brown,Red                         8
Red orange                        8
Orange                            7
Red,Blue                          7
Blue green                        5
Black,Brown,Red                   4
Black,Red,Blue                    4
Black brown                       4
n/a,Black                         4
n/a,Red                           3
Orange,Red                        3
Black,Blue,Brown,Red              3
Brownish red                      3
Olive gree

In [117]:
size_patterns = size_pattern_analysis(approved)


Size Specification Analysis
Populated records: 21,258
Unique patterns: 3,025

Size format categories:
  Single number (diameter)                 11,850 ( 55.7%)
  other                                     3,142 ( 14.8%)
  Ms (manuscript)                           2,492 ( 11.7%)
  SL with extras                            1,246 (  5.9%)
  Circle with extras                        1,246 (  5.9%)
  SL-WxH (standard straight line)             735 (  3.5%)
  C-D (standard circle)                       394 (  1.9%)
  WxH numeric only                            153 (  0.7%)


---
## 6. Raw vs Parsed Field Comparison

Examine whether parsed fields capture all information from the raw source.

In [118]:

#Uncomment this section if you're running against the original data, as some of these fields end up being removed
#in the transformer as a result of this analysis

"""
# Sample records showing raw data alongside parsed fields
sample = approved[
    approved['txtRawStateData'].notna() & 
    (approved['txtTownPostmark'].notna() & approved['txtPostmark'].notna() & approved['txtTown'].notna()) |
    (approved['txtRatesText'].notna() & approved['txtRates'].notna()) |
    (approved['txtTownmarkColor'].notna() & approved['txtColors'].notna())
].sample(10, random_state=42)

print("RAW DATA vs PARSED FIELDS COMPARISON")
print("="*70)

for idx, row in sample.iterrows():
    print(f"RAW: {row['txtRawStateDataTemp']}")
    print(f"  Town Postmark Text: {row['txtTownPostmark']} | Postmark Text: {row['txtPostmark']} | Town Text: {row['txtTown']}")
    print(f"  Rates Text: {row['txtRatesText']} | Rates: {row['txtRates']}")
    print(f"  Townmark Color Text: {row['txtTownmarkColor']} | Color Text: {row['txtColors']}")
    print(f"  Other (Notes): {row['txtOther']}")
"""

'\n# Sample records showing raw data alongside parsed fields\nsample = approved[\n    approved[\'txtRawStateData\'].notna() & \n    (approved[\'txtTownPostmark\'].notna() & approved[\'txtPostmark\'].notna() & approved[\'txtTown\'].notna()) |\n    (approved[\'txtRatesText\'].notna() & approved[\'txtRates\'].notna()) |\n    (approved[\'txtTownmarkColor\'].notna() & approved[\'txtColors\'].notna())\n].sample(10, random_state=42)\n\nprint("RAW DATA vs PARSED FIELDS COMPARISON")\nprint("="*70)\n\nfor idx, row in sample.iterrows():\n    print(f"RAW: {row[\'txtRawStateDataTemp\']}")\n    print(f"  Town Postmark Text: {row[\'txtTownPostmark\']} | Postmark Text: {row[\'txtPostmark\']} | Town Text: {row[\'txtTown\']}")\n    print(f"  Rates Text: {row[\'txtRatesText\']} | Rates: {row[\'txtRates\']}")\n    print(f"  Townmark Color Text: {row[\'txtTownmarkColor\']} | Color Text: {row[\'txtColors\']}")\n    print(f"  Other (Notes): {row[\'txtOther\']}")\n'

---
## 7.1. Geographic Coverage Analysis

In [119]:
# State distribution
state_counts = approved['nStateID'].value_counts().head(25)

# Try to load state names
try:
    states_df = pd.read_csv('./wip/tblStates.csv')
    state_map = dict(zip(states_df['nStateID'], states_df['txtState']))
    
    print("Records by State (Top 25)")
    print("="*50)
    for state_id, count in state_counts.items():
        name = state_map.get(state_id, f'Unknown ({state_id})')
        print(f"  {name:25} {count:6,}")
except:
    print("State counts by ID:")
    print(state_counts)

State counts by ID:
nStateID
32    5672
21    3847
38    2792
35    2747
46    2115
19    1730
13    1592
22    1452
49    1374
7     1332
45    1268
14    1111
33    1106
29     991
24     982
1      917
10     891
15     863
17     861
30     809
20     787
42     771
25     750
5      677
43     667
Name: count, dtype: int64


---
## 7.2. Color Handling Ambiguity Analysis

### The Documentation Gap

A critical question for database normalization is: **When a catalog entry lists multiple colors (e.g., "Black,Red"), does this represent:**

1. **One postmark device** observed in multiple ink colors over its period of use?
2. **Multiple separate observations** that should be distinct records?
3. **A single cover** with multiple ink colors on the same marking?

### What the ASCC Header Actually Says

The **only** explicit text about colors in the ASCC catalog introduction (Page xv, Section 5: "COLOR OF MARKINGS"):

> *"Manuscript markings are commonly found applied in black ink. This catalog makes no distinction in scarcity and value for manuscript markings applied in colors other than black except in the case of Territorial and Colonial markings.*
>
> *Handstamped markings are commonly found applied in black, blue and red, and generally no distinction is made in evaluating markings in these colors. Handstamped markings applied in green, purple, magenta, yellow, brown and orange are considerably scarcer and listings in this catalog often reflect increased valuations for markings known to exist in these colors. Red markings sometimes turn brownish with age."*

**Critically absent from the documentation:**
- What comma-separated colors in a listing mean
- Whether multiple colors represent one device or separate observations
- Any formatting conventions for color notation
- How to interpret compound colors like "Brown black" vs "Black,Brown"

This section analyzes what the **data structure itself** implies about color semantics, acknowledging that these are inferences, not documented conventions.

In [120]:
def analyze_color_field_structure(df, color_col='txtColors', alt_color_col='txtTownmarkColor'):
    """
    - Are colors stored as single values or comma-separated lists?
    - What delimiters are used?
    - How do the two color fields relate?
    """
    
    # Analyze primary color field
    for col in [color_col, alt_color_col]:
        if col not in df.columns:
            print(f"\nColumn {col} not found in dataframe")
            continue
            
        non_null = df[df[col].notna() & (df[col] != 'NULL') & (df[col] != 'n/a') & (df[col] != '')]
        
        print(f"\n--- {col} ---")
        print(f"Populated records: {len(non_null):,}")
        
        # Count entries with multiple colors (comma-separated)
        has_comma = non_null[non_null[col].str.contains(',', na=False)]
        print(f"Entries with comma (potential multi-color): {len(has_comma):,} ({len(has_comma)/len(non_null)*100:.1f}%)")
        
        # Count entries with spaces that might indicate compound colors
        has_space = non_null[non_null[col].str.contains(' ', na=False)]
        print(f"Entries with space: {len(has_space):,} ({len(has_space)/len(non_null)*100:.1f}%)")
        
        # Sample multi-color entries
        if len(has_comma) > 0:
            print(f"\nSample multi-color values (first 15):")
            for val in has_comma[col].value_counts().head(15).index:
                count = (non_null[col] == val).sum()
                print(f"  '{val}' ({count:,} records)")

In [121]:
# Run color field structure analysis
analyze_color_field_structure(approved)


--- txtColors ---
Populated records: 23,214
Entries with comma (potential multi-color): 5,356 (23.1%)
Entries with space: 470 (2.0%)

Sample multi-color values (first 15):
  'Black,Red' (2,313 records)
  'Black,Blue,Red' (883 records)
  'Black,Blue' (688 records)
  'Blue,Red' (556 records)
  'Orange,Red' (65 records)
  'Black,Brown' (60 records)
  'Brown,Red' (53 records)
  'Black,Brown,Red' (49 records)
  'Black,Orange,Red' (37 records)
  'Black,Blue,Brown,Red' (36 records)
  'Black,Blue,Orange,Red' (29 records)
  'Blue,Orange,Red' (26 records)
  'Black,Blue,Green,Red' (26 records)
  'Black,Green' (19 records)
  'Black,Green,Red' (19 records)

--- txtTownmarkColor ---
Populated records: 3,724
Entries with comma (potential multi-color): 454 (12.2%)
Entries with space: 75 (2.0%)

Sample multi-color values (first 15):
  'Black,Red' (169 records)
  'Black,Blue' (62 records)
  'Black,Blue,Red' (57 records)
  'Blue,Red' (52 records)
  'Black,Brown' (15 records)
  'Brown,Red' (8 records)
  

In [122]:
def color_date_correlation_analysis(df, color_col='txtColors', dates_col='txtDatesSeen'):
    """
    Analyze correlation between multi-color entries and date patterns.
    
    HYPOTHESIS: If multi-color entries represent one device observed over time,
    they should MORE OFTEN have date RANGES than single-color entries.
    
    This would suggest: "Black,Red" = one device, observed in black ink on some dates,
    red ink on others, across a span of years.
    """

    # Filter to records with both color and date info
    has_both = df[
        df[color_col].notna() & (df[color_col] != 'NULL') & (df[color_col] != '') &
        df[dates_col].notna() & (df[dates_col] != 'NULL') & (df[dates_col] != '')
    ].copy()
    
    print(f"\nRecords with both color and date: {len(has_both):,}")
    
    # Classify color entries
    has_both['is_multi_color'] = has_both[color_col].str.contains(',', na=False)
    
    # Classify date entries (range vs single date)
    def classify_date(date_str):
        if pd.isna(date_str) or date_str in ['NULL', '', 'n/a']:
            return 'missing'
        date_str = str(date_str)
        if '-' in date_str and not date_str.startswith('-'):
            return 'date_range'
        elif re.match(r'^\d{4}$', date_str):
            return 'year_only'
        elif re.match(r'^[A-Za-z]+\.?\s+\d{1,2},?\s+\d{4}$', date_str):
            return 'specific_date'
        elif re.match(r'^[A-Za-z]+\.?\s+\d{4}$', date_str):
            return 'month_year'
        else:
            return 'other'
    
    has_both['date_type'] = has_both[dates_col].apply(classify_date)
    
    # Cross-tabulation
    print("\n--- Date Pattern by Color Type ---")
    
    single_color = has_both[~has_both['is_multi_color']]
    multi_color = has_both[has_both['is_multi_color']]
    
    print(f"\nSingle-color entries: {len(single_color):,}")
    single_dist = single_color['date_type'].value_counts()
    for dt, count in single_dist.items():
        print(f"  {dt:20} {count:6,} ({count/len(single_color)*100:5.1f}%)")
    
    print(f"\nMulti-color entries: {len(multi_color):,}")
    if len(multi_color) > 0:
        multi_dist = multi_color['date_type'].value_counts()
        for dt, count in multi_dist.items():
            print(f"  {dt:20} {count:6,} ({count/len(multi_color)*100:5.1f}%)")
    
    # Statistical comparison
    single_range_pct = (single_color['date_type'] == 'date_range').mean() * 100
    multi_range_pct = (multi_color['date_type'] == 'date_range').mean() * 100 if len(multi_color) > 0 else 0
    
    print(f"\n--- KEY FINDING ---")
    print(f"Single-color entries with date ranges: {single_range_pct:.1f}%")
    print(f"Multi-color entries with date ranges:  {multi_range_pct:.1f}%")
    
    if multi_range_pct > single_range_pct:
        print(f"\n-> Multi-color entries are {multi_range_pct/single_range_pct:.1f}x MORE LIKELY to have date ranges.")
        print("  This SUPPORTS the hypothesis that multi-color = one device over time.")
    else:
        print("\n-> No significant difference found. Hypothesis not supported by this evidence.")
    
    return has_both

In [123]:
# Run color-date correlation analysis
color_date_df = color_date_correlation_analysis(approved)


Records with both color and date: 21,970

--- Date Pattern by Color Type ---

Single-color entries: 16,709
  year_only             6,981 ( 41.8%)
  date_range            4,958 ( 29.7%)
  specific_date         2,980 ( 17.8%)
  other                 1,729 ( 10.3%)
  month_year               61 (  0.4%)

Multi-color entries: 5,261
  date_range            4,429 ( 84.2%)
  year_only               585 ( 11.1%)
  other                   166 (  3.2%)
  specific_date            81 (  1.5%)

--- KEY FINDING ---
Single-color entries with date ranges: 29.7%
Multi-color entries with date ranges:  84.2%

-> Multi-color entries are 2.8x MORE LIKELY to have date ranges.
  This SUPPORTS the hypothesis that multi-color = one device over time.


In [124]:
def extract_color_disambiguation_datasets(df, color_col='txtColors', dates_col='txtDatesSeen',
                                          output_dir='./wip/out/'):
    """
    Generate datasets to help manually disambiguate color interpretation.
    
    Produces:
    1. multi_color_with_ranges.csv - Multi-color entries with date ranges (likely one device)
    2. multi_color_specific_dates.csv - Multi-color with specific dates (ambiguous)
    3. compound_colors.csv - Entries with space-separated colors (e.g., "Brown black")
    4. color_vocabulary.csv - All unique color values with frequencies
    5. color_by_shape.csv - Color distribution by postmark shape
    """
    from pathlib import Path
    import os
    
    # Create output directory
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    print("=" * 70)
    print("GENERATING COLOR DISAMBIGUATION DATASETS")
    print(f"Output directory: {output_dir}")
    print("=" * 70)
    
    # Filter to records with color data
    has_color = df[
        df[color_col].notna() & 
        (df[color_col] != 'NULL') & 
        (df[color_col] != 'n/a') & 
        (df[color_col] != '')
    ].copy()
    
    # Classify entries
    has_color['has_comma'] = has_color[color_col].str.contains(',', na=False)
    has_color['has_space'] = has_color[color_col].str.contains(' ', na=False) & ~has_color['has_comma']
    
    # Date classification
    def has_date_range(date_str):
        if pd.isna(date_str) or str(date_str) in ['NULL', '', 'n/a']:
            return False
        return '-' in str(date_str) and not str(date_str).startswith('-')
    
    has_color['has_date_range'] = has_color[dates_col].apply(has_date_range)
    
    # Select columns for export
    export_cols = [
        'nRawStateDataID', 'txtTown', 'txtPostmark', color_col, 
        'txtTownmarkColor', dates_col, 'txtTownmarkShape', 
        'txtSizes', 'txtRawStateData'
    ]
    export_cols = [c for c in export_cols if c in has_color.columns]
    
    # 1. Multi-color with date ranges
    multi_with_ranges = has_color[has_color['has_comma'] & has_color['has_date_range']][export_cols]
    multi_with_ranges.to_csv(f"{output_dir}/multi_color_with_ranges.csv", index=False)
    print(f"\n1. multi_color_with_ranges.csv: {len(multi_with_ranges):,} records")
    print("   -> These likely represent ONE DEVICE observed in multiple inks over time")
    
    # 2. Multi-color with specific dates (ambiguous)
    multi_specific = has_color[has_color['has_comma'] & ~has_color['has_date_range']][export_cols]
    multi_specific.to_csv(f"{output_dir}/multi_color_specific_dates.csv", index=False)
    print(f"\n2. multi_color_specific_dates.csv: {len(multi_specific):,} records")
    print("   -> AMBIGUOUS: Could be one device or multiple observations")
    
    # 3. Compound colors (space-separated)
    compound = has_color[has_color['has_space']][export_cols]
    compound.to_csv(f"{output_dir}/compound_colors.csv", index=False)
    print(f"\n3. compound_colors.csv: {len(compound):,} records")
    print("   -> Space-separated colors (e.g., 'Brown black') - meaning unclear")
    
    # 4. Color vocabulary with frequencies
    color_vocab = has_color[color_col].value_counts().reset_index()
    color_vocab.columns = ['color_value', 'frequency']
    color_vocab['is_multi'] = color_vocab['color_value'].str.contains(',')
    color_vocab['has_space'] = color_vocab['color_value'].str.contains(' ') & ~color_vocab['is_multi']
    color_vocab.to_csv(f"{output_dir}/color_vocabulary.csv", index=False)
    print(f"\n4. color_vocabulary.csv: {len(color_vocab):,} unique color values")
    
    # 5. Color by shape cross-tab
    if 'txtTownmarkShape' in has_color.columns:
        color_shape = has_color.groupby(['txtTownmarkShape', color_col]).size().reset_index(name='count')
        color_shape = color_shape.sort_values('count', ascending=False)
        color_shape.to_csv(f"{output_dir}/color_by_shape.csv", index=False)
        print(f"\n5. color_by_shape.csv: {len(color_shape):,} shape-color combinations")
    
    # Summary statistics
    print("\n" + "=" * 70)
    print("SUMMARY STATISTICS")
    print("=" * 70)
    print(f"Total records with color data: {len(has_color):,}")
    print(f"  - Single color: {(~has_color['has_comma'] & ~has_color['has_space']).sum():,}")
    print(f"  - Multi-color (comma): {has_color['has_comma'].sum():,}")
    print(f"  - Compound (space): {has_color['has_space'].sum():,}")
    
    return {
        'multi_with_ranges': multi_with_ranges,
        'multi_specific': multi_specific,
        'compound': compound,
        'vocabulary': color_vocab
    }

In [125]:
# Generate color disambiguation datasets
# Uncomment to run
# color_datasets = extract_color_disambiguation_datasets(approved)

### Color Ambiguity: Conclusions

#### What the Data Structure SUGGESTS (Inferences, Not Proven):

| Evidence | Finding | Implication |
|----------|---------|-------------|
| **txtColors field** | Stores "Black,Red" as a single string | Designed to keep colors together as one unit |
| **Date correlation** | Multi-color entries more often have date ranges | Multi-color = period of use, not single observation |
| **No normalization** | txtTownmarkColor also holds comma-separated values | Neither field was designed to extract individual colors |

#### What CANNOT Be Proven From Available Documentation:

- The **exact meaning** of comma-separated colors from ASCC documentation
- Whether "Black,Red" means one device with multiple inks, or something else
- What space-separated compound colors (like "Brown black") definitively mean
- Whether the data entry conventions were consistent across all catalogers

#### Recommended Approach for Normalization:

1. **Preserve the original string** in txtColors/txtTownmarkColor as-is (authoritative source)
2. **Create a junction table** for querying by individual colors:
   - `PostmarkColors(postmark_id, color_id, is_primary)`
   - Split comma-separated values when populating
3. **Flag compound colors** (space-separated) for manual review
4. **Accept ambiguity** - some entries may never have a definitive interpretation

The strong correlation between multi-color entries and date ranges is **suggestive but not definitive**. Without explicit ASCC documentation stating the convention, any interpretation remains inference from data patterns.

---
## 8. Decision Framework & Analysis Summary

### Normalization Decision Criteria

**NORMALIZE INTO LOOKUP TABLE WHEN:**
- Cardinality < 50 unique values
- Top 10 values cover > 80% of records
- Users need to FILTER/FACET by this attribute
- Controlled vocabulary adds value (standardization)
- Additional metadata needed (display order, descriptions)

**KEEP AS TEXT FIELD WHEN:**
- High cardinality (approaches record count)
- Values contain nuance that categories would lose
- Historical accuracy requires exact preservation
- Field is rarely used for filtering (< 30% populated)
- Forcing into categories creates "Other" catchall problems

**DUAL APPROACH (both lookup + text) WHEN:**
- Text appears ON the physical artifact (postmark text, rates)
- FK for filtering/classification
- Text field preserves exact appearance

**ALWAYS PRESERVE: txtRawStateData**
- Authoritative source text from the ASCC catalog
- Even if perfectly parsed, keep for:
  - Audit trail / provenance
  - Reparsing if rules improve
  - Edge cases that don't fit schema
  - Scholar citation needs

In [126]:
# Summary recommendation generator
def generate_normalization_summary(df):
    """
    Generate a summary of normalization recommendations based on the analysis.
    """
    classification_cols = [
        'txtTownmarkShape', 'txtTownmarkLettering', 'txtTownmarkDateFormat',
        'txtTownmarkFraming', 'txtTownmarkColor'
    ]
    
    card = cardinality_analysis(df, classification_cols)
    pop = field_population_report(df)
    
    print("\nSTRONG LOOKUP TABLE CANDIDATES:")
    for _, row in card[card['normalization_candidate'] == True].iterrows():
        print(f"  - {row['column']}: {row['unique_values']} values, "
              f"{row['top_5_concentration_pct']:.0f}% in top 5")
    
    print("\nPRESERVE AS TEXT (high cardinality or sparse):")
    text_candidates = ['txtSizes', 'txtDatesSeen', 'txtColors', 'txtRatesText', 'txtOther']
    for col in text_candidates:
        pop_row = pop[pop['column'] == col]
        if len(pop_row) > 0:
            print(f"  - {col}: {pop_row['unique_values'].values[0]} unique, "
                  f"{pop_row['population_rate'].values[0]:.1f}% populated")
    
    print("\nALWAYS PRESERVE:")
    print("  - txtRawStateData (authoritative source)")
    print("  - txtPostmark (exact marking text)")
    print("  - txtTownPostmark (differs from txtTown 59% of time)")
    
    print("\nCANDIDATES FOR DEPRECATION (<30% populated):")
    deprecated = pop[pop['population_rate'] < 30][['column', 'population_rate']]
    for _, row in deprecated.iterrows():
        print(f"  - {row['column']}: {row['population_rate']:.2f}%")

generate_normalization_summary(approved)


STRONG LOOKUP TABLE CANDIDATES:
  - txtTownmarkShape: 24 values, 92% in top 5
  - txtTownmarkLettering: 4 values, 100% in top 5
  - txtTownmarkDateFormat: 10 values, 85% in top 5
  - txtTownmarkFraming: 6 values, 99% in top 5
  - txtTownmarkColor: 93 values, 90% in top 5

PRESERVE AS TEXT (high cardinality or sparse):
  - txtSizes: 3025 unique, 49.4% populated
  - txtDatesSeen: 9305 unique, 94.8% populated
  - txtColors: 360 unique, 53.9% populated
  - txtRatesText: 3384 unique, 31.8% populated
  - txtOther: 427 unique, 2.3% populated

ALWAYS PRESERVE:
  - txtRawStateData (authoritative source)
  - txtPostmark (exact marking text)
  - txtTownPostmark (differs from txtTown 59% of time)

CANDIDATES FOR DEPRECATION (<30% populated):
  - txtDefaultImage: 14.90%
  - txtTownmarkColor: 8.65%
  - nEarliestUseYear: 8.37%
  - txtTownmarkLettering: 7.45%
  - nLatestUseYear: 5.52%
  - nWidth: 4.30%
  - txtTownmarkShape: 2.85%
  - txtRates: 2.81%
  - nEarliestUseDay: 2.81%
  - txtOther: 2.33%
  - 