# Notebook 02: Data Preprocessing

Systematic preprocessing of Ames Housing dataset based on exploratory findings to prepare clean data for machine learning model development.

## 1. Data Loading and Initial Processing

Load datasets and implement parser-guided missing data treatment strategies identified during exploratory analysis.

### 1.1 Dataset Import and Validation


In [57]:
# Load required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Load the datasets
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

print("Dataset Import Validation:")
print(f"Training data: {df_train.shape}")
print(f"Test data: {df_test.shape}")

# Create combined dataset for consistent feature processing
df_combined = pd.concat([
    df_train.drop('SalePrice', axis=1),
    df_test
], ignore_index=True)
df_combined['dataset_source'] = ['train']*len(df_train) + ['test']*len(df_test)

print(f"Combined dataset: {df_combined.shape}")
print(f"Features to process: {df_combined.shape[1] - 1}")

# Validate against Notebook 01 findings
expected_missing_features = 34
actual_missing_features = df_combined.drop('dataset_source', axis=1).isnull().any().sum()
print(f"\nMissing data validation:")
print(f"Expected features with missing data: {expected_missing_features}")
print(f"Actual features with missing data: {actual_missing_features}")
print(f"Validation: {'✓ PASS' if actual_missing_features == expected_missing_features else '✗ FAIL'}")

Dataset Import Validation:
Training data: (1460, 81)
Test data: (1459, 80)
Combined dataset: (2919, 81)
Features to process: 80

Missing data validation:
Expected features with missing data: 34
Actual features with missing data: 34
Validation: ✓ PASS


Dataset validation successful - all expected characteristics confirmed from Notebook 01 analysis.

### 1.2 Parser Integration Setup

In [58]:
# Setup data description parser for domain knowledge
from data_description_parser import (
    load_feature_descriptions,
    quick_feature_lookup,
    display_summary_table,
    get_categorical_features,
    get_numerical_features
)

# Load official documentation
feature_descriptions = load_feature_descriptions()
print("Parser Integration Setup:")
print("✓ Official real estate documentation loaded successfully")

# Get feature classifications for preprocessing
categorical_features = get_categorical_features(feature_descriptions)
numerical_features = get_numerical_features(feature_descriptions)

print(f"✓ Categorical features identified: {len(categorical_features)}")
print(f"✓ Numerical features identified: {len(numerical_features)}")

# Verify critical misclassified features from Notebook 01
misclassified_features = ['OverallQual', 'OverallCond', 'MSSubClass']
print(f"\nMisclassified ordinal features to correct:")
for feature in misclassified_features:
    feature_type = 'Categorical' if feature in categorical_features else 'Numerical'
    pandas_type = df_train[feature].dtype
    print(f"  {feature}: Parser={feature_type}, Pandas={pandas_type}")

Parser Integration Setup:
✓ Official real estate documentation loaded successfully
✓ Categorical features identified: 46
✓ Numerical features identified: 33

Misclassified ordinal features to correct:
  OverallQual: Parser=Categorical, Pandas=int64
  OverallCond: Parser=Categorical, Pandas=int64
  MSSubClass: Parser=Categorical, Pandas=int64


Parser integration confirmed 46 categorical and 33 numerical features with 3 misclassified ordinal features requiring correction.

## 2. Feature Classification Correction

Correct misclassified ordinal features identified in Notebook 01 before missing data treatment to ensure proper data types.

### 2.1 Ordinal Feature Correction

In [59]:
# Correct misclassified ordinal features identified in Notebook 01
ordinal_features = ['OverallQual', 'OverallCond', 'MSSubClass']

print("Ordinal Feature Correction:")
print("Converting integer-stored ordinal features to proper categorical types")

# Show current state before correction
print(f"\nBefore correction:")
for feature in ordinal_features:
    dtype = df_combined[feature].dtype
    unique_vals = sorted(df_combined[feature].unique())
    print(f"  {feature}: {dtype} with {len(unique_vals)} unique values: {unique_vals}")

# Convert to ordered categorical for combined dataset
print(f"\nApplying corrections:")
for feature in ordinal_features:
    if feature == 'MSSubClass':
        # MSSubClass: dwelling type categories
        print(f"  {feature}: Converting to unordered categorical (dwelling types)")
        df_combined[feature] = df_combined[feature].astype('category')
    else:
        # OverallQual and OverallCond: 1-10 quality scales
        print(f"  {feature}: Converting to ordered categorical (quality scale)")
        df_combined[feature] = pd.Categorical(df_combined[feature],
                                            categories=sorted(df_combined[feature].unique()),
                                            ordered=True)

print(f"\nAfter correction:")
for feature in ordinal_features:
    dtype = df_combined[feature].dtype
    is_ordered = hasattr(df_combined[feature], 'cat') and df_combined[feature].cat.ordered
    print(f"  {feature}: {dtype} (ordered: {is_ordered})")
    
    # Show categories for verification
    if hasattr(df_combined[feature], 'cat'):
        categories = list(df_combined[feature].cat.categories)
        print(f"    Categories: {categories}")

Ordinal Feature Correction:
Converting integer-stored ordinal features to proper categorical types

Before correction:
  OverallQual: int64 with 10 unique values: [np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10)]
  OverallCond: int64 with 9 unique values: [np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)]
  MSSubClass: int64 with 16 unique values: [np.int64(20), np.int64(30), np.int64(40), np.int64(45), np.int64(50), np.int64(60), np.int64(70), np.int64(75), np.int64(80), np.int64(85), np.int64(90), np.int64(120), np.int64(150), np.int64(160), np.int64(180), np.int64(190)]

Applying corrections:
  OverallQual: Converting to ordered categorical (quality scale)
  OverallCond: Converting to ordered categorical (quality scale)
  MSSubClass: Converting to unordered categorical (dwelling types)

After correction:
  OverallQual: category (ordered

Successfully converted 3 misclassified features to proper categorical types. OverallQual and OverallCond now preserve ordinal relationships (1-10 scales), while MSSubClass represents dwelling type categories.

## 3. Missing Data Treatment

Systematic analysis and treatment of 34 missing data features using parser consultation for domain-guided decisions.

### 3.1 Missing Data Analysis

In [60]:
# Get all features with missing data from combined dataset
missing_data = df_combined.drop('dataset_source', axis=1).isnull().sum()
missing_features = missing_data[missing_data > 0].sort_values(ascending=False)

print("Missing Data Overview:")
print(f"Total features with missing data: {len(missing_features)}")
print(f"Total missing values: {missing_features.sum()}")
print(f"Dataset completeness: {((len(df_combined) * 80 - missing_features.sum()) / (len(df_combined) * 80)) * 100:.1f}%")

print(f"\nTop 10 features with missing data:")
for feature, count in missing_features.head(10).items():
    pct = (count / len(df_combined)) * 100
    print(f"  {feature}: {count} missing ({pct:.1f}%)")

print(f"\nAll missing features for systematic treatment:")
for feature, count in missing_features.items():
    pct = (count / len(df_combined)) * 100
    print(f"  {feature}: {count} ({pct:.1f}%)")

Missing Data Overview:
Total features with missing data: 34
Total missing values: 15707
Dataset completeness: 93.3%

Top 10 features with missing data:
  PoolQC: 2909 missing (99.7%)
  MiscFeature: 2814 missing (96.4%)
  Alley: 2721 missing (93.2%)
  Fence: 2348 missing (80.4%)
  MasVnrType: 1766 missing (60.5%)
  FireplaceQu: 1420 missing (48.6%)
  LotFrontage: 486 missing (16.6%)
  GarageFinish: 159 missing (5.4%)
  GarageQual: 159 missing (5.4%)
  GarageCond: 159 missing (5.4%)

All missing features for systematic treatment:
  PoolQC: 2909 (99.7%)
  MiscFeature: 2814 (96.4%)
  Alley: 2721 (93.2%)
  Fence: 2348 (80.4%)
  MasVnrType: 1766 (60.5%)
  FireplaceQu: 1420 (48.6%)
  LotFrontage: 486 (16.6%)
  GarageFinish: 159 (5.4%)
  GarageQual: 159 (5.4%)
  GarageCond: 159 (5.4%)
  GarageYrBlt: 159 (5.4%)
  GarageType: 157 (5.4%)
  BsmtExposure: 82 (2.8%)
  BsmtCond: 82 (2.8%)
  BsmtQual: 81 (2.8%)
  BsmtFinType2: 80 (2.7%)
  BsmtFinType1: 79 (2.7%)
  MasVnrArea: 23 (0.8%)
  MSZoning: 4 (

Confirmed 34 features with 15,707 missing values across combined dataset. Clear patterns emerge: high-impact amenity features (>50% missing) and coordinated feature groups (garage ~5.4%, basement ~2.8%) indicating systematic architectural absence.

### 3.2 Parser-Guided Feature Analysis

In [61]:
# Parser consultation for missing features - discovery phase
print(f"\nSystematic Parser Consultation for Missing Features:")
print("="*70)

# Get all features with missing data 
all_missing_features = missing_features.index.tolist()  # From section 3.1
print(f"Total features to analyze: {len(all_missing_features)}")

# Process each feature individually
for i, feature in enumerate(all_missing_features, 1):
    missing_count = df_combined[feature].isnull().sum()
    missing_pct = (missing_count / len(df_combined)) * 100
    
    print(f"\n{i}. {feature}")
    print("-" * 50)
    print(f"Missing: {missing_count} values ({missing_pct:.1f}%)")
    
    # Parser consultation
    quick_feature_lookup(feature, feature_descriptions)
    
    # Show feature context and distribution
    if df_combined[feature].dtype == 'object':
        # Categorical feature
        unique_values = df_combined[feature].dropna().unique()
        print(f"Data type: Categorical ({len(unique_values)} unique values)")
        
        # Value distribution 
        value_counts = df_combined[feature].value_counts()
        print(f"Value distribution:")
        for value, count in value_counts.items():
            pct = (count / value_counts.sum()) * 100
            print(f"  {value}: {count} ({pct:.1f}%)")
    else:
        # Numerical feature  
        train_mask = df_combined['dataset_source'] == 'train'
        train_data = df_combined[train_mask][feature].dropna()
        print(f"Data type: Numerical")
        print(f"Range: {train_data.min():.1f} - {train_data.max():.1f}")
        print(f"Stats: Mean={train_data.mean():.1f}, Median={train_data.median():.1f}")
        
        # Check for zero values
        zero_count = (train_data == 0).sum()
        if zero_count > 0:
            zero_pct = (zero_count / len(train_data)) * 100
            print(f"Zero values: {zero_count} ({zero_pct:.1f}%)")
    

print(f"\n✓ Parser consultation completed for all {len(all_missing_features)} features")


Systematic Parser Consultation for Missing Features:
Total features to analyze: 34

1. PoolQC
--------------------------------------------------
Missing: 2909 values (99.7%)
Feature: PoolQC
Description: Pool quality
Type: Categorical

Categories:
  Ex: Excellent
  Gd: Good
  TA: Average/Typical
  Fa: Fair
  NA: No Pool
------------------------------------------------------------
Data type: Categorical (3 unique values)
Value distribution:
  Ex: 4 (40.0%)
  Gd: 4 (40.0%)
  Fa: 2 (20.0%)

2. MiscFeature
--------------------------------------------------
Missing: 2814 values (96.4%)
Feature: MiscFeature
Description: Miscellaneous feature not covered in other categories
Type: Categorical

Categories:
  Elev: Elevator
  Gar2: 2nd Garage (if not described in garage section)
  Othr: Other
  Shed: Shed (over 100 SF)
  TenC: Tennis Court
  NA: None
------------------------------------------------------------
Data type: Categorical (4 unique values)
Value distribution:
  Shed: 95 (90.5%)
  Gar2

Based on parser consultation, categorize features into treatment strategies.

### 3.3 Missing Data Treatment Implementation

In [62]:
# Categorize features based on parser consultation findings and logical reasoning
print("Feature Categorization Based on Parser Analysis:")

# GROUP 1: Amenity/Optional Structure Features - Missing = Structure Doesn't Exist
# Logic: If no pool/fence/alley exists, there's no quality/type to rate → "None"
amenity_none_features = [
    'PoolQC',        # 99.7% missing - most houses don't have pools
    'MiscFeature',   # 96.4% missing - most houses don't have elevators/tennis courts
    'Alley',         # 93.2% missing - most houses don't have alley access
    'Fence',         # 80.4% missing - many houses don't have fences
    'FireplaceQu'    # 48.6% missing - many houses don't have fireplaces
]

# GROUP 2: Coordinated Structure Features - Missing = No Garage/Basement Exists
# Logic: Garage and basement features missing together indicate absent structure → "None"
garage_none_features = [
    'GarageFinish',  # 5.4% missing - no garage = no finish to evaluate
    'GarageQual',    # 5.4% missing - no garage = no quality to rate
    'GarageCond',    # 5.4% missing - no garage = no condition to assess
    'GarageType'     # 5.4% missing - no garage = no type to classify
]

basement_none_features = [
    'BsmtExposure',  # 2.8% missing - no basement = no exposure to evaluate
    'BsmtCond',      # 2.8% missing - no basement = no condition to assess
    'BsmtQual',      # 2.8% missing - no basement = no quality to rate
    'BsmtFinType2',  # 2.7% missing - no basement = no finished area types
    'BsmtFinType1'   # 2.7% missing - no basement = no finished area types
]

# GROUP 3: Masonry Special Case - Has explicit "None" category
masonry_none_features = ['MasVnrType']  # Has "None: None" category for houses without masonry

# Combine all "None" replacement features
all_none_features = amenity_none_features + garage_none_features + basement_none_features + masonry_none_features

# GROUP 4: Geographic Features - Use Neighborhood Context
# Logic: Properties in same neighborhood have similar characteristics
geographic_features = [
    'LotFrontage'    # Street frontage varies by neighborhood development patterns
]

# GROUP 5: Coordinated Numerical Features - Absent Structure = 0 Value
# Logic: If structure doesn't exist, measurements should be 0, not estimated
coordinated_numerical = {
    'GarageYrBlt': 'YearBuilt if garage exists, 0 if not',  # Garage built with house
    'MasVnrArea': '0 (coordinated with masonry absence)',  # Coordinated with MasVnrType=None
    'GarageArea': '0 (coordinated with garage absence)',
    'GarageCars': '0 (coordinated with garage absence)',
    'TotalBsmtSF': '0 (coordinated with basement absence)',
    'BsmtUnfSF': '0 (coordinated with basement absence)',
    'BsmtFinSF2': '0 (coordinated with basement absence)',
    'BsmtFinSF1': '0 (coordinated with basement absence)',
    'BsmtFullBath': '0 (coordinated with basement absence)',
    'BsmtHalfBath': '0 (coordinated with basement absence)'
}

# GROUP 6A: Geographic Categorical Features - Use Neighborhood Mode
# Logic: These features cluster geographically within neighborhoods
geographic_categorical = {
    'MSZoning': 'neighborhood mode (zoning clusters geographically)',
    'Exterior1st': 'neighborhood mode (architectural styles cluster)',
    'Exterior2nd': 'neighborhood mode (architectural styles cluster)',
    'SaleType': 'neighborhood mode (construction patterns vary by area)'
}

# GROUP 6B: System/Standard Features - Use Overall Mode
# Logic: Technical standards that don't vary geographically
system_categorical = {
    'Electrical': 'SBrkr (standard circuit breakers)',
    'Utilities': 'AllPub (all public utilities)',
    'Functional': 'Typ (typical functionality)',
    'KitchenQual': 'TA (typical/average)'
}

print(f"✓ Amenity features: {len(amenity_none_features)}")
print(f"✓ Garage features: {len(garage_none_features)}")
print(f"✓ Basement features: {len(basement_none_features)}")
print(f"✓ Masonry features: {len(masonry_none_features)}")
print(f"✓ Geographic numerical: {len(geographic_features)}")
print(f"✓ Coordinated numerical: {len(coordinated_numerical)}")
print(f"✓ Geographic categorical: {len(geographic_categorical)}")
print(f"✓ System categorical: {len(system_categorical)}")

# STEP 1: Apply "None" replacement for structural absence
print("\n=== STEP 1: 'None' Replacement for Absent Structures ===")
for feature in all_none_features:
    before_missing = df_combined[feature].isnull().sum()
    df_combined[feature] = df_combined[feature].fillna('None')
    print(f"✓ {feature}: {before_missing} missing → 'None'")

print(f"\n✓ Completed 'None' replacement for {len(all_none_features)} features")

# STEP 2: Geographic features with neighborhood context
print("\n=== STEP 2: Geographic Features (Neighborhood Context) ===")
train_mask = df_combined['dataset_source'] == 'train'

for feature in geographic_features:
    if df_combined[feature].isnull().sum() > 0:
        missing_count = df_combined[feature].isnull().sum()
        df_combined[feature] = df_combined.groupby('Neighborhood')[feature].transform(
            lambda x: x.fillna(x.median())
        )
        print(f"✓ {feature}: {missing_count} missing → neighborhood median (geographic context)")

# STEP 3: Smart Coordinated Numerical (Neighborhood-Aware)
# Logic: If structure exists but data missing → use neighborhood median
# Why: Building patterns and standards cluster by neighborhood
print("\n=== STEP 3: Coordinated Numerical Features (Neighborhood-Aware Logic) ===")

# GarageYrBlt - smart imputation based on garage existence
if 'GarageYrBlt' in coordinated_numerical and df_combined['GarageYrBlt'].isnull().sum() > 0:
    missing_count = df_combined['GarageYrBlt'].isnull().sum()

    # If garage exists, use YearBuilt (garage typically built with house)
    garage_exists_mask = df_combined['GarageType'] != 'None'
    garage_missing_mask = df_combined['GarageYrBlt'].isnull()

    df_combined.loc[garage_exists_mask & garage_missing_mask, 'GarageYrBlt'] = df_combined.loc[garage_exists_mask & garage_missing_mask, 'YearBuilt']
    df_combined.loc[~garage_exists_mask & garage_missing_mask, 'GarageYrBlt'] = 0

    print(f"✓ GarageYrBlt: {missing_count} missing → YearBuilt if garage exists, 0 if not")

# Garage area/capacity - neighborhood median for existing garages
garage_area_features = ['GarageArea', 'GarageCars']
for feature in garage_area_features:
    if feature in coordinated_numerical and df_combined[feature].isnull().sum() > 0:
        missing_count = df_combined[feature].isnull().sum()

        # Neighborhood median for houses with garages, 0 for houses without
        garage_exists_mask = df_combined['GarageType'] != 'None'
        missing_mask = df_combined[feature].isnull()

        # Use neighborhood median for existing garages (suburban vs urban differ)
        neighborhood_values = df_combined[garage_exists_mask].groupby('Neighborhood')[feature].transform('median')
        df_combined.loc[garage_exists_mask & missing_mask, feature] = neighborhood_values[missing_mask & garage_exists_mask]
        df_combined.loc[~garage_exists_mask & missing_mask, feature] = 0

        # Count imputation split
        garage_missing_count = (garage_exists_mask & missing_mask).sum()
        no_garage_missing_count = (~garage_exists_mask & missing_mask).sum()

        print(f"✓ {feature}: {missing_count} missing → {garage_missing_count} neighborhood median, {no_garage_missing_count} set to 0")

# Basement features - neighborhood median for existing basements
basement_features = ['TotalBsmtSF', 'BsmtUnfSF', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath']
for feature in basement_features:
    if feature in coordinated_numerical and df_combined[feature].isnull().sum() > 0:
        missing_count = df_combined[feature].isnull().sum()

        # Check if basement exists using BsmtQual as indicator
        basement_exists_mask = df_combined['BsmtQual'] != 'None'
        missing_mask = df_combined[feature].isnull()

        # Use neighborhood median for existing basements (local codes create consistency)
        if basement_exists_mask.any() and missing_mask.any():
            neighborhood_values = df_combined[basement_exists_mask].groupby('Neighborhood')[feature].transform('median')
            df_combined.loc[basement_exists_mask & missing_mask, feature] = neighborhood_values[missing_mask & basement_exists_mask]
        df_combined.loc[~basement_exists_mask & missing_mask, feature] = 0

        # Count imputation split
        basement_missing_count = (basement_exists_mask & missing_mask).sum()
        no_basement_missing_count = (~basement_exists_mask & missing_mask).sum()

        print(f"✓ {feature}: {missing_count} missing → {basement_missing_count} neighborhood median, {no_basement_missing_count} set to 0")

# Masonry area - neighborhood median by masonry type
if 'MasVnrArea' in coordinated_numerical and df_combined['MasVnrArea'].isnull().sum() > 0:
    missing_count = df_combined['MasVnrArea'].isnull().sum()

    # Use neighborhood median for houses with masonry (local styles cluster)
    masonry_exists_mask = df_combined['MasVnrType'] != 'None'
    missing_mask = df_combined['MasVnrArea'].isnull()

    if masonry_exists_mask.any():
        neighborhood_values = df_combined[masonry_exists_mask].groupby(['Neighborhood', 'MasVnrType'])['MasVnrArea'].transform('median')
        df_combined.loc[masonry_exists_mask & missing_mask, 'MasVnrArea'] = neighborhood_values[missing_mask & masonry_exists_mask]
    df_combined.loc[~masonry_exists_mask & missing_mask, 'MasVnrArea'] = 0

    # Count imputation split
    masonry_missing_count = (masonry_exists_mask & missing_mask).sum()
    no_masonry_missing_count = (~masonry_exists_mask & missing_mask).sum()

    print(f"✓ MasVnrArea: {missing_count} missing → {masonry_missing_count} neighborhood median, {no_masonry_missing_count} set to 0")

# STEP 4A: Geographic categorical features (neighborhood mode)
print("\n=== STEP 4A: Geographic Categorical Features (Neighborhood Mode) ===")
for feature, reason in geographic_categorical.items():
    if df_combined[feature].isnull().sum() > 0:
        missing_count = df_combined[feature].isnull().sum()

        # Geographic features use combined data (external structural knowledge)
        neighborhood_mode = df_combined.groupby('Neighborhood')[feature].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None)

        # Fill missing values with neighborhood mode
        for neighborhood in df_combined['Neighborhood'].unique():
            if neighborhood in neighborhood_mode.index and pd.notna(neighborhood_mode[neighborhood]):
                mask = (df_combined['Neighborhood'] == neighborhood) & df_combined[feature].isnull()
                df_combined.loc[mask, feature] = neighborhood_mode[neighborhood]

        # Fallback: use overall mode for any remaining missing values
        if df_combined[feature].isnull().sum() > 0:
            overall_mode = df_combined[feature].mode()[0]
            df_combined[feature].fillna(overall_mode, inplace=True)

        print(f"✓ {feature}: {missing_count} missing → {reason}")

# STEP 4B: System/standard categorical features (overall mode)
print("\n=== STEP 4B: System/Standard Features (Overall Mode) ===")
for feature, reason in system_categorical.items():
    if df_combined[feature].isnull().sum() > 0:
        missing_count = df_combined[feature].isnull().sum()
        # Extract the actual value from the reason string 
        fill_value = reason.split(' (')[0]
        df_combined[feature].fillna(fill_value, inplace=True)
        print(f"✓ {feature}: {missing_count} missing → '{fill_value}' ({reason})")

print(f"\n✓ Completed systematic missing data treatment using intelligent strategies")

# Validate missing data treatment
final_missing = df_combined.drop('dataset_source', axis=1).isnull().sum()
remaining_missing = final_missing[final_missing > 0]

print(f"\nMissing Data Treatment Validation:")
print(f"Features with remaining missing values: {len(remaining_missing)}")
if len(remaining_missing) == 0:
    print("✓ All missing data successfully treated")
else:
    print("Remaining missing values:")
    for feature, count in remaining_missing.items():
        print(f"  {feature}: {count}")

Feature Categorization Based on Parser Analysis:
✓ Amenity features: 5
✓ Garage features: 4
✓ Basement features: 5
✓ Masonry features: 1
✓ Geographic numerical: 1
✓ Coordinated numerical: 10
✓ Geographic categorical: 4
✓ System categorical: 4

=== STEP 1: 'None' Replacement for Absent Structures ===
✓ PoolQC: 2909 missing → 'None'
✓ MiscFeature: 2814 missing → 'None'
✓ Alley: 2721 missing → 'None'
✓ Fence: 2348 missing → 'None'
✓ FireplaceQu: 1420 missing → 'None'
✓ GarageFinish: 159 missing → 'None'
✓ GarageQual: 159 missing → 'None'
✓ GarageCond: 159 missing → 'None'
✓ GarageType: 157 missing → 'None'
✓ BsmtExposure: 82 missing → 'None'
✓ BsmtCond: 82 missing → 'None'
✓ BsmtQual: 81 missing → 'None'
✓ BsmtFinType2: 80 missing → 'None'
✓ BsmtFinType1: 79 missing → 'None'
✓ MasVnrType: 1766 missing → 'None'

✓ Completed 'None' replacement for 15 features

=== STEP 2: Geographic Features (Neighborhood Context) ===
✓ LotFrontage: 486 missing → neighborhood median (geographic context)

==

Systematic missing data treatment completed using parser-guided strategy with neighborhood-aware imputation for coordinated features.