### Overview — what preprocessing should do

1. Clean obvious problems (IDs, wrong dtypes).

2. Handle missing values safely (fit imputers on training data only).

3. Engineer a few robust features (age, totals, flags).

4. Encode categorical variables (ordinal vs nominal).

5. Handle skewed numeric features (log1p when needed).

6. Scale if required (for linear models).

7. Build a single sklearn Pipeline/ColumnTransformer that does everything reproducibly.

8. Save the fitted pipeline for inference.

## 0 — Imports and load (start in a notebook/script)

In [3]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, FunctionTransformer
import joblib


## 1 — Quick cleanup & target transformation

####  Why: remove non-features and stabilise the target distribution.

In [5]:
df = pd.read_csv("../data/train.csv")
# drop Id (not a feature)
if 'Id' in df.columns:
    df = df.drop(columns=['Id'])

# If target is skewed, log-transform for training stability
df['SalePrice_log'] = np.log1p(df['price'])
y = df['SalePrice_log']
X = df.drop(columns=['price', 'SalePrice_log'])


### 2 — Decide column groups (programmatic)

##### Why: separate numeric, ordinal, nominal for different transforms.

In [6]:
num_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object']).columns.tolist()
# optionally remove year columns from numeric if you will create age features


### 3 — Custom feature engineering transformer

#### Why: create features like TotalSF, HouseAge, HasPool reproducibly inside the pipeline.

In [7]:
class FeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        # Example engineered features
        X['TotalSF'] = X['TotalBsmtSF'].fillna(0) + X['1stFlrSF'].fillna(0) + X['2ndFlrSF'].fillna(0) + X['GrLivArea'].fillna(0)
        X['TotalBath'] = X['FullBath'].fillna(0) + 0.5 * X['HalfBath'].fillna(0) + X.get('BsmtFullBath', 0).fillna(0) + 0.5 * X.get('BsmtHalfBath', 0).fillna(0)
        X['HouseAge'] = X['YrSold'] - X['YearBuilt']
        X['RemodAge'] = X['YrSold'] - X['YearRemodAdd']
        X['HasPool'] = (X['PoolArea'].fillna(0) > 0).astype(int)
        X['HasGarage'] = (X['GarageArea'].fillna(0) > 0).astype(int)
        X['HasBsmt'] = (X['TotalBsmtSF'].fillna(0) > 0).astype(int)
        # add more engineered features as needed
        return X


### 4 — Missing value strategy (rules & implementation)

 Why: different semantics need different imputations.

* If missing means none (e.g., PoolQC, GarageType) → fill 'None'.

* If numerical structural missingness (e.g., LotFrontage) → impute by group median (Neighborhood).

* Numeric random missing → median.

* For features where missing might be predictive, create a missing indicator column.

Implementation inside pipeline: use SimpleImputer and a small custom transformer for groupwise fills if needed.

Example for groupwise fill (outside pipeline or inside a transformer):

In [8]:
def fill_lotfrontage_by_neighborhood(df):
    df = df.copy()
    df['LotFrontage'] = df.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))
    return df


### 5 — Ordinal encoding for quality-like features

Why: ExterQual, KitchenQual, BsmtQual are ordered; map them to numbers.

In [9]:
qual_map = {"Ex":5, "Gd":4, "TA":3, "Fa":2, "Po":1, np.nan:0}
# In pipeline you can use sklearn's OrdinalEncoder with custom categories order,
# but a mapping in FeatureEngineer or a small transformer is simpler.
class OrdinalMapper(BaseEstimator, TransformerMixin):
    def __init__(self, mappings):
        self.mappings = mappings
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        for col, m in self.mappings.items():
            if col in X.columns:
                X[col + "_num"] = X[col].map(m).fillna(0)
        return X

ord_mappings = {
    'ExterQual': {"Ex":5,"Gd":4,"TA":3,"Fa":2,"Po":1},
    'KitchenQual': {"Ex":5,"Gd":4,"TA":3,"Fa":2,"Po":1},
    'BsmtQual': {"Ex":5,"Gd":4,"TA":3,"Fa":2,"Po":1},
}


### 6 — Nominal (high-cardinality) categorical encoding

Options & recommendation:

Low-cardinality (<= ~10 unique): One-Hot encode.

High-cardinality (Neighborhood, etc.): Frequency encoding or target (mean) encoding with CV-safe implementation.

Frequency encoding example:

In [10]:
def freq_encode(series):
    freq = series.value_counts() / len(series)
    return series.map(freq)

# In pipeline: implement as transformer or do before pipeline and treat as numeric


Safe target encoding (out-of-fold) — sketch (use KFold on training only):

In [11]:
def target_mean_encode_train(X, y, col, n_splits=5):
    X = X.copy()
    X[col + "_te"] = np.nan
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    for tr_idx, val_idx in kf.split(X):
        tr_mean = X.iloc[tr_idx].groupby(col)[y.name].mean()
        X.loc[X.index[val_idx], col + "_te"] = X.loc[X.index[val_idx]][col].map(tr_mean)
    # global mean for any unseen categories
    global_mean = y.mean()
    X[col + "_te"].fillna(global_mean, inplace=True)
    return X[col + "_te"]


### 7 — Skewness & numeric transforms

Why: many area/price-like columns are right-skewed — log1p helps linear models and stabilises variance.

In [12]:
# find skewed numeric features
skewness = X[num_cols].apply(lambda s: s.dropna().skew()).sort_values(ascending=False)
skewed_feats = skewness[abs(skewness) > 0.75].index.tolist()

# transform them (inside pipeline use FunctionTransformer)
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log1p, validate=False)
# apply log_transformer to skewed numeric columns only in ColumnTransformer


### 8 — Outlier handling

Options:

Remove extreme obvious errors (e.g., GrLivArea > 4500 may be data error) — do only after inspection.

Clip/winsorize numeric features.

Use robust models (tree-based) that tolerate outliers.

Example to clip:

In [13]:
def clip_outliers(X, columns, lower_quantile=0.01, upper_quantile=0.99):
    X = X.copy()
    for col in columns:
        lo = X[col].quantile(lower_quantile)
        hi = X[col].quantile(upper_quantile)
        X[col] = X[col].clip(lo, hi)
    return X


### 9 — Scaling

Tree models (XGBoost, RandomForest): no scaling needed.

Linear models / KNN / SVM: use StandardScaler or RobustScaler (better with outliers).

Include scaling in numeric pipeline:

In [14]:
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('log', FunctionTransformer(np.log1p, validate=False)),  # optional: only on skewed subset
    ('scaler', StandardScaler())
])


### 10 — Putting it together — example ColumnTransformer pipeline

This example demonstrates a full pipeline including feature engineering, ordinal mapping, numeric & categorical processing. Adjust lists to your dataset.

In [16]:
# select columns (example)
numeric_features = ['LotFrontage','LotArea','OverallQual','GrLivArea','TotalBsmtSF','1stFlrSF','2ndFlrSF','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd']
ordinal_features = ['ExterQual','KitchenQual','BsmtQual']  # map separately
low_card_cat = ['MSZoning','Street','SaleCondition']  # one-hot
high_card_cat = ['Neighborhood']  # freq/target encode

# Pipelines
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())   # or omitted for trees
])

cat_low_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# combine into ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipeline, numeric_features),
    ('cat_low', cat_low_pipeline, low_card_cat),
    # For ordinal and high-cardinality, we'll handle in FeatureEngineer/OrdinalMapper or as separate pipeline pieces
], remainder='passthrough') # remainder will let FeatureEngineer output flow through

# Full pipeline with custom feature engineering & ordinal mapping BEFORE ColumnTransformer
full_pipeline = Pipeline([
    ('feat_eng', FeatureEngineer()),
    ('ordinal', OrdinalMapper(ord_mappings)),
    ('preproc', preprocessor)
])


# 11 — Fit on training, transform test safely

Always fit the pipeline only on training data and transform validation/test.

In [None]:

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

full_pipeline.fit(X_train, y_train)         # fit imputers, scalers, etc.
X_train_trans = full_pipeline.transform(X_train)
X_valid_trans = full_pipeline.transform(X_valid)

### 12 — Save pipeline & metadata

Save the fitted pipeline (and any mapping dicts) for inference.

In [None]:
os.makedirs("models", exist_ok=True)
joblib.dump(full_pipeline, "models/preprocessor.joblib")
# Save ordinal mapping too (if used externally)
joblib.dump(ord_mappings, "models/ordinal_mappings.joblib")

### Load in inference:

In [None]:
preprocessor = joblib.load("models/preprocessor.joblib")
X_new = pd.read_csv("data/test.csv")  # raw test
X_new_transformed = preprocessor.transform(X_new)
