# Machine Learning Project - Ames Housing Data

Ames, Iowa is the college town of **Iowa State University**. The Ames housing dataset consists of about $2500$ house sale records between $2006-2010$. Detailed information about the house attributes, along with the sale prices, is recorded in the dataset. The goal of the project is to:
- perform descriptive data analysis to gain business (i.e. housing market) insights
- build descriptive machine learning models to understand the local housing market.
- build predictive machine learning models for the local house price prediction.

A subset of the **Ames** dataset is hosted on [**Kaggle**](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) as an entry-level regression competition. You may visit their site for some information on the meanings of its data columns (the data dictionary). In this notebook, we will describe various project ideas related to this data.

## Data Fields

Here's a brief version of what you'll find in the data description file.

- **`GrLivArea`** – Above grade (ground) living area square feet.
---
- **`SalePrice`** – The property's sale price in dollars. This is the target variable that you're trying to predict.
---
- **`MSSubClass`** – The building class. Identifies the type of dwelling involved in the sale.
<br>**Codes:**
    - `20` – 1-STORY 1946 & NEWER ALL STYLES  
    - `30` – 1-STORY 1945 & OLDER  
    - `40` – 1-STORY W/FINISHED ATTIC ALL AGES  
    - `45` – 1-1/2 STORY - UNFINISHED ALL AGES  
    - `50` – 1-1/2 STORY FINISHED ALL AGES  
    - `60` – 2-STORY 1946 & NEWER  
    - `70` – 2-STORY 1945 & OLDER  
    - `75` – 2-1/2 STORY ALL AGES  
    - `80` – SPLIT OR MULTI-LEVEL  
    - `85` – SPLIT FOYER  
    - `90` – DUPLEX - ALL STYLES AND AGES  
    - `120` – 1-STORY PUD (Planned Unit Development) - 1946 & NEWER  
    - `150` – 1-1/2 STORY PUD - ALL AGES  
    - `160` – 2-STORY PUD - 1946 & NEWER  
    - `180` – PUD - MULTILEVEL - INCL SPLIT LEV/FOYER  
    - `190` – 2 FAMILY CONVERSION - ALL STYLES AND AGES  
---
- **`MSZoning`** – The general zoning classification. Identifies the general zoning classification of the sale.
<br>**Codes:**
    - `A` – Agriculture  
    - `C` – Commercial  
    - `FV` – Floating Village Residential  
    - `I` – Industrial  
    - `RH` – Residential High Density  
    - `RL` – Residential Low Density  
    - `RP` – Residential Low Density Park  
    - `RM` – Residential Medium Density  
---
- **`LotFrontage`** – Linear feet of street connected to property.
---
- **`LotArea`** – Lot size in square feet.
---
- **`Street`** – Type of road access.
<br>**Codes:**
    - `Grvl` – Gravel  
    - `Pave` – Paved  
---
- **`Alley`** – Type of alley access.
<br>**Codes:**
    - `Grvl` – Gravel  
    - `Pave` – Paved  
    - `NA` – No alley access  
---
- **`LotShape`** – General shape of property.
<br>**Codes:**
    - `Reg` – Regular  
    - `IR1` – Slightly irregular  
    - `IR2` – Moderately irregular  
    - `IR3` – Irregular  
---
- **`LandContour`** – Flatness of the property.
<br>**Codes:**
    - `Lvl` – Near Flat/Level  
    - `Bnk` – Banked - Quick and significant rise from street grade to building  
    - `HLS` – Hillside - Significant slope from side to side  
    - `Low` – Depression  
---
- **`Utilities`** – Type of utilities available.
<br>**Codes:**
    - `AllPub` – All public Utilities (E, G, W, & S)  
    - `NoSewr` – Electricity, Gas, and Water (Septic Tank)  
    - `NoSeWa` – Electricity and Gas Only  
    - `ELO` – Electricity only  
---
- **`LotConfig`** – Lot configuration.
<br>**Codes:**
    - `Inside` – Inside lot  
    - `Corner` – Corner lot  
    - `CulDSac` – Cul-de-sac  
    - `FR2` – Frontage on 2 sides of property  
    - `FR3` – Frontage on 3 sides of property  
---
- **`LandSlope`** – Slope of property.
<br>**Codes:**
    - `Gtl` – Gentle slope  
    - `Mod` – Moderate slope  
    - `Sev` – Severe slope  
---
- **`Neighborhood`** – Physical locations within Ames city limits.
<br>**Neighborhood codes:**
    - `Blmngtn` – Bloomington Heights (`NO`)  
    - `Blueste` – Bluestem (`SW`)  
    - `BrDale` – Briardale (`NO`)  
    - `BrkSide` – Brookside (`DT`)  
    - `ClearCr` – Clear Creek (`NO`)  
    - `CollgCr` – College Creek (`SW`)  
    - `Crawfor` – Crawford (`SW`)  
    - `Edwards` – Edwards (`SW`)  
    - `Gilbert` – Gilbert (`NO`)  
    - `IDOTRR` – Iowa DOT and Rail Road (`DT`)  
    - `MeadowV` – Meadow Village (`SE`)  
    - `Mitchel` – Mitchell (`SE`)  
    - `Names` – North Ames (`NO`)  
    - `NoRidge` – Northridge (`NW`)  
    - `NPkVill` – Northpark Villa (`NO`)  
    - `NridgHt` – Northridge Heights (`NW`)  
    - `NWAmes` – Northwest Ames (`NO`)  
    - `OldTown` – Old Town (`DT`)  
    - `SWISU` – South & West of ISU (`SW`)  
    - `Sawyer` – Sawyer (`NW`)  
    - `SawyerW` – Sawyer West (`NW`)  
    - `Somerst` – Somerset (`NW`)  
    - `StoneBr` – Stone Brook (`NO`)  
    - `Timber` – Timberland (`SW`)  
    - `Veenker` – Veenker (`NW`)  
    - `Greens` – Greensboro (`NW`)  
    - `GrnHill` – Greens Hills (`SO`)  
    - `Landmrk` – Landmark Villas?? (`DT`) 
    <br>**Daniel recoding to city sectors (per https://www.thinkames.com/maps/):**
        - `NW` – NorthWest  
        - `NO` – North  
        - `NE` – NorthEast  
        - `SW` – SouthWest  
        - `DT` – Downtown  
        - `SO` – South  
        - `SE` – SouthEast  
---
- **`Condition1`** – Proximity to main road or railroad.
<br>**Codes:**
    - `Artery` – Adjacent to arterial street  
    - `Feedr` – Adjacent to feeder street  
    - `Norm` – Normal  
    - `RRNn` – Within 200' of North-South Railroad  
    - `RRAn` – Adjacent to North-South Railroad  
    - `PosN` – Near positive off-site feature (park, greenbelt, etc.)  
    - `PosA` – Adjacent to positive off-site feature  
    - `RRNe` – Within 200' of East-West Railroad  
    - `RRAe` – Adjacent to East-West Railroad  
---
- **`Condition2`** – Proximity to main road or railroad (if a second is present).
<br>**Codes:**
    - `Artery` – Adjacent to arterial street  
    - `Feedr` – Adjacent to feeder street  
    - `Norm` – Normal  
    - `RRNn` – Within 200' of North-South Railroad  
    - `RRAn` – Adjacent to North-South Railroad  
    - `PosN` – Near positive off-site feature (park, greenbelt, etc.)  
    - `PosA` – Adjacent to positive off-site feature  
    - `RRNe` – Within 200' of East-West Railroad  
    - `RRAe` – Adjacent to East-West Railroad  
---
- **`BldgType`** – Type of dwelling.
<br>**Codes:**
    - `1Fam` – Single-family detached  
    - `2FmCon` – Two-family conversion (originally built as one-family dwelling)  
    - `Duplx` – Duplex  
    - `TwnhsE` – Townhouse end unit  
    - `TwnhsI` – Townhouse inside unit  
---
- **`HouseStyle`** – Style of dwelling.
<br>**Codes:**
    - `1Story` – One story  
    - `1.5Fin` – One and one-half story: 2nd level finished  
    - `1.5Unf` – One and one-half story: 2nd level unfinished  
    - `2Story` – Two story  
    - `2.5Fin` – Two and one-half story: 2nd level finished  
    - `2.5Unf` – Two and one-half story: 2nd level unfinished  
    - `SFoyer` – Split foyer  
    - `SLvl` – Split level  
---
- **`OverallQual`** – Overall material and finish quality.
<br>**Codes:**
    - `10` – Very Excellent  
    - `9` – Excellent  
    - `8` – Very Good  
    - `7` – Good  
    - `6` – Above Average  
    - `5` – Average  
    - `4` – Below Average  
    - `3` – Fair  
    - `2` – Poor  
    - `1` – Very Poor 
---
- **`OverallCond`** – Overall condition rating.
<br>**Codes:**
    - `10` – Very Excellent  
    - `9` – Excellent  
    - `8` – Very Good  
    - `7` – Good  
    - `6` – Above Average  
    - `5` – Average  
    - `4` – Below Average  
    - `3` – Fair  
    - `2` – Poor  
    - `1` – Very Poor  
---
- **`YearBuilt`** – Original construction date.
---
- **`YearRemodAdd`** – Remodel date.
---
- **`RoofStyle`** – Type of roof.
<br>**Codes:**
    - `Flat` – Flat  
    - `Gable` – Gable  
    - `Gambrel` – Gambrel (barn)  
    - `Hip` – Hip  
    - `Mansard` – Mansard  
    - `Shed` – Shed  
---
- **`RoofMatl`** – Roof material.
<br>**Codes:**
    - `ClyTile` – Clay or Tile  
    - `CompShg` – Standard (Composite) Shingle  
    - `Membran` – Membrane  
    - `Metal` – Metal  
    - `Roll` – Roll  
    - `Tar&Grv` – Gravel & Tar  
    - `WdShake` – Wood Shakes  
    - `WdShngl` – Wood Shingles  
---
- **`Exterior1st`** – Exterior covering on house.
<br>**Codes:**
    - `AsbShng` – Asbestos Shingles  
    - `AsphShn` – Asphalt Shingles  
    - `BrkComm` – Brick Common  
    - `BrkFace` – Brick Face  
    - `CBlock` – Cinder Block  
    - `CemntBd` – Cement Board  
    - `HdBoard` – Hard Board  
    - `ImStucc` – Imitation Stucco  
    - `MetalSd` – Metal Siding  
    - `Other` – Other  
    - `Plywood` – Plywood  
    - `PreCast` – PreCast  
    - `Stone` – Stone  
    - `Stucco` – Stucco  
    - `VinylSd` – Vinyl Siding  
    - `Wd Sdng` – Wood Siding  
    - `WdShing` – Wood Shingles  
---
- **`Exterior2nd`** – Exterior covering on house (if more than one material).
<br>**Codes:**
    - `AsbShng` – Asbestos Shingles  
    - `AsphShn` – Asphalt Shingles  
    - `BrkComm` – Brick Common  
    - `BrkFace` – Brick Face  
    - `CBlock` – Cinder Block  
    - `CemntBd` – Cement Board  
    - `HdBoard` – Hard Board  
    - `ImStucc` – Imitation Stucco  
    - `MetalSd` – Metal Siding  
    - `Other` – Other  
    - `Plywood` – Plywood  
    - `PreCast` – PreCast  
    - `Stone` – Stone  
    - `Stucco` – Stucco  
    - `VinylSd` – Vinyl Siding  
    - `Wd Sdng` – Wood Siding  
    - `WdShing` – Wood Shingles  
---
- **`MasVnrType`** – Masonry veneer type.
<br>**Codes:**
    - `BrkCmn` – Brick Common  
    - `BrkFace` – Brick Face  
    - `CBlock` – Cinder Block  
    - `None` – None  
    - `Stone` – Stone  
- **`MasVnrArea`** – Masonry veneer area in square feet.
---
- **`ExterQual`** – Exterior material quality.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Average/Typical  
    - `Fa` – Fair  
    - `Po` – Poor  
---
- **`ExterCond`** – Present condition of the material on the exterior.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Average/Typical  
    - `Fa` – Fair  
    - `Po` – Poor  
---
- **`Foundation`** – Type of foundation.
<br>**Codes:**
    - `BrkTil` – Brick & Tile  
    - `CBlock` – Cinder Block  
    - `PConc` – Poured Concrete  
    - `Slab` – Slab  
    - `Stone` – Stone  
    - `Wood` – Wood  
---
- **`BsmtQual`** – Height of the basement.
<br>**Codes:**
    - `Ex` – Excellent (100+ inches)  
    - `Gd` – Good (90–99 inches)  
    - `TA` – Typical (80–89 inches)  
    - `Fa` – Fair (70–79 inches)  
    - `Po` – Poor (<70 inches)  
    - `NA` – No Basement  
---
- **`BsmtCond`** – General condition of the basement.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Typical - slight dampness allowed  
    - `Fa` – Fair - dampness or some cracking or settling  
    - `Po` – Poor - severe cracking, settling, or wetness  
    - `NA` – No Basement  
---
- **`BsmtExposure`** – Walkout or garden level basement walls.
<br>**Codes:**
    - `Gd` – Good Exposure  
    - `Av` – Average Exposure (split levels or foyers typically score average or above)  
    - `Mn` – Minimum Exposure  
    - `No` – No Exposure  
    - `NA` – No Basement  
---
- **`BsmtFinType1`** – Quality of basement finished area.
<br>**Codes:**
    - `GLQ` – Good Living Quarters  
    - `ALQ` – Average Living Quarters  
    - `BLQ` – Below Average Living Quarters  
    - `Rec` – Average Rec Room  
    - `LwQ` – Low Quality  
    - `Unf` – Unfinished  
    - `NA` – No Basement  
- **`BsmtFinSF1`** – Type 1 finished square feet.
---
- **`BsmtFinType2`** – Quality of second finished area (if present).
<br>**Codes:**
    - `GLQ` – Good Living Quarters  
    - `ALQ` – Average Living Quarters  
    - `BLQ` – Below Average Living Quarters  
    - `Rec` – Average Rec Room  
    - `LwQ` – Low Quality  
    - `Unf` – Unfinished  
    - `NA` – No Basement  
- **`BsmtFinSF2`** – Type 2 finished square feet.
---
- **`BsmtUnfSF`** – Unfinished square feet of basement area.
---
- **`TotalBsmtSF`** – Total square feet of basement area.
---
- **`Heating`** – Type of heating.
<br>**Codes:**
    - `Floor` – Floor furnace  
    - `GasA` – Gas forced warm air furnace  
    - `GasW` – Gas hot water or steam heat  
    - `Grav` – Gravity furnace  
    - `OthW` – Hot water or steam heat other than gas  
    - `Wall` – Wall furnace  
---
- **`HeatingQC`** – Heating quality and condition.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Average/Typical  
    - `Fa` – Fair  
    - `Po` – Poor  
---
- **`CentralAir`** – Central air conditioning.
<br>**Codes:**
    - `N` – No  
    - `Y` – Yes  
- **`Electrical`** – Electrical system.
<br>**Codes:**
    - `SBrkr` – Standard circuit breakers & Romex  
    - `FuseA` – Fuse box over 60 AMP and all Romex wiring (Average)  
    - `FuseF` – 60 AMP fuse box and mostly Romex wiring (Fair)  
    - `FuseP` – 60 AMP fuse box and mostly knob & tube wiring (Poor)  
    - `Mix` – Mixed  
---
- **`1stFlrSF`** – First floor square feet.
---
- **`2ndFlrSF`** – Second floor square feet.
---
- **`LowQualFinSF`** – Low quality finished square feet (all floors).
---
- **`BsmtFullBath`** – Basement full bathrooms.
---
- **`BsmtHalfBath`** – Basement half bathrooms.
---
- **`FullBath`** – Full bathrooms above grade.
---
- **`HalfBath`** – Half baths above grade.
---
- **`Bedroom`** – Number of bedrooms above basement level.
---
- **`Kitchen`** – Number of kitchens.
---
- **`KitchenQual`** – Kitchen quality.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Typical/Average  
    - `Fa` – Fair  
    - `Po` – Poor  
---
- **`TotRmsAbvGrd`** – Total rooms above grade (does not include bathrooms).
---
- **`Functional`** – Home functionality rating.
<br>**Codes:**
    - `Typ` – Typical Functionality  
    - `Min1` – Minor Deductions 1  
    - `Min2` – Minor Deductions 2  
    - `Mod` – Moderate Deductions  
    - `Maj1` – Major Deductions 1  
    - `Maj2` – Major Deductions 2  
    - `Sev` – Severely Damaged  
    - `Sal` – Salvage only  
---
- **`Fireplaces`** – Number of fireplaces.
---
- **`FireplaceQu`** – Fireplace quality.
<br>**Codes:**
    - `Ex` – Excellent - exceptional masonry fireplace  
    - `Gd` – Good - masonry fireplace in main level  
    - `TA` – Average - prefabricated fireplace in main living area or masonry fireplace in basement  
    - `Fa` – Fair - prefabricated fireplace in basement  
    - `Po` – Poor - Ben Franklin stove  
    - `NA` – No fireplace  
---
- **`GarageType`** – Garage location.
<br>**Codes:**
    - `2Types` – More than one type of garage  
    - `Attchd` – Attached to home  
    - `Basment` – Basement garage  
    - `BuiltIn` – Built-in (garage part of house - typically has room above garage)  
    - `CarPort` – Car port  
    - `Detchd` – Detached from home  
    - `NA` – No garage  
---
- **`GarageYrBlt`** – Year garage was built.
---
- **`GarageFinish`** – Interior finish of the garage.
<br>**Codes:**
    - `Fin` – Finished  
    - `RFn` – Rough finished  
    - `Unf` – Unfinished  
    - `NA` – No garage  
---
- **`GarageCars`** – Size of garage in car capacity.
---
- **`GarageArea`** – Size of garage in square feet.
---
- **`GarageQual`** – Garage quality.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Typical/Average  
    - `Fa` – Fair  
    - `Po` – Poor  
    - `NA` – No garage  
---
- **`GarageCond`** – Garage condition.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Typical/Average  
    - `Fa` – Fair  
    - `Po` – Poor  
    - `NA` – No garage  
---
- **`PavedDrive`** – Paved driveway.
<br>**Codes:**
    - `Y` – Paved  
    - `P` – Partial pavement  
    - `N` – Dirt/Gravel  
---
- **`WoodDeckSF`** – Wood deck area in square feet.
---
- **`OpenPorchSF`** – Open porch area in square feet.
---
- **`EnclosedPorch`** – Enclosed porch area in square feet.
---
- **`3SsnPorch`** – Three season porch area in square feet.
---
- **`ScreenPorch`** – Screen porch area in square feet.
---
- **`PoolArea`** – Pool area in square feet.
---
- **`PoolQC`** – Pool quality.
<br>**Codes:**
    - `Ex` – Excellent  
    - `Gd` – Good  
    - `TA` – Average/Typical  
    - `Fa` – Fair  
    - `NA` – No pool  
---
- **`Fence`** – Fence quality.
<br>**Codes:**
    - `GdPrv` – Good privacy  
    - `MnPrv` – Minimum privacy  
    - `GdWo` – Good wood  
    - `MnWw` – Minimum wood/wire  
    - `NA` – No fence  
---
- **`MiscFeature`** – Miscellaneous feature not covered in other categories.
<br>**Codes:**
    - `Elev` – Elevator  
    - `Gar2` – 2nd garage (if not described in garage section)  
    - `Othr` – Other  
    - `Shed` – Shed (over 100 SF)  
    - `TenC` – Tennis court  
    - `NA` – None  
---
- **`MiscVal`** – Dollar value of miscellaneous feature.
---
- **`MoSold`** – Month sold (MM).
---
- **`YrSold`** – Year sold (YYYY).
---
- **`SaleType`** – Type of sale.
<br>**Codes:**
    - `WD` – Warranty Deed - Conventional  
    - `CWD` – Warranty Deed - Cash  
    - `VWD` – Warranty Deed - VA Loan  
    - `New` – Home just constructed and sold  
    - `COD` – Court Officer Deed/Estate  
    - `Con` – Contract 15% down payment, regular terms  
    - `ConLw` – Contract low down payment and low interest  
    - `ConLI` – Contract low interest  
    - `ConLD` – Contract low down  
    - `Oth` – Other  
---
- **`SaleCondition`** – Condition of sale.
<br>**Codes:**
    - `Normal` – Normal sale  
    - `Abnorml` – Abnormal sale (trade, foreclosure, short sale)  
    - `AdjLand` – Adjoining land purchase  
    - `Alloca` – Allocation - two linked properties with separate deeds (e.g., condo with a garage unit)  
    - `Family` – Sale between family members  
    - `Partial` – Home was not completed when last assessed (associated with new homes)  

In [14]:
import pandas as pd
import numpy as np
import joblib
from pathlib import Path

# Scikit-Learn Imports
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, FunctionTransformer
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import Lasso

# Tree Models
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

In [15]:
# ==========================================
# 1. CONFIGURATION & DATA LOADING
# ==========================================

# Define the columns (Manual Lists from your Lab Notebook)
CATEGORICAL_COLS = [
    'MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
    'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
    'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
    'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
    'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
    'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
    'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
    'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
    'MoSold', 'YrSold', 'SaleType', 'SaleCondition'
]

NUMERICAL_COLS = [
    'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
    'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
    'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
    'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
    'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'MiscVal'
]

# Load Data (Assuming standard path)
# Adjust path if necessary
df = pd.read_csv("data/Ames_Housing_Price_Data.csv")

# Basic Cleanup
if 'PID' in df.columns:
    df = df.drop(columns=['PID', 'Unnamed: 0'], errors='ignore')

X = df.drop(columns=['SalePrice'])
y = df['SalePrice']

# Split (We fit on Train to maintain validity of our 0.933 score)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Data Loaded. Train Shape: {X_train.shape}")

# ==========================================
# 2. DEFINE THE "TRANSLATOR" (Preprocessing)
# ==========================================

# Helper function to ensure everything is a string before imputation
def cast_to_str(x):
    return x.astype(str)

# A. Categorical Branch
# 1. Force to String
# 2. Fill Missing with 'None' (The "Safety Net")
# 3. Ordinal Encode (Strings -> Integers like 0, 1, 2)
#    Note: We use -1 for unknown categories so the model doesn't crash on new data
cat_preprocessing = Pipeline([
    ('caster', FunctionTransformer(cast_to_str, validate=False)),
    ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])

# B. Numerical Branch
# 1. Fill Missing with Median
num_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
])

# C. Global Preprocessor
# This combines the two branches. 
# IMPORTANT: It outputs [Categorical_Cols, Numerical_Cols] in that order.
preprocessor = ColumnTransformer([
    ('cat', cat_preprocessing, CATEGORICAL_COLS),
    ('num', num_preprocessing, NUMERICAL_COLS)
], verbose_feature_names_out=False)

# ==========================================
# 3. DEFINE THE "BRAIN" (Model Branches)
# ==========================================

# We need to calculate how many categorical columns we have.
# This helps the Lasso branch know which columns to One-Hot Encode.
n_cats = len(CATEGORICAL_COLS)

# --- Branch A: Lasso ---
# Input: [Ordinal_Ints, Floats]
# Lasso needs One-Hot Encoding for Categories, but NOT for Numericals.
# We use a sub-ColumnTransformer to apply OHE only to the first 'n_cats' columns.
lasso_pipeline = Pipeline([
    ('prep', ColumnTransformer([
        ('ohe', OneHotEncoder(categories='auto', sparse_output=False, handle_unknown='ignore'), slice(0, n_cats))
    ], remainder='passthrough')), # Numerical columns pass through as-is
    ('scaler', StandardScaler()),
    ('model', Lasso(alpha=0.001, max_iter=50000, random_state=42))
])

# --- Branch B: XGBoost ---
# XGBoost handles the [Ordinal_Ints, Floats] array natively.
xgb_model = XGBRegressor(
    n_estimators=500, learning_rate=0.1, max_depth=3, subsample=0.8,
    random_state=42, n_jobs=1
)

# --- Branch C: CatBoost ---
# CatBoost handles the [Ordinal_Ints, Floats] array natively.
cb_model = CatBoostRegressor(
    iterations=1000, learning_rate=0.05, depth=4, l2_leaf_reg=3,
    loss_function='RMSE', random_seed=42, verbose=0, allow_writing_files=False
)

# --- The Voting Ensemble ---
voting_model = VotingRegressor(
    estimators=[
        ('lasso', lasso_pipeline),
        ('xgb', xgb_model),
        ('catboost', cb_model)
    ],
    weights=[1, 2, 2],
    n_jobs=1
)

# ==========================================
# 4. THE "GRAND PIPELINE" (End-to-End)
# ==========================================

# We wrap the whole thing:
# Raw Data -> Preprocessor -> LogTransform -> VotingModel -> InverseLog
final_production_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', TransformedTargetRegressor(
        regressor=voting_model,
        func=np.log1p,
        inverse_func=np.expm1
    ))
])

# ==========================================
# 5. TRAINING
# ==========================================
print("Training the Unified Production Pipeline (Raw Data -> Prediction)...")
# Notice we pass X_train (Raw), not X_train_ordinal!
final_production_pipeline.fit(X_train, y_train)

# Score
score = final_production_pipeline.score(X_test, y_test)
print(f"✅ Training Complete. Test R^2: {score:.5f}")
print("(This should match or slightly exceed your previous 0.933 score)")

# ==========================================
# 6. SAVE ARTIFACT (Production Ready)
# ==========================================
# 1. Save the Model Pipeline (Brain + Translator)
model_filename = 'ames_housing_super_model_production.pkl'
joblib.dump(final_production_pipeline, model_filename)

# 2. Save the Column List (The Alignment Key)
# This is required so the API knows how to create the NaN columns
cols_filename = 'ames_model_columns.pkl'
joblib.dump(X_train.columns.tolist(), cols_filename)

print(f"✅ Model saved to:   {model_filename}")
print(f"✅ Columns saved to: {cols_filename}")
print(f"   You can now restart app_3.0.py")

# ==========================================
# 7. PRODUCTION SIMULATION (Inference)
# ==========================================
print("\n--- Simulating Production Inference ---")

# 1. THE USER INPUT (Partial Data)
user_input = {
    "Neighborhood": "CollgCr",
    "LotArea": 9600,
    "OverallQual": 7,
    "YearBuilt": 2000,
    "GrLivArea": 1700,
    # Missing: GarageCars, KitchenQual, etc.
}

# 2. CREATE DATAFRAME
input_df = pd.DataFrame([user_input])

# 3. THE BRIDGE (Crucial Alignment Step)
# Load the columns we just saved (simulating a real API restart)
expected_cols = joblib.load(cols_filename)

# Force input to match training structure (add missing cols as NaN)
input_df = input_df.reindex(columns=expected_cols)

# 4. PREDICT
# The pipeline handles the NaNs using the internal SimpleImputer
pred_price = final_production_pipeline.predict(input_df)[0]

print(f"Input: {user_input}")
print(f"Predicted Price: ${pred_price:,.2f}")

Data Loaded. Train Shape: (2064, 79)
Training the Unified Production Pipeline (Raw Data -> Prediction)...
✅ Training Complete. Test R^2: 0.93176
(This should match or slightly exceed your previous 0.933 score)
✅ Model saved to:   ames_housing_super_model_production.pkl
✅ Columns saved to: ames_model_columns.pkl
   You can now restart app_3.0.py

--- Simulating Production Inference ---
Input: {'Neighborhood': 'CollgCr', 'LotArea': 9600, 'OverallQual': 7, 'YearBuilt': 2000, 'GrLivArea': 1700}
Predicted Price: $160,803.77
