# Machine Learning Project - Ames Housing Data

Ames, Iowa is the college town of **Iowa State University**. The Ames housing dataset consists of about $2500$ house sale records between $2006-2010$. Detailed information about the house attributes, along with the sale prices, is recorded in the dataset. The goal of the project is to:
- perform descriptive data analysis to gain business (i.e. housing market) insights
- build descriptive machine learning models to understand the local housing market.
- build predictive machine learning models for the local house price prediction.

A subset of the **Ames** dataset is hosted on [**Kaggle**](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) as an entry-level regression competition. You may visit their site for some information on the meanings of its data columns (the data dictionary). In this notebook, we will describe various project ideas related to this data.

In [2]:
import pandas as pd
import numpy as np
import joblib
from pathlib import Path

# Scikit-Learn Imports
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, FunctionTransformer
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import Lasso

# Tree Models
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

In [3]:
import sklearn
print(f"Notebook Scikit-Learn Version: {sklearn.__version__}")

Notebook Scikit-Learn Version: 1.6.1


In [4]:
# ==========================================
# 1. CONFIGURATION & DATA LOADING
# ==========================================

# Define the columns (Manual Lists from your Lab Notebook)
CATEGORICAL_COLS = [
    'MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
    'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
    'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
    'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
    'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
    'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
    'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
    'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
    'MoSold', 'YrSold', 'SaleType', 'SaleCondition'
]

NUMERICAL_COLS = [
    'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
    'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
    'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
    'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
    'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'MiscVal'
]

# 1. Get the current working directory (where this notebook is)
current_dir = Path.cwd()

# 2. Define the path to the data file
# If your notebook is in a subfolder (e.g., 'notebooks/'), use .parent to go up one level
data_path = current_dir.parent / "data" / "Ames_Housing_Price_Data.csv"

# Check if the file exists before loading (optional but helpful for debugging)
if not data_path.exists():
    # Fallback: Maybe the notebook IS in the root?
    data_path = current_dir / "data" / "Ames_Housing_Price_Data.csv"

df = pd.read_csv(data_path)

# Quick check
print(f"Data loaded successfully from: {data_path}")
print(df.shape)

# Basic Cleanup
if 'PID' in df.columns:
    df = df.drop(columns=['PID', 'Unnamed: 0'], errors='ignore')

X = df.drop(columns=['SalePrice'])
y = df['SalePrice']

# Split (We fit on Train to maintain validity of our 0.933 score)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Data Loaded. Train Shape: {X_train.shape}")

# ==========================================
# 2. DEFINE THE "TRANSLATOR" (Preprocessing)
# ==========================================

# HELPER FUNCTION IS DEPRICATED AND MOVED TO UTILS: dashboard.py used manual function (caused shiny deployment errors), dashboard_2.0.py used utils.py
# Helper function to ensure everything is a string before imputation
# def cast_to_str(x):
#     return x.astype(str)
import sys
from pathlib import Path

# 1. Get the current directory, then go to the parent (project root)
# .resolve() ensures we have the absolute path
project_root = Path.cwd().parent.resolve()

# 2. Add it to sys.path if it isn't already there
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# 3. Now you can import
from utils import cast_to_str

print(f"cast_to_str module: {cast_to_str.__module__}")
# ------------------------------------------------------

# A. Categorical Branch
# 1. Force to String
# 2. Fill Missing with 'None' (The "Safety Net")
# 3. Ordinal Encode (Strings -> Integers like 0, 1, 2)
#    Note: We use -1 for unknown categories so the model doesn't crash on new data
cat_preprocessing = Pipeline([
    ('caster', FunctionTransformer(cast_to_str, validate=False)),
    ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])

# B. Numerical Branch
# 1. Fill Missing with Median
num_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
])

# C. Global Preprocessor
# This combines the two branches. 
# IMPORTANT: It outputs [Categorical_Cols, Numerical_Cols] in that order.
preprocessor = ColumnTransformer([
    ('cat', cat_preprocessing, CATEGORICAL_COLS),
    ('num', num_preprocessing, NUMERICAL_COLS)
], verbose_feature_names_out=False)

# ==========================================
# 3. DEFINE THE "BRAIN" (Model Branches)
# ==========================================

# We need to calculate how many categorical columns we have.
# This helps the Lasso branch know which columns to One-Hot Encode.
n_cats = len(CATEGORICAL_COLS)

# --- Branch A: Lasso ---
# Input: [Ordinal_Ints, Floats]
# Lasso needs One-Hot Encoding for Categories, but NOT for Numericals.
# We use a sub-ColumnTransformer to apply OHE only to the first 'n_cats' columns.
lasso_pipeline = Pipeline([
    ('prep', ColumnTransformer([
        ('ohe', OneHotEncoder(categories='auto', sparse_output=False, handle_unknown='ignore'), slice(0, n_cats))
    ], remainder='passthrough')), # Numerical columns pass through as-is
    ('scaler', StandardScaler()),
    ('model', Lasso(alpha=0.001, max_iter=50000, random_state=42))
])

# --- Branch B: XGBoost ---
# XGBoost handles the [Ordinal_Ints, Floats] array natively.
xgb_model = XGBRegressor(
    n_estimators=500, learning_rate=0.1, max_depth=3, subsample=0.8,
    random_state=42, n_jobs=1
)

# --- Branch C: CatBoost ---
# CatBoost handles the [Ordinal_Ints, Floats] array natively.
cb_model = CatBoostRegressor(
    iterations=1000, learning_rate=0.05, depth=4, l2_leaf_reg=3,
    loss_function='RMSE', random_seed=42, verbose=0, allow_writing_files=False
)

# --- The Voting Ensemble ---
voting_model = VotingRegressor(
    estimators=[
        ('lasso', lasso_pipeline),
        ('xgb', xgb_model),
        ('catboost', cb_model)
    ],
    weights=[1, 2, 2],
    n_jobs=1
)

# ==========================================
# 4. THE "GRAND PIPELINE" (End-to-End)
# ==========================================

# We wrap the whole thing:
# Raw Data -> Preprocessor -> LogTransform -> VotingModel -> InverseLog
final_production_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', TransformedTargetRegressor(
        regressor=voting_model,
        func=np.log1p,
        inverse_func=np.expm1
    ))
])

# ==========================================
# 5. TRAINING
# ==========================================
print("Training the Unified Production Pipeline (Raw Data -> Prediction)...")
# Notice we pass X_train (Raw), not X_train_ordinal!
final_production_pipeline.fit(X_train, y_train)

# Score
score = final_production_pipeline.score(X_test, y_test)
print(f"âœ… Training Complete. Test R^2: {score:.5f}")
print("(This should match or slightly exceed your previous 0.933 score)")

# ==========================================
# 6. SAVE ARTIFACT (Production Ready)
# ==========================================

# 1. Setup the Directory
# This creates a 'models' folder in the same directory as the notebook
MODEL_DIR = Path.cwd() / 'models'
MODEL_DIR.mkdir(parents=True, exist_ok=True)

# 2. Define Paths
model_path = MODEL_DIR / 'ames_housing_super_model_production.pkl'
cols_path = MODEL_DIR / 'ames_model_columns.pkl'

# 1. Save the Model Pipeline (Brain + Translator)
joblib.dump(final_production_pipeline, model_path)

# 2. Save the Column List (The Alignment Key)
# This is required so the API knows how to create the NaN columns
cols_filename = 'models/ames_model_columns.pkl'
joblib.dump(X_train.columns.tolist(), cols_path)

print(f"âœ… Model saved to:   {model_path}")
print(f"âœ… Columns saved to: {cols_path}")
print(f"   You can now restart app_3.0.py")

# ==========================================
# 7. PRODUCTION SIMULATION (Inference)
# ==========================================
print("\n--- Simulating Production Inference ---")

# 1. THE USER INPUT (Partial Data)
user_input = {
    "Neighborhood": "CollgCr",
    "LotArea": 9600,
    "OverallQual": 7,
    "YearBuilt": 2000,
    "GrLivArea": 1700,
    # Missing: GarageCars, KitchenQual, etc.
}

# 2. CREATE DATAFRAME
input_df = pd.DataFrame([user_input])

# 3. THE BRIDGE (Crucial Alignment Step)
# Load the columns we just saved (simulating a real API restart)
expected_cols = joblib.load(cols_filename)

# Force input to match training structure (add missing cols as NaN)
input_df = input_df.reindex(columns=expected_cols)

# 4. PREDICT
# The pipeline handles the NaNs using the internal SimpleImputer
pred_price = final_production_pipeline.predict(input_df)[0]

print(f"Input: {user_input}")
print(f"Predicted Price: ${pred_price:,.2f}")

Data loaded successfully from: /Users/ozkangelincik/git_proj/ames-housing-prices/data/Ames_Housing_Price_Data.csv
(2580, 82)
Data Loaded. Train Shape: (2064, 79)
cast_to_str module: utils
Training the Unified Production Pipeline (Raw Data -> Prediction)...
âœ… Training Complete. Test R^2: 0.93176
(This should match or slightly exceed your previous 0.933 score)
âœ… Model saved to:   /Users/ozkangelincik/git_proj/ames-housing-prices/notebooks/models/ames_housing_super_model_production.pkl
âœ… Columns saved to: /Users/ozkangelincik/git_proj/ames-housing-prices/notebooks/models/ames_model_columns.pkl
   You can now restart app_3.0.py

--- Simulating Production Inference ---
Input: {'Neighborhood': 'CollgCr', 'LotArea': 9600, 'OverallQual': 7, 'YearBuilt': 2000, 'GrLivArea': 1700}
Predicted Price: $160,803.77


In [7]:
# ==========================================
# FINAL SAVE: Correct Path & All Files (Fixed)
# ==========================================
import joblib
from pathlib import Path

# 1. Define Root
project_root = Path.cwd().parent.resolve()
MODELS_DIR = project_root / 'models'
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# 2. Calculate Defaults (For Math/Imputation)
train_defaults = {}
for col in X_train.columns:
    if col in CATEGORICAL_COLS:
        # Use mode[0] to get the single most frequent value
        # Ensure we handle empty modes just in case
        modes = X_train[col].mode()
        train_defaults[col] = modes[0] if not modes.empty else "None"
    else:
        train_defaults[col] = X_train[col].median()

# 3. Calculate Unique Options (For Dashboard Dropdowns!)
#    This extracts every unique value seen in the training set
unique_options = {}
for col in X_train.columns:
    if col in CATEGORICAL_COLS:
        # FIX IS HERE: .astype(str) ensures NaNs become "nan" strings, preventing the crash
        unique_vals = X_train[col].astype(str).unique().tolist()
        unique_options[col] = sorted(unique_vals)

# 4. Save EVERYTHING
print(f"Saving files to: {MODELS_DIR} ...")

# A. The Model
joblib.dump(final_production_pipeline, MODELS_DIR / 'ames_housing_super_model_production.pkl')

# B. The Columns
joblib.dump(X_train.columns.tolist(), MODELS_DIR / 'ames_model_columns.pkl')

# C. The Defaults (For Math)
joblib.dump(train_defaults, MODELS_DIR / 'ames_model_defaults.pkl')

# D. The Options (For Dropdowns)
joblib.dump(unique_options, MODELS_DIR / 'ames_model_options.pkl')
print("âœ… Options Saved (ames_model_options.pkl)")

print("\nðŸš€ READY! Re-run the dashboard now.")

Saving files to: /Users/ozkangelincik/git_proj/ames-housing-prices/models ...
âœ… Options Saved (ames_model_options.pkl)

ðŸš€ READY! Re-run the dashboard now.
