# Ames Housing Price Prediction Pipeline

This notebook walks through data preprocessing, feature engineering, and a linear regression model for predicting house prices using the Ames Housing dataset. 

## Imports and Setup
Import necessary libraries for data handling, modeling, and preprocessing.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

## Load Data
Read training, test, and sample submission files from disk.

In [4]:
# Load datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
submission = pd.read_csv('sample_submission.csv')

## Prepare Features and Target
- Separate the target variable `SalePrice` and IDs.
- Drop them from feature sets to prepare for modeling.

In [5]:
# Separate target and IDs
y = train['SalePrice']
train_ids = train['Id']
test_ids = test['Id']

# Drop unwanted columns
train_features = train.drop(['SalePrice', 'Id'], axis=1)
test_features = test.drop(['Id'], axis=1)

# Combine for uniform preprocessing
all_data = pd.concat([train_features, test_features], axis=0)

## Feature Engineering
Create new composite features:
- Total square footage (`TotalSF`)
- Total porch area (`TotalPorchSF`)
- Combined bathroom count (`TotalBath`)
- House age, remodel age, and garage age

In [6]:
# Composite features
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
all_data['TotalPorchSF'] = all_data[['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch']].sum(axis=1)
all_data['TotalBath'] = (all_data['FullBath'] + 0.5*all_data['HalfBath'] +
                         all_data['BsmtFullBath'] + 0.5*all_data['BsmtHalfBath'])
all_data['HouseAge'] = all_data['YrSold'] - all_data['YearBuilt']
all_data['RemodelAge'] = all_data['YrSold'] - all_data['YearRemodAdd']
all_data['GarageAge'] = all_data['YrSold'] - all_data['GarageYrBlt']
all_data['GarageAge'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  all_data['GarageAge'].fillna(0, inplace=True)


## Split Back into Train/Test
Separate the combined data back into training and test sets, and log-transform the target for modeling.

In [7]:
# Split
n_train = y.shape[0]
X_train = all_data.iloc[:n_train, :].copy()
X_test = all_data.iloc[n_train:, :].copy()

# Log-transform the target
y_log = np.log1p(y)

## Preprocessing Pipelines
Define numeric and categorical transformers and combine them using `ColumnTransformer`.

In [8]:
# Identify feature types
numeric_feats = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_feats = X_train.select_dtypes(include=['object']).columns.tolist()

# Numeric pipeline: impute and scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline: impute and one-hot encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine into ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_feats),
    ('cat', categorical_transformer, categorical_feats)
])

## Modeling and Evaluation
- Build a full pipeline with preprocessing and `LinearRegression`.
- Evaluate using 5-fold cross-validated RMSE on the log-transformed target.

In [9]:
# Full modeling pipeline
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Cross-validated RMSE (log-target)
cv_scores = np.sqrt(-cross_val_score(
    model_pipeline, X_train, y_log, cv=5, scoring='neg_mean_squared_error'
))
print(f'CV RMSE (log-target): {cv_scores.mean():.4f}')

CV RMSE (log-target): 0.1572


## Train Final Model and Create Submission
Train on the full training data, generate predictions on the test set, and prepare the submission file.

In [10]:
# Fit on full data
model_pipeline.fit(X_train, y_log)

# Predict and invert log-transform
preds_log = model_pipeline.predict(X_test)
preds = np.expm1(preds_log)

# Prepare submission- This changes the sample's SalePrice to the predictions
submission['SalePrice'] = preds
submission.to_csv('submission.csv', index=False)
print("Saved submission.csv")

Saved submission.csv
