# 🧠 Feature Engineering & Preprocessing

This notebook prepares the cleaned housing dataset for machine learning by transforming variables, encoding categoricals, and saving the final dataset.


## 📥 Step 1: Load Cleaned Dataset


In [1]:
import pandas as pd

df = pd.read_csv('../data/house_prices_cleaned.csv')
df.head()


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice,SalePrice_log
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,65.0,196.0,61,5,7,856,2003,2003,208500,12.247694
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,80.0,0.0,0,8,6,1262,1976,1976,181500,12.109011
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,68.0,162.0,42,5,7,920,2001,2002,223500,12.317167
3,961,0.0,3.0,No,216,ALQ,540,642,Unf,1998.0,...,60.0,0.0,35,5,7,756,1915,1970,140000,11.849398
4,1145,0.0,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,84.0,350.0,84,5,8,1145,2000,2000,250000,12.429216


## 🏷️ Step 2: Encode Categorical Variables

We apply One-Hot Encoding to transform categorical features into numeric.


In [2]:
# Encode categorical variables using one-hot encoding
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.shape


(1460, 32)

## 📉 Step 3: Drop Original Target Column

We trained the model on the log-transformed price, so we drop the original `SalePrice`.


In [3]:
df_encoded.drop(columns=['SalePrice'], inplace=True)


## 💾 Step 4: Save the Processed Dataset

We save the fully preprocessed dataset to a new CSV file.


In [4]:
df_encoded.to_csv('../data/house_prices_preprocessed.csv', index=False)
