# Feature Engineering
Transforming categorical data into numerical format.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Ensure output messages display properly
print("Libraries loaded successfully.")



Libraries loaded successfully.


## Load Cleaned Data
Displaying cleaned data before feature engineering.


In [3]:
# Load the cleaned dataset
data = pd.read_csv("../data/final_cleaned_train.csv")

# Display the first few rows
print("Loaded cleaned dataset:")
data.head()


Loaded cleaned dataset:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Grvl,Reg,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Grvl,Reg,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,12,2008,WD,Normal,250000


## Encoding Categorical Variables
Categorical variables need to be converted into numerical format before model training.
- One-hot encoding is applied to categorical features.
- `drop_first=True` is used to avoid multicollinearity.

In [4]:
# Identify categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns

# Apply one-hot encoding
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

# Display dataset info after encoding
print(f"Feature Engineering: Categorical variables encoded. Data now has {data.shape[1]} features.")


Feature Engineering: Categorical variables encoded. Data now has 246 features.


## Feature Scaling
Standardizing numeric variables ensures fair weighting in models.
- StandardScaler is applied to normalize the dataset.



In [5]:
# Initialize the scaler
scaler = StandardScaler()

# Apply scaling only to numeric columns
numeric_cols = data.select_dtypes(include=['number']).columns
data[numeric_cols] = scaler.fit_transform(data[numeric_cols])

print("Feature Engineering: Applied standard scaling.")


Feature Engineering: Applied standard scaling.


## Save Processed Dataset
The final processed dataset is saved for use in model training.


In [None]:
# Save the processed dataset
data.to_csv("../data/processed_train.csv", index=False)

print("Feature Engineering complete. Processed dataset saved successfully!")


OSError: Cannot save file into a non-existent directory: 'data'