For a real estate pricing analysis project, one important aspect of feature engineering is incorporating size-related variables into the dataset to enhance model accuracy and insight. Below, I provide a Python script that covers several size-related features that could be impactful for my analysis.

In [1]:
import pandas as pd

In [2]:
import pandas as pd

# Load cleaned data
path = r'C:\Users\Admin\Desktop\NextHikes\EDA_Project_Complete\Data\cleaned_data.csv'
df = pd.read_csv(path)

In [3]:
# 1. Total square footage as the sum of basement, 1st and 2nd floor areas
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']

In [4]:
# 2. Total bathrooms including full and half, both above and below grade
df['TotalBathrooms'] = (df['FullBath'] + (0.5 * df['HalfBath']) +
                        df['BsmtFullBath'] + (0.5 * df['BsmtHalfBath']))

In [5]:
# 3. Total porch area (sum of all porch area variables)
df['TotalPorchSF'] = (df['OpenPorchSF'] + df['EnclosedPorch'] + 
                      df['3SsnPorch'] + df['ScreenPorch'])

In [6]:
# 4. Adding boolean feature for whether the house has a 2nd floor
df['Has2ndFloor'] = (df['2ndFlrSF'] > 0).astype(int)

In [7]:
# 5. Adding boolean feature for whether the house has a garage
df['HasGarage'] = (df['GarageArea'] > 0).astype(int)

In [8]:
# 6. Adding boolean feature for whether the house has a basement
df['HasBasement'] = (df['TotalBsmtSF'] > 0).astype(int)

In [9]:
# 7. Adding boolean feature for whether the house has a fireplace
df['HasFireplace'] = (df['Fireplaces'] > 0).astype(int)

In [10]:
# 8. Feature interaction: Area * Overall Quality (potentially capturing the quality of the large houses better)
df['Area_x_OverallQual'] = df['TotalSF'] * df['OverallQual']

In [12]:
import numpy as np

In [13]:
# 9. Log transformation of TotalSF to normalize the distribution
df['LogTotalSF'] = np.log(df['TotalSF'] + 1)

In [14]:
# 10. Creating new feature for the age of the house at the time of sale
df['HouseAge'] = df['YrSold'] - df['YearBuilt']

In [15]:
# 11. Feature to capture years since remodeling
df['YearsSinceRemodel'] = df['YrSold'] - df['YearRemodAdd']

In [19]:
# Save the enhanced dataset with new features
df.to_csv('C:/Users/Admin/Desktop/NextHikes/EDA_Project_Complete/Data/enhanced_data_with_size_features.csv', index=False)

In [18]:
print(df.head())  # To check the new features

   Unnamed: 0  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0           0           65     8450            7            5       2003   
1           1           80     9600            6            8       1976   
2           2           68    11250            7            5       2001   
3           3           60     9550            7            5       1915   
4           4           84    14260            8            5       2000   

   YearRemodAdd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  ...  TotalBathrooms  \
0          2003         196         706           0  ...             3.5   
1          1976           0         978           0  ...             2.5   
2          2002         162         486           0  ...             3.5   
3          1970           0         216           0  ...             2.0   
4          2000         350         655           0  ...             3.5   

   TotalPorchSF  Has2ndFloor  HasGarage  HasBasement  HasFireplace  \
0            61 