# Feature Engineering for House Price Prediction

In this notebook, we will explore feature engineering techniques to enhance our model's performance. We will create new features, test combinations, and evaluate their impact on the model.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the processed dataset
data = pd.read_csv('../data/processed/processed_data.csv')
data.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,Price
0,15.0234,0.0,18.1,0,0.614,5.304,97.3,2.1007,24,666,20.2,349.48,24.91,12.0
1,0.62739,0.0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9
2,0.03466,35.0,6.06,0,0.4379,6.031,23.3,6.6407,1,304,16.9,362.25,7.83,19.4
3,7.05042,0.0,18.1,0,0.614,6.103,85.1,2.0218,24,666,20.2,2.52,23.29,13.4
4,0.7258,0.0,8.14,0,0.538,5.727,69.5,3.7965,4,307,21.0,390.95,11.28,18.2


In [None]:
print(data.columns)

Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'Price'],
      dtype='object')


## Creating New Features

We will create new features based on existing ones to capture more information that may help in predicting house prices.

In [22]:
# Example: Creating a new feature 'rooms_per_tax'
data['rooms_per_tax'] = data['rm'] / data['tax']

# Example: Creating a new feature 'lstat_to_crim_ratio'
data['lstat_to_crim_ratio'] = data['lstat'] / data['crim']

# Display the new features
print(data[['rm', 'tax', 'rooms_per_tax', 'lstat', 'crim', 'lstat_to_crim_ratio']].head())

# Example: Creating a new feature 'lstat_to_crim_ratio'
data['lstat_to_crim_ratio'] = data['lstat'] / data['crim']

# Display the new features
print(data[['rm', 'tax', 'rooms_per_tax', 'lstat', 'crim', 'lstat_to_crim_ratio']])

      rm  tax  rooms_per_tax  lstat      crim  lstat_to_crim_ratio
0  5.304  666       0.007964  24.91  15.02340             1.658080
1  5.834  307       0.019003   8.47   0.62739            13.500375
2  6.031  304       0.019839   7.83   0.03466           225.908829
3  6.103  666       0.009164  23.29   7.05042             3.303349
4  5.727  307       0.018655  11.28   0.72580            15.541471
        rm  tax  rooms_per_tax  lstat      crim  lstat_to_crim_ratio
0    5.304  666       0.007964  24.91  15.02340             1.658080
1    5.834  307       0.019003   8.47   0.62739            13.500375
2    6.031  304       0.019839   7.83   0.03466           225.908829
3    6.103  666       0.009164  23.29   7.05042             3.303349
4    5.727  307       0.018655  11.28   0.72580            15.541471
..     ...  ...            ...    ...       ...                  ...
399  5.836  384       0.015198  18.66   0.17120           108.995327
400  5.856  223       0.026260  13.00   0.2991

## Testing Feature Combinations

We will test different combinations of features to see which ones contribute the most to the model's performance.

In [24]:
# Example: Testing combinations of features
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define features and target
X = data[['rm', 'tax', 'rooms_per_tax', 'lstat', 'crim', 'lstat_to_crim_ratio']]  # Relevant features
y = data['Price']  # Target variable

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 26.267323886242025
R-squared: 0.6305574781652072


## Conclusion

In this notebook, we explored feature engineering techniques that can potentially improve our model's performance. Further analysis can be conducted to refine these features and test additional combinations.