# House Prices - Model Validation and Improvement

This notebook describes the process of creating and testing a regression model for forecasting housing prices.
The main goal of the project is to find out why the model's accuracy changes and how validation helps make rational decisions.

## Load and Inspect the Data

I start by loading the dataset and performing basic checks:
- shape and columns,
- missing values,
- duplicates,
- basic descriptive statistics.

This helps ensure that the data is clean and suitable for modeling.

In [1]:
import pandas as pd
file_path = "C:\\Users\\lb_20\\Downloads\\house_prices_practice.csv"
df = pd.read_csv(file_path)

In [2]:
df

Unnamed: 0,Id,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,YearBuilt,FullBath,BedroomAbvGr,LotArea,SalePrice
0,1,7,1560,0,1658,1969,2,1,8059,177106
1,2,4,2827,2,1319,2012,3,4,13530,301044
2,3,8,3920,0,841,2010,1,4,9010,360609
3,4,5,3044,0,1058,1998,0,4,13207,240556
4,5,7,801,1,2428,2020,0,1,9117,193656
...,...,...,...,...,...,...,...,...,...,...
295,296,1,3495,1,1792,1954,2,5,4978,250604
296,297,5,3438,3,1266,2003,0,1,9373,329906
297,298,6,1992,0,1148,1996,1,1,7907,184623
298,299,3,3722,1,1407,1998,1,1,8097,303345


In [3]:
df.isnull().sum()

Id              0
OverallQual     0
GrLivArea       0
GarageCars      0
TotalBsmtSF     0
YearBuilt       0
FullBath        0
BedroomAbvGr    0
LotArea         0
SalePrice       0
dtype: int64

In [4]:
df.duplicated().sum()

0

In [5]:
df["Id"].nunique()

300

In [6]:
df.describe()

Unnamed: 0,Id,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,YearBuilt,FullBath,BedroomAbvGr,LotArea,SalePrice
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,150.5,5.326667,2307.386667,1.33,1468.796667,1986.163333,1.523333,2.926667,8969.453333,252262.903333
std,86.746758,2.873001,1042.561303,1.109898,672.333705,21.377089,1.131543,1.456604,3753.531132,74998.055214
min,1.0,1.0,504.0,0.0,303.0,1950.0,0.0,1.0,2009.0,82494.0
25%,75.75,3.0,1392.25,0.0,903.0,1967.0,0.0,2.0,5996.25,190355.25
50%,150.5,5.0,2265.5,1.0,1502.0,1986.0,2.0,3.0,9031.0,251292.5
75%,225.25,8.0,3306.5,2.0,2129.5,2004.25,3.0,4.0,12316.0,307105.0
max,300.0,10.0,3998.0,3.0,2492.0,2023.0,3.0,5.0,14987.0,435291.0


## Feature Selection
All features in the dataset are represented numerically and directly affect the future price of the house. That is why these particular features were selected for variable X.

In [7]:
df.columns

Index(['Id', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF',
       'YearBuilt', 'FullBath', 'BedroomAbvGr', 'LotArea', 'SalePrice'],
      dtype='object')

In [8]:
y = df.SalePrice

feature_names = ["OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF", "YearBuilt", "FullBath", "BedroomAbvGr", "LotArea"]
X = df[feature_names]

In [9]:
X.describe()

Unnamed: 0,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,YearBuilt,FullBath,BedroomAbvGr,LotArea
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,5.326667,2307.386667,1.33,1468.796667,1986.163333,1.523333,2.926667,8969.453333
std,2.873001,1042.561303,1.109898,672.333705,21.377089,1.131543,1.456604,3753.531132
min,1.0,504.0,0.0,303.0,1950.0,0.0,1.0,2009.0
25%,3.0,1392.25,0.0,903.0,1967.0,0.0,2.0,5996.25
50%,5.0,2265.5,1.0,1502.0,1986.0,2.0,3.0,9031.0
75%,8.0,3306.5,2.0,2129.5,2004.25,3.0,4.0,12316.0
max,10.0,3998.0,3.0,2492.0,2023.0,3.0,5.0,14987.0


In [10]:
X.head()

Unnamed: 0,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,YearBuilt,FullBath,BedroomAbvGr,LotArea
0,7,1560,0,1658,1969,2,1,8059
1,4,2827,2,1319,2012,3,4,13530
2,8,3920,0,841,2010,1,4,9010
3,5,3044,0,1058,1998,0,4,13207
4,7,801,1,2428,2020,0,1,9117


## Baseline Model
I trained a simple regression algorithm - a decision tree - on the full dataset and made predictions within the sample, but does not yet allow to assess how well it performs on new data.

In [11]:
from sklearn.tree import DecisionTreeRegressor as dtr
house_prices_model = dtr(random_state = 1)
house_prices_model.fit(X, y)

In [12]:
predictions = house_prices_model.predict(X)
print(predictions[:10])

[177106. 301044. 360609. 240556. 193656. 213952. 145539. 350830. 275955.
 211664.]


In [13]:
print("First in-sample predictions:", house_prices_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [177106. 301044. 360609. 240556. 193656.]
Actual target values for those homes: [177106, 301044, 360609, 240556, 193656]


## Model validation

To assess the actual effectiveness, the data was divided into training and validating samples.
The model was evaluated using the mean absolute error (MAE), which shows the average deviation of forecasts from actual values. 

In [15]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

In [16]:
house_prices_model = dtr(random_state = 1)
house_prices_model.fit(train_X, train_y)

In [17]:
val_predictions = house_prices_model.predict(val_X)

print(val_predictions[:5])
print(val_y.head())

[424406. 166363. 319207. 135390. 153184.]
189    435291
123    178656
185    312483
213    123716
106    165288
Name: SalePrice, dtype: int64


In [18]:
from sklearn.metrics import mean_absolute_error as mae
val_mae = mae(val_y, val_predictions)
print("Validation MAE: {:.2f}".format(val_mae))

Validation MAE: 36950.32


## Underfitting and Overfitting

I test the "max_leaf_nodes" parameter of the decision tree to observe how model complexity affects validation error. This illustrates the compromise between underfitting and overfitting.

In [20]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = dtr(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    val_mae = mae(val_y, preds_val)
    return(val_mae)

In [21]:
options_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
for max_leaf_nodes in options_max_leaf_nodes:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  43339
Max leaf nodes: 25  		 Mean Absolute Error:  38746
Max leaf nodes: 50  		 Mean Absolute Error:  34592
Max leaf nodes: 100  		 Mean Absolute Error:  35673
Max leaf nodes: 250  		 Mean Absolute Error:  35796
Max leaf nodes: 500  		 Mean Absolute Error:  35796


In [22]:
best_tree_size = 50

final_model = dtr(max_leaf_nodes = best_tree_size, random_state = 1)
final_model.fit(X, y)

## Random forest

In the end, I replaced the single decision tree with an ensemble of trees (random forest).
This approach typically improves generalization by reducing variance and revealing more stable patterns.
Comparing its MAE metric during validation with a model consisting of a single tree confirms that a random forest is the most suitable model for future predictions.

In [24]:
val_predictions = house_prices_model.predict(val_X)
val_mae = mae(val_y, val_predictions)
print("Validation MAE when not specifying max_leaf_nodes: {:.2f}".format(val_mae))

house_prices_model = dtr(max_leaf_nodes = 50, random_state = 0)
house_prices_model.fit(train_X, train_y)
val_predictions = house_prices_model.predict(val_X)
val_mae = mae(val_y, val_predictions)
print("Validation MAE for best value of max_leaf_nodes: {:.2f}".format(val_mae))

Validation MAE when not specifying max_leaf_nodes: 36950.32
Validation MAE for best value of max_leaf_nodes: 34592.24


In [25]:
from sklearn.ensemble import RandomForestRegressor as rfr
rf_model = rfr(random_state = 1)
rf_model.fit(train_X, train_y)

rf_val_preds = rf_model.predict(val_X)
rf_val_mae = mae(rf_val_preds, val_y)

print("Validation MAE for Random Forest Model: {:.2f}".format(rf_val_mae))

Validation MAE for Random Forest Model: 29117.72
