<a href="https://www.kaggle.com/code/danuherath/house-prices-regression-advanced?scriptVersionId=201526848" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<h1 align="center"> Iowa House Prices Prediction (Regression) </h1>

<img 
    src="https://storage.googleapis.com/kaggle-media/competitions/kaggle/5407/media/housesbanner.png"
    alt="" 
    height="300"
    width="500" 
    style="display: block; margin: 0 auto; border-radius:15px" 
/>

---

## Problem Definition

- Dataset

    - [House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data) dataset from Kaggle which contains 79 features explaining "(almost) every aspect of residential homes in Ames, Iowa". Train dataset contains 1,460 samples and each instance represents one house.

<br>

- Objective

    - The goal of this project is to predict the sales price for each house based on the above features.

<br>

- Algorithms

    - Following regression algorithms are used to train models. The models are evaluated using the  Root-Mean-Squared-Error (RMSE).
    
        - [TensorFlow Decision Forests (TFDF) - RandomForestModel](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel)
        - [TensorFlow Decision Forests (TFDF) - GradientBoostedTreesModel](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)
    
<br>

---


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

import tensorflow as tf
import tensorflow_decision_forests as tfdf
from keras_tuner import RandomSearch

# import warnings
# warnings.filterwarnings('ignore')

print("TensorFlow v" + tf.__version__)
print("TensorFlow Decision Forests v" + tfdf.__version__)


2024-10-16 15:46:40.563205: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-16 15:46:40.563339: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-16 15:46:40.719627: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


TensorFlow v2.15.0
TensorFlow Decision Forests v1.8.1


In [2]:
train_data = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')


In [3]:
train_data.drop("Id", axis=1, inplace=True)
test_data.drop("Id", axis=1, inplace=True)


---

### No preprocessing needed for TFDF models. However, dataset must be converted to a [TF dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

---

In [4]:
train_df, val_df = train_test_split(train_data, test_size=0.2, random_state=42)

label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label=label, task = tfdf.keras.Task.REGRESSION)
val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(val_df, label=label, task = tfdf.keras.Task.REGRESSION)


---

# Train Models

---

In [5]:
rf_model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION, verbose=0)

rf_model.compile(metrics=["mae"])

rf_model.fit(train_ds, verbose=0)

rf_score = rf_model.evaluate(val_ds, verbose=0)
print("Validation accuracy:", rf_score)


[INFO 24-10-16 15:47:00.5111 UTC kernel.cc:1233] Loading model from path /tmp/tmppa9aq00f/model/ with prefix 54064f001e7f473e
[INFO 24-10-16 15:47:00.8979 UTC decision_forest.cc:660] Model loaded with 300 root(s), 111040 node(s), and 74 input feature(s).
[INFO 24-10-16 15:47:00.8980 UTC abstract_model.cc:1344] Engine "RandomForestOptPred" built
[INFO 24-10-16 15:47:00.8980 UTC kernel.cc:1061] Use fast generic engine


Validation accuracy: [0.0, 16184.6630859375]


In [6]:
gbt_model = tfdf.keras.GradientBoostedTreesModel(task = tfdf.keras.Task.REGRESSION, verbose=0)

gbt_model.compile(metrics=["mae"])

gbt_model.fit(train_ds, verbose=0)

gbt_score = gbt_model.evaluate(val_ds, verbose=0)
print("Validation accuracy:", gbt_score)


[INFO 24-10-16 15:47:19.1575 UTC kernel.cc:1233] Loading model from path /tmp/tmphoaj1w99/model/ with prefix c771bd2d69bc4a8e
[INFO 24-10-16 15:47:19.1873 UTC quick_scorer_extended.cc:903] The binary was compiled without AVX2 support, but your CPU supports it. Enable it for faster model inference.
[INFO 24-10-16 15:47:19.1895 UTC abstract_model.cc:1344] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO 24-10-16 15:47:19.1895 UTC kernel.cc:1061] Use fast generic engine


Validation accuracy: [0.0, 15609.1767578125]


---

# Tune Hyperparameters

---

In [7]:
# tuner_rf = tfdf.tuner.RandomSearch(num_trials=5, use_predefined_hps=True)

# rf_model_tuned = tfdf.keras.RandomForestModel(tuner=tuner_rf, task=tfdf.keras.Task.REGRESSION)
# rf_model_tuned.fit(train_ds)

# rf_tuned_score = rf_model_tuned.evaluate(val_ds)
# print("Validation accuracy:", rf_tuned_score)


In [8]:
tuner_gbt = tfdf.tuner.RandomSearch(num_trials=50, use_predefined_hps=True)

gbt_model_tuned = tfdf.keras.GradientBoostedTreesModel(tuner=tuner_gbt, task=tfdf.keras.Task.REGRESSION)
gbt_model_tuned.fit(train_ds)

gbt_tuned_score = gbt_model_tuned.evaluate(val_ds)
print("Validation accuracy:", gbt_tuned_score)


Use /tmp/tmpruy80_qf as temporary training directory
Reading training dataset...




Training dataset read in 0:00:00.973082. Found 1168 examples.
Training model...
Model trained in 0:29:55.700446
Compiling model...


[INFO 24-10-16 16:17:17.2368 UTC kernel.cc:1233] Loading model from path /tmp/tmpruy80_qf/model/ with prefix 47017115d6694f50
[INFO 24-10-16 16:17:17.2787 UTC decision_forest.cc:660] Model loaded with 256 root(s), 8804 node(s), and 67 input feature(s).
[INFO 24-10-16 16:17:17.2787 UTC abstract_model.cc:1344] Engine "GradientBoostedTreesGeneric" built
[INFO 24-10-16 16:17:17.2788 UTC kernel.cc:1061] Use fast generic engine


Model compiled.
Validation accuracy: 0.0


---

# Predict on Test Data

---

In [9]:
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data, task = tfdf.keras.Task.REGRESSION)

# test_predictions = rf_model_tuned.predict(test_ds)
test_predictions = gbt_model_tuned.predict(test_ds)




In [10]:
submission = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv")
submission['SalePrice'] = test_predictions
submission.to_csv('submission.csv', index=False)

submission.head()


Unnamed: 0,Id,SalePrice
0,1461,130578.296875
1,1462,150648.921875
2,1463,170689.21875
3,1464,189688.078125
4,1465,181756.171875
