# House Prices Prediction using TensorFlow Decision Forests

## Import the Library

In [None]:
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
print("Tensorflow v"+ tf.__version__)
print("Tensorflow decision v" + tfdf.__version__)

## Loading Dataset

In [None]:
train_file_path = "../input/house-prices-advanced-regression-techniques/train.csv"
dataset_df = pd.read_csv(train_file_path)
print("Full train dataset shape is {}".format(dataset_df.shape))

The data has 81 columns with 1460 entries.We can get to know by printing top 5 entries.

In [None]:
dataset_df.head(5)

There are 79 features columns.Using thes features model has to predict the house sales price indicated by the label column named SalePrice.

We will drop the ID column as it is not necessary for model training.

In [None]:
dataset_df = dataset_df.drop('Id', axis=1)


In [None]:
dataset_df.head(3)

Getting through the features.

In [None]:
dataset_df.info()

In [None]:
print(dataset_df['SalePrice'].describe())
plt.figure(figsize=(9, 8))
sns.distplot(dataset_df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4});

## Numerical data distribution

We will look at how the numerical features are distributed. In order to do this, let us first list all the types of data from our dataset and select only the numerical ones.

In [None]:
list(set(dataset_df.dtypes.tolist()))

Now let us plot the distribution for all the numerical features.

In [None]:
df_num = dataset_df.select_dtypes(include = ['float64', 'int64'])
df_num.head()

In [None]:
df_num.hist(figsize =(16,20),bins = 50 ,xlabelsize =8,ylabelsize =8);

## Prepare the Dataset

This dataset includes a combination of numeric, categorical, and missing features. Fortunately, TensorFlow Decision Forests (TF-DF) can handle all these feature types directly, without the need for manual preprocessing. This built-in flexibility makes tree-based models an excellent starting point for learning machine learning with TensorFlow.

Next, let's split the dataset into training and testing sets.

In [None]:
import numpy as np

def split_dataset(dataset, test_ratio=0.30):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
    len(train_ds_pd), len(valid_ds_pd)))

Before training the model, there's one more important step: converting the dataset from a Pandas DataFrame (pd.DataFrame) to a TensorFlow Dataset (tf.data.Dataset).

TensorFlow Datasets provide efficient data pipelines, which are especially useful when training models on hardware accelerators like GPUs or TPUs.

Also, since the default setting for the Random Forest model is classification, and our task is regression, we need to explicitly specify the task type using tfdf.keras.Task.REGRESSION.

In [None]:
label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task = tfdf.keras.Task.REGRESSION)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label, task = tfdf.keras.Task.REGRESSION)

## Select a Model
TensorFlow Decision Forests offers multiple tree-based models to choose from:

- RandomForestModel

- GradientBoostedTreesModel

- CartModel

- DistributedGradientBoostedTreesModel

To begin with, we'll use the Random Forest model — one of the most popular and widely used decision forest algorithms.

A Random Forest is an ensemble of decision trees, where each tree is trained independently on a randomly sampled subset of the training data (with replacement). This approach makes the model robust to overfitting and easy to use, even with minimal hyperparameter tuning.

You can view all the available models in TensorFlow Decision Forests using the following command:

In [None]:
tfdf.keras.get_all_models()

## How to Configure the Models

TensorFlow Decision Forests comes with well-optimized default settings, including top-performing hyperparameters based on internal benchmarks—tweaked to ensure efficient training time.

However, if you want to fine-tune the model for better accuracy, you have the flexibility to customize various hyperparameters.

You can start by selecting a predefined hyperparameter template and specifying key parameters like this:

rf = tfdf.keras.RandomForestModel(hyperparameter_template="benchmark_rank1", task=tfdf.keras.Task.REGRESSION)

## Create a Random Forest
For today’s implementation, we’ll use the default settings to create a Random Forest model, while specifying the task type as tfdf.keras.Task.REGRESSION to indicate that this is a regression problem.

In [None]:
rf = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
rf.compile(metrics=["mse"]) # Optional, you can use this to include a list of eval metrics

## Train the Model
Training the model can be done with a simple one-liner.

In [None]:
rf.fit(x=train_ds)


## Visualize the Model
One advantage of tree-based models is their interpretability—you can easily visualize individual decision trees. By default, the Random Forest model contains 300 trees. Below, we’ll display one of them (specifically, the first tree) up to a maximum depth of 3:

In [None]:
tree = rf.make_inspector().extract_tree(tree_idx=0)
print(tree)



##  Evaluate the Model Using OOB Data and Validation Dataset

* Before training, we manually set aside **20% of the dataset as a validation set**, named `valid_ds`.
* In addition to this, we can evaluate the **Random Forest model** using the **Out-of-Bag (OOB) score**.

---

###  What is OOB Data?

* During training, the Random Forest algorithm samples random subsets of the training data **with replacement**.
* The **samples not selected** in a particular tree are referred to as **Out-of-Bag (OOB)** data.
* These OOB samples act like an internal validation set used to estimate model performance.
* The model computes an **OOB score** using this data, helping to assess its generalization without needing a separate validation set.

 [Learn more about OOB data here.](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr)

---

###  Interpreting the RMSE Plot

* The training log shows how the **Root Mean Squared Error (RMSE)** on OOB data evolves as more trees are added.
* We can plot this to visualize model performance.
* **Note:** For RMSE, **lower values indicate better performance**.




In [None]:
import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("RMSE (out-of-bag)")
plt.show()

We can also see some general stats on the OOB dataset:

In [None]:
inspector = rf.make_inspector()
inspector.evaluation()

Now, let us run an evaluation using the validation dataset.

In [None]:
evaluation = rf.evaluate(x=valid_ds,return_dict=True)

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")


##  Variable Importances

Variable importance helps us understand how much each feature contributes to the model's predictions or performance.

TensorFlow Decision Forests provides several methods to evaluate feature importance in Decision Tree models.

Let’s explore the different types of variable importance metrics available.


In [None]:
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
  print("\t", importance)


###  Example: NUM\_AS\_ROOT Variable Importance

Let’s look at feature importance using the `NUM_AS_ROOT` metric.

* A **higher score** means the feature is **more frequently used as the root node** across the trees in the forest.
* Features at the **top of the list** have the **strongest influence** on model predictions.
* The output is **sorted by importance**, with the most impactful features listed first.


In [None]:
inspector.variable_importances()["NUM_AS_ROOT"]

Plot the variable importances from the inspector using Matplotlib

In [None]:
plt.figure(figsize=(12, 4))

# Variable importance metric: number of times features are used as root nodes.
variable_importance_metric = "NUM_AS_ROOT"
variable_importances = inspector.variable_importances()[variable_importance_metric]

# Retrieve feature names and their corresponding importance scores.
# `variable_importances` contains tuples of (feature, importance).
feature_names = [vi[0].name for vi in variable_importances]
feature_importances = [vi[1] for vi in variable_importances]

# Features are sorted by decreasing importance.
feature_ranks = range(len(feature_names))

bar = plt.barh(feature_ranks, feature_importances, label=[str(x) for x in feature_ranks])
plt.yticks(feature_ranks, feature_names)
plt.gca().invert_yaxis()

# TODO: Update to use "plt.bar_label()" when it becomes available.
# Annotate each bar with its importance value.
for importance, patch in zip(feature_importances, bar.patches):
    plt.text(patch.get_x() + patch.get_width(), patch.get_y(), f"{importance:.4f}", va="top")

plt.xlabel(variable_importance_metric)
plt.title("NUM_AS_ROOT Importance for class 1 vs others")
plt.tight_layout()
plt.show()



## Submission

Finally, use the trained model to make predictions on the competition’s test dataset.


In [None]:
test_file_path = "../input/house-prices-advanced-regression-techniques/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop('Id')

test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    test_data,
    task = tfdf.keras.Task.REGRESSION)

preds = rf.predict(test_ds)
output = pd.DataFrame({'Id': ids,
                       'SalePrice': preds.squeeze()})

output.head()

In [None]:
sample_submission_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
sample_submission_df['SalePrice'] = rf.predict(test_ds)
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()