# Housing Price Prediction Using Neural Networks

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

##  Description:

This application utilizes a deep learning model to predict housing prices based on various features using a neural network. It includes preprocessing, model training, and evaluation through multiple experimental setups to optimize the prediction accuracy. The app leverages synthetic data and different hyperparameter configurations for improved model performance

## Step 1: Install Requirements

This step initializes flags and variables to check whether the required dependencies are installed. The max_retries and retries variables handle retry attempts if installation fails.

This function checks if the necessary Python dependencies (from requirements.txt) are installed, and if not, it attempts to install them. If installation fails, it retries up to 3 times.

In [2]:
import os

requirements_installed = False
max_retries = 3
retries = 0


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return

install_requirements()

## Step 2: Set Up Environment Variables

 This step loads environment variables from a .env file. It checks whether specific environment variables are set and warns the user if they are missing.

In [4]:
from dotenv import load_dotenv
import os


def setup_env():
    """Sets up the environment variables"""
    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")
    load_dotenv()

    variables_to_check = []

    for var in variables_to_check:
        check_env(var)

setup_env()

## Step 3: Load, Preprocess, and Train Housing Price Prediction Model

This code defines a housing price prediction model using a neural network in PyTorch. It loads a CSV dataset, preprocesses it (scaling numeric features and one-hot encoding categorical ones), and splits it into training and test sets.

 A custom model class builds the neural network architecture dynamically based on the configuration (e.g., number of layers, activation functions). 
 
 The model is trained using the Adam optimizer and MSE loss. After training, it can make predictions and evaluate performance metrics such as MSE, RMSE, MAPE, and MAE.

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
from copy import deepcopy

# Adjust the random seed for reproducibility
random_seed = 42

torch.manual_seed(random_seed)

class HousingPriceDataset:
    """Class to load and preprocess the housing dataset."""
    def __init__(self, csv_path, target_column, scale_factor=1_000_000):
        # Load the dataset
        self.data = pd.read_csv(csv_path)

        # Separate features and target
        self.X = self.data.drop(columns=[target_column])
        self.y = self.data[target_column] / scale_factor  # Scale target

        # Identify categorical and numeric columns
        categorical_cols = self.X.select_dtypes(include=['object']).columns
        numeric_cols = self.X.select_dtypes(include=['number']).columns

        # Preprocess the data
        preprocessor = ColumnTransformer(
            transformers=[
                ('num', StandardScaler(), numeric_cols),
                ('cat', OneHotEncoder(), categorical_cols)
            ]
        )

        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, test_size=0.2, random_state=42)

        # Fit and transform the training data, transform the testing data
        self.X_train = torch.tensor(preprocessor.fit_transform(X_train), dtype=torch.float32)
        self.X_test = torch.tensor(preprocessor.transform(X_test), dtype=torch.float32)
        self.y_train = torch.tensor(y_train.values, dtype=torch.float32).reshape(-1, 1)
        self.y_test = torch.tensor(y_test.values, dtype=torch.float32).reshape(-1, 1)

    def get_input_size(self):
        return self.X_train.shape[1]
    
    def get_train_data(self):
        return self.X_train, self.y_train

class HousingPriceModel(nn.Module):
    """Class to define the neural network model for predicting housing prices."""
    def __init__(self, dataset: HousingPriceDataset, training_config=None):
        if training_config is None:
            training_config = {
                "epochs": 500,
                "learning_rate": 0.01,
                "dropout_rate": 0.5,
                "l2_lambda": 0.01,
                "layers": [64, 32],
                "activation": "ReLU"
            }
        input_size = dataset.get_input_size()
        self.X_train, self.y_train = dataset.get_train_data()
        self.training_config = training_config
        super(HousingPriceModel, self).__init__()

        # Build the model dynamically based on config
        layers = []
        previous_size = input_size
        for size in training_config["layers"]:
            layers.append(nn.Linear(previous_size, size))
            if training_config["activation"] == "ReLU":
                layers.append(nn.ReLU())
            elif training_config["activation"] == "Tanh":
                layers.append(nn.Tanh())
            elif training_config["activation"] == "Sigmoid":
                layers.append(nn.Sigmoid())
            layers.append(nn.Dropout(training_config["dropout_rate"]))
            previous_size = size
        layers.append(nn.Linear(previous_size, 1))  # Output layer
        self.model = nn.Sequential(*layers)
        self.cached_model = None

    def forward(self, x):
        return self.model(x)
    
    def set_training_config(self, config):
        self.training_config = deepcopy(config)

    def get_training_config(self):
        return deepcopy(self.training_config)

    def train_model(self):
        training_config = self.get_training_config()
        epochs = training_config["epochs"]
        learning_rate = training_config["learning_rate"]
        l2_lambda = training_config["l2_lambda"]
        criterion = nn.MSELoss()
        optimizer = optim.Adam(self.parameters(), lr=learning_rate, weight_decay=l2_lambda)

        for epoch in range(epochs):
            # Forward pass
            outputs = self(self.X_train)
            loss = criterion(outputs, self.y_train)

            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Print loss every 50 epochs
            if (epoch + 1) % 50 == 0:
                print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}')

        # Cache the trained model
        self.cached_model = self.state_dict()

    def _assert_model_state(self):
        if self.cached_model is None:
            print("Model is not trained. Training the model now...")
            self.train_model()

    def predict_price(self, X):
        self._assert_model_state()
        self.eval() 
        with torch.no_grad():
            prediction = self(X)
        return prediction.item()

    def batch_predict(self, X, y=None):
        self._assert_model_state()
        self.eval()  
        metrics = {
            "loss": None,
            "rms": None,
            "mape": None,
            "mae": None
        }
        with torch.no_grad():
            predictions = self(X)
            if y is not None:
                loss = nn.MSELoss()(predictions, y).item()
                rms = torch.sqrt(torch.tensor(loss)).item()
                mape = torch.mean(torch.abs((predictions - y) / y) * 100).item()

                metrics["loss"] = loss
                metrics["rms"] = rms
                metrics["mape"] = mape
                metrics["mae"] = nn.L1Loss()(predictions, y).item()
        return predictions, metrics

## Step 4: Run Experiment and Evaluate Model Performance

This step defines the run_experiment function, which:


- Loads the housing price dataset.

- Initializes and trains the HousingPriceModel using the provided training configuration.

- Evaluates the model on test data and calculates performance metrics such as MSE, RMSE, MAPE, and MAE.

- Displays the metrics and compares sample predictions with the actual values.

In [7]:
from IPython.display import clear_output


def run_experiment(
    csv_path,
    target_column,
    training_config={
        "epochs": 500,
        "learning_rate": 0.01,
        "dropout_rate": 0.5,
        "l2_lambda": 0.01,
        "layers": [128, 64, 32],
        "activation": "ReLU",
    },
):
    """Method to run the experiment"""
    dataset = HousingPriceDataset(csv_path, target_column)
    model = HousingPriceModel(dataset, training_config)
    model.train_model()
    predictions, metrics = model.batch_predict(dataset.X_test, dataset.y_test)
    clear_output()

    print("METRICS:")
    print(f"MSE: {metrics['loss']:.4f}")
    print(f"RMSE: {metrics['rms']:.4f}")
    print(f"MAPE: {metrics['mape']:.4f}%")
    print(f"MAE: {metrics['mae']:.4f}")

    # Print sample predictions
    print("Sample predictions vs actual values:")
    for i in range(5):
        print(
            f"Predicted: {predictions[i].item():.2f}, Actual: {dataset.y_test[i].item():.2f}"
        )

## Step 5: Basic Experiment with Initial Training Configuration

This experiment uses a basic training configuration with the following settings:

- 1000 epochs

- Learning rate: 0.01

- Dropout rate: 0.5

- L2 regularization: 0.01

- Two hidden layers (128 and 64 neurons)

- ReLU activation function

The housing price dataset is loaded from data/housing_prices/housing_prices.csv, and the target column is "price". The model is trained, and the results, including performance metrics and sample predictions, are displayed.

In [None]:
# Experiment 1: Basic stuff.

training_config = {
        "epochs": 1000,
        "learning_rate": 0.01,
        "dropout_rate": 0.5,
        "l2_lambda": 0.01,
        "layers": [128, 64],
        "activation": "ReLU"
    }

csv_path = "data/housing_prices/housing_prices.csv"
target_column = "price"  
run_experiment(csv_path, target_column, training_config=training_config)

## Step 6: Optimized Experiment with 1000 Epochs and 3 Layers

This experiment uses a refined configuration with the following settings:


- 1000 epochs

- Learning rate: 0.01

- Dropout rate: 0.5

- L2 regularization: 0.01

- Three hidden layers (128, 64, and 32 neurons)

- ReLU activation function

It provides better performance as compared to the previous experiment, delivering improved metrics for housing price predictions.


In [None]:
# Experiment 2: 1000 epochs, 3 layers, ReLU activation
# This works the best. 

training_config = {
        "epochs": 1000,
        "learning_rate": 0.01,
        "dropout_rate": 0.5,
        "l2_lambda": 0.01,
        "layers": [128, 64, 32],
        "activation": "ReLU"
    }

csv_path = "data/housing_prices/housing_prices.csv"
target_column = "price"  
run_experiment(csv_path, target_column, training_config=training_config)

## Step 7: Experiment with Increased Epochs and More Layers

In this experiment, the following changes were made:

- 5000 epochs (increased training duration)

- Learning rate: 0.05 (higher learning rate)

- Dropout rate: 0.5 (kept the same)

- L2 regularization: 0.01 (kept the same)

- Five hidden layers with varying sizes: [128, 64, 32, 64, 128]

- ReLU activation function

This setup explores the effects of longer training time and a deeper network, and it may help refine model performance, though further adjustments might be needed based on metrics.

In [None]:
# Experiment 3: 5x the epochs and add more hidden layers

training_config = {
    "epochs": 5000,
    "learning_rate": 0.05,
    "dropout_rate": 0.5,
    "l2_lambda": 0.01,
    "layers": [128, 64, 32, 64, 128],
    "activation": "ReLU",
}

csv_path = "data/housing_prices/housing_prices.csv"
target_column = "price"
run_experiment(csv_path, target_column, training_config=training_config)

## Step 8: Experiment with Synthetic Data (Augmented Dataset)

In this experiment, synthetic data was added to the dataset with 1000 samples, and the following configuration was used:

- 5000 epochs (extended training duration)

- Learning rate: 0.01 (kept the same)

- Dropout rate: 0.5 (kept the same)

- L2 regularization: 0.01 (kept the same)

- Layers: [128, 64, 32]

- ReLU activation function

The hypothesis was to see if synthetic data would improve the model performance, but the results suggest that the added data does not significantly improve the model's predictions compared to the previous experiment with the original dataset.









In [None]:
# Experiment 4: 1000 epochs, 3 layers, ReLU activation, synthetic data with 1000 samples
# This works the best. 
# Note: Adding synthetic data doesn't seem to help much.

training_config = {
        "epochs": 5000,
        "learning_rate": 0.01,
        "dropout_rate": 0.5,
        "l2_lambda": 0.01,
        "layers": [128, 64, 32],
        "activation": "ReLU"
    }

csv_path = "data/housing_prices/augmented_housing_prices_large.csv"
target_column = "price"  
run_experiment(csv_path, target_column, training_config=training_config)

## Conclusion: 

The app for predicting housing prices using a neural network offers an effective solution for real-estate analysis by providing insights into price predictions based on historical data. The experiments conducted have shown the following:

### Model Performance:

- The model performs optimally with 1000 epochs and 3 hidden layers (128, 64, 32) using a ReLU activation function. This configuration produced the best predictive results with low error rates, making it the preferred setup.

### Training Configurations:

- Increasing the epochs to 5000 or adding more layers did not significantly enhance performance. This indicates that the model reaches an optimal capacity at a certain configuration, and further complexity does not always equate to better predictions.

### Synthetic Data Usage:

- Adding synthetic data with 1000 additional samples did not noticeably improve performance. The original dataset provided sufficient variability and patterns for accurate predictions, suggesting that more data may not always be the solution unless it's of higher quality or better representative of the real world.

### App Takeaways:

- The app efficiently predicts housing prices, utilizing a trained neural network to generate accurate results based on given features.

- The best configuration for training the model involves 3 layers and 1000 epochs, providing a balance of training time and accuracy.

- The app can be extended further with additional data preprocessing, feature engineering, or deployment capabilities to handle larger datasets and improve prediction outcomes over time.


---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

