# House Prices Prediction and Particle Swarm Optimization

This notebook predicts house prices using multiple regressors and optimizes their ensemble weights using Particle Swarm Optimization (PSO).


## Step 1: Load and Preprocess the Data
We load the training and test datasets, preprocess them to handle missing values and categorical variables, and scale the features for better performance.


In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from pyswarm import pso

# Load training data
data_train = pd.read_csv("data/train.csv")

# Separate features and target for training
X_train = data_train.drop(["SalePrice", "Id"], axis=1)
y_train = data_train["SalePrice"]

# Handle categorical features in training data
X_train = pd.get_dummies(X_train, drop_first=True)

# Fill missing values in training data
X_train = X_train.fillna(X_train.median())

# Load test data
data_test = pd.read_csv("data/test.csv")

# Keep track of IDs in test data for output
test_ids = data_test["Id"]

# Drop ID column in test data
X_test = data_test.drop(["Id"], axis=1)

# Handle categorical features in test data
X_test = pd.get_dummies(X_test, drop_first=True)

# Align test data with training data (add missing columns)
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

# Fill missing values in test data
X_test = X_test.fillna(X_train.median())

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Step 2: Train Multiple Regressors
We train three regression models: Random Forest, Support Vector Regression (SVR), and K-Nearest Neighbors (KNN).

In [5]:
models = {
    "RandomForest": RandomForestRegressor(random_state=42, n_estimators=100),
    "SVR": SVR(kernel='rbf', C=10, gamma=0.1),
    "KNeighbors": KNeighborsRegressor(n_neighbors=7)
}

# Train and store predictions
predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions[name] = model.predict(X_test)

# Replace NaN values in predictions if necessary
for name in predictions:
    predictions[name] = np.nan_to_num(predictions[name])

# Generate training predictions for PSO
train_predictions = {}
for name, model in models.items():
    train_predictions[name] = model.predict(X_train)

# Display individual model performance on training data
for name, model in models.items():
    train_predictions_train = model.predict(X_train)
    mse = mean_squared_error(y_train, train_predictions_train)
    print(f"Training MSE for {name}: {mse}")

Training MSE for RandomForest: 120823814.9010809
Training MSE for SVR: 6626763595.51494
Training MSE for KNeighbors: 1279290126.952502


## Step 3: Save Unoptimized Predictions
Before optimizing ensemble weights, compute and save unoptimized mean predictions for comparison.

In [6]:
# Compute mean ensemble prediction without optimization
unoptimized_ensemble = sum(predictions[name] for name in models) / len(models)

# Save unoptimized predictions to submission2.csv
submission2 = pd.DataFrame({"Id": test_ids, "SalePrice": unoptimized_ensemble})
submission2.to_csv("data/submission2.csv", index=False)
print("Unoptimized predictions saved as 'data/submission2.csv'")

Unoptimized predictions saved as 'data/submission2.csv'


## Step 4: Define the PSO Objective Function
The objective function minimizes the Mean Squared Error (MSE) by optimizing weights for the ensemble.

In [7]:
def objective_function(weights):
    weights = np.array(weights)
    if weights.sum() == 0:
        return 1e10
    
    weights = weights / weights.sum()  # Normalize weights
    
    # Compute ensemble predictions on training data
    ensemble_prediction = sum(weights[i] * train_predictions[name] for i, name in enumerate(models))
    
    # Replace NaN values in the predictions
    ensemble_prediction = np.nan_to_num(ensemble_prediction)
    
    # Ensure that the prediction length matches y_train
    if len(ensemble_prediction) != len(y_train):
        raise ValueError(f"Prediction length mismatch: {len(ensemble_prediction)} != {len(y_train)}")
    
    return mean_squared_error(y_train, ensemble_prediction)

## Step 5: Optimize Weights with PSO
Use the PSO algorithm to find the optimal weights for combining the predictions of the models.

In [8]:
# Define bounds for weights
lb = [0] * len(models)
ub = [1] * len(models)

# Run PSO
optimal_weights, _ = pso(objective_function, lb, ub, swarmsize=100, maxiter=200)

# Normalize optimal weights
optimal_weights = optimal_weights / sum(optimal_weights)

# Display results
for i, name in enumerate(models):
    print(f"Weight for {name}: {optimal_weights[i]}")

Stopping search: maximum iterations reached --> 200
Weight for RandomForest: 1.0
Weight for SVR: 0.0
Weight for KNeighbors: 0.0


## Step 6: Evaluate Final Ensemble
Combine model predictions using the optimized weights and save the results to a CSV file.

In [9]:
# Compute final ensemble prediction
optimized_ensemble = sum(optimal_weights[i] * predictions[name] for i, name in enumerate(models))

# Save optimized predictions to submission.csv
submission = pd.DataFrame({"Id": test_ids, "SalePrice": optimized_ensemble})
submission.to_csv("data/submission.csv", index=False)
print("Optimized predictions saved as 'data/submission.csv'")

# Save both submissions
submission2 = pd.DataFrame({"Id": test_ids, "SalePrice": unoptimized_ensemble})
submission2.to_csv("data/submission2.csv", index=False)
print("Unoptimized predictions saved as 'data/submission2.csv'")

Optimized predictions saved as 'data/submission.csv'
Unoptimized predictions saved as 'data/submission2.csv'
