In this notebook, I performed data preprocessing, feature engineering, and model predictions for housing prices using various machine learning models. The purpose of this notebook is to predict the sale prices of houses based on the given features in the dataset. We utilized RandomForest, XGBRegressor, and a Neural Network model to make predictions, demonstrating the application of different regression techniques to improve the accuracy of our predictions.

# Libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
import dill
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import TargetEncoder
import os
from define_function import *

# Load data

In [2]:
df_test = load_data('test.csv')

In [3]:
id_test = df_test['Id']

# drop features

In [None]:
# drop uneeded features
df_test = drop_features(df_test, features_to_drop=['Alley','PoolQC','Fence','MiscFeature'])

# Clean Data

In [None]:
# Impute data to fill missing values
df_test = clean_data(df_test, train=False)

In [None]:
# check if there are any missing values
df_test.isna().sum().sum()

0

# Encode Data

In [None]:
# Encode data to numerical values
Target_Encoding_list = ['MSZoning', 'Street', 'Utilities', 'LotConfig', 'Neighborhood', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'CentralAir', 'Electrical', 'GarageType', 'SaleType']
Ordinal_Encoding_list= ['LotShape', 'LandContour', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'HeatingQC', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'SaleCondition']


encoding_methods = {col: 'target' for col in Target_Encoding_list}
encoding_methods.update({col: 'ordinal' for col in Ordinal_Encoding_list})

df_test = encode_data(df_test, encoding_methods , train=False, target=['SalePrice'])


{'target': TargetEncoder(target_type='continuous'), 'ordinal': OrdinalEncoder()}


# Predictions

### RandomForest Model Predictions

In [None]:
# load model and feature list
with open('trained_model_rf.pickle', 'rb') as f:
    trained_model_rf = dill.load(f)


with open('feature_list.pickle', 'rb') as f:
    train_columns = dill.load(f)


# select only the features used in training
df_test = df_test[train_columns]
for col in train_columns:
    df_test[col] = df_test[col].astype(float)


# predict the target variable
y_new_pred_rf = predict_model(df_test, trained_model_rf)
y_new_pred_rf = y_new_pred_rf.flatten()

Predictions after inverse transform (if applicable):[125860.37171951 146872.91789266 181843.74102924 ... 154667.79169852
 120938.70947389 210821.00266924]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[col] = df_test[col].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[col] = df_test[col].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test[col] = df_test[col].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

### XgbRegressor Model Predictions

In [None]:
# load model and feature list
with open('trained_model_XG.pickle', 'rb') as f:
    trained_model_XG = dill.load(f)


with open('feature_list.pickle', 'rb') as f:
    train_columns = dill.load(f)

# select only the features used in training
df_test = df_test[train_columns]
for col in train_columns:
    df_test[col] = df_test[col].astype(float)


# predict the target variable
y_new_pred_XG = predict_model(df_test, trained_model_XG)
y_new_pred_XG = y_new_pred_XG.flatten()

Predictions after inverse transform (if applicable):[124631.52 154569.9  183071.45 ... 161633.5  122293.63 203357.58]


### Neural Network Model PRedictions

In [None]:
# load model
with open('trained_nn_model.pickle', 'rb') as f:
    trained_nn_model = dill.load(f)


# predict the target variable
y_new_pred_nn = predict_model(df_test, trained_nn_model)
y_new_pred_nn

[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Inverse transform produced NaNs. Returning raw predictions.
[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 796us/step
Predictions after inverse transform (if applicable):[[154803.3 ]
 [157160.17]
 [215290.06]
 ...
 [156491.62]
 [172144.95]
 [163895.38]]


array([[154803.3 ],
       [157160.17],
       [215290.06],
       ...,
       [156491.62],
       [172144.95],
       [163895.38]], dtype=float32)

In [None]:
# Save predictions with corresponding IDs for the random forest model
model_rf = pd.DataFrame({'Id': id_test, 'SalePrice': y_new_pred_rf})
model_rf.to_csv('prediction_rf.csv', index=False)

In [None]:
# Save predictions with corresponding IDs for the XGBRegressor model
model_xg = pd.DataFrame({'Id': id_test, 'SalePrice': y_new_pred_XG})
model_xg.to_csv('prediction_xg.csv', index=False)

In [None]:
# Save predictions with corresponding IDsvfor the neural network model
model_nn = pd.DataFrame({'Id': id_test, 'SalePrice': y_new_pred_nn.flatten()})
model_nn.to_csv('prediction_nn.csv', index=False)