<h1> House Price Prediction Challenge</h1>
<h5> By Franke van der Vorm and Bart van Moorsel</h5>

<p> Project House Price Prediction Challenge (HPPC) is an 4-day assignment given by Avans Hogeschool in the Data Science for the Smart Industry minor on 31/10/2022. The original challenge can be found on <a href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques">Kaggle</a>. The purpose of this challenge is to test the students ability to successfully make predictions based on historical data and evaluate the results. The given case of this challenge is to prediction house prices based on an unprepared dataset of around 80 collumns and 1460 rows. This dataset can be found <a href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data">here</a>. </p>

<p> This markdown is intended for demonstrating the solution made by the HPPC-team. In this IPython notebook, every collection of cells will be described in a way that explains what has been done and why it has been done. The solution is structured in 4 sections based on the best practices of the Cross Industry Standard Process for Data Mining (CRISP-DM).

*   Importing the right modules
*   Preprocessing the data
*   Training the linear regression model
*   Evaluating the model</p>

<p> The chosen model for this assignment is a linear regression model. This model has been chosen because the target value is of numerical type, namely the price at which a house will be sold. In combination with an X amount of feature variables makes this a good option for linear regression.</p>

<h3> Importing the right modules </h3>
<p> Pandas and Numpy are used for efficient data manipulation. The machine learning module will be SKLearn, this will be used for fitting and evaluating the model. Lastly, matplotlib will be utilized for visualising the results. Next, the dataset will be loaded in and inspected. </p>

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, KFold
import matplotlib.pyplot as plt

df = pd.read_csv("train.csv", index_col=0)
df

<h3> Preprocessing the data </h3>
<p> As mentioned before, the dataset is unprepared and not ready for fitting. This means that data will have to be preprocessed first. This requires a significant understanding of the data. The data-understanding of the HCCP team is documented in <a href="">the data-understanding report</a>, which also describes how certain data is filtered. </p>

<p> The raw data could be categorized in two categories; numerical data and categorical data. Both need a different be approach. If there are no missing values for a categoral collumn, then it is ready for one-hot-encoding. In this coding section, a for loop is defined which iterates over all the categoral collumns specified in the ohe_list.   </p>

In [None]:
#Filter feature data
X_unproc = df[["LotArea","YrSold"]]
ohe_list = ["Neighborhood", "ExterQual"]
X_final = pd.DataFrame()

#Do one hot encoding for every collumn specified in ohe_list
for collumn in ohe_list:
    pre_ohe_df = df[[collumn]]
    ohe_df = pd.get_dummies(pre_ohe_df, prefix=collumn)
    X_final = ohe_df.join(X_final)
    # print(X_final)
    # print(ohe_df)

# print(X)
# X["YrSold"] = list(map(lambda year: year - 2006, df["YrSold"])) #example of remapping column values 
# print(list(X["YrSold"]))
X_final = X_unproc.join(X_final)
# X_final.to_csv("temp.csv")
X = X_final.values


y = df["SalePrice"].values
print(X.shape)
# print(y.shape)
# # X = X.reshape(1460, 54)
# print(X.shape)




In [None]:
#Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

print(X_train.shape)

# X_train = X_train.reshape(-1, 2)
# print(X_train.shape)

# X_test = X_test.reshape(-1, 1)

# "x_train size: " + str(X_train.size) + " y_train size: " + str(y_train.size) 



In [None]:
#Train LinearRegression model
reg = LinearRegression()
reg.fit(X_train, y_train)


In [None]:
#Evaluate results
# X_test = X_test.reshape(-1, 2)
print(X_test.shape)
y_pred = reg.predict(X_test)
r_squared = reg.score(X_test, y_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)

kf = KFold(n_splits=6, shuffle=True)
cv_scores = cross_val_score(reg, X, y, cv=kf)
cv_mean = np.mean(cv_scores)

print("Mean squared error: " + str(rmse))
print("Root squared error: " + str(r_squared))
print("Mean 6-fold CV: " + str(cv_mean))




In [None]:
#Before evaluating on history, retrieve the history

history = pd.DataFrame(columns=["Attempt", "mean_cv", "rmse"] );
try:
    history = pd.read_csv("history.csv")
    print("Found history.csv with rows amount: " + str(history.shape[0]))
except FileNotFoundError:
    print("history.csv not found, creating new one")
    history.to_csv('history.csv', index = False)


In [None]:
#Evaluate based on history and visualize progress


history.loc[len(history.index)] = [int(len(history.index)), cv_mean, int(rmse)]
print(history[-5:-1])
print("Current attempt: ", history[-1:])
history.to_csv('history.csv', index = False)

plt.plot(history["Attempt"], history["mean_cv"])
min_cv_mean = np.min(history["mean_cv"].values)
max_cv_mean = np.max(history["mean_cv"].values)

visual_margin = (max_cv_mean - min_cv_mean) * 0.5 + 0.001 #Makes the graph more eye friendly

plt.xlabel("Attempts")
plt.xlim(0, np.size(history["Attempt"].values))
# plt.ylim(min_r_squared - visual_margin, max_r_squared + visual_margin)
plt.ylabel("Mean cross_val_score")
plt.ylim(0, 1)
plt.title("Cross-Validation score progress")
plt.show()
