# Step 1 - Import Data

In [284]:
import pandas as pd

cities = ['2800', '2820', '2830', '2840', '2850', '2900', '2920', '2930', '2942', '2950', '3000', '3460']
combined_df = pd.DataFrame()

for city in cities:
    filename = f'./data/house_data/house_data_{city}.csv'
    df = pd.read_csv(filename)
    df["City"] = city
    df.dropna()
    combined_df = pd.concat([combined_df, df], ignore_index=True)

    
combined_df[combined_df["City"] == str(3000)].head(5)

Unnamed: 0,Address,X,Y,Price,Type,Size,Squaremeter price,Energy class,Url,City
991,"Heimdalsvej 3A, 1.",56.034845,12.591295,1995000,Ejerlejlighed,68,29338,D,https://www.dingeo.dk/adresse/3000-helsing%C3%...,3000
992,Højvænget 8,56.040647,12.600903,14995000,Villa,340,44102,C,https://www.dingeo.dk/adresse/3000-helsing%C3%...,3000
993,Ribevej 4,56.027213,12.594296,5495000,Villa,169,32514,C,https://www.dingeo.dk/adresse/3000-helsing%C3%...,3000
994,"Fiolgade 9C, 2.",56.033803,12.608923,3895000,Ejerlejlighed,159,24496,C,https://www.dingeo.dk/adresse/3000-helsing%C3%...,3000
995,Valnøddevænget 14,56.01953,12.590993,6750000,Villa,161,41925,B,https://www.dingeo.dk/adresse/3000-helsing%C3%...,3000


# Step 2 - Data Processing

In [217]:
#combined_df = combined_df.dropna()
#combined_df.describe()


# Step 3 - Feauture Selection
Select the relevant features (variables) that you want to use for predicting the price. Exclude any columns that are not useful or not available during prediction.


In [293]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

features = ['X', 'Y', "Price", "Type", 'Size', 'Energy class']
target = "Price"

X = combined_df[features]
y = combined_df[target]

# Perform one-hot encoding for categorical variables
encoder = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), ['Type', 'Energy class'])],
    remainder='passthrough'
)
X_encoded = encoder.fit_transform(X)


# Step 4 - Split the data
Split the data into training and testing sets to evaluate the performance of your machine learning model. The training set will be used to train the model, and the testing set will be used to evaluate its performance.

In [294]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Step 5 - Train a model
Choose a suitable machine learning algorithm for your task, such as linear regression, decision tree, or random forest. Train the model on the training data.

In [295]:
model = LinearRegression()
model.fit(X_train, y_train)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

# Step 6 - Evaluate the model
Evaluate the performance of your trained model using appropriate evaluation metrics, such as mean squared error (MSE), mean absolute error (MAE), or R-squared.

In [266]:
from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 5.42555660591112e-17


A mean squared error (MSE) value of 3.744599393294227e-16 suggests that the linear regression model is performing exceptionally well on the test set. The MSE measures the average squared difference between the predicted and actual values. In this case, the extremely small MSE indicates that the predicted values are very close to the actual values, almost negligible in terms of the squared difference.

An MSE close to zero suggests that the model is fitting the data very well, capturing the underlying patterns and minimizing the prediction errors. However, it's important to note that such a low MSE could also indicate overfitting, where the model is too closely tailored to the training data and may not generalize well to unseen data.

To gain a better understanding of the model's performance, you can also consider other evaluation metrics such as the root mean squared error (RMSE), mean absolute error (MAE), or coefficient of determination (R-squared). Additionally, visualizing the predicted values against the actual values in scatter plots or regression plots can provide insights into the model's performance.

In summary, an MSE of 3.744599393294227e-16 suggests that your linear regression model is fitting the data very well, but be cautious of potential overfitting and consider evaluating the model using additional metrics and visualizations.

The value 3.744599393294227e-16 is in scientific notation, where "e-16" indicates "10 to the power of -16." Therefore, the value can be expressed as:

0.0000000000000003744599393294227

In decimal form, this is an extremely small number very close to zero. It represents the mean squared error (MSE), which measures the average squared difference between the predicted and actual values. In this case, the MSE value being very close to zero suggests that the predicted values are almost identical to the actual values, indicating a very accurate model.

However, it's important to note that when dealing with floating-point values and numerical computations, extremely small values like this can be subject to rounding errors and precision limitations. In practical terms, achieving an MSE value of exactly zero is highly unlikely and might indicate some numerical artifacts. Therefore, it's essential to interpret such small values with caution and consider them as close to zero rather than absolute zero.

# Make predictions
Once you have trained and evaluated your model, you can use it to make predictions on new data. Create a function that takes user-specified variables as input and predicts the house price.

In [277]:
# User input
user_input = {
    'X': 55.793,
    'Y': 12.468,
    'Type': 'Ejerlejlighed',
    'Size': 110,
    'Energy class': 'C'
}

# Convert user input to a DataFrame
user_df = pd.DataFrame([user_input])

# Perform one-hot encoding for categorical variables
user_encoded = encoder.transform(user_df)

# Make predictions
user_pred = model.predict(user_encoded)

# Print the predicted price
print("Predicted Price:", user_pred[0])


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- Size
- X
- Y
