The Aim of the following script is to create a model to predict Inca Tribe House Prices

## Imports

In [134]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression

## Sklearn libraries
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error, r2_score,  mean_squared_error, mean_absolute_percentage_error
from sklearn.preprocessing import StandardScaler

from xgboost import XGBRegressor

import category_encoders

## Import Data

In [135]:
## Read CSV to pandas dataframe
Data = pd.read_csv("Inca Tribe House Prices.csv")

In [136]:
## Show top 5 rows
Data.head()

Unnamed: 0,Type,Price,Bedrooms,Bathrooms,Area,Furnished,Level,Compound,Payment_Option,Delivery_Date,Delivery_Term,City
0,Chalet,70000,2.0,2.0,10.0,Yes,Unknown,Fanar De Luna,Cash,Ready to move,Finished,Ain Sukhna
1,Apartment,1500000,3.0,3.0,10.0,No,4,Unknown,Unknown Payment,Ready to move,Unknown,New Hut - El Tagamoa
2,Stand Alone Villa,29000000,5.0,6.0,11.0,No,Unknown,Mivida,Cash,Ready to move,Core & Shell,New Hut - El Tagamoa
3,Chalet,3000000,2.0,2.0,12.0,No,Ground,Marina 5,Cash,Ready to move,Finished,North Coast
4,Apartment,1128000,3.0,2.0,14.0,No,3,Beit Al Watan,Installment,soon,Unknown,New Hut - El Tagamoa


## Data exploration

In [137]:
## Extract numerical features
numerical_data = Data.select_dtypes(include='number')

## Evaluate price correlation
price_corr = numerical_data.corr()['Price'].sort_values(ascending=False).drop('Price')

# Display the correlations
print("Correlation of Price with other numerical features:")
print(price_corr)

Correlation of Price with other numerical features:
Area         0.655411
Bathrooms    0.584470
Bedrooms     0.505422
Name: Price, dtype: float64


In [138]:
## Check average bedroom size per housing type
Type_bedrooms = Data.groupby('Type')['Bedrooms'].mean().reset_index()

# Display the grouped results
print(Type_bedrooms)

                Type  Bedrooms
0          Apartment  2.797507
1             Chalet  2.364477
2             Duplex  3.399535
3          Penthouse  3.127376
4  Stand Alone Villa  4.668922
5   Standalone Villa  4.616145
6             Studio  1.124000
7         Town House  3.654510
8         Twin House  4.027374
9         Twin house  3.629400


In [139]:
## Check Delivery_Term unique values
Data['Delivery_Term'].unique()

array(['Finished', 'Unknown ', 'Core & Shell', 'Not Finished',
       'Semi Finished'], dtype=object)

In [140]:
## Check average price for every bedroom-type grouping
grouped_data = Data.groupby(['Type', 'Bedrooms']).agg(
    mean_price=('Price', 'mean')
).reset_index()

# Display the grouped results
print(grouped_data)

          Type  Bedrooms    mean_price
0    Apartment       1.0  1.918335e+06
1    Apartment       2.0  1.587841e+06
2    Apartment       3.0  2.013230e+06
3    Apartment       4.0  3.171255e+06
4    Apartment       5.0  3.367359e+06
..         ...       ...           ...
73  Twin house       3.0  4.267520e+06
74  Twin house       4.0  6.312660e+06
75  Twin house       5.0  8.820929e+06
76  Twin house       6.0  7.203018e+06
77  Twin house       7.0  1.362082e+07

[78 rows x 3 columns]


In [141]:
## Display unique Types
Data['Type'].unique()

array(['Chalet', 'Apartment', 'Stand Alone Villa', 'Studio',
       'Standalone Villa', 'Twin House', 'Town House', 'Duplex',
       'Penthouse', 'Twin house'], dtype=object)

In [142]:
## Display unique levels
Data['Level'].unique()

array(['Unknown', '4', 'Ground', '3', '2', '1', '9', '6', '10+', '7', '8',
       '10', '5', 'Highest'], dtype=object)

## Data Cleaning

In [143]:
## Check if there are null or NaN values 
Data.isnull().sum()

Type                0
Price               0
Bedrooms          203
Bathrooms         171
Area              471
Furnished           0
Level               0
Compound            0
Payment_Option      0
Delivery_Date       0
Delivery_Term       0
City                0
dtype: int64

In [144]:
## Drop rows with null or NaN values
Data.dropna(inplace=True)

In [145]:
## Check for NaN/Null values
Data[["Bedrooms", "Bathrooms", "Area", "Furnished"]].isnull().sum()

Bedrooms     0
Bathrooms    0
Area         0
Furnished    0
dtype: int64

In [146]:
## Check the amount of 'Unknown' entries per column
unknown_entries = (Data == 'Unknown').sum()
print(unknown_entries)

Type                  0
Price                 0
Bedrooms              0
Bathrooms             0
Area                  0
Furnished          8342
Level              9796
Compound          10698
Payment_Option        0
Delivery_Date      9878
Delivery_Term         0
City                  0
dtype: int64


In [147]:
## Check Delivery term values
Data['Delivery_Term'].unique()

array(['Finished', 'Unknown ', 'Core & Shell', 'Not Finished',
       'Semi Finished'], dtype=object)

In [148]:
## Determine amount of unknowns in Delivery term
unknown_entries = (Data == 'Unknown ').sum()
print(unknown_entries)

Type                 0
Price                0
Bedrooms             0
Bathrooms            0
Area                 0
Furnished            0
Level                0
Compound             0
Payment_Option       0
Delivery_Date        0
Delivery_Term     4546
City                 0
dtype: int64


In [149]:
Data[["Bedrooms", 'Bathrooms', 'Area', 'Furnished']].isna().any().any()

False

## Encoding

The following encoding were used for initial testing and removed from the final testing. Refer to Discussion section.

In [150]:
# mode_value = Data['Furnished'].mode()[0]
# print(mode_value)

# Data['Furnished'] = Data['Furnished'].replace('Unknown', mode_value)

# ## Encode Furnished
# Data['Furnished'] = Data['Furnished'].map({'Yes': 1, 'No': 0})

In [151]:
# # Create the custom mapping for the levels
# level_mapping = {
#     'Ground': 1,
#     '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10,
#     '10+': 11, 'Highest': 12
# }

# # Map values
# Data['Level_encoded'] = Data['Level'].map(level_mapping)

# # Replace 'Unknown' values with the mode of the encoded column
# mode_value = Data['Level_encoded'].mode()[0]
# Data['Level_encoded'] = Data['Level_encoded'].fillna(mode_value)

In [152]:
# # Create mapping for delivery term
# delivery_mapping = {
#     'Finished': 4,
#     'Semi Finished': 3,
#     'Core & Shell': 2,
#     'Not Finished': 1
# }

# # Apply the mapping
# Data['Delivery_term_encoded'] = Data['Delivery_Term'].map(delivery_mapping)

# # Replace 'Unknown ' values with the mode of the encoded column
# Data['Delivery_term_encoded'] = Data['Delivery_term_encoded'].fillna(0)


In [153]:
# Data['Delivery_term_encoded'].head()

In [154]:
# Initialize Binary Encoder using Type column
encoder = category_encoders.BinaryEncoder(cols=['Type'])

# Apply binary encoding to the 'Type' column
Data = encoder.fit_transform(Data)

In [155]:
Data.head()

Unnamed: 0,Type_0,Type_1,Type_2,Type_3,Price,Bedrooms,Bathrooms,Area,Furnished,Level,Compound,Payment_Option,Delivery_Date,Delivery_Term,City
0,0,0,0,1,70000,2.0,2.0,10.0,Yes,Unknown,Fanar De Luna,Cash,Ready to move,Finished,Ain Sukhna
1,0,0,1,0,1500000,3.0,3.0,10.0,No,4,Unknown,Unknown Payment,Ready to move,Unknown,New Hut - El Tagamoa
2,0,0,1,1,29000000,5.0,6.0,11.0,No,Unknown,Mivida,Cash,Ready to move,Core & Shell,New Hut - El Tagamoa
3,0,0,0,1,3000000,2.0,2.0,12.0,No,Ground,Marina 5,Cash,Ready to move,Finished,North Coast
4,0,0,1,0,1128000,3.0,2.0,14.0,No,3,Beit Al Watan,Installment,soon,Unknown,New Hut - El Tagamoa


## Test - Training split

In [156]:
## Assign target variable
Target_variable = Data['Price']

## List of usable features
Features_list = ['Bedrooms', 'Bathrooms', 'Area', 'Type_0', 'Type_1', 'Type_2','Type_3']

## Assign features
Features = Data[Features_list]

In [157]:
print(Features)

       Bedrooms  Bathrooms   Area  Type_0  Type_1  Type_2  Type_3
0           2.0        2.0   10.0       0       0       0       1
1           3.0        3.0   10.0       0       0       1       0
2           5.0        6.0   11.0       0       0       1       1
3           2.0        2.0   12.0       0       0       0       1
4           3.0        2.0   14.0       0       0       1       0
...         ...        ...    ...     ...     ...     ...     ...
26845       5.0        4.0  990.0       0       0       1       1
26846       5.0        4.0  990.0       0       0       1       1
26847       6.0        4.0  990.0       0       1       0       1
26848       5.0        5.0  990.0       0       0       1       1
26849       6.0        5.0  995.0       0       1       0       1

[26693 rows x 7 columns]


In [158]:
# Split the data into training and test data
Features_train, Features_test, Target_train, Target_test = train_test_split(Features, Target_variable,train_size=0.8, test_size = 0.2, random_state=1)


In [159]:
## Scale features for linear regression

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
Features_train_scaled = scaler.fit_transform(Features_train)

Features_test_scaled = scaler.fit_transform(Features_test)

## Model

In [160]:
## Initialise linear model and fit scaled data
model = LinearRegression()
model.fit(Features_train_scaled, Target_train)

In [161]:
# Evaluate test predictions
Target_predicted = model.predict(Features_test_scaled)

# Calculate evaluation metrics
mae = mean_absolute_error(Target_test, Target_predicted)
mse = mean_squared_error(Target_test, Target_predicted)
rmse = np.sqrt(mse)
r2 = r2_score(Target_test, Target_predicted)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

Mean Absolute Error (MAE): 2281025.9648602563
Mean Squared Error (MSE): 13642673537373.146
R-squared (R²): 0.44967883777930895


In [162]:
average_price = Data['Price'].mean()
percentage_RMSE = (rmse/average_price)*100
print(percentage_RMSE)

84.1861095874359


## XGBOOST model

In [163]:
# Initialize and train the XGBoost model
XGBoost_model = XGBRegressor(eval_metric=mean_absolute_error, n_estimators=50, learning_rate=0.05)
XGBoost_model.fit(Features_train, Target_train)

In [164]:
# Make predictions
Target_predicted = XGBoost_model.predict(Features_test)
# Calculate evaluation metrics
mae = mean_absolute_error(Target_test, Target_predicted)
mse = mean_squared_error(Target_test, Target_predicted)
rmse = np.sqrt(mse)
r2 = r2_score(Target_test, Target_predicted)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)

Mean Absolute Error (MAE): 2096481.4176578012
Mean Squared Error (MSE): 12640119807563.408
Root Mean Squared Error (RMSE): 3555294.616141313
R-squared (R²): 0.4901200703768762


In [165]:
percentage_RMSE = (rmse/average_price)*100
print(percentage_RMSE)

81.0338162206344


## Discussion

The following are points observed from the raw data:

- From the correlation matrix values we can see that Area, amount of Bedrooms and Bathrooms have an average positive correlation with some linearity
- There is a relationship in Price with respect to house Type when looking at some grouped values

Assumptions and procedures made for modelling:
- There are null and NaN values in the data and these were removed from the dataset.
- Type was binary encoded for representation, this was an optimisation to reduce the feature set and sparsity in comparison to one hot encoding
- Level, Furnished and Delivery Term have a large amount of Unknown parameters. These features were, resultantly, dropped. Replacing these values with the mean/mode could result in more bias towards that category. Removing unknown rows could be possible if there was more data.
- Scaling was performed for the linear regression model to reduce bias on Area feature which has larger values in comparison to the other features.
- Standard 80:20 Training to Test data split was performed.


Model performance:
- The linear regression performed slightly worse with respect to performance metrics (r squared and Mean Squared Error (MSE)) in comparison to the XGBoost model. However, the improvement 
in using XGBoost isn't significant.
- Root mean squared error values for both models are at an average of 82% of the average price which is quite poor for modelling prices. Two models performing poorly could point towards 
improving the dataset itself for model performance. 
- Adding features like level, furnished and delivery term could improve accuracy with more data such that we can drop unknown values. Preliminary testing with encoding these features only showed a 1% improvement which isn't worth the extra dimensionality. It also proves replacing unknown values with the mode isn't a fair assumption.





