In [19]:
import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
from sklearn.metrics import r2_score

In [20]:
# Read the original CSV file
df = pd.read_csv("../data/cleaned_data.csv")

### Data Preprocessing:
##### Boolean Conversion:
 Certain columns in the dataset represent boolean values, such as 'Swimming_pool', 'Garden', 'Terrace', 'Open Fire', and 'Furnished'. To facilitate further analysis and model building, we convert these boolean values to integers (0 for False and 1 for True) using the astype(int) method.

In [21]:
# Convert boolean columns to integers (0 for False and 1 for True)
df['Swimming_pool'] = df['Swimming_pool'].astype(int)
df['Garden'] = df['Garden'].astype(int)
df['Terrace'] = df['Terrace'].astype(int)
df['Open Fire'] = df['Open Fire'].astype(int)
df['Furnished'] = df['Furnished'].astype(int)

##### One-Hot Encoding:
 To handle categorical data, we apply one-hot encoding to the 'province' and 'District' columns. This transforms categorical variables into binary vectors, enabling the ML models to better interpret and use this information.

In [22]:
# Perform one-hot encoding for the 'province' and 'District' columns
df = pd.get_dummies(df, columns=['province','District'])

##### Separate DataFrames for Apartments and Houses:
We split the dataset into two separate DataFrames: "apartment_df" and "house_df", containing information about apartments and houses, respectively.

In [23]:
# Filter and process apartment data
apartment_df = df[df["Type"] == "apartment"]
columns_to_drop_apartment = ['URL','Listing_ID', 'Price_per_sqm','Type', 'Subtype','Listing_address','Postal_code', 'Locality', 'Kitchen','State of the building']
apartment_df = apartment_df.drop(columns= columns_to_drop_apartment)
apartment_df.to_csv('../data/apartment_df.csv', index=False)
# Filter and process house data
house_df = df[df["Type"] == "house"]
columns_to_drop_house =  ['URL', 'Listing_ID','Type', 'Price_per_sqm','Subtype','Listing_address','Postal_code', 'Locality', 'Kitchen','State of the building']
house_df = house_df.drop(columns= columns_to_drop_house)
house_df.to_csv('../data/house_df.csv', index=False)

### Linear Regression Model
In this section we utilize the Linear Regression model from the scikit-learn library to predict property prices based on various features. The process is broken down into the following steps:
1. Data Preparation:
Feature Matrix (X): We create the feature matrix X by dropping the target column (target_col) from the original DataFrame (df) and converting it to a NumPy array using the .to_numpy() method. The feature matrix contains the independent variables used for prediction.
Target Vector (y): We create the target vector y by selecting the target_col from the original DataFrame and converting it to a NumPy array. The target vector contains the dependent variable, which is the property price to be predicted.
2. Train-Test Split:
The dataset is divided into training and testing sets using the train_test_split() function from the scikit-learn library. The split is performed with a specified test_size, which determines the proportion of data used for testing. The random_state parameter ensures reproducibility of the split.
3. Linear Regression Model:
We create an instance of the Linear Regression model (regressor) using the LinearRegression() class from scikit-learn. Linear Regression is a simple yet powerful algorithm that fits a linear equation to the data, making it suitable for predicting continuous values like property prices.

4. Model Training:
The Linear Regression model is trained using the training data (X_train and y_train) via the .fit() method. During training, the model learns the coefficients and intercepts that best fit the training data.

5. Evaluation:
The performance of the model is evaluated using the coefficient of determination (R-squared) on both the training and testing sets. The R-squared metric quantifies the proportion of variance in the target variable explained by the model. Higher R-squared values indicate better predictive performance.

The results of the Linear Regression model training and evaluation are stored in train_score and test_score, respectively. These values are useful for assessing how well the model fits the data and its ability to generalize to unseen data.

Please note that Linear Regression is a baseline model, and in the subsequent sections, we explore other advanced techniques such as Decision Tree Regression and XGBoost Regression to potentially improve the predictive performance for property prices.

In [24]:
# linear regression For houses
df = pd.read_csv("../data/house_df.csv")
X_house = df.drop(columns=["Price"]).to_numpy()
y_house = df.Price.to_numpy().reshape(-1 , 1)
print("X shape : ", X_house.shape)
print("y shape : ", y_house.shape)
# Splitting the dataset(for house) into the Training set and Test set

X_train, X_test, y_train, y_test = train_test_split(X_house, y_house, test_size=0.2, random_state=0)
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_test: ", y_test.shape)

regressor = LinearRegression()
regressor.fit(X_train,y_train)
regressor.score(X_train, y_train)
print("train_score_house: ", regressor.score(X_train, y_train))
regressor.score(X_test, y_test)
print("test_score_house: ", regressor.score(X_test, y_test))

X shape :  (5975, 64)
y shape :  (5975, 1)
Shape of X_train:  (4780, 64)
Shape of X_test:  (1195, 64)
Shape of y_train:  (4780, 1)
Shape of y_test:  (1195, 1)
train_score_house:  0.439170242623662
test_score_house:  0.4866060744559789


In [25]:
# linear regression For apartment
df = pd.read_csv("../data/apartment_df.csv")
X = df.drop(columns=["Price"]).to_numpy()
y = df.Price.to_numpy().reshape(-1 , 1)
print("X shape : ", X.shape)
print("y shape : ", y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_test: ", y_test.shape)
regressor = LinearRegression()
regressor.fit(X_train,y_train)
regressor.score(X_train, y_train)
print("train_score_apartment: ", regressor.score(X_train, y_train))
regressor.score(X_test, y_test)
print("test_score_apartment: ", regressor.score(X_test, y_test))

X shape :  (3601, 64)
y shape :  (3601, 1)
Shape of X_train:  (2880, 64)
Shape of X_test:  (721, 64)
Shape of y_train:  (2880, 1)
Shape of y_test:  (721, 1)
train_score_apartment:  0.42964676022099213
test_score_apartment:  0.3892493125485248


#### Decision Tree Regression Model
In this section we employ Decision Tree Regression models to predict property prices for apartments and houses. The process is broken down into the following steps:

1. Model Creation:
DecisionTreeRegressor Instances: We create separate instances of the DecisionTreeRegressor model for apartments and houses. The models are initialized with the DecisionTreeRegressor class from scikit-learn, and a random_state parameter is set to ensure reproducibility of the results.

2. Model Training:
Training Data: The Decision Tree models are trained using the training data specific to each type. For apartments, the feature matrix X_train_apartment and target vector y_train_apartment are used, while for houses, X_train_house and y_train_house are utilized.
Model Fitting: The .fit() method is applied to each model with their respective training data. During training, the Decision Tree models recursively split the data based on different features to create a tree-like structure.

3. Prediction:
Test Data Prediction: The trained Decision Tree models are used to make predictions on the test data. For apartments, the feature matrix X_test_apartment is passed to the model, resulting in predicted prices stored in y_pred_apartment. Similarly, for houses, predictions are made using X_test_house, and the results are stored in y_pred_house.

4. Evaluation:
R-squared Calculation: The R-squared metric is calculated to assess the performance of the models. R-squared measures how well the model explains the variance in the target variable relative to the total variance. Higher R-squared values indicate better predictive performance.


In [26]:
df = pd.read_csv("../data/apartment_df.csv")
X_apartment = apartment_df.drop(columns=["Price"]).to_numpy()
y_apartment = apartment_df["Price"].to_numpy().reshape(-1, 1)
df = pd.read_csv("../data/house_df.csv")
X_house = house_df.drop(columns=["Price"]).to_numpy()
y_house = house_df["Price"].to_numpy().reshape(-1, 1)

In [27]:
X_train_apartment, X_test_apartment, y_train_apartment, y_test_apartment = train_test_split(X_apartment, y_apartment, test_size=0.2, random_state=0)
X_train_house, X_test_house, y_train_house, y_test_house = train_test_split(X_house, y_house, test_size=0.2, random_state=0)

In [28]:
# Create DecisionTreeRegressor instances
dt_regressor_apartment = DecisionTreeRegressor(random_state=0)
dt_regressor_house = DecisionTreeRegressor(random_state=0)

# Fit the models to the training data
dt_regressor_apartment.fit(X_train_apartment, y_train_apartment)
dt_regressor_house.fit(X_train_house, y_train_house)

In [29]:
# Predict on the test data
y_pred_apartment = dt_regressor_apartment.predict(X_test_apartment)
y_pred_house = dt_regressor_house.predict(X_test_house)

# Calculate R-squared for apartment and house models
r2_apartment = r2_score(y_test_apartment, y_pred_apartment)
r2_house = r2_score(y_test_house, y_pred_house)

# Print the R-squared scores
print("R-squared (Apartment):", r2_apartment)
print("R-squared (House):", r2_house)

R-squared (Apartment): 0.28439580719960345
R-squared (House): 0.4157044795354198


##### XGBoost Regression Model
In this section we implement XGBoost Regression models to predict property prices for apartments and houses. The process is broken down into the following steps:

1. Model Creation:
XGBoostRegressor Instances: We create separate instances of the XGBoostRegressor model for apartments and houses. The models are initialized using the xgb.XGBRegressor() class from the XGBoost library. XGBoost is an optimized and powerful gradient boosting algorithm designed for regression and classification tasks.

2. Model Training:
Training Data: The XGBoost Regression models are trained using the training data specific to each type. For apartments, the feature matrix X_train_apartment and target vector y_train_apartment are used, while for houses, X_train_house and y_train_house are utilized.
Model Fitting: The .fit() method is applied to each model with their respective training data. During training, XGBoost applies gradient boosting, which sequentially adds weak learners (decision trees) to improve predictive accuracy.

3. Prediction:
Test Data Prediction: Using the trained XGBoost Regression models, we make predictions on the test data. For apartments, we pass the feature matrix X_test_apartment to the xgb_regressor_apartment model, which generates predicted prices stored in y_pred_apartment. Similarly, for houses, predictions are made using X_test_house, and the results are stored in y_pred_house.

4. Evaluation:
R-squared Calculation: The R-squared metric is used to evaluate the models' performance. R-squared measures how well the models explain the variance in the target variable relative to the total variance. Higher R-squared values indicate better predictive performance.

In [30]:
df = pd.read_csv("../data/apartment_df.csv")
X_apartment = apartment_df.drop(columns=["Price"]).to_numpy()
y_apartment = apartment_df["Price"].to_numpy().reshape(-1, 1)
df = pd.read_csv("../data/house_df.csv")
X_house = house_df.drop(columns=["Price"]).to_numpy()
y_house = house_df["Price"].to_numpy().reshape(-1, 1)

In [31]:
X_train_apartment, X_test_apartment, y_train_apartment, y_test_apartment = train_test_split(X_apartment, y_apartment, test_size=0.2, random_state=0)
X_train_house, X_test_house, y_train_house, y_test_house = train_test_split(X_house, y_house, test_size=0.2, random_state=0)

In [32]:
# Create XGBoost regression instances
xgb_regressor_apartment = xgb.XGBRegressor()
xgb_regressor_house = xgb.XGBRegressor()

# Fit the models to the training data
xgb_regressor_apartment.fit(X_train_apartment, y_train_apartment)
xgb_regressor_house.fit(X_train_house, y_train_house)

In [33]:
# Predict on the test data
y_pred_apartment = xgb_regressor_apartment.predict(X_test_apartment)
y_pred_house = xgb_regressor_house.predict(X_test_house)

# Calculate R-squared for apartment and house models
r2_apartment = r2_score(y_test_apartment, y_pred_apartment)
r2_house = r2_score(y_test_house, y_pred_house)

# Print the R-squared scores
print("R-squared (Apartment):", r2_apartment)
print("R-squared (House):", r2_house)


R-squared (Apartment): 0.3673625456224776
R-squared (House): 0.6548866586000375
