# <center> **DATA SCIENCE 2: ASSIGNMENT 1** 
    
The assignment is aimed at developing a predictive model that minimizes the loss in order to predict real estate prices as accurately as possible. The project creates a linear model, multi-linear model, random forest model, and gradient boosted random forest model. Furthermore, feature engineering was conducted to transform some of the variables through squares, interations, and interacting the squqared variables. 

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


### Import Data

We only are looking at 20% of the dataset. 

In [2]:
prng = np.random.RandomState(20240322)
real_estate_data = pd.read_csv("https://raw.githubusercontent.com/divenyijanos/ceu-ml/2023/data/real_estate/real_estate.csv")
real_estate_sample = real_estate_data.sample(frac=0.2)


### Set X (features) and Y (outcome) variables for predictive modelling 

Setting the outcome to the `house_price_of_unit_area` variable, since this is what we are trying to predict

Set features to: `house_age`, `distance_to_the_nearest_MRT_station`, `number_of_convenience_stores`, `latitude`, `longitude`

Split test and train, with the test size being 30%

In [3]:
real_estate_sample.head()

Unnamed: 0,id,transaction_date,house_age,distance_to_the_nearest_MRT_station,number_of_convenience_stores,latitude,longitude,house_price_of_unit_area
5,6,2012.667,7.1,2175.03,3,24.96305,121.51254,32.1
369,370,2012.667,20.2,2185.128,3,24.96322,121.51237,22.8
158,159,2013.0,11.6,390.5684,5,24.97937,121.54245,39.4
409,410,2013.0,13.7,4082.015,0,24.94155,121.50381,15.4
114,115,2012.667,30.6,143.8383,8,24.98155,121.54142,53.3


In [4]:
outcome = real_estate_sample["house_price_of_unit_area"]
features = real_estate_sample[['house_age', 'distance_to_the_nearest_MRT_station', 'number_of_convenience_stores']]
X_train, X_test, y_train, y_test = train_test_split(features, outcome, test_size=0.3, random_state=prng)
print(f"Size of the training set: {X_train.shape}, size of the test set: {X_test.shape}")


Size of the training set: (58, 3), size of the test set: (25, 3)


The size of the training and test set will work for constructing the models and premptively working on them, but will need to be later tested on the full dataset. This is acting like a validation set.  

### (2 points) Think about an appropriate loss function you can use to evaluate your predictive models. What is the risk (from a business perspective) that you would have to take by making a wrong prediction?

In [5]:
# define loss function
def calculateRMSLE(prediction, y_obs):
    return round(np.sqrt(
        np.mean(
            (
                np.log(np.where(prediction < 0, 0, prediction) + 1) - 
                np.log(y_obs + 1)
            )**2
        )
    ), 4)

#### The Loss Function

This loss function is appropriate, including in a business context or the context of predicting real estate. This function can handle high value predictions, which can be important in price prediction. Unlike Mean Squared Error (MSE), RMSLE can handle asymmetry in prediction errors by comparing the log of predicted values with the log of actual values. Additionally, this function avoids negative predictions `(np.where(prediction < 0, 0, prediction))` here, which will allow for the results to remain interpretable. A negative price prediction would not make sense. The two primary business risks include underestimation and overestimation. Underestimating could cause a loss in revenue, due to setting prices too low, while overestimating could lead to properties not getting purchased or rented due to them being overpriced. Either way, there would be a loss in revenue.  


### (2 points) Build a simple benchmark model and evaluate its performance on the hold-out set (using your chosen loss function).

In [6]:
# estimate benchmark model
benchmark = np.mean(y_train)
benchmark_result = ["Benchmark", calculateRMSLE(benchmark, y_train), calculateRMSLE(benchmark, y_test)]

In [7]:
# collect results into a DataFrame
result_columns = ["Model", "Train", "Test"]
results_df = pd.DataFrame([benchmark_result], columns=result_columns)
results_df

Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972


Here, the bench mark model was created based on the mean of the y variable in the training set. This is a very naive model, and not very accurate. There are definitely improvements that can be made. Additionally, the training and test RMSLE scores are not very close to each other. 

### (2 points) Build a simple linear regression model using a chosen feature and evaluate its performance. Would you launch your evaluator web app using this model?

In [8]:
from sklearn.linear_model import LinearRegression
import pandas as pd
# Build a simple linear regression model using 'distance_to_the_nearest_MRT_station' as the feature
lin_reg = LinearRegression().fit(X_train[["distance_to_the_nearest_MRT_station"]], y_train)

# Predictions on training and testing sets
train_predictions = lin_reg.predict(X_train[["distance_to_the_nearest_MRT_station"]])
test_predictions = lin_reg.predict(X_test[["distance_to_the_nearest_MRT_station"]])

# Calculate RMSLE for the training and testing sets
model_train_rmsle = calculateRMSLE(train_predictions, y_train)
model_test_rmsle = calculateRMSLE(test_predictions, y_test)

# Prepare the model's results
model_result = pd.DataFrame([["Simple Linear Regression", model_train_rmsle, model_test_rmsle]],
                            columns=["Model", "Train", "Test"])

# Append model_result to the existing results_df DataFrame
results_df = pd.concat([results_df, model_result], ignore_index=True)

# Display the updated results
results_df


Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972
1,Simple Linear Regression,0.3986,0.2477


### (2 points) Build a multivariate linear model with all the meaningful variables available. Did it improve the predictive power?

In [9]:
from sklearn.linear_model import LinearRegression
import pandas as pd

features = ['house_age', 'distance_to_the_nearest_MRT_station', 'number_of_convenience_stores']

# Build and train the model
lin_reg_multi = LinearRegression()
lin_reg_multi.fit(X_train[features], y_train)

# Make predictions on the training and testing sets
train_predictions_multi = lin_reg_multi.predict(X_train[features])
test_predictions_multi = lin_reg_multi.predict(X_test[features])

# Assuming calculateRMSLE is correctly defined and available
model_train_rmsle_multi = calculateRMSLE(train_predictions_multi, y_train)
model_test_rmsle_multi = calculateRMSLE(test_predictions_multi, y_test)

# Prepare the model's results
model_result_multi = pd.DataFrame([["Multivariate Linear Regression", model_train_rmsle_multi, model_test_rmsle_multi]],
                            columns=["Model", "Train", "Test"])

# Append model_result to the existing results_df DataFrame
results_df = pd.concat([results_df, model_result_multi], ignore_index=True)

# Display the updated results DataFrame
results_df


Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972
1,Simple Linear Regression,0.3986,0.2477
2,Multivariate Linear Regression,0.2266,0.2542


### (6 points) Try to make your model (even) better. Document your process and its success while taking two approaches:
1. Feature engineering - e.g. including squares and interactions or making sense of latitude & longitude by calculating the distance from the city center, etc.
2. Training more flexible models - e.g. random forest or gradient boosting

#### Feature Engineering

In [10]:
# squared terms
real_estate_sample['house_age_squared'] = real_estate_sample['house_age'] ** 2
real_estate_sample['distance_to_the_nearest_MRT_station_squared'] = real_estate_sample['distance_to_the_nearest_MRT_station'] ** 2
real_estate_sample['number_of_convenience_stores_squared'] = real_estate_sample['number_of_convenience_stores'] ** 2

# interaction terms
real_estate_sample['age_x_distance'] = real_estate_sample['house_age'] * real_estate_sample['distance_to_the_nearest_MRT_station']
real_estate_sample['age_x_stores'] = real_estate_sample['house_age'] * real_estate_sample['number_of_convenience_stores']
real_estate_sample['distance_x_stores'] = real_estate_sample['distance_to_the_nearest_MRT_station'] * real_estate_sample['number_of_convenience_stores']

# interactions between squared terms
real_estate_sample['age_squared_x_distance_squared'] = real_estate_sample['house_age_squared'] * real_estate_sample['distance_to_the_nearest_MRT_station_squared']
real_estate_sample['age_squared_x_stores_squared'] = real_estate_sample['house_age_squared'] * real_estate_sample['number_of_convenience_stores_squared']
real_estate_sample['distance_squared_x_stores_squared'] = real_estate_sample['distance_to_the_nearest_MRT_station_squared'] * real_estate_sample['number_of_convenience_stores_squared']

real_estate_sample.head()


Unnamed: 0,id,transaction_date,house_age,distance_to_the_nearest_MRT_station,number_of_convenience_stores,latitude,longitude,house_price_of_unit_area,house_age_squared,distance_to_the_nearest_MRT_station_squared,number_of_convenience_stores_squared,age_x_distance,age_x_stores,distance_x_stores,age_squared_x_distance_squared,age_squared_x_stores_squared,distance_squared_x_stores_squared
5,6,2012.667,7.1,2175.03,3,24.96305,121.51254,32.1,50.41,4730756.0,9,15442.713,21.3,6525.09,238477400.0,453.69,42576800.0
369,370,2012.667,20.2,2185.128,3,24.96322,121.51237,22.8,408.04,4774784.0,9,44139.5856,60.6,6555.384,1948303000.0,3672.36,42973060.0
158,159,2013.0,11.6,390.5684,5,24.97937,121.54245,39.4,134.56,152543.7,25,4530.59344,58.0,1952.842,20526280.0,3364.0,3813592.0
409,410,2013.0,13.7,4082.015,0,24.94155,121.50381,15.4,187.69,16662850.0,0,55923.6055,0.0,0.0,3127450000.0,0.0,0.0
114,115,2012.667,30.6,143.8383,8,24.98155,121.54142,53.3,936.36,20689.46,64,4401.45198,244.8,1150.7064,19372780.0,59927.04,1324125.0


In [11]:
real_estate_sample.columns

Index(['id', 'transaction_date', 'house_age',
       'distance_to_the_nearest_MRT_station', 'number_of_convenience_stores',
       'latitude', 'longitude', 'house_price_of_unit_area',
       'house_age_squared', 'distance_to_the_nearest_MRT_station_squared',
       'number_of_convenience_stores_squared', 'age_x_distance',
       'age_x_stores', 'distance_x_stores', 'age_squared_x_distance_squared',
       'age_squared_x_stores_squared', 'distance_squared_x_stores_squared'],
      dtype='object')

#### Distance from City Center using Lat/Long

In [12]:
import numpy as np

# Coordinates of New Taipei City center (Banqiao District)
city_center_lat = 25.0143
city_center_lon = 121.4672

def haversine(lat1, lon1, lat2, lon2):
    # Radius of the Earth in kilometers
    R = 6371.0
    # Convert latitude and longitude from degrees to radians
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    # Compute differences in coordinates
    dlat = lat2_rad - lat1_rad
    dlon = lon2_rad - lon1_rad
    # Apply Haversine formula
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    distance = R * c
    return distance

# Calculate the distance for each property in the DataFrame
real_estate_sample['distance_to_city_center'] = real_estate_sample.apply(
    lambda row: haversine(row['latitude'], row['longitude'], city_center_lat, city_center_lon), axis=1)

# Check the first few rows to verify the new column is added as expected
real_estate_sample.head()


Unnamed: 0,id,transaction_date,house_age,distance_to_the_nearest_MRT_station,number_of_convenience_stores,latitude,longitude,house_price_of_unit_area,house_age_squared,distance_to_the_nearest_MRT_station_squared,number_of_convenience_stores_squared,age_x_distance,age_x_stores,distance_x_stores,age_squared_x_distance_squared,age_squared_x_stores_squared,distance_squared_x_stores_squared,distance_to_city_center
5,6,2012.667,7.1,2175.03,3,24.96305,121.51254,32.1,50.41,4730756.0,9,15442.713,21.3,6525.09,238477400.0,453.69,42576800.0,7.304606
369,370,2012.667,20.2,2185.128,3,24.96322,121.51237,22.8,408.04,4774784.0,9,44139.5856,60.6,6555.384,1948303000.0,3672.36,42973060.0,7.279138
158,159,2013.0,11.6,390.5684,5,24.97937,121.54245,39.4,134.56,152543.7,25,4530.59344,58.0,1952.842,20526280.0,3364.0,3813592.0,8.520418
409,410,2013.0,13.7,4082.015,0,24.94155,121.50381,15.4,187.69,16662850.0,0,55923.6055,0.0,0.0,3127450000.0,0.0,0.0,8.89133
114,115,2012.667,30.6,143.8383,8,24.98155,121.54142,53.3,936.36,20689.46,64,4401.45198,244.8,1150.7064,19372780.0,59927.04,1324125.0,8.319173


In [13]:
from sklearn.model_selection import train_test_split

# Define your features and target variable
features = ['house_age', 'distance_to_the_nearest_MRT_station', 'number_of_convenience_stores',
            'latitude', 'longitude', 'house_age_squared', 'distance_to_the_nearest_MRT_station_squared',
            'number_of_convenience_stores_squared', 'age_x_distance', 'age_x_stores', 'distance_x_stores',
            'age_squared_x_distance_squared', 'age_squared_x_stores_squared', 'distance_squared_x_stores_squared',
            'distance_to_city_center']
target = 'house_price_of_unit_area'

# Split your data into training and testing sets
X = real_estate_sample[features]
y = real_estate_sample[target]
X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(X, y, test_size=0.3, random_state=42)

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Define the pipeline
pipe_rf = Pipeline([
    ("random_forest", RandomForestRegressor(random_state=42))
])

# Fit the model on the training data
pipe_rf.fit(X_train_fe, y_train_fe)

# Assuming you have a correctly defined calculateRMSLE function
train_error = calculateRMSLE(pipe_rf.predict(X_train_fe), y_train_fe)
test_error = calculateRMSLE(pipe_rf.predict(X_test_fe), y_test_fe)

# Prepare the model's results
model_result_rf = pd.DataFrame([["FE Random Forest", train_error, test_error]],
                               columns=["Model", "Train", "Test"])
results_df = pd.concat([results_df, model_result_rf], ignore_index=True)
results_df


Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972
1,Simple Linear Regression,0.3986,0.2477
2,Multivariate Linear Regression,0.2266,0.2542
3,FE Random Forest,0.075,0.2175


In [14]:
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
import pandas as pd

# Define the pipeline with XGBRegressor
pipe_xgb = Pipeline([
    ("gradient_boosting", XGBRegressor(random_state=42))
])

# Fit the model on the training data
pipe_xgb.fit(X_train_fe, y_train_fe)

# Assuming calculateRMSLE is already defined
train_error_xgb = calculateRMSLE(pipe_xgb.predict(X_train_fe), y_train_fe)
test_error_xgb = calculateRMSLE(pipe_xgb.predict(X_test_fe), y_test_fe)

# Prepare and append the model's results to the existing results_df DataFrame
model_result_xgb = pd.DataFrame([["FE Gradient Boosted RF", train_error_xgb, test_error_xgb]],
                                columns=["Model", "Train", "Test"])

results_df = pd.concat([results_df, model_result_xgb], ignore_index=True)

# Display the updated results DataFrame
results_df


Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972
1,Simple Linear Regression,0.3986,0.2477
2,Multivariate Linear Regression,0.2266,0.2542
3,FE Random Forest,0.075,0.2175
4,FE Gradient Boosted RF,0.0219,0.2932


### (2 points) Would you launch your web app now? What options you might have to further improve the prediction performance?

### (4 points) Rerun three of your previous models (including both flexible and less flexible ones) on the full train set. Ensure that your test result remains comparable by keeping that dataset intact. 

(Hint: extend the code snippet below.) Did it improve the predictive power of your models? Where do you observe the biggest improvement? Would you launch your web app now?

In [15]:
# is what this does is that it finds the first 
real_estate_full = real_estate_data.loc[~real_estate_data.index.isin(X_test.index)]
print(f"Size of the full training set: {real_estate_full.shape}")

Size of the full training set: (389, 8)


In [16]:
# re-initialize test sample for non-flexible linear regression

real_estate_sample = real_estate_data.sample(frac=0.2)

outcome = real_estate_sample["house_price_of_unit_area"]
features = real_estate_sample[['house_age', 'distance_to_the_nearest_MRT_station', 'number_of_convenience_stores']]


X_train, X_test, y_train, y_test = train_test_split(features, outcome, test_size=0.3, random_state=prng)

In [17]:
# setting the training values
X_full_train = real_estate_full[['house_age', 'distance_to_the_nearest_MRT_station', 'number_of_convenience_stores']]
y_full_train = real_estate_full['house_price_of_unit_area']


In [18]:
from sklearn.linear_model import LinearRegression

# Initialize the Linear Regression model
lin_reg_multi_full = LinearRegression()

# Train the model on the full training set
lin_reg_multi_full.fit(X_full_train, y_full_train)

# Make predictions on the full training set and the original test set
train_predictions_full = lin_reg_multi_full.predict(X_full_train)
test_predictions_full = lin_reg_multi_full.predict(X_test)

# Calculate RMSLE for the full training set and original test set
model_train_rmsle_full = calculateRMSLE(train_predictions_full, y_full_train)
model_test_rmsle_full = calculateRMSLE(test_predictions_full, y_test)

# Prepare the model's results
model_result_full = pd.DataFrame([["Multivariate Linear Regression (FULL)", model_train_rmsle_full, model_test_rmsle_full]],
                            columns=["Model", "Train", "Test"])

# Append model_result to the existing results_df DataFrame
results_df = pd.concat([results_df, model_result_full], ignore_index=True)

# Display the updated results DataFrame
results_df


Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972
1,Simple Linear Regression,0.3986,0.2477
2,Multivariate Linear Regression,0.2266,0.2542
3,FE Random Forest,0.075,0.2175
4,FE Gradient Boosted RF,0.0219,0.2932
5,Multivariate Linear Regression (FULL),0.2731,0.3667


In [19]:
# squared terms
real_estate_full['house_age_squared'] = real_estate_full['house_age'] ** 2
real_estate_full['distance_to_the_nearest_MRT_station_squared'] = real_estate_full['distance_to_the_nearest_MRT_station'] ** 2
real_estate_full['number_of_convenience_stores_squared'] = real_estate_full['number_of_convenience_stores'] ** 2

# interaction terms
real_estate_full['age_x_distance'] = real_estate_full['house_age'] * real_estate_full['distance_to_the_nearest_MRT_station']
real_estate_full['age_x_stores'] = real_estate_sample['house_age'] * real_estate_full['number_of_convenience_stores']
real_estate_full['distance_x_stores'] = real_estate_full['distance_to_the_nearest_MRT_station'] * real_estate_full['number_of_convenience_stores']

# interactions between squared terms
real_estate_full['age_squared_x_distance_squared'] = real_estate_full['house_age_squared'] * real_estate_full['distance_to_the_nearest_MRT_station_squared']
real_estate_full['age_squared_x_stores_squared'] = real_estate_full['house_age_squared'] * real_estate_full['number_of_convenience_stores_squared']
real_estate_full['distance_squared_x_stores_squared'] = real_estate_full['distance_to_the_nearest_MRT_station_squared'] * real_estate_full['number_of_convenience_stores_squared']

real_estate_full.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  real_estate_full['house_age_squared'] = real_estate_full['house_age'] ** 2
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  real_estate_full['distance_to_the_nearest_MRT_station_squared'] = real_estate_full['distance_to_the_nearest_MRT_station'] ** 2
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rea

Unnamed: 0,id,transaction_date,house_age,distance_to_the_nearest_MRT_station,number_of_convenience_stores,latitude,longitude,house_price_of_unit_area,house_age_squared,distance_to_the_nearest_MRT_station_squared,number_of_convenience_stores_squared,age_x_distance,age_x_stores,distance_x_stores,age_squared_x_distance_squared,age_squared_x_stores_squared,distance_squared_x_stores_squared
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9,1024.0,7204.414085,100,2716.12224,320.0,848.7882,7377320.0,102400.0,720441.4
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2,380.25,94000.310068,81,5978.59665,,2759.3523,35743620.0,30800.25,7614025.0
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3,176.89,315826.57824,25,7474.39385,66.5,2809.9225,55866560.0,4422.25,7895664.0
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8,176.89,315826.57824,25,7474.39385,,2809.9225,55866560.0,4422.25,7895664.0
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1,25.0,152543.675079,25,1952.842,,1952.842,3813592.0,625.0,3813592.0


In [20]:
# Calculate the distance for each property in the DataFrame
real_estate_full['distance_to_city_center'] = real_estate_full.apply(
    lambda row: haversine(row['latitude'], row['longitude'], city_center_lat, city_center_lon), axis=1)

# Check the first few rows to verify the new column is added as expected
real_estate_full.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  real_estate_full['distance_to_city_center'] = real_estate_full.apply(


Unnamed: 0,id,transaction_date,house_age,distance_to_the_nearest_MRT_station,number_of_convenience_stores,latitude,longitude,house_price_of_unit_area,house_age_squared,distance_to_the_nearest_MRT_station_squared,number_of_convenience_stores_squared,age_x_distance,age_x_stores,distance_x_stores,age_squared_x_distance_squared,age_squared_x_stores_squared,distance_squared_x_stores_squared,distance_to_city_center
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9,1024.0,7204.414085,100,2716.12224,320.0,848.7882,7377320.0,102400.0,720441.4,8.143117
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2,380.25,94000.310068,81,5978.59665,,2759.3523,35743620.0,30800.25,7614025.0,8.207602
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3,176.89,315826.57824,25,7474.39385,66.5,2809.9225,55866560.0,4422.25,7895664.0,8.28663
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8,176.89,315826.57824,25,7474.39385,,2809.9225,55866560.0,4422.25,7895664.0,8.28663
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1,25.0,152543.675079,25,1952.842,,1952.842,3813592.0,625.0,3813592.0,8.520418


In [21]:
# setting the training values
X_full_train_fe = real_estate_full[['house_age', 'distance_to_the_nearest_MRT_station', 'number_of_convenience_stores',
            'latitude', 'longitude', 'house_age_squared', 'distance_to_the_nearest_MRT_station_squared',
            'number_of_convenience_stores_squared', 'age_x_distance', 'age_x_stores', 'distance_x_stores',
            'age_squared_x_distance_squared', 'age_squared_x_stores_squared', 'distance_squared_x_stores_squared',
            'distance_to_city_center']]

y_full_train_fe = real_estate_full['house_price_of_unit_area']

In [22]:
# Define the pipeline
pipe_rf = Pipeline([
    ("random_forest", RandomForestRegressor(random_state=42))
])

# Fit the model on the training data
pipe_rf.fit(X_full_train_fe, y_full_train_fe)

# Assuming you have a correctly defined calculateRMSLE function
train_error = calculateRMSLE(pipe_rf.predict(X_full_train_fe), y_full_train_fe)
test_error = calculateRMSLE(pipe_rf.predict(X_test_fe), y_test_fe)

# Prepare the model's results
model_result_rf = pd.DataFrame([["FE Random Forest (FULL)", train_error, test_error]],
                               columns=["Model", "Train", "Test"])
results_df = pd.concat([results_df, model_result_rf], ignore_index=True)
results_df

Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972
1,Simple Linear Regression,0.3986,0.2477
2,Multivariate Linear Regression,0.2266,0.2542
3,FE Random Forest,0.075,0.2175
4,FE Gradient Boosted RF,0.0219,0.2932
5,Multivariate Linear Regression (FULL),0.2731,0.3667
6,FE Random Forest (FULL),0.0762,0.1219


In [23]:
# Define the pipeline with XGBRegressor
pipe_xgb = Pipeline([
    ("gradient_boosting", XGBRegressor(random_state=42))
])

# Fit the model on the training data
pipe_xgb.fit(X_full_train_fe, y_full_train_fe)

# Assuming calculateRMSLE is already defined
train_error_xgb = calculateRMSLE(pipe_xgb.predict(X_full_train_fe), y_full_train_fe)
test_error_xgb = calculateRMSLE(pipe_xgb.predict(X_test_fe), y_test_fe)

# Prepare and append the model's results to the existing results_df DataFrame
model_result_xgb = pd.DataFrame([["FE Gradient Boosted RF (FULL)", train_error_xgb, test_error_xgb]],
                                columns=["Model", "Train", "Test"])

results_df = pd.concat([results_df, model_result_xgb], ignore_index=True)

# Display the updated results DataFrame
results_df

Unnamed: 0,Model,Train,Test
0,Benchmark,0.3908,0.3972
1,Simple Linear Regression,0.3986,0.2477
2,Multivariate Linear Regression,0.2266,0.2542
3,FE Random Forest,0.075,0.2175
4,FE Gradient Boosted RF,0.0219,0.2932
5,Multivariate Linear Regression (FULL),0.2731,0.3667
6,FE Random Forest (FULL),0.0762,0.1219
7,FE Gradient Boosted RF (FULL),0.023,0.1073
