# Let’s say we want to build a model to predict booking prices on Airbnb. Between linear regression and random forest regression, which model would perform better and why?

Random forest regression would perform better because it combines the principles of ensemble learning and decision trees to perform regression tasks.Decision trees are a machine learning algorithm that partitions data into hierarchical branches based on feature values to make predictions or decisions. In decision tree regression, each internal node represents a feature test, and leaf nodes hold predicted  values, enabling the tree to estimate continuous target variables by traversing the tree based on feature conditions. So the Random Forest algorithm is an extension of the Decision Tree.The main idea behind Random Forest Regression is to create multiple decision trees and average their predictions to obtain a more accurate and robust regression model.
Random Forest regressors can capture complex non-linear relationships between the input features and the target   variable. They are capable of automatically learning and incorporating interactions and non-linearities present in thedata,  making them suitable for modeling housing price prediction where multiple factors influence the outcome.
We therefore proceed to import relevant libraries for the model.

# Import the relevant libraries

In [84]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the data

In [197]:
data = pd.read_csv('Airbnb.list.csv')
data

Unnamed: 0,ID,HOST_ID,NEIGHBOURHOOD_GROUP,LATITUDE,LONGITUDE,ROOM_TYPE,PRICE,MINIMUM_NIGHTS,NUMBER_OF_REVIEWS,LAST_REVIEW,REVIEWS_PER_MONTH,CALCULATED_HOST_LISTINGS_COUNT,AVAILABILITY_365
0,2539,2787,Brooklyn,40.64749,-73.97237,Private room,149,1,9,19/10/2018,0.21,6,365
1,2595,2845,Manhattan,40.75362,-73.98377,Entire home/apt,225,1,45,21/05/2019,0.38,2,355
2,3647,4632,Manhattan,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,4869,Brooklyn,40.68514,-73.95976,Entire home/apt,89,1,270,05/07/2019,4.64,1,194
4,5022,7192,Manhattan,40.79851,-73.94399,Entire home/apt,80,10,9,19/11/2018,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,8232441,Brooklyn,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,6570630,Brooklyn,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,23492952,Manhattan,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,30985759,Manhattan,40.75751,-73.99112,Shared room,55,1,0,,,6,2


# Encode Categorical features

In [198]:
data.columns = data.columns.str.strip().str.upper()
categorical_columns = ['NEIGHBOURHOOD_GROUP', 'ROOM_TYPE']
data_encoded = pd.get_dummies(data, columns=categorical_columns)
data_encoded.head()

Unnamed: 0,ID,HOST_ID,LATITUDE,LONGITUDE,PRICE,MINIMUM_NIGHTS,NUMBER_OF_REVIEWS,LAST_REVIEW,REVIEWS_PER_MONTH,CALCULATED_HOST_LISTINGS_COUNT,AVAILABILITY_365,NEIGHBOURHOOD_GROUP_Bronx,NEIGHBOURHOOD_GROUP_Brooklyn,NEIGHBOURHOOD_GROUP_Manhattan,NEIGHBOURHOOD_GROUP_Queens,NEIGHBOURHOOD_GROUP_Staten Island,ROOM_TYPE_Entire home/apt,ROOM_TYPE_Private room,ROOM_TYPE_Shared room
0,2539,2787,40.64749,-73.97237,149,1,9,19/10/2018,0.21,6,365,False,True,False,False,False,False,True,False
1,2595,2845,40.75362,-73.98377,225,1,45,21/05/2019,0.38,2,355,False,False,True,False,False,True,False,False
2,3647,4632,40.80902,-73.9419,150,3,0,,,1,365,False,False,True,False,False,False,True,False
3,3831,4869,40.68514,-73.95976,89,1,270,05/07/2019,4.64,1,194,False,True,False,False,False,True,False,False
4,5022,7192,40.79851,-73.94399,80,10,9,19/11/2018,0.1,1,0,False,False,True,False,False,True,False,False


In [199]:
data = data.copy()
data['NEIGHBOURHOOD_GROUP'] = data['NEIGHBOURHOOD_GROUP'].map({'Brooklyn':1,'Manhattan': 0, 'Bronx':2,'Queens':3,'Staten Island':4})
data['ROOM_TYPE'] = data['ROOM_TYPE'].map({'Private room':10,'Entire home/apt': 20, 'Shared room':30})
data

Unnamed: 0,ID,HOST_ID,NEIGHBOURHOOD_GROUP,LATITUDE,LONGITUDE,ROOM_TYPE,PRICE,MINIMUM_NIGHTS,NUMBER_OF_REVIEWS,LAST_REVIEW,REVIEWS_PER_MONTH,CALCULATED_HOST_LISTINGS_COUNT,AVAILABILITY_365
0,2539,2787,1,40.64749,-73.97237,10,149,1,9,19/10/2018,0.21,6,365
1,2595,2845,0,40.75362,-73.98377,20,225,1,45,21/05/2019,0.38,2,355
2,3647,4632,0,40.80902,-73.94190,10,150,3,0,,,1,365
3,3831,4869,1,40.68514,-73.95976,20,89,1,270,05/07/2019,4.64,1,194
4,5022,7192,0,40.79851,-73.94399,20,80,10,9,19/11/2018,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,8232441,1,40.67853,-73.94995,10,70,2,0,,,2,9
48891,36485057,6570630,1,40.70184,-73.93317,10,40,4,0,,,2,36
48892,36485431,23492952,0,40.81475,-73.94867,20,115,10,0,,,1,27
48893,36485609,30985759,0,40.75751,-73.99112,30,55,1,0,,,6,2


# Encode Categorical features

In [204]:
data['LAST_REVIEW'] = data['LAST_REVIEW'].fillna('2023-01-01')
data['LAST_REVIEW'] = data['LAST_REVIEW'].apply(lambda x: pd.to_datetime(x, errors='coerce', dayfirst=True))

In [205]:
data['LAST_REVIEW'] = pd.to_datetime(data['LAST_REVIEW'])
data['LAST_REVIEW'] = data['LAST_REVIEW'].view('int64')
data['LAST_REVIEW'] = data['LAST_REVIEW'].astype(int)

# Handling of missing values

In [206]:
numeric_columns = ['PRICE', 'MINIMUM_NIGHTS', 'NUMBER_OF_REVIEWS', 'REVIEWS_PER_MONTH', 'CALCULATED_HOST_LISTINGS_COUNT', 'AVAILABILITY_365']
data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())

# Standardization

In [255]:
scaler = StandardScaler()
numeric_columns = ['PRICE', 'MINIMUM_NIGHTS', 'NUMBER_OF_REVIEWS', 'REVIEWS_PER_MONTH', 'CALCULATED_HOST_LISTINGS_COUNT', 'AVAILABILITY_365']
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

In [256]:
X.dropna(inplace=True)
y = y[X.index] 

# Data Splitting

In [257]:
X = data.drop('PRICE', axis=1) 
y = data['PRICE'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Random Forest Regression model

In [269]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Model prediction

In [270]:
y_pred = rf_model.predict(X_test)

# Evaluation of the model

In [277]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

In [278]:
print(f"Mean Absolute Error: {mae:.2f}")

Mean Absolute Error: 0.26


The Mean Absolute Error (MAE) is a popular metric for evaluating the effectiveness of regression models like the Random Forest Regressor. MAE is the average absolute difference between the actual (observed) values and the model's projected values. MAE in this case is 0.26.
The MAE value is expressed in the same units as the target variable attempting to predict. The target variable in this case is "PRICE." This suggests that your model's predictions are, on average, 0.26 units off in the same currency or scale as the "PRICE" variable.
An MAE of 0.26 indicates the model's forecasts are, on average, pretty close to the actual prices. 

In [279]:
print(f"Mean Squared Error: {mse:.2f}")

Mean Squared Error: 0.59


The Mean Squared Error (MSE) is a popular metric for evaluating the performance of regression models like RandomForestRegressor model. The average squared difference between the real values and the predicted values generated by the model is measured by MSE. A lower MSE suggests a more accurate model fit.
In this case, the Mean Squared Error is 0.59, it means that the squared difference between the actual and forecasted prices is 0.59 on average.
For example, in this dataset's prices are in thousands of dollars, an MSE of 0.59 is interpreted as the average squared difference of $590. In this example, a lower MSE indicates that your model's predictions are, on average, more accurate.

In [280]:
print(f"Root Mean Squared Error: {rmse:.2f}")

Root Mean Squared Error: 0.77


Another often used statistic for measuring the effectiveness of regression models, especially for RandomForestRegressor model, is the Root Mean Squared Error (RMSE). RMSE is similar to Mean Squared Error (MSE), but with one important difference: RMSE is expressed in the same units as the target variable, making it easier to read.
The Root Mean Squared Error (RMSE) in this case is 0.77, it signifies that the difference between the actual and projected prices, when squared and then squared root, results in a value of 0.77 on average. The RMSE calculates the average error in the model's predictions in the same units as the target variable. In other words, it estimates how far your model's forecasts depart from the actual prices.
In the context of this regression model, an RMSE of 0.77 means that the model's predicted prices differ from the actual prices by around 0.77 units on average. Smaller RMSE values indicate that your model's predictions are, on average, closer to the actual prices

In [281]:
print(f"R-squared: {r2:.2f}")

R-squared: 0.23


The R-squared (R2) value, also known as the coefficient of determination, is a statistical measure used to assess a regression model's goodness of fit. R2 measures how effectively the model's independent factors explain the variability in the dependent variable (in this case, predicting the "PRICE" of properties). 
With an R2 of 0.23 (or 23%), the model accounts for 23% of the variation in the "PRICE" of properties in the dataset. In other words, the independent variables in the model explain 23% of the variation in property prices observed.
This suggests that the model has missed a large amount of unexplained variability or noise in the dataset. The independent variables in your model do not account for the remaining 77% of the variability in property values.
An R-squared value of 0.23 indicates that the model explains just a small portion of the variation in property values, and there is still a considerable amount of unexplained variation in the data. This implies that the model could be improved or that additional factors not included in the model are impacting property values.