### Import the required liraries as follows. We will import others as we move along

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

### Importing the csv formatted data which is in github

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/house_price_by_area.csv")
df.head()

Unnamed: 0,LotArea,SalePrice
0,138,1204000
1,145,1274000
2,152,1673000
3,152,1232000
4,152,1195600


### The preprocessing we will need is scaling since our columns have different scales. Our data is generally clean hence no further preprocessing

In [3]:
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled, columns = df.columns)

### We will use the default max_depth of none for our first random forest model

In [4]:
model = RandomForestRegressor(n_estimators = 45)

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X = df[['LotArea']]
y = df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [7]:
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("MSE:", mean_squared_error(y_test, predictions))
print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))
print("R2_Score:", r2_score(y_test, predictions))

MSE: 422490886596.1823
RMSE: 649992.9896515672
R2_Score: 0.3203611544334456


### Now we want to understand how different values of n_estimators and max_depth will affect the prfromance of our model and identify best parameters

In [8]:
def evaluate_rf_performance(X_train, X_test, y_train, y_test, n_estimators, max_depth):
    rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    rmse =np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    return rmse, r2
results = []

# Iterate over different values of n_estimators and max_depth
for n_estimators in [50, 100, 150]:
    for max_depth in [5, 10, 15]:
        # Call function to evaluate performance
        rmse, r2 = evaluate_rf_performance(X_train, X_test, y_train, y_test, n_estimators, max_depth)
        # Append results to list
        results.append({'n_estimators': n_estimators, 'max_depth': max_depth, 'RMSE': rmse, 'R2 Score': r2})

# Create DataFrame from list of results
results_df = pd.DataFrame(results)
print(results_df)

   n_estimators  max_depth           RMSE  R2 Score
0            50          5  595492.457032  0.429556
1            50         10  642413.479872  0.336119
2            50         15  645465.923451  0.329795
3           100          5  595122.779201  0.430264
4           100         10  636382.485503  0.348526
5           100         15  639620.534238  0.341879
6           150          5  596876.659637  0.426900
7           150         10  637534.304175  0.346165
8           150         15  640830.259738  0.339387


### From the above measures, n_estimators of 100 and max_depth of 5 gives us a lower RMSE and a higher R2_score. This means that approximately 43% of the variance in the dependent variable is explained by the independent variables in the model. Thus we would conclude that (100, 5) combination is more accurate.