## House Price Regression Project
This is a small self-built project using the RandomForestRegressor found in the
[scikit-learn](https://scikit-learn.org/stable/) library, for this project I am using the
dataset: [Melbourne Housing Snapshot](https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot)


In [11]:
# Import the pandas library to read in data
import pandas as pd

# Read in the data using pandas
melb_data = pd.read_csv("melb_data.csv")
print(melb_data.describe())


              Rooms         Price      Distance      Postcode      Bedroom2  \
count  13580.000000  1.358000e+04  13580.000000  13580.000000  13580.000000   
mean       2.937997  1.075684e+06     10.137776   3105.301915      2.914728   
std        0.955748  6.393107e+05      5.868725     90.676964      0.965921   
min        1.000000  8.500000e+04      0.000000   3000.000000      0.000000   
25%        2.000000  6.500000e+05      6.100000   3044.000000      2.000000   
50%        3.000000  9.030000e+05      9.200000   3084.000000      3.000000   
75%        3.000000  1.330000e+06     13.000000   3148.000000      3.000000   
max       10.000000  9.000000e+06     48.100000   3977.000000     20.000000   

           Bathroom           Car       Landsize  BuildingArea    YearBuilt  \
count  13580.000000  13518.000000   13580.000000   7130.000000  8205.000000   
mean       1.534242      1.610075     558.416127    151.967650  1964.684217   
std        0.691712      0.962634    3990.669241   

In [12]:
# Now we select our target variable, i.e. the value we want to predict
y = melb_data.Price

# Now we select our features, these are the values that will be used to make predictions
features = ['Rooms','Postcode','Landsize']

In [13]:
# Select the columns from the dataframe corresponding to the features we wish to use for prediction
X = melb_data[features]

# Preview the data selected
print(X.head())

   Rooms  Postcode  Landsize
0      2    3067.0     202.0
1      2    3067.0     156.0
2      3    3067.0     134.0
3      3    3067.0      94.0
4      4    3067.0     120.0


In [14]:
# Split the data into training data (data that will be used to train the model) and
# validation data (data that will be used to test the accuracy of the model)

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_Y = train_test_split(X, y, random_state=1)

In [16]:
# Define a RandomForestModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
random_forest_model = RandomForestRegressor(random_state=1)

# Fit the Model with the training data
random_forest_model.fit(train_X, train_y)

# Make predictions with the model
rf_model_predictions = random_forest_model.predict(val_X)

# Get the MAE (Mean Absolute Error)
random_forest_mae = mean_absolute_error(val_Y, rf_model_predictions)

print("Validation MAE for Random Forest Model: {:,.0f}".format(random_forest_mae))

Validation MAE for Random Forest Model: 214,174
