## Introduction

In this project, I developed a machine learning model to predict house prices for the Housing Prices Competition for Kaggle Learn Users. The goal is to accurately estimate housing prices based on various features, such as lot size, number of rooms, and overall condition.

To achieve this, I used a Random Forest Regressor, leveraging its ability to handle complex relationships and interactions within the dataset. The model was trained on historical housing data and optimized to reduce prediction errors.

This model provides a solid foundation for price prediction and can be further refined with advanced techniques like feature selection, hyperparameter tuning, and ensemble methods.

In [10]:
# Import helpful libraries
import os
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Set up filepaths
train_data_path = os.path.join("data", "train.csv")
test_data_path = os.path.join("data", "test.csv")

# Read CSV files directly
train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

In [11]:
# Load the data, and separate the target
iowa_file_path = 'data/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice


features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Select columns corresponding to features, and preview the data
X = home_data[features]
X.head()

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define a random forest model
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

Validation MAE for Random Forest Model: 21,857


## Train a model for the competition
The code cell above trains a Random Forest model on train_X and train_y.
Use the code cell below to build a Random Forest model and train it on all of X and y

In [12]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=1)

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(X , y)

In [13]:
# Path to the file you will use for predictions
test_data_path = 'data/test.csv'

# Read test data file using pandas
test_data = pd.read_csv(test_data_path)

# Create test_X which comes from test_data but includes only the columns used for prediction
# The list of columns is stored in a variable called features
test_X = test_data[features]

# Make predictions which we will submit
test_preds = rf_model_on_full_data.predict(test_X)

In [14]:
#Run the code to save predictions in the format used for competition scoring
output = pd.DataFrame({'Id': test_data.Id,                     'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)