Iowa House Prices Prediction

In this notebook, I have worked on predicting the sale prices of houses in Iowa using two machine learning models: 
DecisionTreeRegressor and RandomForestRegressor. 
The dataset consists of training data (`train.csv`) and test data (`test.csv`) from Iowa's housing market.

Steps Taken:

1. **Data Exploration**: 
    - Understanding the dataset's features and target variable.
    - Exploring the distribution and relationships between variables.


2. **Model Building and Evaluation**:
    - Trained DecisionTreeRegressor and RandomForestRegressor models.
    - Used Mean Absolute Error (MAE) as the evaluation metric to compare model performances with varying hyperparameters.

3. **Prediction**:
    - Predicted the sale prices for the houses in the test dataset.
    - Prepared a result file (`result.csv`) for evaluation.

Conclusion:

By experimenting with different models and hyperparameters, 
I aim to find the model that accurately predicts house sale prices in Iowa.


In [1]:
import pandas as pd #importing Pandas

In [2]:
file_path = '../data/train.csv'
iowa_df = pd.read_csv(file_path) #importing file

In [3]:
iowa_df.columns #viewing the columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [4]:
target = iowa_df.SalePrice #selecting what we need to predict

In [5]:
feature_list = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd'] #selecting features to predict

In [6]:
features = iowa_df[feature_list] #getting the columns of the features

In [7]:
from sklearn.model_selection import train_test_split #importing train_test_split for  splitting a dataset into two separate sets:
                                                     #one for training a model and the other for testing its performance. 

In [8]:
train_features, test_features, train_target, test_target = train_test_split(features, target, random_state = 0)
    #train_features: Features used for training the model.
    #test_features: Features used for evaluating the model's performance.
    #train_target: Target values corresponding to the training features.
    #test_target: Target values corresponding to the testing features.

In [9]:
from sklearn.metrics import mean_absolute_error #mean_absolute_error measures the average absolute difference between the predicted and actual values
from sklearn.tree import DecisionTreeRegressor #importing DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_fetures, test_features, train_target, test_target ): #function for mean_absolute_error
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state = 0) # Creating a DecisionTreeRegressor object
    model.fit(train_features, train_target) # Training the model on the training data
    predicted = model.predict(test_features) # Predictions on the testing data
    mae = mean_absolute_error(test_target, predicted)# comparing test_target and predicted price
    return (mae)

In [10]:
for max_leaf_nodes in [5 ,50, 500, 5000]: # compare MAE with differing values of max_leaf_nodes
    my_mae = get_mae(max_leaf_nodes, train_features, test_features, train_target, test_target )
    print(f"Max leaf nodes:{max_leaf_nodes} \t\t Mean Absolute Error: {my_mae}")

Max leaf nodes:5 		 Mean Absolute Error: 35190.33670788684
Max leaf nodes:50 		 Mean Absolute Error: 27825.888386265695
Max leaf nodes:500 		 Mean Absolute Error: 32685.401335072846
Max leaf nodes:5000 		 Mean Absolute Error: 33404.21643835616


From the above we can see, 50 is the optimal number of leaves as it gives lowest MAE. Now we are using Random Forest. 
The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. 
It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

In [11]:
from sklearn.ensemble import RandomForestRegressor


forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_features, train_target)
melb_preds = forest_model.predict(test_features)
print(mean_absolute_error(test_target, melb_preds))

23009.206570906717


In [12]:
# To improve accuracy, creating a new Random Forest model which we will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state = 1)

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(features, target)

Now testing the model in new data, which the model have not seen before

In [13]:
test_data_path = '../data/test.csv'
test_data = pd.read_csv(test_data_path)

In [14]:
test_X = test_data[feature_list]

In [15]:
test_predictions = rf_model_on_full_data.predict(test_X)

In [16]:
output = pd.DataFrame({'Id': test_data.Id, 'SalePrice': test_predictions})
output

Unnamed: 0,Id,SalePrice
0,1461,122656.58
1,1462,156789.00
2,1463,182959.00
3,1464,178102.00
4,1465,189049.48
...,...,...
1454,2915,83645.00
1455,2916,86785.00
1456,2917,151283.01
1457,2918,127878.00


In [17]:
output.to_csv('result.csv',index = False)