# Introduction
In this notebook, I will build a model using random forest and predict house prices with the data given.<br>
And I improve the accuracy using a feature selection method where I'll test each newly added feature to see if the MAE increased or decreased.

# Importing Data and Necessary Libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.ensemble import RandomForestRegressor # The model to be used
from sklearn.metrics import mean_absolute_error # Calculating accuracy
from sklearn.model_selection import train_test_split # Improve training for the model

data_path = '/kaggle/input/home-data-for-ml-course/train.csv' 
home_data = pd.read_csv(data_path)

## Preprocessing
Here we make a variable that contains all the columns except the SalePrice since that will obstruct training.

In [2]:
home_data = home_data.loc[:, home_data.columns != 'Id'] # This is to remove the Id column which will be useless for training
home_data = pd.get_dummies(home_data) # This will turn any string values into a true and false columns
home_data = home_data.fillna(0) # Replace NaN values with 0

X = home_data.drop(columns=["SalePrice"])
y = home_data["SalePrice"]
original_features = home_data.loc[:, home_data.columns != 'SalePrice']


print(f"original features shape: {home_data.shape}")
print(f"number of features: {len(original_features)}")
print(home_data.head())

original features shape: (1460, 288)
number of features: 1460
   MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0          60         65.0     8450            7            5       2003   
1          20         80.0     9600            6            8       1976   
2          60         68.0    11250            7            5       2001   
3          70         60.0     9550            7            5       1915   
4          60         84.0    14260            8            5       2000   

   YearRemodAdd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  ...  SaleType_ConLw  \
0          2003       196.0         706           0  ...           False   
1          1976         0.0         978           0  ...           False   
2          2002       162.0         486           0  ...           False   
3          1970         0.0         216           0  ...           False   
4          2000       350.0         655           0  ...           False   

   SaleType_New  SaleTyp

# Feature Selection
The most important part of this notebook is feature selection where we will discard any features that doesn't improve the mean absolute error.<br>
## Step 1: Create the feature selection function 
we create the feature_selection function that will take as input the features, training data, Sale price training data, testing data, and finally the validation testing data. <br>
Then, initialize the ML model and give it the training data, to then make predictions and test the mean absolute error (and print for debugging).<br>
And return the **MAE**.


In [3]:
model = None
def feature_selection(features, feature_train, feature_val, output_train, output_val):
    global model
    model = RandomForestRegressor(random_state=1) # Current ML model
    model.fit(feature_train[features], output_train) # Train the model with the training data

    predictions = model.predict(feature_val[features]) # Make the predictions
    mae = mean_absolute_error(output_val, predictions) # Calculate the Mean absolute error

    return mae

## Step 2: Start selection
Now that we have the function, we will begin selecting features one by one and test which one improved our **MAE**.<br>To do that, I will make a loop that:
1. Add new feature.
2. Create test and validation data with the new feature.
3. Add the data to the function feature_selection.
4. Check whether **MAE** improved or not.
5. If so, keep the feature, else discard it.

In [4]:
previous_mae = 0
selected_features = []
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) # Create training data and validation data

for feature in original_features:
    if feature not in selected_features: # We append a new feature to a list to test the new MAE
        selected_features.append(feature)

        mae = feature_selection(selected_features, train_X, val_X, train_y, val_y)
        if not previous_mae: previous_mae = mae

        if min(mae, previous_mae) == mae:
            print(f"KEPT {feature}: MAE = {mae:.2f}")
            previous_mae = mae
        else:
            print(f"REMOVED {feature}")
            selected_features.remove(feature)

KEPT MSSubClass: MAE = 48346.44
KEPT LotFrontage: MAE = 47086.63
KEPT LotArea: MAE = 44642.28
KEPT OverallQual: MAE = 27946.33
KEPT OverallCond: MAE = 27063.86
KEPT YearBuilt: MAE = 25085.10
KEPT YearRemodAdd: MAE = 24779.62
KEPT MasVnrArea: MAE = 23533.29
KEPT BsmtFinSF1: MAE = 21743.72
KEPT BsmtFinSF2: MAE = 21663.38
KEPT BsmtUnfSF: MAE = 21081.83
KEPT TotalBsmtSF: MAE = 20258.85
KEPT 1stFlrSF: MAE = 19609.53
KEPT 2ndFlrSF: MAE = 18080.86
KEPT LowQualFinSF: MAE = 17861.74
KEPT GrLivArea: MAE = 16816.85
KEPT BsmtFullBath: MAE = 16813.47
REMOVED BsmtHalfBath
KEPT FullBath: MAE = 16757.18
REMOVED HalfBath
REMOVED BedroomAbvGr
REMOVED KitchenAbvGr
REMOVED TotRmsAbvGrd
KEPT Fireplaces: MAE = 16624.27
KEPT GarageYrBlt: MAE = 16208.60
KEPT GarageCars: MAE = 16093.88
REMOVED GarageArea
KEPT WoodDeckSF: MAE = 16064.51
REMOVED OpenPorchSF
REMOVED EnclosedPorch
REMOVED 3SsnPorch
REMOVED ScreenPorch
REMOVED PoolArea
REMOVED MiscVal
REMOVED MoSold
KEPT YrSold: MAE = 16054.22
REMOVED MSZoning_C (a

## Step 3: Test the model
Finally, we will test the model using the test.csv.

In [5]:
train_feature_names = model.feature_names_in_
test_data = pd.read_csv('/kaggle/input/home-data-for-ml-course/test.csv')

test_data_processed = test_data.drop(columns=['Id'])
test_data_processed = pd.get_dummies(test_data_processed)
test_data_processed = test_data_processed.fillna(0)

test_X = test_data_processed.reindex(columns=train_feature_names, fill_value=0)

preds = model.predict(test_X)

## Step 4: Create the submission file

In [6]:
output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': preds})
output.to_csv('submission.csv', index=False)

# My Thoughts
I improved the MAE from the tutorial in kaggle, and I'm proud of the results. Still I noticed that despite this working correctly, I dropped many features that could have some use, maybe if they are tied together we could get a different prediction.<br>
I noticed the neighborhood column for example that could serve in the predictions but got dropped instead since the MAE got higher, so there should be a better method. But that's the limit of my current knowledge.