# End to End machine learning project [HOUSING HOUSE PREDICTION]

The goal of this project is to predict the price of the house.

1. **Suburb**: Suburb
2. **Address**: Address
3. **Rooms**: Number of rooms
4. **Price**: Price in Australian dollars
5. **Method**: S - property sold; 
               SP - property sold prior
               PI - property passed in
               PN - sold prior not disclosed
               SN - sold not disclosed
               NB - no bid
               VB - vendor bid
               W - withdrawn prior to auction
               SA - sold after auction
               SS - sold after auction price not disclosed. 
               N/A - price or highest bid not available.
6. **Type**: br - bedroom(s);
             h - house,cottage,villa, semi,terrace;
             u - unit, duplex;
             t - townhouse;
             dev site - development site;
             o res - other residential.
7. **SellerG**: Real Estate Agent
8. **Date**: Date sold
9. **Distance**: Distance from CBD in Kilometres
10. **Regionname**: General Region (West, North West, North, North east …etc)
11. **Propertycount**: Number of properties that exist in the suburb.
12. **Bedroom2** : Scraped # of Bedrooms (from different source)
13. **Bathroom**: Number of Bathrooms
14. **Car**: Number of carspots
15. **Landsize**: Land Size in Metres
16. **BuildingArea**: Building Size in Metres
17. **YearBuilt**: Year the house was built
18. **CouncilArea**: Governing council for the area
19. **Lattitude**: Self explanitory
20. **Longtitude**: Self explanitory

### Libraries:

In [1]:
import pandas as pd
import numpy as np
import datetime as dt

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

## Load the dataset:

In [2]:
df = pd.read_csv('dataset/housing-snapshot/train_set.csv',index_col=0) 

In [3]:
df.head()

Unnamed: 0_level_0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,Aberfeldie,241 Buckley St,4,h,1380000.0,VB,Nelson,12/08/2017,7.5,3040.0,...,2.0,2.0,766.0,,,Moonee Valley,-37.75595,144.90551,Western Metropolitan,1543.0
1,Northcote,67 Charles St,2,h,1100000.0,SP,Jellis,20/05/2017,5.5,3070.0,...,1.0,1.0,189.0,,,Darebin,-37.7767,144.9924,Northern Metropolitan,11364.0
2,Balwyn North,42 Maud St,3,h,1480000.0,PI,Jellis,15/10/2016,9.2,3104.0,...,1.0,4.0,605.0,116.0,1950.0,Boroondara,-37.7951,145.0696,Southern Metropolitan,7809.0
3,Brunswick,13 Percy St,3,h,1055000.0,S,Nelson,7/05/2016,5.2,3056.0,...,1.0,1.0,324.0,,1930.0,Moreland,-37.7653,144.9586,Northern Metropolitan,11918.0
4,Templestowe Lower,253 Thompsons Rd,4,h,1000000.0,VB,hockingstuart,13/08/2016,13.8,3107.0,...,3.0,2.0,728.0,164.0,1970.0,Manningham,-37.768,145.1027,Eastern Metropolitan,5420.0


1. **Categorical estimators:** Suburb, Address, Type, Method, SellerG, CouncilArea, Postcode and Regionname.
2. **Numerical estimators:** Rooms, Date, Distance, Bedroom2, Bathrom, Car, Landsize, BuildingArea, YearBuilt, Lattitude, Longitude and Propertycount.
3. **Target value:** price

In [4]:
# POSTCODE IS NOT NUMERIC, we should change it to categorical
df['Postcode'] = pd.Categorical(df.Postcode)

# Convert date to numerical value, or it won't be useful
df['Date'] = pd.to_datetime(df['Date'])
df['Date']=df['Date'].map(dt.datetime.toordinal)

## Exploratory data analysis:

#### Categorical estimators:

In [5]:
categorical_estimators = df[['Suburb', 'Address', 'Type', 'Method', 'SellerG', 'CouncilArea', 'Postcode', 'Regionname']]

In [6]:
# Share of null values
(categorical_estimators.isnull().sum()/len(df))*100

Suburb          0.000000
Address         0.000000
Type            0.000000
Method          0.000000
SellerG         0.000000
CouncilArea    10.180412
Postcode        0.000000
Regionname      0.000000
dtype: float64

In [7]:
# List of unique values for categorical variables

Suburb = df.Suburb.unique()
Address = df.Address.unique()
Type = df.Type.unique()
Method = df.Method.unique()
SellerG = df.SellerG.unique()
CouncilArea = df.CouncilArea.unique()
Postcode = df.Postcode.unique()
Regionname = df.Regionname.unique()

#### Numerical estimators:

In [8]:
numerical_estimators = df[['Rooms', 'Date', 'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude', 'Propertycount']]

In [9]:
# Share of null values
(numerical_estimators.isnull().sum()/len(df))*100

Rooms             0.000000
Date              0.000000
Distance          0.000000
Bedroom2          0.000000
Bathroom          0.000000
Car               0.460236
Landsize          0.000000
BuildingArea     46.796760
YearBuilt        39.212077
Lattitude         0.000000
Longtitude        0.000000
Propertycount     0.000000
dtype: float64

## Prepare the data for Machine Learning algorithms

In [10]:


# column index
Rooms_ix, Bedroom2_ix, Bathroom_ix, BuildingArea_ix = 0, 2, 3, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        rooms_per_building_area = X[:, Rooms_ix] / (1.0 +X[:, BuildingArea_ix])# add 1 to avoid 0 division
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, Bedroom2_ix] / (1.0 + X[:, Bathroom_ix]) # add 1 to avoid 0 division
            return np.c_[X, rooms_per_building_area, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_building_area]

## create a function to replace 0 by NaN
def replace_0_2_NaN(data):
    data[data == 0] = np.nan
    return data

In [11]:
# Pipeline transformation (can improve)
imputer = SimpleImputer(strategy="median")

num0_pipeline = Pipeline([
        ('zeros2NaN',FunctionTransformer(func = replace_0_2_NaN,validate=False)),
        ('imputer', SimpleImputer(strategy="median")),
        ('log',FunctionTransformer(np.log1p, validate=True)),
        ('std_scaler', StandardScaler()),
    ])

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="constant",fill_value='Unknown')),
        ('one_hot_encoder', OneHotEncoder(handle_unknown='ignore')),
    ])

In [12]:
# All attributes
num_attribs0 = ['Landsize','BuildingArea']
num_attribs1 = list(numerical_estimators)
cat_attribs = ["CouncilArea",'Type','Suburb','Postcode','SellerG']


full_pipeline = ColumnTransformer([
        ("num0", num0_pipeline, num_attribs0),
        ("num1", num_pipeline, num_attribs1),
        ("cat", cat_pipeline, cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(df)

In [13]:
# Add predictor label
housing_label = df['Price']

## Model Selection

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8], 'max_depth':[3,5,7,10]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]


forest_reg = RandomForestRegressor(random_state=42)

# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_root_mean_squared_error',
                           return_train_score=True)

grid_search.fit(housing_prepared, housing_label)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
             param_grid=[{'max_depth': [3, 5, 7, 10],
                          'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_root_mean_squared_error')

In [15]:
print("the best parameters are:")
print(grid_search.best_params_)

the best parameters are:
{'bootstrap': False, 'max_features': 2, 'n_estimators': 10}


In [16]:
print("the best trained model:")
grid_search.best_estimator_

the best trained model:


RandomForestRegressor(bootstrap=False, max_features=2, n_estimators=10,
                      random_state=42)

 ## Final Model 
 Create Final Model and evaluate it (You shoud do this only once)

In [17]:
# Load test dataset
X_test = pd.read_csv('dataset/housing-snapshot/test_set.csv',index_col=0)

In [18]:
# POSTCODE IS NOT NUMERIC, we should change it to categorical
X_test['Postcode'] = pd.Categorical(X_test.Postcode)

# Convert date to numerical value, or it won't be useful
X_test['Date'] = pd.to_datetime(X_test['Date'])
X_test['Date']=X_test['Date'].map(dt.datetime.toordinal)

# Assign na values
X_test[X_test['Type'].isin(Type)==False] = np.nan
X_test[X_test['Address'].isin(Address)==False] = np.nan
X_test[X_test['Suburb'].isin(Suburb)==False] = np.nan
X_test[X_test['Method'].isin(Method)==False] = np.nan
X_test[X_test['SellerG'].isin(SellerG)==False] = np.nan
X_test[X_test['Postcode'].isin(Postcode)==False] = np.nan
X_test[X_test['Regionname'].isin(Regionname)==False] = np.nan
X_test[X_test['CouncilArea'].isin(CouncilArea)==False] = np.nan

In [19]:
grid_search = GridSearchCV(RandomForestRegressor(random_state=42, n_estimators=4, max_depth=10),
                  param_grid={'max_features': range(2, 50, 2)},
                  scoring='neg_root_mean_squared_error', return_train_score=True)

grid_search.fit(housing_prepared, housing_label)

final_model = grid_search.best_estimator_

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

df_output = pd.DataFrame(final_predictions)
df_output = df_output.reset_index()
df_output.columns = ['index','Price']

## predict the test set and generate the submission file
df_output.to_csv('baseline.csv',index=False)