BOSTON HOUSES PREDICTION WITH SCIKIT

In [1]:
import pandas as pd
dir = ('https://raw.githubusercontent.com/AleGL92/Scikit-Learn/main/melb_data.csv')
mel_data = pd.read_csv(dir)
mel_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [11]:
mel_data.columns
# Some of those columns might not be interesting to include in the model, so we might drop them.
# Also, there are rows with missing values. We're dropping them with the following command. (axis = 0 means rows, axis = 1 means columns)
m_data = mel_data.dropna(axis = 0)

In [12]:
# Now we start preparing the model. Y is the objective or labels, X is the training data or features.
y = m_data.Price        # y = m_data['Price']
m_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = m_data[m_features]
X.describe()
# X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,141.568645,1964.081988,-37.807904,144.990201
std,0.971079,0.711362,897.449881,90.834824,38.105673,0.07585,0.099165
min,1.0,1.0,0.0,0.0,1196.0,-38.16492,144.54237
25%,2.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198
50%,3.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958
75%,4.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527
max,8.0,8.0,37000.0,3112.0,2018.0,-37.45709,145.52635


Building the model

In [13]:
from sklearn.tree import DecisionTreeRegressor
m_model = DecisionTreeRegressor(random_state = 1)
# ML models allow some randomness in model training. There will be variations in the tree structure each time you build a model. 
# Passing a specific seed to random_state ensures the same result is generated each time we build the model.
# DecissionTreeRegressor can be represented as a decission tree, with branches and leaves (nodes).
m_model.fit(X,y)

print('Predictions for the following houses: \n', X.head())
print('The predictions are: \n', m_model.predict(X.head()))

Predictions for the following houses: 
    Rooms  Bathroom  Landsize  BuildingArea  YearBuilt  Lattitude  Longtitude
1      2       1.0     156.0          79.0     1900.0   -37.8079    144.9934
2      3       2.0     134.0         150.0     1900.0   -37.8093    144.9944
4      4       1.0     120.0         142.0     2014.0   -37.8072    144.9941
6      3       2.0     245.0         210.0     1910.0   -37.8024    144.9993
7      2       1.0     256.0         107.0     1890.0   -37.8060    144.9954
The predictions are: 
 [1035000. 1465000. 1600000. 1876000. 1636000.]


In [14]:
# To know if the results were good we could use the Mean Absolute Error, passing the predictions and the real values
from sklearn.metrics import mean_absolute_error
m_preds = m_model.predict(X)
mean_absolute_error(y, m_preds)

434.71594577146544

In [15]:
# The results seem good. But we've been using in-sample houses for the prediction. We shouldnt do this, as the model won't give good results
# if we try with some other samples it has never seen before.
# Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. 
# The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's 
# accuracy on data it hasn't seen before. This data is called validation data.
# In the next lines, we'll separate them.

In [16]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
# val_X.head()
# val_y.head()
m_model = DecisionTreeRegressor(random_state = 1)
m_model.fit(train_X, train_y)

val_preds = m_model.predict(val_X)
print(mean_absolute_error(val_y, val_preds))
# Now we can see the results are quite different. They're much worse than before.

251876.65138799226


In [17]:
# In addition to the use of validation data, we should also consider overfitting and undergitting:

# Overfitting is when a model matches the training data almost perfectly, but does poorly in validation and 
# other new data. In this case, it would mean we made our tree very extense, with many branches, dividing up the houses into 
# very distinct groups and also having less number of leaves, which makes training fit closely to the examples given.

# On the contrary, if a tree divides houses into only 2 or 4 groups, each group still has a wide variety of houses. Resulting predictions may be 
# far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to 
# capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

In [18]:
# To avoid over or underfitting, we should control the number of nodes. Here we define a function to get the MAE. Then we pass a list with 
# number of maximun nodes, to see which one performs better.
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_n, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_n, random_state = 1)
    model.fit(train_X, train_y)

    m_preds = model.predict(val_X)
    mae = mean_absolute_error(val_y, m_preds)
    return mae

for max_nodes in [5, 10, 50, 100, 500, 1000, 5000, 10000]:
    my_mae = get_mae(max_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_nodes, my_mae))

# We got the lowest MAE with 100 nodes (). Then the model starts overfitting and the results are worse.

Max leaf nodes: 5  		 Mean Absolute Error:  324110
Max leaf nodes: 10  		 Mean Absolute Error:  301550
Max leaf nodes: 50  		 Mean Absolute Error:  248796
Max leaf nodes: 100  		 Mean Absolute Error:  238658
Max leaf nodes: 500  		 Mean Absolute Error:  239296
Max leaf nodes: 1000  		 Mean Absolute Error:  240794
Max leaf nodes: 5000  		 Mean Absolute Error:  249905
Max leaf nodes: 10000  		 Mean Absolute Error:  249905


In [21]:
# The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much 
# better predictive accuracy than a single decision tree. So this time, we'll be predicting with a RandomForestRegressor instead of
# predicting with DecisionTreeRegressor.
from sklearn.ensemble import RandomForestRegressor
f_model = RandomForestRegressor(random_state = 1)
f_model.fit(train_X, train_y)
f_preds = f_model.predict(val_X)
mae = mean_absolute_error(val_y, f_preds)
print(f'MAE with Random Forest Regressor: {mae}')

# We got an even better result this time (173864).

MAE with Random Forest Regressor: 173864.25945341078
