# Basic Data Exploration

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

In [ ]:
# save filepath to variable for easier access
melbourne_file_path = 'C:\\Users\\matthew.yim\\PycharmProjects\\Kaggle\\melbourne-housing-snapshot\\melb_data.csv'

# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)

In [ ]:
# print a summary of the data in Melbourne data
print(melbourne_data.describe())

newest_home_age = melbourne_data.loc[:, 'YearBuilt'].max()
avg = melbourne_data.loc[:, 'Landsize'].mean()
print(f"avg", avg)
print(f"newest home", newest_home_age)
print(melbourne_data.columns)

In [ ]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()
X.head()

In [ ]:
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

In [ ]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

# Model Validation

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Load data
melbourne_file_path = melbourne_file_path = 'C:\\Users\\matthew.yim\\PycharmProjects\\Kaggle\\melbourne-housing-snapshot\\melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)

In [3]:
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544

In [ ]:
""" The Problem with "In-Sample" Scores """ 
""" 
    The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building
    the model and evaluating it. Here's why this is bad
    
    Image that, in the large real estate market, door color is unrelated to home price.
    
    However, in the sample of data you used to build the moedel, all homes iwth green doors were very expensive. The model's
    job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for 
    homes with green doors.
    
    Since this pattern was derived from the training data, the model will appear accurate in the training data.
    
    But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate whem used in practice
    
    Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used
    to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and
    then use those to test the model's accuracy on data it hasn't seen before. This data is called {Validation data} 
"""

In [4]:
from sklearn.model_selection import train_test_split

# Split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to the
# random_state argument guarentees we get the same split every time we run this script

# First split to get a training set
train_X, temp_X, train_y, temp_y = train_test_split(X, y, train_size=0.7, random_state=0)

# Second split to get a validation and test set
val_X, test_X, val_y, test_y = train_test_split(temp_X, temp_y, test_size=0.5, random_state=0)

# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

257947.9300322928


In [ ]:
"""
Mean absolute error for the in-sample data was $434, while the out-sample data was $257947

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types
"""

In [5]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    prediction_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, prediction_val)
    return(mae)

In [9]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 525, 550, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5 		 Mean Absolute Error: 345797
Max leaf nodes: 50 		 Mean Absolute Error: 260605
Max leaf nodes: 500 		 Mean Absolute Error: 249284
Max leaf nodes: 525 		 Mean Absolute Error: 251912
Max leaf nodes: 550 		 Mean Absolute Error: 252133
Max leaf nodes: 5000 		 Mean Absolute Error: 259424


# Underfitting and Overfitting

In [ ]:
""" Experimenting With Different Models"""

"""Decision-Tree:
        In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 2^10 groups of houses by the time we get to the 10th level. That's 1024 leaves
        
        When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).
        
        This is a phenomenon called {overfitting}, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.
        
        At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called {underfitting}.
        
        max_leaf_nodes argument provides a very sensible way to control {overfitting} vs {underfitting}.
        - The more leaves we allow the model to make, the more we move from the {underfitting} area in the above graph to the {overfitting} area
        
"""

In [ ]:
""" Conclusion """
"""
    Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions
    Underfitting: failing to caputreu relevant patterns, again leading to less accurate predictions
    * Utilize {validation data}, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.
"""