MODEL OPTIMIZATION
In the previous chapter we built our first supervised learning model. We now
want to improve its accuracy and reduce the effects of overfitting. A good
place to start is modifying the model’s hyperparameters.
Without changing any other hyperparameters, let’s first start by modifying
max_depth from “30” to “5.” The model now generates the following results:
# Results will differ due to the randomized data split
Training Set Mean Absolute Error: 129412.51
Although the mean absolute error of the training set is higher, this helps
reduce the problem of overfitting and should improve the results of the test
data. Another step to optimize the model is to add more trees. If we set
n_estimators to 250, we see this result:
# Results will differ as per the randomized data split
Training Set Mean Absolute Error: 118130.46
Test Set Mean Absolute Error: 159886.32
This second optimization reduces the training set’s absolute error rate by
approximately $11,000 and we now have a smaller gap between our training
and test results for mean absolute error.
Together, these two optimizations underline the importance of maximizing
and understanding the impact of individual hyperparameters. If you decide to
replicate this supervised machine learning model at home, I recommend that
you test modifying each of the hyperparameters individually and analyze
their impact on mean absolute error. In addition, you will notice changes in
the machine’s processing time based on the hyperparameters selected. For
instance, setting max_depth to “5” reduces total processing time compared to
when it was set to “30” because the maximum number of branch layers are
significantly less. Processing speed and resources will become an important
consideration as you move on to working with larger datasets.
Another important optimization technique is feature selection. As you willrecall, we removed nine features while scrubbing our dataset. Now might be
a good time to reconsider those features and analyze whether they have an
effect on the overall accuracy of the model. “SellerG” would be an interesting
feature to add to the model because the real estate company selling the
property could have some impact on the final selling price.
Alternatively, dropping features from the current model may reduce
processing time without having a significant effect on accuracy—or may
even improve accuracy. To select features effectively, it is best to isolate
feature modifications and analyze the results, rather than applying various
changes at once.
While manual trial and error can be an effective technique to understand the
impact of variable selection and hyperparameters, there are also automated
techniques for model optimization, such as grid search. Grid search allows
you to list a range of configurations you wish to test for each hyperparameter,
and then methodically tests each of those possible hyperparameters. An
automated voting process takes place to determine the optimal model. As the
model must test each possible combination of hyperparameters, grid search
does take a long time to run! Example code for grid search is shown at the
end of this chapter.
Finally, if you wish to use a different supervised machine learning algorithm
and not gradient boosting, much of the code used in this exercise can be
replicated. For instance, the same code can be used to import a new dataset,
preview the dataframe, remove features (columns), remove rows, split and
shuffle the dataset, and evaluate mean absolute error.
http://scikit-learn.org is a great resource to learn more about other algorithms
as well as the gradient boosting used in this exercise.
For a copy of the code, please contact the author at
oliver.theobald@scatterplotpress.com or see the code example below. In
addition, if you have troubles implementing the model using the code found
in this book, please feel free to contact the author by email for extra
assistance at no cost.

Code for the Optimized Model
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_splitfrom sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib
# Read in data from CSV
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL-26-09-2017.csv')
# Delete unneeded columns
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
# Remove rows with missing values
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
# Convert non-numerical data using one-hot encoding
features_df = pd.get_dummies(df, columns=['Suburb', 'CouncilArea', 'Type'])
# Remove price
del features_df['Price']
# Create X and y arrays from the dataset
X = features_df.as_matrix()
y = df['Price'].as_matrix()
# Split data into test/train set (70/30 split) and shuffle
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Set up algorithm
model = ensemble.GradientBoostingRegressor(
n_estimators=250,
learning_rate=0.1,
max_depth=5,
min_samples_split=4,
min_samples_leaf=6,
max_features=0.6,
loss='huber'
)
# Run model on training data
model.fit(X_train, y_train)# Save model to file
joblib.dump(model, 'trained_model.pkl')
# Check model accuracy (up to two decimal places)
mse = mean_absolute_error(y_train, model.predict(X_train))
print ("Training Set Mean Absolute Error: %.2f" % mse)
mse = mean_absolute_error(y_test, model.predict(X_test))
print ("Test Set Mean Absolute Error: %.2f" % mse)

Code for Grid Search Model
# Import libraries, including GridSearchCV
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib
from sklearn.model_selection import GridSearchCV
# Read in data from CSV
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL-26-09-2017.csv')
# Delete unneeded columns
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
# Remove rows with missing values
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
# Convert non-numerical data using one-hot encoding
features_df = pd.get_dummies(df, columns=['Suburb', 'CouncilArea', 'Type'])
# Remove price
del features_df['Price']
# Create X and y arrays from the dataset
X = features_df.as_matrix()
y = df['Price'].as_matrix()
# Split data into test/train set (70/30 split) and shuffleX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Input algorithm
model = ensemble.GradientBoostingRegressor()
# Set the configurations that you wish to test
param_grid = {
'n_estimators': [300, 600, 1000],
'max_depth': [7, 9, 11],
'min_samples_split': [3, 4, 5],
'min_samples_leaf': [5, 6, 7],
'learning_rate': [0.01, 0.02, 0.6, 0.7],
'max_features': [0.8, 0.9],
'loss': ['ls', 'lad', 'huber']
}
# Define grid search. Run with four CPUs in parallel if applicable.
gs_cv = GridSearchCV(model, param_grid, n_jobs=4)
# Run grid search on training data
gs_cv.fit(X_train, y_train)
# Print optimal hyperparameters
print(gs_cv.best_params_)
# Check model accuracy (up to two decimal places)
mse = mean_absolute_error(y_train, gs_cv.predict(X_train))
print("Training Set Mean Absolute Error: %.2f" % mse)
mse = mean_absolute_error(y_test, gs_cv.predict(X_test))
print("Test Set Mean Absolute Error: %.2f" % mse)

FURTHER RESOURCES
This section lists relevant learning materials for readers that wish to progress
further in the field of machine learning. Please note that certain details listed
in this section, including prices, may be subject to change in the future.
| Machine Learning |
Machine Learning
Format: Coursera course
Presenter: Andrew Ng
Cost: Free
Suggested Audience: Beginners (especially those with a preference for
MATLAB)
A free and well-taught introduction from Andrew Ng, one of the most
influential figures in this field. This course has become a virtual rite of
passage for anyone interested in machine learning.
Project 3: Reinforcement Learning
Format: Online blog tutorial
Author: EECS Berkeley
Suggested Audience: Upper intermediate to advanced
A practical demonstration of reinforcement learning, and Q-learning
specifically, explained through the game Pac-Man.
| Basic Algorithms |
Machine Learning With Random Forests And Decision Trees: A Visual
Guide For Beginners
Format: E-book
Author: Scott Hartshorn
Suggested Audience: Established beginnersA short, affordable (USD $3.20), and engaging read on decision trees and
random forests with detailed visual examples, useful practical tips, and clear
instructions.
Linear Regression And Correlation: A Beginner's Guide
Format: E-book
Author: Scott Hartshorn
Suggested Audience: All
A well-explained and affordable (USD $3.20) introduction to linear
regression, as well as correlation.
| The Future of AI |
The Inevitable: Understanding the 12 Technological Forces That Will
Shape Our Future
Format: E-Book, Book, Audiobook
Author: Kevin Kelly
Suggested Audience: All (with an interest in the future)
A well-researched look into the future with a major focus on AI and machine
learning by The New York Times Best Seller Kevin Kelly. Provides a guide
to twelve technological imperatives that will shape the next thirty years.
Homo Deus: A Brief History of Tomorrow
Format: E-Book, Book, Audiobook
Author: Yuval Noah Harari
Suggested Audience: All (with an interest in the future)
As a follow-up title to the success of Sapiens: A Brief History of Mankind,
Yuval Noah Harari examines the possibilities of the future with notable
sections of the book examining machine consciousness, applications in AI,
and the immense power of data and algorithms.
| Programming |
Learning Python, 5th EditionFormat: E-Book, Book
Author: Mark Lutz
Suggested Audience: All (with an interest in learning Python)
A comprehensive introduction to Python published by O’Reilly Media.
Hands-On Machine Learning with Scikit-Learn and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems
Format: E-Book, Book
Author: Aurélien Géron
Suggested Audience: All (with an interest in programming in Python, Scikit-
Learn and TensorFlow)
As a highly popular O’Reilly Media book written by machine learning
consultant Aurélien Géron, this is an excellent advanced resource for anyone
with a solid foundation of machine learning and computer programming.
| Recommendation Systems |
The Netflix Prize and Production Machine Learning Systems: An Insider
Look
Format: Blog
Author: Mathworks
Suggested Audience: All
A very interesting blog article demonstrating how Netflix applies machine
learning to form movie recommendations.
Recommender Systems
Format: Coursera course
Presenter: The University of Minnesota
Cost: Free 7-day trial or included with $49 USD Coursera subscription
Suggested Audience: All
Taught by the University of Minnesota, this Coursera specialization covers
fundamental recommender system techniques including content-based and
collaborative filtering as well as non-personalized and project-association
recommender systems.
.| Deep Learning |
Deep Learning Simplified
Format: Blog
Channel: DeepLearning.TV
Suggested Audience: All
A short video series to get you up to speed with deep learning. Available for
free on YouTube.
Deep Learning Specialization: Master Deep Learning, and Break into AI
Format: Coursera course
Presenter: deeplearning.ai and NVIDIA
Cost: Free 7-day trial or included with $49 USD Coursera subscription
Suggested Audience: Intermediate to advanced (with experience in Python)
A robust curriculum for those wishing to learn how to build neural networks
in Python and TensorFlow, as well as career advice, and how deep learning
theory applies to industry.
Deep Learning Nanodegree
Format: Udacity course
Presenter: Udacity
Cost: $599 USD
Suggested Audience: Upper beginner to advanced, with basic experience in
Python
Comprehensive and practical introduction to convolutional neural networks,
recurrent neural networks, and deep reinforcement learning taught online
over a four-month period. Practical components include building a dog breed
classifier, generating TV scripts, generating faces, and teaching a quadcopter
how to fly.
| Future Careers |
Will a Robot Take My Job?
Format: Online articleAuthor: The BBC
Suggested Audience: All
Check how safe your job is in the AI era leading up to the year 2035.
So You Wanna Be a Data Scientist? A Guide to 2015's Hottest Profession
Format: Blog
Author: Todd Wasserman
Suggested Audience: All
Excellent insight into becoming a data scientist.
The Data Science Venn Diagram
Format: Blog
Author: Drew Conway
Suggested Audience: All
The popular 2010 data science diagram designed by Drew Conway.

DOWNLOADING DATASETS
Before you can start practicing algorithms and building machine learning
models, you will first need data. For beginners starting out in machine
learning, there are a number of options. One is to source your own dataset
from writing a web crawler in Python or utilizing a click-and-drag tool such
as Import.io to crawl the Internet. However, the easiest and best option to get
started is by visiting kaggle.com.
As mentioned throughout this book, Kaggle offers free datasets for
download. This saves you the time and effort of sourcing and formatting your
own dataset. Meanwhile, you also have the opportunity to discuss and
problem-solve with other users on the forum, join competitions, and simply
hang out and talk about data.
Bear in mind, however, that datasets you download from Kaggle will
inherently need some refining (through scrubbing) to tailor to the machine
learning model that you decide to build. Below are four free sample datasets
from Kaggle that may prove useful to your further learning in this field.
World Happiness Report
What countries rank the highest in overall happiness? Which factors
contribute most to happiness? How did country rankings change between the
2015 and 2016 reports? Did any country experience a significant increase or
decrease in happiness? These are the questions you can ask of this dataset
recording happiness scores and rankings using data from the Gallup World
Poll. The scores are based on answers to the main life evaluation questions
asked in the poll.
Hotel Reviews
Does having a five-star reputation lead to more disgruntled guests, and
conversely, can two-star hotels rock the guest ratings by setting low
expectations and over-delivering? Or are one and two-star rated hotels simply
rated low for a reason? Find all this out from this sample dataset of hotel
reviews. This particular dataset covers 1,000 hotels and includes hotel name,
location, review date, text, title, username, and rating. The dataset is sourced
from the Datafiniti’s Business Database, which includes almost every hotel inthe world.
Craft Beers Dataset
Do you like craft beer? This dataset contains a list of 2,410 American craft
beers and 510 breweries collected in January 2017 from CraftCans.com.
Drinking and data crunching is perfectly legal.
Brazil's House of Deputies Reimbursements
As politicians in Brazil are entitled to receive refunds from money spent on
activities to “better serve the people,” there are interesting findings and
suspicious outliers to be found in this dataset. Data on these expenses are
publicly available, but there is very little monitoring of expenses in Brazil. So
don’t be surprised to see one public servant racking up over 800 flights in
twelve months, and another that recorded R 140,000 (USD $44,500) on post
expenses—yes, snail mail!