THE ML TOOLBOX
A handy way to learn a new subject area is to map and visualize the essential
materials and tools inside a toolbox.
If you were packing a toolbox to build websites, for example, you would first
pack a selection of programming languages. This would include frontend
languages such as HTML, CSS, and JavaScript, one or two backend
programming languages based on personal preferences, and of course, a text
editor. You might throw in a website builder such as WordPress and then
have another compartment filled with web hosting, DNS, and maybe a few
domain names that you’ve recently purchased.
This is not an extensive inventory, but from this general list, you can start to
gain a better appreciation of what tools you need to master in order to
become a successful website developer.
Let’s now unpack the toolbox for machine learning.

Compartment 1: Data
In the first compartment is your data. Data constitutes the input variables
needed to form a prediction. Data comes in many forms, including structured
and non-structured data. As a beginner, it is recommended that you start with
structured data. This means that the data is defined and labeled (with
schema) in a table, as shown here:

Tabla bitcoin vs dias transpirados

Before we proceed, I first want to explain the anatomy of a tabular dataset. A
tabular (table-based) dataset contains data organized in rows and columns. In
each column is a feature. A feature is also known as a variable, a dimension
or an attribute—but they all mean the same thing.
Each individual row represents a single observation of a given
feature/variable. Rows are sometimes referred to as a case or value, but in
this book, we will use the term “row.”

Figure 1: Example of a tabular dataset

Each column is known as a vector. Vectors store your X and y values and
multiple vectors (columns) are commonly referred to as matrices. In the case
of supervised learning, y will already exist in your dataset and be used to
identify patterns in relation to independent variables (X). The y values are
commonly expressed in the final column, as shown in Figure 2.

Figure 2: The y value is often but not always expressed in the far right column

Next, within the first compartment of the toolbox is a range of scatterplots,
including 2-D, 3-D, and 4-D plots. A 2-D scatterplot consists of a vertical
axis (known as the y-axis) and a horizontal axis (known as the x-axis) and
provides the graphical canvas to plot a series of dots, known as data points.
Each data point on the scatterplot represents one observation from the dataset,
with X values plotted on the x-axis and y values plotted on the y-axis.

Tabla Variable independiente vs Variable dependiente

Compartment 2: Infrastructure
The second compartment of the toolbox contains your infrastructure, which
consists of platforms and tools to process data. As a beginner to machine
learning, you are likely to be using a web application (such as Jupyter
Notebook) and a programming language like Python. There are then a series
of machine learning libraries, including NumPy, Pandas, and Scikit-learn that
are compatible with Python. Machine learning libraries are a collection of
pre-compiled programming routines frequently used in machine learning.
You will also need a machine from which to work, in the form of a computer
or a virtual server. In addition, you may need specialized libraries for data
visualization such as Seaborn and Matplotlib, or a standalone software
program like Tableau, which supports a range of visualization
techniques including charts, graphs, maps, and other visual options.
With your infrastructure sprayed out across the table (hypothetically of
course), you are now ready to get to work building your first machine
learning model. The first step is to crank up your computer. Laptops and
desktop computers are both suitable for working with smaller datasets. You
will then need to install a programming environment, such as Jupyter
Notebook, and a programming language, which for most beginners is Python.
Python is the most widely used programming language for machine learning
because:
a) It is easy to learn and operate,
b) It is compatible with a range of machine learning libraries, and
c) It can be used for related tasks, including data collection (web
scraping) and data piping (Hadoop and Spark).
Other go-to languages for machine learning include C and C++. If you’re
proficient with C and C++ then it makes sense to stick with what you alreadyknow. C and C++ are the default programming languages for advanced
machine learning because they can run directly on a GPU (Graphical
Processing Unit). Python needs to be converted first before it can run on a
GPU, but we will get to this and what a GPU is later in the chapter.
Next, Python users will typically install the following libraries: NumPy,
Pandas, and Scikit-learn. NumPy is a free and open-source library that allows
you to efficiently load and work with large datasets, including managing
matrices.
Scikit-learn provides access to a range of popular algorithms, including linear
regression, Bayes’ classifier, and support vector machines.
Finally, Pandas enables your data to be represented on a virtual
spreadsheet that you can control through code. It shares many of the same
features as Microsoft Excel in that it allows you to edit data and perform
calculations. In fact, the name Pandas derives from the term “panel data,”
which refers to its ability to create a series of panels, similar to “sheets” in
Excel. Pandas is also ideal for importing and extracting data from CSV files.

Figure 4: Previewing a table in Jupyter Notebook using Pandas

In summary, users can draw on these three libraries to:
1) Load and work with a dataset via NumPy.
2) Clean up and perform calculations on data, and extract data from CSV files
with Pandas.
3) Implement algorithms with Scikit-learn.
For students seeking alternative programming options (beyond Python, C,
and C++), other relevant programming languages for machine learning
include R, MATLAB, and Octave.
R is a free and open-source programming language optimized formathematical operations, and conducive to building matrices and statistical
functions, which are built directly into the language libraries of R. Although
R is commonly used for data analytics and data mining, R supports machine
learning operations as well.
MATLAB and Octave are direct competitors to R. MATLAB is a commercial
and propriety programming language. It is strong in regards to solving
algebraic equations and is also a quick programming language to learn.
MATLAB is widely used in electrical engineering, chemical engineering,
civil engineering, and aeronautical engineering. However, computer scientists
and computer engineers tend not to rely on MATLAB as heavily and
especially in recent times. In machine learning, MATLAB is more often used
in academia than in industry. Thus, while you may see MATLAB featured in
online courses, and especially on Coursera, this is not to say that it’s
commonly used in the wild. If, however, you’re coming from an engineering
background, MATLAB is certainly a logical choice.
Lastly, Octave is essentially a free version of MATLAB developed in
response to MATLAB by the open-source community.

Compartment 3: Algorithms
Now that the machine learning environment is set up and you’ve chosen your
programming language and libraries, you can next import your data directly
from a CSV file. You can find hundreds of interesting datasets in CSV format
from kaggle.com. After registering as a member of their platform, you can
download a dataset of your choice. Best of all, Kaggle datasets are free and
there is no cost to register as a user.
The dataset will download directly to your computer as a CSV file, which
means you can use Microsoft Excel to open and even perform basic
algorithms such as linear regression on your dataset.
Next is the third and final compartment that stores the algorithms. Beginners
will typically start off by using simple supervised learning algorithms such as
linear regression, logistic regression, decision trees, and k-nearest neighbors.
Beginners are also likely to apply unsupervised learning in the form of k-
means clustering and descending dimension algorithms.

Visualization
No matter how impactful and insightful your data discoveries are, you need away to effectively communicate the results to relevant decision-makers. This
is where data visualization, a highly effective medium to communicate data
findings to a general audience, comes in handy. The visual message conveyed
through graphs, scatterplots, box plots, and the representation of numbers in
shapes makes for quick and easy storytelling.
In general, the less informed your audience is, the more important it is to
visualize your findings. Conversely, if your audience is knowledgeable about
the topic, additional details and technical terms can be used to supplement
visual elements.
To visualize your results you can draw on Tableau or a Python library such as
Seaborn, which are stored in the second compartment of the toolbox.

Advanced Toolbox
We have so far examined the toolbox for a typical beginner, but what about
an advanced user? What would their toolbox look like? While it may take
some time before you get to work with the advanced toolkit, it doesn’t hurt to
have a sneak peek.
The toolbox for an advanced learner resembles the beginner’s toolbox but
naturally comes with a broader spectrum of tools and, of course, data. One of
the biggest differences between a beginner and an advanced learner is the size
of the data they manage and operate. Beginners naturally start by working
with small datasets that are easy to manage and which can be downloaded
directly to one’s desktop as a simple CSV file. Advanced learners, though,
will be eager to tackle massive datasets, well in the vicinity of big data.

Compartment 1: Big Data
Big data is used to describe a dataset that, due to its value, variety, volume,
and velocity, defies conventional methods of processing and would be
impossible for a human to process without the assistance of an advanced
machine. Big data does not have an exact definition in terms of size or the
total number of rows and columns. At the moment, petabytes qualify as big
data, but datasets are becoming increasingly larger as we find new ways to
efficiently collect and store data at low cost. And with big data also comes
greater noise and complicated data structures. A huge part, therefore, of
working with big data is scrubbing: the process of refining your dataset
before building your model, which will be covered in the next chapter.

Compartment 2: Infrastructure
After scrubbing the dataset, the next step is to pull out your machine learning
equipment. In terms of tools, there are no real surprises. Advanced learners
are still using the same machine learning libraries, programming languages,
and programming environments as beginners.
However, given that advanced learners are now dealing with up to petabytes
of data, robust infrastructure is required. Instead of relying on the CPU of a
personal computer, advanced students typically turn to distributed computing
and a cloud provider such as Amazon Web Services (AWS) to run their data
processing on what is known as a Graphical Processing Unit (GPU) instance.GPU chips were originally added to PC motherboards and video consoles
such as the PlayStation 2 and the Xbox for gaming purposes. They were
developed to accelerate the creation of images with millions of pixels whose
frames needed to be constantly recalculated to display output in less than a
second. By 2005, GPU chips were produced in such large quantities that their
price had dropped dramatically and they’d essentially matured into a
commodity. Although highly popular in the video game industry, the
application of such computer chips in the space of machine learning was not
fully understood or realized until recently.
In his 2016 novel, The Inevitable: Understanding the 12 Technological
Forces That Will Shape Our Future, Founding Executive Editor of Wired
Magazine, Kevin Kelly, explains that in 2009, Andrew Ng and a team at
Stanford University discovered how to link inexpensive GPU clusters to run
neural networks consisting of hundreds of millions of node connections.
“Traditional processors required several weeks to calculate all the cascading
possibilities in a neural net with one hundred million parameters. Ng found
that a cluster of GPUs could accomplish the same thing in a day.”
As a specialized parallel computing chip, GPU instances are able to perform
many more floating point operations per second than a CPU, allowing for
much faster solutions with linear algebra and statistics than with a CPU.
It is important to note that C and C++ are the preferred languages to directly
edit and perform mathematical operations on the GPU. However, Python can
also be used and converted into C in combination with TensorFlow from
Google.
Although it’s possible to run TensorFlow on the CPU, you can gain up to
about 1,000x in performance using the GPU. Unfortunately for Mac users,
TensorFlow is only compatible with the Nvidia GPU card, which is no longer
available with Mac OS X. Mac users can still run TensorFlow on their CPU
but will need to engineer a patch/external driver or run their workload on the
cloud to access GPU. Amazon Web Services, Microsoft Azure, Alibaba
Cloud, Google Cloud Platform, and other cloud providers offer pay-as-you-
go GPU resources, which may start off free through a free trial program.
Google Cloud Platform is currently regarded as a leading option for GPU
resources based on performance and pricing. In 2016, Google also announced
that it would publicly release a Tensor Processing Unit designed specifically
for running TensorFlow, which is already used internally at Google.

Compartment 3: Advanced Algorithms
To round out this chapter, let’s have a look at the third compartment of the
advanced toolbox containing machine learning algorithms.
To analyze large datasets, advanced learners work with a plethora of
advanced algorithms including Markov models, support vector machines, and
Q-learning, as well as a series of simple algorithms like those found in the
beginner’s toolbox. But the algorithm family they’re most likely to use is
neural networks (introduced in Chapter 10), which comes with its own
selection of advanced machine learning libraries.
While Scikit-learn offers a range of popular shallow algorithms, TensorFlow
is the machine learning library of choice for deep learning/neural networks as
it supports numerous advanced techniques including automatic calculus for
back-propagation/gradient descent. Due to the depth of resources,
documentation, and jobs available with TensorFlow, it is the obvious
framework to learn today.
Popular alternative neural network libraries include Torch, Caffe, and the
fast-growing Keras. Written in Python, Keras is an open-source deep learning
library that runs on top of TensorFlow, Theano, and other frameworks, and
allows users to perform fast experimentation in fewer lines of code. Like a
WordPress website theme, Keras is minimal, modular, and quick to get up
and running but is less flexible compared with TensorFlow and other
libraries. Users will sometimes utilize Keras to validate their model before
switching to TensorFlow to build a more customized model.
Caffe is also open-source and commonly used to develop deep learning
architectures for image classification and image segmentation. Caffe is
written in C++ but has a Python interface that also supports GPU-based
acceleration using the Nvidia CuDNN.
Released in 2002, Torch is well established in the deep learning community.
It is open-source and based on the programming language Lua. Torch offers a
range of algorithms for deep learning and is used within Facebook, Google,
Twitter, NYU, IDIAP, Purdue as well as other companies and research labs.
Until recently, Theano was another competitor to TensorFlow but as of late
2017, contributions to the framework have officially ceased.
Sometimes used beside neural networks is another advanced approach called
ensemble modeling. This technique essentially combines algorithms and
statistical techniques to create a unified model, which we will explore further
in Chapter 12.