SETTING UP YOUR DATA
Once you have cleaned your dataset, the next job is to split the data into two
segments for testing and training. It is very important not to test your model
with the same data that you used for training. The ratio of the two splits
should be approximately 70/30 or 80/20. This means that your training data
should account for 70 percent to 80 percent of the rows in your dataset, and
the other 20 percent to 30 percent of rows is your test data. It is vital to split
your data by rows and not columns.

Figure 1: Training and test partitioning of the dataset 70/30

Before you split your data, it is important that you randomize all rows in the
dataset. This helps to avoid bias in your model, as your original dataset might
be arranged sequentially depending on the time it was collected or some other
factor. Unless you randomize your data, you may accidentally omit important
variance from the training data that will cause unwanted surprises when youapply the trained model to your test data. Fortunately, Scikit-learn provides a
built-in function to shuffle and randomize your data with just one line of code
(demonstrated in Chapter 13).
After randomizing your data, you can begin to design your model and apply
that to the training data. The remaining 30 percent or so of data is put to the
side and reserved for testing the accuracy of the model.
In the case of supervised learning, the model is developed by feeding the
machine the training data and the expected output (y). The machine is able to
analyze and discern relationships between the features (X) found in the
training data to calculate the final output (y).
The next step is to measure how well the model actually performs. A
common approach to analyzing prediction accuracy is a measure called mean
absolute error, which examines each prediction in the model and provides an
average error score for each prediction.
In Scikit-learn, mean absolute error is found using the model.predict function
on X (features). This works by first plugging in the y values from the training
dataset and generating a prediction for each row in the dataset. Scikit-learn
will compare the predictions of the model to the correct outcome and measure
its accuracy. You will know if your model is accurate when the error rate
between the training and test dataset is low. This means that the model has
learned the dataset’s underlying patterns and trends.
Once the model can adequately predict the values of the test data, it is ready
for use in the wild. If the model fails to accurately predict values from the test
data, you will need to check whether the training and test data were properly
randomized. Alternatively, you may need to change the model's
hyperparameters.
Each algorithm has hyperparameters; these are your algorithm settings. In
simple terms, these settings control and impact how fast the model learns
patterns and which patterns to identify and analyze.

Cross Validation
Although the training/test data split can be effective in developing models
from existing data, a question mark remains as to whether the model will
work on new data. If your existing dataset is too small to construct an
accurate model, or if the training/test partition of data is not appropriate, this
can lead to poor estimations of performance in the wild.Fortunately, there is an effective workaround for this issue. Rather than
splitting the data into two segments (one for training and one for testing), we
can implement what is known as cross validation. Cross validation
maximizes the availability of training data by splitting data into various
combinations and testing each specific combination.
Cross validation can be performed through two primary methods. The first
method is exhaustive cross validation, which involves finding and testing all
possible combinations to divide the original sample into a training set and a
test set. The alternative and more common method is non-exhaustive cross
validation, known as k-fold validation. The k-fold validation technique
involves splitting data into k assigned buckets and reserving one of those
buckets to test the training model at each round.
To perform k-fold validation, data are first randomly assigned to k number of
equal sized buckets. One bucket is then reserved as the test bucket and is used
to measure and evaluate the performance of the remaining (k-1) buckets.

Figure 2: k-fold validation

The cross validation process is repeated k number of times (“folds”). At each
fold, one bucket is reserved to test the training model generated by the other
buckets. The process is repeated until all buckets have been utilized as both atraining and test bucket. The results are then aggregated and combined to
formulate a single model.
By using all available data for both training and testing purposes, the k-fold
validation technique dramatically minimizes potential error (such as
overfitting) found by relying on a fixed split of training and test data.

How Much Data Do I Need?
A common question for students starting out in machine learning is how
much data do I need to train my dataset? In general, machine learning works
best when your training dataset includes a full range of feature combinations.
What does a full range of feature combinations look like? Imagine you have a
dataset about data scientists categorized by the following features:
- University degree (X)
- 5+ years professional experience (X)
- Children (X)
- Salary (y)
To assess the relationship that the first three features (X) have to a data
scientist’s salary (y), we need a dataset that includes the y value for each
combination of features. For instance, we need to know the salary for data
scientists with a university degree, 5+ years professional experience and that
don’t have children, as well as data scientists with a university degree, 5+
years professional experience and that do have children.
The more available combinations, the more effective the model will be at
capturing how each attribute affects y (the data scientist’s salary). This will
ensure that when it comes to putting the model into practice on the test data
or real-life data, it won’t immediately unravel at the sight of unseen
combinations.
At a minimum, a machine learning model should typically have ten times as
many data points as the total number of features. So for a small dataset with
three features, the training data should ideally have at least thirty rows.
The other point to remember is that more relevant data is usually better than
less. Having more relevant data allows you to cover more combinations and
generally helps to ensure more accurate predictions. In some cases, it might
not be possible or cost-effective to source data for every possible
combination. In these cases, you will need to make do with the data that you
have at your disposal.The following chapters will examine specific algorithms commonly used in
machine learning. Please note that I include some equations out of necessity,
and I have tried to keep them as simple as possible. Many of the machine
learning techniques that we discuss in this book already have working
implementations in your programming language of choice—no equation
writing necessary.