# Chapter 6 The Universal Workflow of Machine Learning

The workflow of machine learning is broadly structed in three parts

1. Define the task—Understand the problem domain and the business logic underlying what the customer asked for. Collect a dataset, understand what the data
represents, and choose how you will measure success on the task

2. Develop a model—Prepare your data so that it can be processed by a machine
learning model, select a model evaluation protocol and a simple baseline to
beat, train a first model that has generalization power and that can overfit, and
then regularize and tune your model until you achieve the best possible generalization performance

3. Deploy the model—Present your work to stakeholders, ship the model to a web
server, a mobile app, a web page, or an embedded device, monitor the model’s performance in the wild, and start collecting the data you’ll need to build the
next-generation model.

## 6.1 Define the task

### 6.1.1 Frame the problem

Some questions that should be on the top of your mind:

+ What will your input data be? What are you trying to predict? You can only
learn to predict something if you have training data available

+ What type of machine learning task are you facing?

+ What do existing solutions look like?

+ Are there particular constraints you will need to deal with?

Once you’ve done your research, you should know what your inputs will be, what your
targets will be, and what broad type of machine learning task the problem maps to. Be
aware of the hypotheses you’re making at this stage:

1. You hypothesize that your targets can be predicted given your inputs.

2. You hypothesize that the data that’s available (or that you will soon collect) is
sufficiently informative to learn the relationship between inputs and targets

### 6.1.2 Collect the dataset

the number of data points you have,the reliability of your labels, the quality of your features

If you’re doing supervised learning, then once you’ve collected inputs (such as
 images) you’re going to need annotations for them (such as tags for those images)—
 the targets you will train your model to predict.

#### INVESTING IN DATA ANNOTATION INFRASTRUCTURE

Your data annotation process will determine the quality of your targets, which in turn
determine the quality of your model. Carefully consider the options you have available:
1. Should you annotate the data yourself?

2. Should you use a crowdsourcing platform like Mechanical Turk to collect labels?

3. Should you use the services of a specialized data-labeling company?

To pick the best option, consider the constraints you’re working with:
1. Do the data labelers need to be subject matter experts, or could anyone annotate the data? Annotating CT scans of bone fractures pretty much requires a medical degree.


2. If annotating the data requires specialized knowledge, can you train people to
do it? If not, how can you get access to relevant experts?


3.  Do you, yourself, understand the way experts come up with the annotations? If
you don’t, you will have to treat your dataset as a black box, and you won’t be able
to perform manual feature engineering—this isn’t critical, but it can be limiting.

#### BEWARE OF NON-REPRESENTATIVE DATA

It’s critical that the data used for training should be representative of the production data

If possible, collect data directly from the environment where your model will be used

A related phenomenon you should be aware of is concept drift. You’ll encounter
 concept drift in almost all real-world problems, especially those that deal with user generated data.

 Concept drift occurs when the properties of the production data
 change over time, causing model accuracy to gradually decay.

 Keep in mind that machine learning can only be used to memorize patterns that
are present in your training data. You can only recognize what you’ve seen before.


Using machine learning trained on past data to predict the future is making the
assumption that the future will behave like the past. That often isn’t the case

### 6.1.3 Understand your data

1. If your data includes images or natural language text, take a look at a few samples (and their labels) directly.

2. If your data contains numerical features, it’s a good idea to plot the histogram
of feature values to get a feel for the range of values taken and the frequency of
different values.

3. If your data includes location information, plot it on a map. Do any clear patterns emerge?

4. Are some samples missing values for some features? If so, you’ll need to deal
with this when you prepare the data (we’ll cover how to do this in the next
section).

5. If your task is a classification problem, print the number of instances of each
class in your data. Are the classes roughly equally represented? If not, you will
need to account for this imbalance.

6. Check for target leaking: the presence of features in your data that provide information about the targets and which may not be available in production. If
you’re training a model on medical records to predict whether someone will be
treated for cancer in the future, and the records include the feature “this person has been diagnosed with cancer,” then your targets are being artificially
leaked into your data. 

Always ask yourself, is every feature in your data something that will be available in the same form in production?

### 6.1.4 Choose a measures of success

To achieve success on a project, you must first define what you mean by success. Accuracy? Precision and recall?
 Customer retention rate? 
 
Your metric for success will guide all of the technical choices
 you make throughout the project

 For balanced classification problems, where every class is equally likely, accuracy
and the area under a receiver operating characteristic (ROC) curve, abbreviated as ROC
AUC, are common metrics. 

For class-imbalanced problems, ranking problems, or
multilabel classification, you can use precision and recall, as well as a weighted form of
accuracy or ROC AUC.