# Working with Data in OpenCV 使用OpenCV的数据

Now that we have whetted our appetite for machine learning, it is time to delve a little deeper into the different parts that make up a typical machine learning system.

现在，我们已经激发了我们对机器学习的兴趣，现在是时候深入研究组成一个典型的机器学习系统的不同部分了。

Machine learning is all about building mathematical models in order to understand data. 

### 机器学习是为了理解数据而建立数学模型。

The learning aspect enters this process when we give a machine learning model the capability to adjust its **internal parameters**; we can tweak these parameters so that the model explains the data better. 

学习方面进入了这个过程; 当我们给机器学习模型提供调整其内部参数的能力时，我们可以调整这些参数，以便模型更好地解释数据。

In a sense, this can be understood as the model learning from the data. 

从某种意义上说，这可以被理解为从数据中学习的模型。

Once the model has learned enough—whatever that means—we can ask it to explain newly observed data.

一旦模型学到了足够的知识——不管它意味着什么——我们可以要求它解释新观测到的数据。

Hence machine learning problems are always split into (at least) two distinct phases:

因此，机器学习的问题总是分成两个不同的阶段（至少）：

- A **training phase**, during which we aim to train a machine learning model on a set of data that we call the **training dataset**.
训练阶段，我们的目标是在一组数据上训练机器学习模型，我们称之为训练数据集。

- A **test phase**, during which we evaluate the learned (or finalized) machine learning model on a new set of never-before-seen data that we call the **test dataset**.
测试阶段，在此期间，我们将在一组从未见过的数据中评估学习的（或最终确定的）机器学习模型，我们称之为测试数据集。

The importance of splitting our data into a training set and test set cannot be understated.

将我们的数据分割成一个训练集和测试集的重要性不能被低估。

We always evaluate our models on an independent test set because we are interested in knowing how well our models generalize to new data. 

我们总是在一个独立的测试集上评估我们的模型，因为我们感兴趣的是知道我们的模型对新数据的泛化程度。

In the end, isn't this what learning is all about—be it machine learning or human learning?

最后，这难道不是学习的全部意义——无论是机器学习还是人类学习？

Machine learning is also all about the **data**.

机器学习也是关于数据的。

Data can be anything from images and movies to textdocuments and audio files. 

数据可以是任何东西，从图像和电影到文本文档和音频文件。

Therefore, in its raw form, data might be made of pixels, letters,words, or even worse: pure bits. 

因此，在原始的形式中，数据可能是由像素、字母、单词甚至更糟的：纯比特。

It is easy to see that data in such a raw form might not be very convenient to work with. 

很容易看出，以这种原始形式的数据可能不太方便使用。

Instead, we have to find ways to **preprocess** the data in order to bring it into a form that is easy to parse.

相反，我们必须找到方法来预处理数据，以便将其转换成易于解析的形式。

In this chapter, we want to learn how data fits in with machine learning, and how to work with data using the tools of our choice: OpenCV and Python.

在这一章中，我们想了解数据如何与机器学习相适应，以及如何使用我们选择的工具来使用数据：OpenCV和Python。

In specific, we want to address the following questions:

具体来说，我们想要解决以下问题：

- What does a typical machine learning workflow look like?
典型的机器学习工作流程是什么样的？
- What are training data, validation data, and test data - and what are they good for?
什么是训练数据、验证数据和测试数据——它们有什么好处？
- How do I load, store, and work with such data in OpenCV using Python?
如何使用Python来加载、存储和使用OpenCV中的此类数据？

## Outline 大纲

- [Dealing with Data Using Python's NumPy Package](02.01-Dealing-with-Data-Using-Python-NumPy.ipynb)
使用Python的NumPy Packag处理数据
- [Loading External Datasets in Python](02.02-Loading-External-Datasets-in-Python.ipynb)
在Python中加载外部数据集
- [Visualizing Data Using Matplotlib](02.03-Visualizing-Data-Using-Matplotlib.ipynb)
使用Matplotlib可视化数据
- [Dealing with Data Using OpenCV's TrainData container in C++](02.05-Dealing-with-Data-Using-the-OpenCV-TrainData-Container-in-C%2B%2B.ipynb)
使用OpenCV的C++的TrainData容器来处理数据

## Starting a new IPython or Jupyter session

Before we can get started, we need to open an IPython shell or start a Jupyter Notebook:

1. Open a terminal like we did in the previous chapter, and navigate to the `opencv-machine-learning` directory:

   ```
    $ cd Desktop/opencv-machine-learning
   ```

2. Activate the conda environment we created in the previous chapter:

    ```
    $ source activate Python3 # Mac OS X / Linux
 $ activate Python3 # Windows
    ```

3. Start a new IPython or Jupyter session:

   ```
    $ ipython # for an IPython session
 $ jupyter notebook # for a Jupyter session
   ```

If you chose to start an IPython session, the program should have greeted you with a
welcome message such as the following:
    
    $ ipython
    Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13)
    [MSC v.1900 64 bit (AMD64)]
    Type "copyright", "credits" or "license" for more information.
    IPython 3.5.0 -- An enhanced Interactive Python.
    ? -> Introduction and overview of IPython's features.
    %quickref -> Quick reference.
    help -> Python's own help system.
    object? -> Details about 'object', use 'object??' for extra details.
    
    In [1]:

The line starting with `In [1]` is where you type in your regular Python commands. In
addition, you can also use the Tab key while typing the names of variables and functions in
order to have IPython automatically complete them.

If you chose to start a Jupyter session, a new window should have opened in your web
browser that is pointing to http://localhost:8888. You want to create a new notebook by
clicking on New in the top-right corner and selecting Notebooks (Python3).

This will open a new window that contains an empty page with the same command line as in an IPython session:
    
    In [ ]: