The following additional libraries are needed to run this
notebook. Note that running on Colab is experimental, please report a Github
issue if you have any problem.

In [1]:
!pip install -U mxnet-cu101==1.7.0


Collecting mxnet-cu101==1.7.0
  Downloading mxnet_cu101-1.7.0-py2.py3-none-manylinux2014_x86_64.whl (846.0 MB)
[K     |███████████████████████████████▌| 834.1 MB 1.3 MB/s eta 0:00:10tcmalloc: large alloc 1147494400 bytes == 0x55d875bfc000 @  0x7f61a8ced615 0x55d83bf2b02c 0x55d83c00b17a 0x55d83bf2de4d 0x55d83c01fc0d 0x55d83bfa20d8 0x55d83bf9cc35 0x55d83bf2f73a 0x55d83bfa1f40 0x55d83bf9cc35 0x55d83bf2f73a 0x55d83bf9e93b 0x55d83c020a56 0x55d83bf9dfb3 0x55d83c020a56 0x55d83bf9dfb3 0x55d83c020a56 0x55d83bf9dfb3 0x55d83bf2fb99 0x55d83bf72e79 0x55d83bf2e7b2 0x55d83bfa1e65 0x55d83bf9cc35 0x55d83bf2f73a 0x55d83bf9e93b 0x55d83bf9cc35 0x55d83bf2f73a 0x55d83bf9db0e 0x55d83bf2f65a 0x55d83bf9dd67 0x55d83bf9cc35
[K     |████████████████████████████████| 846.0 MB 22 kB/s 
Collecting graphviz<0.9.0,>=0.8.1
  Downloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet-cu101
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.10.1
    

# Data Preprocessing
:label:`sec_pandas`

So far we have introduced a variety of techniques for manipulating data that are already stored in tensors.
To apply deep learning to solving real-world problems,
we often begin with preprocessing raw data, rather than those nicely prepared data in the tensor format.
Among popular data analytic tools in Python, the `pandas` package is commonly used.
Like many other extension packages in the vast ecosystem of Python,
`pandas` can work together with tensors.
So, we will briefly walk through steps for preprocessing raw data with `pandas`
and converting them into the tensor format.
We will cover more data preprocessing techniques in later chapters.

## Reading the Dataset

As an example,
we begin by (**creating an artificial dataset that is stored in a
csv (comma-separated values) file**)
`../data/house_tiny.csv`. Data stored in other
formats may be processed in similar ways.

Below we write the dataset row by row into a csv file.


In [2]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # Column names
    f.write('NA,Pave,127500\n')  # Each row represents a data example
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

To [**load the raw dataset from the created csv file**],
we import the `pandas` package and invoke the `read_csv` function.
This dataset has four rows and three columns, where each row describes the number of rooms ("NumRooms"), the alley type ("Alley"), and the price ("Price") of a house.


In [3]:
# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000


## Handling Missing Data

Note that "NaN" entries are missing values.
To handle missing data, typical methods include *imputation* and *deletion*,
where imputation replaces missing values with substituted ones,
while deletion ignores missing values. Here we will consider imputation.

By integer-location based indexing (`iloc`), we split `data` into `inputs` and `outputs`,
where the former takes the first two columns while the latter only keeps the last column.
For numerical values in `inputs` that are missing,
we [**replace the "NaN" entries with the mean value of the same column.**]


In [11]:
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN


[**For categorical or discrete values in `inputs`, we consider "NaN" as a category.**]
Since the "Alley" column only takes two types of categorical values "Pave" and "NaN",
`pandas` can automatically convert this column to two columns "Alley_Pave" and "Alley_nan".
A row whose alley type is "Pave" will set values of "Alley_Pave" and "Alley_nan" to 1 and 0.
A row with a missing alley type will set their values to 0 and 1.


In [12]:
inputs = pd.get_dummies(inputs)
print(inputs)

   NumRooms  Alley_Pave
0       3.0           1
1       2.0           0
2       4.0           0
3       3.0           0


## Conversion to the Tensor Format

Now that [**all the entries in `inputs` and `outputs` are numerical, they can be converted to the tensor format.**]
Once data are in this format, they can be further manipulated with those tensor functionalities that we have introduced in :numref:`sec_ndarray`.


In [16]:
from mxnet import np,npx

X, y = np.array(inputs.values,ctx=npx.gpu()), np.array(outputs.values, ctx=npx.gpu())
X, y

(array([[3., 1.],
        [2., 0.],
        [4., 0.],
        [3., 0.]], dtype=float64, ctx=gpu(0)),
 array([127500, 106000, 178100, 140000], dtype=int64, ctx=gpu(0)))

## Summary

* Like many other extension packages in the vast ecosystem of Python, `pandas` can work together with tensors.
* Imputation and deletion can be used to handle missing data.


## Exercises

Create a raw dataset with more rows and columns.

1. Delete the column with the most missing values.
2. Convert the preprocessed dataset to the tensor format.


[Discussions](https://discuss.d2l.ai/t/28)
