# Data Preprocessing

So far we have introduced a variety of techniques for manipulating data that are already stored in ndarrays. To apply deep learning to solving real-world problems, we often begin with preprocessing raw data, rather than those nicely prepared data in the ndarray format. Among popular data analytic tools in Python, the pandas package is commonly used. Like many other extension packages in the vast ecosystem of Python, pandas can work together with ndarray. So, we will briefly walk through steps for preprocessing raw data with pandas and converting them into the ndarray format. We will cover more data preprocessing techniques in later chapters.

# Reading the Dataset

As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated values) file ../data/house_tiny.csv. Data stored in other formats may be processed in similar ways. The following mkdir_if_not_exist function ensures that the directory ../data exists. The comment # Saved in the d2l package for later use is a special mark where the following function, class, or import statements are also saved in the d2l package so that we can directly invoke d2l.mkdir_if_not_exist() later. Then, we write the dataset row by row to CSV.

In [3]:
import os

# Saved in the d2l package for later use
def mkdir_if_not_exist(path):
    if not isinstance(path, str):
        path = os.path.join(*path)
    if not os.path.exists(path):
        os.makedirs(path)
        
data_file = '../data/house_tiny.csv'
mkdir_if_not_exist('../data')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # Column names
    f.write('NA,Pave,127500\n')  # Each row is a data point
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

To load the raw dataset from the created csv file, we import the pandas package and invoke the read_csv function. This dataset has $4$ rows and $3$ columns, where each row describes the number of rooms ("NumRooms"), the alley type ("Alley"), and the price ("Price") of a house.

In [4]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000


# Handling missing data

In [9]:
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN
   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1


# Conversion to Numpy array

Now that all the entries in inputs and outputs are numerical, they can be converted to the ndarray format. Once data are in this format, they can be further manipulated with those ndarray functionalities that we have introduced in :numref:sec_ndarray.

In [13]:
import numpy as np
X, y = np.array(inputs.values), np.array(outputs.values)
print(X, y)

[[3. 1. 0.]
 [2. 0. 1.]
 [4. 0. 1.]
 [3. 0. 1.]] [127500 106000 178100 140000]
