<a href="https://colab.research.google.com/github/Bule-rain/PyTorch-/blob/main/chapter_preliminaries/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing
:label:`sec_pandas`

So far, we have been working with synthetic data
that arrived in ready-made tensors.
However, to apply deep learning in the wild
we must extract messy data
stored in arbitrary formats,
and preprocess it to suit our needs.
Fortunately, the *pandas* [library](https://pandas.pydata.org/)
can do much of the heavy lifting.
This section, while no substitute
for a proper *pandas* [tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html),
will give you a crash course
on some of the most common routines.

## Reading the Dataset

Comma-separated values (CSV) files are ubiquitous
for the storing of tabular (spreadsheet-like) data.
In them, each line corresponds to one record
and consists of several (comma-separated) fields, e.g.,
"Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,field of gravitational physics".
To demonstrate how to load CSV files with `pandas`,
we (**create a CSV file below**) `../data/house_tiny.csv`.
This file represents a dataset of homes,
where each row corresponds to a distinct home
and the columns correspond to the number of rooms (`NumRooms`),
the roof type (`RoofType`), and the price (`Price`).


In [None]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

Now let's import `pandas` and load the dataset with `read_csv`.


In [None]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


## Data Preparation

In supervised learning, we train models
to predict a designated *target* value,
given some set of *input* values.
Our first step in processing the dataset
is to separate out columns corresponding
to input versus target values.
We can select columns either by name or
via integer-location based indexing (`iloc`).

You might have noticed that `pandas` replaced
all CSV entries with value `NA`
with a special `NaN` (*not a number*) value.
This can also happen whenever an entry is empty,
e.g., "3,,,270000".
These are called *missing values*
and they are the "bed bugs" of data science,
a persistent menace that you will confront
throughout your career.
Depending upon the context,
missing values might be handled
either via *imputation* or *deletion*.
Imputation replaces missing values
with estimates of their values
while deletion simply discards
either those rows or those columns
that contain missing values.

Here are some common imputation heuristics.
[**For categorical input fields,
we can treat `NaN` as a category.**]
Since the `RoofType` column takes values `Slate` and `NaN`,
`pandas` can convert this column
into two columns `RoofType_Slate` and `RoofType_nan`.
A row whose roof type is `Slate` will set values
of `RoofType_Slate` and `RoofType_nan` to 1 and 0, respectively.
The converse holds for a row with a missing `RoofType` value.


In [None]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True


For missing numerical values,
one common heuristic is to
[**replace the `NaN` entries with
the mean value of the corresponding column**].


In [None]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True


## Conversion to the Tensor Format

Now that [**all the entries in `inputs` and `targets` are numerical,
we can load them into a tensor**] (recall :numref:`sec_ndarray`).


In [None]:
import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

## Discussion

You now know how to partition data columns,
impute missing variables,
and load `pandas` data into tensors.
In :numref:`sec_kaggle_house`, you will
pick up some more data processing skills.
While this crash course kept things simple,
data processing can get hairy.
For example, rather than arriving in a single CSV file,
our dataset might be spread across multiple files
extracted from a relational database.
For instance, in an e-commerce application,
customer addresses might live in one table
and purchase data in another.
Moreover, practitioners face myriad data types
beyond categorical and numeric, for example,
text strings, images,
audio data, and point clouds.
Oftentimes, advanced tools and efficient algorithms
are required in order to prevent data processing from becoming
the biggest bottleneck in the machine learning pipeline.
These problems will arise when we get to
computer vision and natural language processing.
Finally, we must pay attention to data quality.
Real-world datasets are often plagued
by outliers, faulty measurements from sensors, and recording errors,
which must be addressed before
feeding the data into any model.
Data visualization tools such as [seaborn](https://seaborn.pydata.org/),
[Bokeh](https://docs.bokeh.org/), or [matplotlib](https://matplotlib.org/)
can help you to manually inspect the data
and develop intuitions about
the type of problems you may need to address.


## Exercises

1. Try loading datasets, e.g., Abalone from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php) and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?
1. Try indexing and selecting data columns by name rather than by column number. The pandas documentation on [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) has further details on how to do this.
1. How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What happens if you try it out on a server?
1. How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?
1. What alternatives to pandas can you think of? How about [loading NumPy tensors from a file](https://numpy.org/doc/stable/reference/generated/numpy.load.html)? Check out [Pillow](https://python-pillow.org/), the Python Imaging Library.


In [3]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"  # 数据集网址
data = pd.read_csv(url, header=None)
missing_values = data.isnull().sum().sum()
total_values = data.size
missing_proportion = missing_values / total_values if total_values > 0 else 0
print(f"缺失值比例: {missing_proportion}")
numeric_columns = data.select_dtypes(include='number').columns
categorical_columns = data.select_dtypes(include='object').columns
text_columns = []  # 此数据集中无纯文本列

numeric_proportion = len(numeric_columns) / len(data.columns) if len(data.columns) > 0 else 0
categorical_proportion = len(categorical_columns) / len(data.columns) if len(data.columns) > 0 else 0
text_proportion = len(text_columns) / len(data.columns) if len(data.columns) > 0 else 0

print(f"数值型变量比例: {numeric_proportion}")
print(f"分类型变量比例: {categorical_proportion}")
print(f"文本型变量比例: {text_proportion}")

缺失值比例: 0.0
数值型变量比例: 0.8888888888888888
分类型变量比例: 0.1111111111111111
文本型变量比例: 0.0


In [4]:
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']
data.columns = column_names
selected_columns = data[['Length', 'Diameter']]
print(selected_columns)

      Length  Diameter
0      0.455     0.365
1      0.350     0.265
2      0.530     0.420
3      0.440     0.365
4      0.330     0.255
...      ...       ...
4172   0.565     0.450
4173   0.590     0.440
4174   0.600     0.475
4175   0.625     0.485
4176   0.710     0.555

[4177 rows x 2 columns]


3.你认为以这种方式可以加载多大的数据集？可能存在哪些限制？提示：考虑读取数据的时间、表示方式、处理过程和内存占用。在你的笔记本电脑上试试。如果在服务器上尝试会发生什么？
在笔记本电脑上，能加载的数据集大小受到以下限制：
内存限制：笔记本电脑的内存通常有限（例如 8GB、16GB 等），如果数据集太大，可能无法全部加载到内存中。当数据集接近或超过可用内存时，会导致内存不足错误（MemoryError）。
读取时间：对于非常大的数据集，从存储设备（如硬盘、固态硬盘）读取数据到内存会花费很长时间。机械硬盘的读取速度相对较慢，固态硬盘会快一些，但随着数据量增加，读取时间仍然会显著增加。
处理过程：加载数据集后，进行数据处理（如缺失值处理、转换等）也需要时间和内存资源。复杂的处理操作可能会使笔记本电脑的性能大幅下降，甚至导致程序无响应。
在服务器上：
服务器通常有更多的内存和计算资源，所以理论上可以加载更大的数据集。但也存在限制，例如服务器的内存虽然比笔记本电脑大，但也是有限的，如果数据集过大，同样会面临内存不足的问题。
服务器的网络带宽也可能成为限制因素，如果数据集是从远程存储读取，网络传输速度会影响数据加载时间。
此外，服务器上可能同时运行多个任务，资源竞争也会影响数据集的加载和处理。

4.你将如何处理具有大量类别的数据？如果类别标签都是唯一的怎么办？是否应该包含后者？
处理具有大量类别的数据：
降维技术：可以使用主成分分析（PCA）等降维方法，将高维的类别数据转换为低维表示，减少数据维度，同时保留数据的主要特征。
聚类方法：将相似的类别聚合成少数几个大的类别，简化数据结构。例如使用 K-Means 聚类算法对类别进行聚类。
特征选择：选择对目标变量影响较大的类别特征，去除那些对结果影响较小的类别特征。
如果类别标签都是唯一的：
这种情况下，直接使用独热编码会导致特征维度急剧增加，可能会引起维度灾难和过拟合问题。
是否应该包含后者：
一般不建议直接包含所有唯一的类别标签作为特征。可以考虑先对这些标签进行分析，看是否能找到一些潜在的规律或分组方式，将它们进行合并或转换。例如，根据业务逻辑或数据的某些属性，将一些相似的唯一标签归为一类。如果无法进行有效的合并或转换，也可以尝试使用一些专门处理高维稀疏数据的方法，如稀疏矩阵表示等，但这些方法也会增加处理的复杂性。

[Discussions](https://discuss.d2l.ai/t/29)


5.你能想到 pandas 的哪些替代方案？看看 Pillow，即 Python 图像处理库。
替代 pandas 的方案：
Dask：一个用于并行计算的库，可以处理比内存更大的数据集。它提供了与 pandas 类似的数据结构（如 Dask DataFrame），并且支持并行计算，能够在多个核心或多台机器上处理数据，适合处理大规模数据集。
Vaex：一个用于 lazy Out-of-Core DataFrames 的库，它可以在不将整个数据集加载到内存的情况下进行数据处理和分析。Vaex 特别适合处理大型表格数据，提供了高效的计算方法和可视化工具。
Polars：一个快速的、多线程的、内存高效的 DataFrame 库，它的性能在处理大型数据集时通常优于 pandas，并且支持多种数据格式的读写。
Pillow（Python 图像处理库）：
Pillow 主要用于处理图像数据，例如读取、写入、编辑图像等。它提供了丰富的图像操作功能，如调整图像大小、裁剪、旋转、颜色转换等。在处理图像数据时，Pillow 是一个非常有用的工具，可以与深度学习库（如 PyTorch、TensorFlow）结合使用，对图像进行预处理，以便输入到模型中进行训练或预测。例如：

In [5]:
from PIL import Image
# 打开图像
image = Image.open('example.jpg')
# 调整图像大小
resized_image = image.resize((224, 224))
# 转换为灰度图像
gray_image = resized_image.convert('L')

FileNotFoundError: [Errno 2] No such file or directory: 'example.jpg'