This tutorial will walk you through the essentials of how to index & filter data with Pandas. Think of it as a greatly condensed, opinionated, version of [the official indexing documentation.](http://pandas.pydata.org/pandas-docs/stable/indexing.html#).

We'll start by loading Pandas and the data:

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../input/parks.csv', index_col=['Park Code'])

In [3]:
df.head(3)   # .head()括号里的默认值为5，即默认值是前5行

Unnamed: 0_level_0,Park Name,State,Acres,Latitude,Longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACAD,Acadia National Park,ME,47390,44.35,-68.21
ARCH,Arches National Park,UT,76519,38.68,-109.57
BADL,Badlands National Park,SD,242756,43.75,-102.5


### Indexing: Single Rows
The simplest way to access a row is to pass the row number to the `.iloc` method. Note that first row is zero, just like list indexes.
索引：单行
访问行的最简单方法是将行号传递给 .iloc 方法。 请注意，第一行为零，就像列表索引一样。

In [4]:
df.iloc[2]   # 访问第三行的信息

Park Name    Badlands National Park
State                            SD
Acres                        242756
Latitude                      43.75
Longitude                    -102.5
Name: BADL, dtype: object

The other main approach is to pass a value from your dataframe's index to the `.loc` method: 另一种主要方法是将数据框索引中的值传递给 .loc 方法：

In [5]:
df.loc['BADL']  # 访问索引名为BADL的信息

Park Name    Badlands National Park
State                            SD
Acres                        242756
Latitude                      43.75
Longitude                    -102.5
Name: BADL, dtype: object

### Indexing: Multiple Rows
If we need multiple rows, we can pass in multiple index values. Note that this changes the order of the results!
### 索引：多行
如果我们需要多行，我们可以传入多个索引值。 请注意，这会改变结果的顺序！

In [6]:
df.loc[['BADL', 'ARCH', 'ACAD']] # 同时访问多个索引的信息，按顺序打印（会打乱原有顺序）

Unnamed: 0_level_0,Park Name,State,Acres,Latitude,Longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BADL,Badlands National Park,SD,242756,43.75,-102.5
ARCH,Arches National Park,UT,76519,38.68,-109.57
ACAD,Acadia National Park,ME,47390,44.35,-68.21


In [7]:
df.iloc[[2, 1, 0]]  # 先访问第三行索引，再第二行，最后第一行

Unnamed: 0_level_0,Park Name,State,Acres,Latitude,Longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BADL,Badlands National Park,SD,242756,43.75,-102.5
ARCH,Arches National Park,UT,76519,38.68,-109.57
ACAD,Acadia National Park,ME,47390,44.35,-68.21


Slicing the dataframe just as if it were a list also works.
像列表一样对数据框进行切片也可以。

In [8]:
df[:3] # 列表切片一样对0，1，2索引的信息进行访问

Unnamed: 0_level_0,Park Name,State,Acres,Latitude,Longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACAD,Acadia National Park,ME,47390,44.35,-68.21
ARCH,Arches National Park,UT,76519,38.68,-109.57
BADL,Badlands National Park,SD,242756,43.75,-102.5


In [9]:
df[3:6]  # 访问索引3，4，5，对应索引为4，5，6个信息

Unnamed: 0_level_0,Park Name,State,Acres,Latitude,Longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BIBE,Big Bend National Park,TX,801163,29.25,-103.25
BISC,Biscayne National Park,FL,172924,25.65,-80.08
BLCA,Black Canyon of the Gunnison National Park,CO,32950,38.57,-107.72


### Indexing: Columns
We can access a subset of the columns in a dataframe by placing the list of columns in brackets like so:
### 索引：列
我们可以通过将列列表放在括号中来访问数据框中列的子集，如下所示：

In [10]:
df['State'].head(3)  # 访问前三个索引的State项目（对单列进行访问），（此处Park Code为索引名）

Park Code
ACAD    ME
ARCH    UT
BADL    SD
Name: State, dtype: object

You can also access a single column as if it were an attribute of the dataframe, but only if the name has no spaces, uses only basic characters, and doesn't share a name with a dataframe method. So, `df.State` works:
您还可以访问单个列，就好像它是数据框的属性一样，但前提是名称没有空格，仅使用基本字符，并且不与数据框方法共享名称。 所以，`df.State` 有效：

In [11]:
df.State.head(3)

Park Code
ACAD    ME
ARCH    UT
BADL    SD
Name: State, dtype: object

but `df.Park Code` will fail as there's a space in the name:

In [12]:
df.Park Code # 只能用df['Park Code']对‘Park Code’进行访问，因此习惯第一种方法会更加好

SyntaxError: invalid syntax (<ipython-input-12-a282baf0d56f>, line 1)

We can only access the 'Park Code' column by passing its name as a string in brackets, like `df['Park Code']`. I recommend either always using that approach or always converting your column names into a valid format as soon as you read in the data so that you don't have to mix the two methods. It's just a bit tidier.

It's a good practice to clean your column names to prevent this sort of error. I'll use a very short cleaning function here since the names don't have any odd characters. By convention, the names should also be converted to lower case. Pandas is case sensitive, so future calls to all of the columns will need to be updated.
我们只能通过将其名称作为括号中的字符串传递来访问“Park Code”列，例如 df['Park Code']。 我建议要么始终使用该方法，要么在读取数据后立即将列名转换为有效格式，这样您就不必混合使用这两种方法。 它只是有点整洁。

清理列名以防止此类错误是一种很好的做法。 我将在这里使用一个非常简短的清理功能，因为名称没有任何奇数字符。 按照惯例，名称也应转换为小写。 Pandas 区分大小写，因此未来对所有列的调用都需要更新。

In [13]:
df.columns = [col.replace(' ', '_').lower() for col in df.columns]
print(df.columns)

Index(['park_name', 'state', 'acres', 'latitude', 'longitude'], dtype='object')


### Indexing: Columns and Rows
If we need to subset by both columns and rows, you can stack the commands we've already learned.
### 索引：列和行
如果我们需要按列和行进行子集化，您可以堆叠我们已经学习过的命令。

In [16]:
df[['state', 'acres']][:3]

Unnamed: 0_level_0,state,acres
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1
ACAD,ME,47390
ARCH,UT,76519
BADL,SD,242756


### Indexing: Scalar Values
As you may have noticed, everything we've tried so far returns a small dataframe or series. If you need a single value, simply pass in a single column and index value.
### 索引：标量值
您可能已经注意到，到目前为止我们尝试过的所有操作都返回一个小的数据框或系列。 如果您需要单个值，只需传入单个列和索引值。

In [17]:
df.state.iloc[2] # 输入单个列以及索引值可对单个值进行访问

'SD'

Note that you will get a different return type if you pass a single value in a list. 请注意，如果您在列表中传递单个值，您将获得不同的返回类型。

In [18]:
df.state.iloc[[2]]

Park Code
BADL    SD
Name: state, dtype: object

### Selecting a Subset of the Data

The main method for subsetting data in Pandas is called [boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing). First, let's take a look at what pandas does when we ask it to evaluate a boolean:
在 Pandas 中对数据进行子集化的主要方法称为 [布尔索引](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)。 首先，让我们看看当我们要求它评估布尔值时，pandas 做了什么：

In [19]:
(df.state == 'UT').head(3)

Park Code
ACAD    False
ARCH     True
BADL    False
Name: state, dtype: bool

We get a series of the results of the boolean. Passing that series into a dataframe gives us the subset of the dataframe where the boolean evaluates to `True`.
我们得到一系列布尔值的结果。 将该系列传递给数据框为我们提供了数据框的子集，其中布尔值评估为“真”。

In [20]:
df[df.state == 'UT']  # 输出state为UT的所有项

Unnamed: 0_level_0,park_name,state,acres,latitude,longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ARCH,Arches National Park,UT,76519,38.68,-109.57
BRCA,Bryce Canyon National Park,UT,35835,37.57,-112.18
CANY,Canyonlands National Park,UT,337598,38.2,-109.93
CARE,Capitol Reef National Park,UT,241904,38.2,-111.17
ZION,Zion National Park,UT,146598,37.3,-113.05


Some of the logical operators are different:
- `~` replaces `not`
- `|` replaces `or`
- `&` replaces `and`

If you have multiple arguments they'll need to be wrapped in parentheses. For example:

一些逻辑运算符是不同的：
- `~` 替换 `not`
- `|` 替换 `or`
- `&` 替换 `and`

如果您有多个参数，则需要将它们括在括号中。 例如：

In [21]:
df[(df.latitude > 60) | (df.acres > 10**6)].head(3)

Unnamed: 0_level_0,park_name,state,acres,latitude,longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DENA,Denali National Park and Preserve,AK,3372402,63.33,-150.5
DEVA,Death Valley National Park,"CA, NV",4740912,36.24,-116.82
EVER,Everglades National Park,FL,1508538,25.32,-80.93


You can also use more complicated expressions, including lambdas.
您还可以使用更复杂的表达式，包括 lambda。

In [22]:
df[df['park_name'].str.split().apply(lambda x: len(x) == 3)].head(3)

Unnamed: 0_level_0,park_name,state,acres,latitude,longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACAD,Acadia National Park,ME,47390,44.35,-68.21
ARCH,Arches National Park,UT,76519,38.68,-109.57
BADL,Badlands National Park,SD,242756,43.75,-102.5


### Key Companion Methods: `isin` and `isnull`
These methods make it much easier and faster to perform some very common tasks. Suppose we wanted to find all parks on the West coast. `isin` makes that simple:
### 关键配套方法：`isin` 和 `isnull`
这些方法使执行一些非常常见的任务变得更加容易和快捷。 假设我们想找到西海岸的所有公园。 `isin` 让事情变得简单：

In [23]:
df[df.state.isin(['WA', 'OR', 'CA'])].head()

Unnamed: 0_level_0,park_name,state,acres,latitude,longitude
Park Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CHIS,Channel Islands National Park,CA,249561,34.01,-119.42
CRLA,Crater Lake National Park,OR,183224,42.94,-122.1
JOTR,Joshua Tree National Park,CA,789745,33.79,-115.9
LAVO,Lassen Volcanic National Park,CA,106372,40.49,-121.51
MORA,Mount Rainier National Park,WA,235625,46.85,-121.75


### Less Common Methods
Pandas offers many more indexing methods. You should probably stick to a few of them for the sake of keeping your code readable, but it's worth knowing they exist in case you need to read other people's code or have an unusual use case:

- There are other ways to slice data with brackets. For the sake of readability, please don't use of them.
- `.at` and `.iat`: like `.loc` and `.iloc` but much faster in exchange for only working on a single column and only returning a single result.
- `.eval`: fast evaluation of a limited set of simple operators. `.query` works by calling this.
- `.ix`: deprecated method that tried to determine if an index should be evaluated with .loc or .iloc. This led to a lot of subtle bugs! If you see this, you're looking at old code that won't work any more.
- `.get`: like `.loc`, but will return a default value if the key doesn't exist in the index. Only works on a single column/series.
- `.lookup`: Not recommended. It's in the documentation, but it's unclear if this is actually still supported.
- `.mask`: like boolean indexing, but returns a dataframe/series of the same size as the original and anywhere that the boolean evaluates to `True` is set to `nan`.
- `.query`: similar to boolean indexing. Faster for large dataframes. Only supports a restricted set of operations; don't use if you need `isnull()` or other dataframe methods.
- `.take`: equivalent to `.iloc`, but can operate on either rows or columns.
- `.where`: like boolean indexing, but returns a dataframe/series of the same size as the original and anywhere that the boolean evaluates to `False` is set to `nan`.
- [Multi-indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html): potentially useful for small to mid sized heirarchical datasets. Slow on larger datasets.
-
### 不太常用的方法
Pandas 提供了更多的索引方法。为了保持代码的可读性，您可能应该坚持使用其中的一些，但值得知道它们存在，以防您需要阅读其他人的代码或有不寻常的用例：

- 还有其他方法可以用括号对数据进行切片。为了可读性，请不要使用它们。
- `.at` 和 `.iat`：与 `.loc` 和 `.iloc` 类似，但速度更快，以换取仅处理单个列并仅返回单个结果。
- `.eval`：快速评估一组有限的简单运算符。 `.query` 通过调用它来工作。
- `.ix`：尝试确定是否应使用 .loc 或 .iloc 评估索引的已弃用方法。这导致了很多微妙的错误！如果您看到这一点，那么您正在查看不再适用的旧代码。
- `.get`：类似于 `.loc`，但如果索引中不存在键，将返回默认值。仅适用于单个列/系列。
- `.lookup`：不推荐。它在文档中，但目前尚不清楚这是否仍然受支持。
- `.mask`：类似于布尔索引，但返回与原始数据相同大小的数据帧/系列，并且布尔值评估为 `True` 的任何地方都设置为 `nan`。
- `.query`：类似于布尔索引。对于大型数据帧更快。只支持一组受限的操作；如果您需要 `isnull()` 或其他数据框方法，请不要使用。
- `.take`：等同于 `.iloc`，但可以对行或列进行操作。
- `.where`：类似于布尔索引，但返回与原始数据相同大小的数据帧/系列，并且布尔值评估为 `False` 的任何地方都设置为 `nan`。
- [多索引](http://pandas.pydata.org/pandas-docs/stable/advanced.html)：可能对中小型分层数据集有用。在较大的数据集上速度较慢。