Intro to Data Structures
====

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import numpy and load pandas into your namespace:

我们将首先快速，非全面地概述大熊猫中的基本数据结构，以帮助您入门。关于数据类型，索引和轴标记/对齐的基本行为适用于所有对象。首先，将numpy和pandas导入到命名空间中：

In [2]:
import pandas as pd
import numpy as np

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

这是一个要记住的基本原则：数据对齐是固有的。除非您明确说明，否则标签和数据之间的链接不会被破坏。

We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in separate sections.

我们将简要介绍数据结构，然后在单独的部分中考虑所有大类功能和方法。

# Series

[`Series`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.html#pandas.Series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. The basic method to create a Series is to call:

Series是一维标记的数组，能够保存任何数据类型（整数，字符串，浮点数，Python对象等）。轴标签统称为索引。创建系列的基本方法是调用：

`s = pd.Series(data, index=index)`

Here, `data` can be many different things:

> - a Python dict
> - an ndarray
> - a scalar value (like 5)

The passed **index** is a list of axis labels. Thus, this separates into a few cases depending on what **data** is:

传递的索引是轴标签列表。 因此，根据数据的不同，这可分为几种情况：

**From ndarray**

If `data` is an ndarray, **index** must be the same length as **data**. If no index is passed, one will be created having values `[0, ..., len(data) - 1]`.

## 从 ndarray

如果`data`是一个ndarray, **index**必须与**data**长度相同。如果不传递index，将创建一个index，`index = [0, ..., len(data) - 1]`

In [3]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

s

a   -0.299559
b    0.467014
c   -1.288236
d    0.022349
e   -0.031241
dtype: float64

In [4]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
pd.Series(np.random.randn(5))

0    0.893753
1   -0.571587
2    1.164287
3    1.616977
4   -0.168251
dtype: float64

**Note** Starting in v0.8.0, pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).

**注意** 从v0.8.0开始，pandas支持非唯一索引值。 如果尝试不支持多重索引值的操作，则会在此时引发异常。 懒惰的原因几乎都是基于性能的（计算中有很多实例，比如GroupBy中没有使用索引的部分）。

**From dict**

If `data` is a dict, if **index** is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

## 来自dict

如果`data`是一个dict，如果**index**被传递，则将拉出与索引中标签对应的数据中的值。否则，如果可能，将从dict的排序键构造索引。

In [6]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [7]:
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [8]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

**Note** `NaN` (not a number) is the standard missing data marker used in pandas

**From scalar value** 

If `data` is a scalar value, an index must be provided. The value will be repeated to match the length of **index**

## 来自标量值

如果`data`是标量值，则必须提供索引。 将重复该值以匹配**索引的长度**

In [9]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

## Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:

Series类似于固定大小的dict，您可以通过索引标签获取和设置值：

In [10]:
s['a']

-0.2995590466953986

In [11]:
s['e']

-0.031241355917586703

In [12]:
s

a   -0.299559
b    0.467014
c   -1.288236
d    0.022349
e   -0.031241
dtype: float64

In [13]:
'e' in s

True

In [14]:
'f' in s

False

If a label is not contained, an exception is raised:

如果标签不包含在里面，将触发一个错误：

In [15]:
# s['f']

Using the `get` method, a missing label will return None or specified default:

使用`get`方法，缺少的标签将返回None或指定的默认值：

In [16]:
s.get('f')

In [17]:
s.get('f', np.nan)

nan

See also the [section on attribute access](http://pandas.pydata.org/pandas-docs/version/0.20.3/indexing.html#indexing-attribute-access).

**Vectorized operations and label alignment with Series**

When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray.

### 使用Series进行矢量化操作和标签对齐

在进行数据分析时，通常不需要像原始NumPy数组那样通过 Series 逐个值循环。Seires 也可以传递给大多数期待一个 ndarray 的NumPy方法。

In [18]:
s + s

a   -0.599118
b    0.934029
c   -2.576472
d    0.044699
e   -0.062483
dtype: float64

In [19]:
s * 2

a   -0.599118
b    0.934029
c   -2.576472
d    0.044699
e   -0.062483
dtype: float64

In [20]:
np.exp(s)

a    0.741145
b    1.595225
c    0.275757
d    1.022601
e    0.969242
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

Series和ndarray之间的主要区别在于Series之间的操作会根据标签自动对齐数据。 因此，您可以在不考虑所涉及的系列是否具有相同标签的情况下编写计算。

In [21]:
s[1:] + s[:-1]

a         NaN
b    0.934029
c   -2.576472
d    0.044699
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

未对齐Series之间的操作结果将包含所涉及的索引的并集。 如果在一个Sries或另一个Series中找不到标签，结果将被标记为缺少NaN。 能够在不进行任何明确数据对齐的情况下编写代码，可以在交互式数据分析和研究中获得巨大的自由度和灵活性。pandas数据结构集成的数据对齐功能使pandas有别于大多数用于处理标签数据的相关工具。

**Note**

In general, we chose to make the default result of operations between differently indexed objects yield the**union** of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the **dropna** function.

**注意**

通常，我们选择使不同索引对象之间的操作的默认结果产生索引的**union**，以避免信息丢失。尽管缺少数据，但索引标签通常是重要信息，作为计算的一部分。您当然可以选择通过**dropna**函数删除缺少数据的标签。

### Name attribute

Series can also have a `name` attribute:

In [22]:
s = pd.Series(np.random.randn(5), name='something')
s

0    1.583968
1   -1.518521
2    1.160031
3   -0.174313
4   -1.919506
Name: something, dtype: float64

In [23]:
s.name

'something'

The Series `name` will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as you will see below.

New in version 0.18.0.

You can rename a Series with the [`pandas.Series.rename()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.rename.html#pandas.Series.rename) method.

Series`name`将在很多情况下自动分配，特别是在获取DataFrame 的 1D 切片时，如下所示。

版本0.18.0中的新功能。

您可以使用[`pandas.Series.rename()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.rename.html#pandas) 方法重命名Series。

In [24]:
s2 = s.rename("different")

In [25]:
s2.name

'different'

Note that `s` and `s2` refer to different objects.

**注意** `s` 和 `s2` 引用不同的对象。

**DataFrame**

**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

## DataFrame

** DataFrame **是一个二维标记数据结构，其中包含可能不同类型的列。 您可以将其视为电子表格或SQL表，或Series对象的字典。 它通常是最常用的pandas对象。 与Series类似，DataFrame接受许多不同类型的输入：

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- [Structured or record](http://docs.scipy.org/doc/numpy/user/basics.rec.html) ndarray
- A `Series`
- Another `DataFrame`

Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

除了数据，您还可以选择传递**index**（行标签）和**columns**（列标签）参数。 如果传递 index 和/或 columns，则可以保证生成的DataFrame的索引和/或列。 因此，Series 的字典加上特定索引将丢弃与传递的索引不匹配的所有数据。

如果未传递轴标签，则将根据常识规则从输入数据构造它们。

**From dict of Series or dicts**

The result **index** will be the **union** of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.

### 来自Series的字典或字典 
结果**index**将是各Series indexes 的**union**。 如果有任何嵌套的dicts，这些将首先转换为Series。 如果没有传递列，则列将是dict键的排序列表。

In [26]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

dfa = pd.DataFrame(d)
dfa

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [27]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [28]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


The row and column labels can be accessed respectively by accessing the **index** and **columns** attributes:

通过访问**index**和**columns**属性，可以分别访问行和列标签：

**Note** When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.

**注意** 当一组特定的列与数据的dict一起传递时，传递的列将覆盖dict中的键。

In [29]:
 dfa.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [30]:
dfa.columns

Index(['one', 'two'], dtype='object')

**From dict of ndarrays / lists**

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be `range(n)`, where `n` is the array length.

### 来自ndarrays / lists的dict

ndarrays必须都是相同的长度。 如果传递索引，则它必须明显与数组的长度相同。 如果没有传递索引，结果将是`range（n）`，其中`n`是数组长度。

In [31]:
d = {'one' : [1., 2., 3., 4.],'two' : [4., 3., 2., 1.]}

In [32]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [33]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


**From structured or record array**

This case is handled identically to a dict of arrays.

### 来自结构化或记录数组

这种情况的处理方式与数组的字典相同。

In [34]:
dfB = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [35]:
dfB[:] = [(1,2.,'Hello'), (2,3.,"World")]

pd.DataFrame(dfB)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [36]:
 pd.DataFrame(dfB, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [37]:
pd.DataFrame(dfB, columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


**Note** DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

**注意** DataFrame并不像二维NumPy ndarray那样工作。

**From a list of dicts**

### 从字典列表



In [38]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [39]:
pd.DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


pd.DataFrame(data2, columns=['a', 'b'])

**From a dict of tuples**

You can automatically create a multi-indexed frame by passing a tuples dictionary

### 来自元组的字典

您可以通过传递元组字典自动创建多索引框架

In [40]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


**From a Series**

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

### 来自一个Series

结果将是一个与输入Series具有相同索引的DataFrame，以及一个列，其名称是Series的原始名称（仅当没有提供其他列名时）。

**Missing Data**

Much more will be said on this topic in the [Missing data](http://pandas.pydata.org/pandas-docs/version/0.20.3/missing_data.html#missing-data) section. To construct a DataFrame with missing data, use `np.nan` for those values which are missing. Alternatively, you may pass a `numpy.MaskedArray` as the data argument to the DataFrame constructor, and its masked entries will be considered missing.

**缺失数据**

在[缺失数据](http://pandas.pydata.org/pandas-docs/version/0.20.3/missing_data.html#missing-data)部分中将对此主题进行更多说明。要构造有缺失数据的DataFrame，请使用`np.nan`来查找缺少的值。 或者，您可以将`numpy.MaskedArray`作为数据参数传递给DataFrame构造函数，并且其掩码条目将被视为缺失。

**Alternate Constructors**

**DataFrame.from_dict**

`DataFrame.from_dict` takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the `DataFrame` constructor except for the `orient` parameter which is `'columns'` by default, but which can be set to `'index'` in order to use the dict keys as row labels.

### 备用构造函数

**DataFrame.from_dict**

`DataFrame.from_dict`采用dicts的dict或类似数组序列的dict并返回DataFrame。 它的操作类似于`DataFrame`构造函数，除了`orient`参数，默认情况下是`'columns'`，但是可以设置为`'index'`以便使用dict键作为行标签。

**DataFrame.from_records**

`DataFrame.from_records` takes a list of tuples or an ndarray with structured dtype. Works analogously to the normal `DataFrame` constructor, except that index maybe be a specific field of the structured dtype to use as the index. For example:

**DataFrame.from_records**

`DataFrame.from_records`获取元组列表或带有结构化dtype的ndarray。 类似于普通的`DataFrame`构造函数，除了索引可能是结构化dtype的特定字段以用作索引。例如：

In [41]:
dfB

array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [42]:
pd.DataFrame.from_records(dfB, index='C')

Unnamed: 0_level_0,A,B
C,Unnamed: 1_level_1,Unnamed: 2_level_1
b'Hello',1,2.0
b'World',2,3.0


**DataFrame.from_items**

`DataFrame.from_items` works analogously to the form of the `dict` constructor that takes a sequence of `(key, value)` pairs, where the keys are column (or row, in the case of `orient='index'`) names, and the value are the column values (or row values). This can be useful for constructing a DataFrame with the columns in a particular order without having to pass an explicit list of columns:

**DataFrame.from_items** 

`DataFrame.from_items`类似于`dict`构造函数的形式，它采用一系列`（key，value）`对，其中键是列（或行，在`orient ='index'的情况下） ）名称，值是列值（或行值）。 这对于使用特定顺序的列构造DataFrame非常有用，而无需传递明确的列列表：

**注意：** `from_items()`已被弃用，可以使用 `DataFrame.from_dict(dict(items)`代替。`DataFrame.from_dict(OrderedDict(items))`可以用于保存key顺序。

In [43]:
 pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])

  """Entry point for launching an IPython kernel.


Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [44]:
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


If you pass `orient='index'`, the keys will be the row labels. But in this case you must also pass the desired column names:

如果你传递`orient ='index'`，键将是行标签。 但在这种情况下，您还必须传递所需的列名：

In [45]:
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])],
                        orient='index', columns=['one', 'two', 'three'])

  


Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


In [46]:
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
                        orient='index', columns=['one', 'two', 'three'])

Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


**Column selection, addition, deletion**

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

### 列选择，添加，删除

您可以在语义上将DataFrame视为类似索引的Series对象的dict。 获取，设置和删除列的工作方式与类似的dict操作相同：

In [47]:
 dfa['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [48]:
dfa['three'] = dfa['one'] * dfa['two']

In [49]:
dfa['flag'] = dfa['one'] > 2

In [50]:
df

NameError: name 'df' is not defined

Columns can be deleted or popped like with a dict:

可以类似字典那样删除或插入（pop）列。

In [None]:
del dfa['two']

In [None]:
three = dfa.pop('three')

In [None]:
dfa

When inserting a scalar value, it will naturally be propagated to fill the column:

当插入一个标量值时，它自然会被广播到填充列：

In [None]:
dfa['foo'] = 'bar'
dfa

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

插入与DataFrame不具有相同索引的Series时，它将遵照DataFrame的索引：

In [None]:
df['one_trunc'] = df['one'][:2]
df

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns:

您可以插入原始ndarrays，但它们的长度必须与DataFrame索引的长度相匹配。

默认情况下，列会在末尾插入。 insert函数可用于在特定位置插入列：

In [None]:
dfa.insert(1, 'bar', dfa['one'])
dfa

**Assigning New Columns in Method Chains**

New in version 0.16.0.

Inspired by [dplyr’s](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate) `mutate` verb, DataFrame has an [`assign()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign) method that allows you to easily create new columns that are potentially derived from existing columns.

### 在方法链中分配新列

版本0.16.0中的新功能。

受[dplyr's](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#mutate) `mutate`动词的启发，DataFrame有一个[`assign()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign)方法，允许您轻松创建可能从现有列派生的新列。

In [None]:
iris = pd.read_csv('iris.csv')

In [None]:
iris

In [None]:
(iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength']).head())

Above was an example of inserting a precomputed value. We can also pass in a function of one argument to be evalutated on the DataFrame being assigned to.

以上是插入预先计算的值的示例。 我们还可以传递一个参数的函数，以便在分配给的DataFrame上进行评估。

In [None]:
iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /x['SepalLength'])).head()

`assign` **always** returns a copy of the data, leaving the original DataFrame untouched.

Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to the DataFrame at hand. This is common when using `assign` in chains of operations. For example, we can limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:


`assign` ** always **返回数据的副本，保持原始DataFrame不变。

当您没有对手头的DataFrame的引用时，传递可调用的，而不是要插入的实际值。 在操作链中使用`assign`时很常见。 例如，我们可以将DataFrame限制为仅包含Sepal Length大于5的观察值，计算比率，并绘制：

In [None]:
(iris.query('SepalLength > 5') .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
         PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
 .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))

Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly, this is the DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering happens first, and then the ratio calculations. This is an example where we didn’t have a reference to the filtered DataFrame available.

由于传入的是一个函数，因此在分配给的DataFrame上计算函数。 重要的是，已经过滤了DataFrame中那些萼片长度大于5的行e。首先进行过滤，然后进行比例计算。这是一个我们没有引用过滤的DataFrame的有效示例。

The function signature for `assign` is simply `**kwargs`. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a `Series` or NumPy array), or a function of one argument to be called on the `DataFrame`. A *copy* of the original DataFrame is returned, with the new values inserted.

`assign`的函数签名只是`** kwargs`。 键是新字段的列名，值是要插入的值（例如，`Series`或NumPy数组），或者是要在`DataFrame`上调用的一个参数的函数。 返回原始DataFrame的* * *，并插入新值。

**Warning**

Since the function signature of `assign` is `**kwargs`, a dictionary, the order of the new columns in the resulting DataFrame cannot be guaranteed to match the order you pass in. To make things predictable, items are inserted alphabetically (by key) at the end of the DataFrame.

All expressions are computed first, and then assigned. So you can’t refer to another column being assigned in the same call to `assign`. For example:

**警告**

由于`assign`的函数签名是`** kwargs`，一个字典，结果DataFrame中新列的顺序不能保证与你传入的顺序相匹配。为了使事情可预测，项目按字母顺序插入（通过 key）在DataFrame的末尾。

首先计算所有表达式，然后分配。 所以你不能引用在同一个`assign`中调用的另一列。 例如：

In [None]:
dfB = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
dfB[:] = [(1,2.,'Hello'), (2,3.,"World")]
dfB = pd.DataFrame(df)
dfB

In [None]:
# Don't do this, bad reference to `C`
dfB.assign(C = lambda x: x['A'] + x['B'], D = lambda x: x['A'] + x['C'])

In [None]:
# Instead, break it into two assigns
(df.assign(C = lambda x: x['A'] + x['B']).assign(D = lambda x: x['A'] + x['C']))

### [Indexing / Selection](http://pandas.pydata.org/pandas-docs/version/0.20.3/dsintro.html#indexing-selection)

The basics of indexing are as follows:

| Operation                        | Syntax        | Result    |
| -------------------------------- | ------------- | --------- |
| Select column                    | `df[col]`       | Series    |
| Select row by label              | `df.loc[label]` | Series    |
| Select row by integer   location | `df.iloc[loc]`  | Series    |
| Slice rows                       | `df[5:10]`      | DataFrame |
| Select rows by boolean   vector  |`df[bool_vec]`  | DataFrame |



Row selection, for example, returns a Series whose index is the columns of the DataFrame:

例如，行选择返回一个Series，其索引是DataFrame的列：

In [None]:
dfa.loc['b']  #使用标签选择

In [None]:
dfa.iloc[2]  #使用索引号选择

For a more exhaustive treatment of more sophisticated label-based indexing and slicing, see the [section on indexing](http://pandas.pydata.org/pandas-docs/version/0.20.3/indexing.html#indexing). We will address the fundamentals of reindexing / conforming to new sets of labels in the [section on reindexing](http://pandas.pydata.org/pandas-docs/version/0.20.3/basics.html#basics-reindexing).

有关更复杂的基于标签的索引和切片的更详尽的处理，请参阅[section on indexing](http://pandas.pydata.org/pandas-docs/version/0.20.3/indexing.html#indexing)。 我们将在[section on reindexing](http://pandas.pydata.org/pandas-docs/version/0.20.3/basics.html#basics-reindexing)中讨论重新索引/符合新标签集的基础知识。

**Data alignment and arithmetic**

Data alignment between DataFrame objects automatically align on **both the columns and the index (row labels)**. Again, the resulting object will have the union of the column and row labels.

### 数据对齐和算术

DataFrame对象之间的数据对齐自动对齐**列和索引（行标签）**。 同样，生成的对象将具有列和行标签的并集。

In [None]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

In [None]:
df

In [None]:
df2

In [None]:
df + df2

When doing an operation between DataFrame and Series, the default behavior is to align the Series **index** on the DataFrame **columns**, thus [broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) row-wise. For example:

在DataFrame和Series之间进行操作时，默认行为是在DataFrame **列**上对齐Series ** index **，从而[broadcast]（http://docs.scipy.org/doc/numpy/ user / basics.broadcasting.html）row-wise。 例如：

In [None]:
df - df.iloc[0]

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise:

在使用时间序列数据的特殊情况下，DataFrame索引还包含日期，广播将按列进行：

In [None]:
index = pd.date_range('1/1/2000', periods=8)

df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))

df

In [None]:
type(df['A'])

In [None]:
df - df['A']

**Warning**

```
df - df['A']
```
is now deprecated and will be removed in a future release. The preferred way to replicate this behavior is

现在已被弃用，在未来的版本中将删除。模拟这种行为的最佳方法是

```
df.sub(df['A'], axis=0)
```

In [None]:
df.sub(df['A'], axis=0)

For explicit control over the matching and broadcasting behavior, see the section on [flexible binary operations](http://pandas.pydata.org/pandas-docs/version/0.20.3/basics.html#basics-binop).

Operations with scalars are just as you would expect:

为了显式地控制匹配和广播行为，请参阅 [flexible binary operations](http://pandas.pydata.org/pandas-docs/version/0.20.3/basics.html#basics-binop).

使用标量的操作正如您所期望的那样：

In [None]:
df * 5 + 2

In [None]:
1 / df

In [None]:
df ** 4

Boolean operators work as well:

Boolean 运算符也可以工作：

In [None]:
df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)

In [None]:
df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)

In [None]:
df1 & df2

In [None]:
df1 | df2

In [None]:
df1 ^ df2

In [None]:
- df1

**Transposing**

To transpose, access the `T` attribute (also the `transpose` function), similar to an ndarray:

### 转置 Transposing

要转置，访问`T`属性（也可以`transpose`函数），类似于一个 ndarray:

In [None]:
# only show the first 5 rows
In [95]: df[:5].T

**DataFrame interoperability with NumPy functions**

Elementwise NumPy ufuncs (log, exp, sqrt, ...) and various other NumPy functions can be used with no issues on DataFrame, assuming the data within are numeric:

### DataFrame与NumPy功能的互操作性

Elementwise NumPy ufuncs（log，exp，sqrt，...）和各种其他NumPy函数可以在DataFrame上使用，假设其中的数据是数字：

In [None]:
np.exp(df)

In [None]:
np.asarray(df)

The dot method on DataFrame implements matrix multiplication:

DataFrame上的`dot` 方法实现矩阵乘法：

In [None]:
df.T.dot(df)

Similarly, the dot method on Series implements dot product:

同样，Series上的dot方法实现了dot 乘积：

In [None]:
s1 = pd.Series(np.arange(5,10))

s1.dot(s1)

DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in places from a matrix.

DataFrame并不是ndarray的替代品，因为它的索引语义与矩阵的位置完全不同。

**Console display**

Very large DataFrames will be truncated to display them in the console. You can also get a summary using [`info()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.info.html#pandas.DataFrame.info). (Here I am reading a CSV version of the **baseball** dataset from the **plyr** R package):

### 控制台显示

非常大的DataFrame将被截断以在控制台中显示它们。 您还可以使用[`info（）`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.info.html#pandas.DataFrame.info)获取摘要。（下面我们从 **plyr** R包中读取**baseball**数据集的一个 CSV版本）：

In [None]:
baseball = pd.read_csv('../../RDatasets/baseball.csv')
baseball

In [None]:
baseball.info()

owever, using `to_string` will return a string representation of the DataFrame in tabular form, though it won’t always fit the console width:

但是，使用`tostring`将以表格形式返回DataFrame的字符串表示形式，但它并不总是适合控制台宽度：

In [None]:
print(baseball.iloc[-20:, :12].to_string())

New since 0.10.0, wide DataFrames will now be printed across multiple rows by default:

从0.10.0开始，新的DataFrame现在将默认打印在多行中：

In [None]:
pd.DataFrame(np.random.randn(3, 12))

You can change how much to print on a single row by setting the `display.width` option:

您可以通过设置`display.width`选项来更改单行打印的数量：

In [None]:
pd.set_option('display.width', 40) # default is 80
pd.DataFrame(np.random.randn(3, 12))

You can adjust the max width of the individual columns by setting `display.max_colwidth`

您可以通过设置`display.max_colwidth`来调整各列的最大宽度

In [62]:
pd.set_option('expand_frame_repr', True)
datafile={'filename': ['filename_01','filename_02'],
          'path': ["media/user_name/storage/folder_01/filename_01",
                   "media/user_name/storage/folder_02/filename_02"]}

In [63]:
pd.set_option('display.max_colwidth',30)

pd.DataFrame(datafile)

Unnamed: 0,filename,path
0,filename_01,media/user_name/storage/fo...
1,filename_02,media/user_name/storage/fo...


In [64]:
pd.set_option('display.max_colwidth',100)

pd.DataFrame(datafile)

Unnamed: 0,filename,path
0,filename_01,media/user_name/storage/folder_01/filename_01
1,filename_02,media/user_name/storage/folder_02/filename_02


You can also disable this feature via the `expand_frame_repr` option. This will print the table in one block.

您也可以通过`expand_frame_repr`选项禁用此功能。 这将在一个块中打印表。

**DataFrame column attribute access and IPython completion**

If a DataFrame column label is a valid Python variable name, the column can be accessed like attributes:

### DataFrame列属性访问和 IPython 完成

如果DataFrame列标签是有效的Python变量名称，则可以像属性一样访问该列：

In [69]:
df = pd.DataFrame({'foo1' : np.random.randn(5),
                   'foo2' : np.random.randn(5)})

In [70]:
df

Unnamed: 0,foo1,foo2
0,0.058424,0.264921
1,0.688179,-0.046362
2,1.382438,0.389125
3,0.640406,-0.429875
4,1.643994,1.126769


In [71]:
df.foo1

0    0.058424
1    0.688179
2    1.382438
3    0.640406
4    1.643994
Name: foo1, dtype: float64

The columns are also connected to the IPython completion mechanism so they can be tab-completed:

列也连接到IPython完成机制，因此它们可以tab补全：

In [77]:
df.fo<TAB>

SyntaxError: invalid syntax (<ipython-input-77-3c44d32e4475>, line 1)

## [Panel](http://pandas.pydata.org/pandas-docs/version/0.20.3/dsintro.html#panel)

Warning

 

In 0.20.0, `Panel` is deprecated and will be removed in a future version. See the section [Deprecate Panel](http://pandas.pydata.org/pandas-docs/version/0.20.3/dsintro.html#dsintro-deprecate-panel).

在0.20.0中，“Panel”已弃用，将在以后的版本中删除。

Panel is a somewhat less-used, but still important container for 3-dimensional data. The term [panel data](http://en.wikipedia.org/wiki/Panel_data) is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you may find the axis names slightly arbitrary:

Panel是一种很少使用但仍然很重要的三维数据容器。术语[panel data](http://en.wikipedia.org/wiki/Panel_data)源自计量经济学，并且`**pandas**`的名称部分来自于它：`pan（el）-da（ta）-s`。 3轴的名称旨在为描述涉及面板数据的操作提供一些语义含义，特别是面板数据的计量经济分析。 但是，为了严格的切片slicing 和 切割 DataFrame 对象集合的目的，您可能会发现轴名称有点随意：

> - **items**: axis 0, each item corresponds to a DataFrame contained inside
> - **major_axis**: axis 1, it is the **index** (rows) of each of the DataFrames
> - **minor_axis**: axis 2, it is the **columns** of each of the DataFrames

> - **items**: axis 0, 每个条目对应一个被包含在里面的 DataFrame
> - **major_axis**: axis 1, 它是每个DataFrame的 **index** (rows)
> - **minor_axis**: axis 2, 它是每个DataFrame的**columns**

Construction of Panels works about like you would expect:

面板的构建与您期望的一样：

## Deprecate Panel

Over the last few years, pandas has increased in both breadth and depth, with new features, datatype support, and manipulation routines. As a result, supporting efficient indexing and functional routines for `Series`, `DataFrame` and `Panel`has contributed to an increasingly fragmented and difficult-to-understand codebase.

The 3-D structure of a `Panel` is much less common for many types of data analysis, than the 1-D of the `Series` or the 2-D of the `DataFrame`. Going forward it makes sense for pandas to focus on these areas exclusively.

Oftentimes, one can simply use a MultiIndex `DataFrame` for easily working with higher dimensional data.

In additon, the `xarray` package was built from the ground up, specifically in order to support the multi-dimensional analysis that is one of `Panel` s main usecases. [Here is a link to the xarray panel-transition documentation](http://xarray.pydata.org/en/stable/pandas.html#panel-transition).

## 弃用Panel

在过去几年中，pandas的广度和深度都有所增加，具有新功能，数据类型支持和操作例程。 因此，支持`Series`，`DataFrame`和`Panel`的高效索引和功能例程导致了越来越分散且难以理解的代码库。

对于许多类型的数据分析，“Panel”的三维结构比“系列”的1-D或“DataFrame”的2-D少得多。 展望未来，pandas 专注于这些领域是有道理的。

通常，人们可以简单地使用 MultiIndex`DataFrame` 来轻松处理更高维数据。

另外，`xarray`包是从头开始构建的，特别是为了支持多维分析，这是“Panel”的主要用途之一。 [Here is a link to the xarray panel-transition documentation](http://xarray.pydata.org/en/stable/pandas.html#panel-transition).

You can see the full-documentation for the [xarray package](http://xarray.pydata.org/en/stable/).

In [84]:
import statsmodels.formula.api as sm
bb = pd.read_csv('../../RDatasets/baseball.csv', index_col='id')
bb.head()

Unnamed: 0_level_0,Unnamed: 0,year,stint,team,lg,g,ab,r,h,X2b,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ansonca01,4,1871,1,RC1,,25,120,29,39,11,...,16.0,6.0,2.0,2,1.0,,,,,
forceda01,44,1871,1,WS3,,32,162,45,45,9,...,29.0,8.0,0.0,4,0.0,,,,,
mathebo01,68,1871,1,FW1,,19,89,15,24,3,...,10.0,2.0,1.0,2,0.0,,,,,
startjo01,99,1871,1,NY2,,33,161,35,58,5,...,34.0,4.0,2.0,3,0.0,,,,,
suttoez01,102,1871,1,CL1,,29,128,35,45,3,...,23.0,3.0,1.0,1,0.0,,,,,


In [83]:
(bb.query('h > 0')
 .assign(ln_h = lambda df: np.log(df.h))
 .pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
 .fit()
 .summary()
)

Optimization terminated successfully.
         Current function value: 2.948624
         Iterations 10


0,1,2,3
Dep. Variable:,hr,No. Observations:,18236.0
Model:,Poisson,Df Residuals:,18227.0
Method:,MLE,Df Model:,8.0
Date:,"Thu, 04 Oct 2018",Pseudo R-squ.:,0.5616
Time:,09:21:10,Log-Likelihood:,-53771.0
converged:,True,LL-Null:,-122650.0
,,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-30.0945,0.219,-137.475,0.000,-30.524,-29.665
C(lg)[T.AL],0.0118,0.052,0.228,0.819,-0.089,0.113
C(lg)[T.FL],-0.1573,0.125,-1.254,0.210,-0.403,0.089
C(lg)[T.NL],-0.0079,0.052,-0.153,0.878,-0.109,0.093
C(lg)[T.PL],0.2712,0.096,2.830,0.005,0.083,0.459
C(lg)[T.UA],-0.4598,0.381,-1.206,0.228,-1.207,0.288
ln_h,1.1199,0.012,95.898,0.000,1.097,1.143
year,0.0136,0.000,125.931,0.000,0.013,0.014
g,0.0028,0.000,13.533,0.000,0.002,0.003
