Descriptive statistics
====

A large number of methods for computing descriptive statistics and other related operations on [Series](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-series-stats), [DataFrame](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-dataframe-stats), and [Panel](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-panel-stats). Most of these are aggregations (hence producing a lower-dimensional result) like [`sum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.sum.html#pandas.DataFrame.sum), [`mean()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.mean.html#pandas.DataFrame.mean), and [`quantile()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.quantile.html#pandas.DataFrame.quantile), but some of them, like [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) and [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod), produce an object of the same size. Generally speaking, these methods take an **axis** argument, just like *ndarray.{sum, std, ...}*, but the axis can be specified by name or integer:

> - **Series**: no axis argument needed
> - **DataFrame**: “index” (axis=0, default), “columns” (axis=1)
> - **Panel**: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)

For example:

 [Series](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-series-stats), [DataFrame](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-dataframe-stats), 和 [Panel](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-panel-stats)有用于计算描述性统计和其他相关操作的大量方法。 像 [`sum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.sum.html#pandas.DataFrame.sum), [`mean()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.mean.html#pandas.DataFrame.mean), and [`quantile()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.quantile.html#pandas.DataFrame.quantile)等都是聚合（因此产生了一个低维度的结果）。但其它一些，像 [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) 和 [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod),生成一个大小相同的对象。一般来说，这些方法接受一个 **axis** 参数，就像*ndarray.{sum, std, ...}*，但 axis 可以通过名称或者整数指定:

> - **Series**: no axis argument needed，不需要axis参数。
> - **DataFrame**: “index” (axis=0, default), “columns” (axis=1)
> - **Panel**: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)

例如：

In [7]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
    
df

Unnamed: 0,one,two,three
a,-0.213832,2.231999,
b,0.440781,0.917159,1.161711
c,-0.623836,1.168,-1.982916
d,,0.681627,0.345367


df.mean(0)

In [8]:
df.mean(1)

a    1.009084
b    0.839884
c   -0.479584
d    0.513497
dtype: float64

All such methods have a skipna option signaling whether to exclude missing data (True by default):

所有这些方法都有一个skipna选项，用于显示是否排除缺失的数据（默认是 True）:

In [9]:
df.sum(0, skipna=False)

one           NaN
two      4.998785
three         NaN
dtype: float64

In [10]:
df.sum(axis=1, skipna=True)

a    2.018167
b    2.519651
c   -1.438753
d    1.026994
dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

结合广播/算术行为，可以非常简洁地描述各种统计过程，如标准化（绘制数据零均值和标准差1）：

In [13]:
ts_stand = (df - df.mean()) / df.std()

ts_stand

Unnamed: 0,one,two,three
a,-0.151845,1.435446,
b,1.067238,-0.485939,0.809137
c,-0.915394,-0.119384,-1.117992
d,,-0.830124,0.308855


In [12]:
ts_stand.std()

one      1.0
two      1.0
three    1.0
dtype: float64

In [14]:
xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [15]:
xs_stand.std(1)

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

Note that methods like [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) and [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod) preserve the location of `NaN` values. This is somewhat different from[`expanding()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding) and [`rolling()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.rolling.html#pandas.DataFrame.rolling). For more details please see [this note](http://pandas.pydata.org/pandas-docs/version/0.20.3/computation.html#stats-moments-expanding-note).

**注意：** [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) 和 [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod) 方法保留 `NaN` 值的位置. 这与[`expanding()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding) 和 [`rolling()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.rolling.html#pandas.DataFrame.rolling)有点不同。更详细的说明请参见 [this note](http://pandas.pydata.org/pandas-docs/version/0.20.3/computation.html#stats-moments-expanding-note).

In [16]:
df.cumsum()

Unnamed: 0,one,two,three
a,-0.213832,2.231999,
b,0.226949,3.149159,1.161711
c,-0.396887,4.317158,-0.821205
d,,4.998785,-0.475838


Here is a quick reference summary table of common functions. Each also takes an optional `level` parameter which applies only if the object has a [hierarchical index](http://pandas.pydata.org/pandas-docs/version/0.20.3/advanced.html#advanced-hierarchical).

下面是常用函数的快速参考汇总表。每个都接受一个可选的`level`参数，该参数仅在对象具有 [hierarchical index](http://pandas.pydata.org/pandas-docs/version/0.20.3/advanced.html#advanced-hierarchical)（多层次索引）时才适用。

| Function   | Description                                |
| ---------- | ------------------------------------------ |
| `count`    | Number of non-null observations            |
| `sum`      | Sum of values                              |
| `mean`     | Mean of values                             |
| `mad`      | Mean absolute deviation                    |
| `median`   | Arithmetic median of values                |
| `min`      | Minimum                                    |
| `max`      | Maximum                                    |
| `mode`     | Mode                                       |
| `abs`      | Absolute Value                             |
| `prod`     | Product of values                          |
| `std`      | Bessel-corrected sample standard deviation |
| `var`      | Unbiased variance                          |
| `sem`      | Standard error of the mean                 |
| `skew`     | Sample skewness (3rd moment)               |
| `kurt`     | Sample kurtosis (4th moment)               |
| `quantile` | Sample quantile (value at %)               |
| `cumsum`   | Cumulative sum                             |
| `cumprod`  | Cumulative product                         |
| `cummax`   | Cumulative maximum                         |
| `cummin`   | Cumulative minimum                         |

Note that by chance some NumPy methods, like `mean`, `std`, and `sum`, will exclude NAs on Series input by default:

请注意，默认情况下，一些NumPy方法（如`mean`，`std`和`sum`）将排除系列输入上的NAs：

In [17]:
np.mean(df['one'])

-0.13229579103157632

In [18]:
np.mean(df['one'].values)

nan

`Series` also has a method [`nunique()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.nunique.html#pandas.Series.nunique) which will return the number of unique non-null values:

`Series` 也有一个方法 [`nunique()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.nunique.html#pandas.Series.nunique) ，该方法返回唯一`non-null`（非空）值的数量：

In [19]:
series = pd.Series(np.random.randn(500))

In [20]:
series[20:500] = np.nan

In [21]:
series[10:20]  = 5

In [22]:
series.nunique()

11