Descriptive statistics
====
**描述性统计**

A large number of methods for computing descriptive statistics and other related operations on [Series](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-series-stats), [DataFrame](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-dataframe-stats), and [Panel](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-panel-stats). Most of these are aggregations (hence producing a lower-dimensional result) like [`sum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.sum.html#pandas.DataFrame.sum), [`mean()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.mean.html#pandas.DataFrame.mean), and [`quantile()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.quantile.html#pandas.DataFrame.quantile), but some of them, like [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) and [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod), produce an object of the same size. Generally speaking, these methods take an **axis** argument, just like *ndarray.{sum, std, ...}*, but the axis can be specified by name or integer:

> - **Series**: no axis argument needed
> - **DataFrame**: “index” (axis=0, default), “columns” (axis=1)
> - **Panel**: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)

For example:

[Series](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-series-stats), [DataFrame](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-dataframe-stats), 和 [Panel](http://pandas.pydata.org/pandas-docs/version/0.20.3/api.html#api-panel-stats)有用于计算描述性统计和其他相关操作的大量方法。 像 [`sum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.sum.html#pandas.DataFrame.sum), [`mean()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.mean.html#pandas.DataFrame.mean), and [`quantile()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.quantile.html#pandas.DataFrame.quantile)等都是聚合（因此产生了一个低维度的结果）。但其它一些，像 [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) 和 [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod),生成一个大小相同的对象。一般来说，这些方法接受一个 **axis** 参数，就像*ndarray.{sum, std, ...}*，但 axis 可以通过名称或者整数指定:

> - **Series**: no axis argument needed，不需要axis参数。
> - **DataFrame**: “index” (axis=0, default), “columns” (axis=1)
> - **Panel**: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)

例如：

In [75]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
    
df

Unnamed: 0,one,two,three
a,-1.381778,1.233242,
b,1.26801,0.709955,1.121913
c,0.94486,-0.122784,-0.555194
d,,-0.00108,0.127829


df.mean(0)

In [76]:
df.mean(1)

a   -0.074268
b    1.033293
c    0.088961
d    0.063374
dtype: float64

All such methods have a skipna option signaling whether to exclude missing data (True by default):

所有这些方法都有一个skipna选项，用于显示是否排除缺失的数据（默认是 True）:

In [77]:
df.sum(0, skipna=False)

one           NaN
two      1.819333
three         NaN
dtype: float64

In [78]:
df.sum(axis=1, skipna=True)

a   -0.148536
b    3.099878
c    0.266882
d    0.126748
dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

结合广播/算术行为，可以非常简洁地描述各种统计过程，如标准化（绘制数据零均值和标准差1）：

In [79]:
ts_stand = (df - df.mean()) / df.std()

ts_stand

Unnamed: 0,one,two,three
a,-1.147466,1.224409,
b,0.685501,0.401297,1.055789
c,0.461965,-0.908571,-0.932842
d,,-0.717135,-0.122947


In [80]:
ts_stand.std()

one      1.0
two      1.0
three    1.0
dtype: float64

In [81]:
xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [82]:
xs_stand.std(1)

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

Note that methods like [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) and [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod) preserve the location of `NaN` values. This is somewhat different from[`expanding()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding) and [`rolling()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.rolling.html#pandas.DataFrame.rolling). For more details please see [this note](http://pandas.pydata.org/pandas-docs/version/0.20.3/computation.html#stats-moments-expanding-note).

**注意：** [`cumsum()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) 和 [`cumprod()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod) 方法保留 `NaN` 值的位置. 这与[`expanding()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding) 和 [`rolling()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.rolling.html#pandas.DataFrame.rolling)有点不同。更详细的说明请参见 [this note](http://pandas.pydata.org/pandas-docs/version/0.20.3/computation.html#stats-moments-expanding-note).

In [83]:
df.cumsum()

Unnamed: 0,one,two,three
a,-1.381778,1.233242,
b,-0.113768,1.943197,1.121913
c,0.831092,1.820413,0.566719
d,,1.819333,0.694548


Here is a quick reference summary table of common functions. Each also takes an optional `level` parameter which applies only if the object has a [hierarchical index](http://pandas.pydata.org/pandas-docs/version/0.20.3/advanced.html#advanced-hierarchical).

下面是常用函数的快速参考汇总表。每个都接受一个可选的`level`参数，该参数仅在对象具有 [hierarchical index](http://pandas.pydata.org/pandas-docs/version/0.20.3/advanced.html#advanced-hierarchical)（多层次索引）时才适用。

| Function   | Description                                |
| ---------- | ------------------------------------------ |
| `count`    | Number of non-null observations            |
| `sum`      | Sum of values                              |
| `mean`     | Mean of values                             |
| `mad`      | Mean absolute deviation                    |
| `median`   | Arithmetic median of values                |
| `min`      | Minimum                                    |
| `max`      | Maximum                                    |
| `mode`     | Mode                                       |
| `abs`      | Absolute Value                             |
| `prod`     | Product of values                          |
| `std`      | Bessel-corrected sample standard deviation |
| `var`      | Unbiased variance                          |
| `sem`      | Standard error of the mean                 |
| `skew`     | Sample skewness (3rd moment)               |
| `kurt`     | Sample kurtosis (4th moment)               |
| `quantile` | Sample quantile (value at %)               |
| `cumsum`   | Cumulative sum                             |
| `cumprod`  | Cumulative product                         |
| `cummax`   | Cumulative maximum                         |
| `cummin`   | Cumulative minimum                         |

Note that by chance some NumPy methods, like `mean`, `std`, and `sum`, will exclude NAs on Series input by default:

请注意，默认情况下，一些NumPy方法（如`mean`，`std`和`sum`）将排除系列输入上的NAs：

In [84]:
np.mean(df['one'])

0.27703058675622944

In [85]:
np.mean(df['one'].values)

nan

`Series` also has a method [`nunique()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.nunique.html#pandas.Series.nunique) which will return the number of unique non-null values:

`Series` 也有一个方法 [`nunique()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.nunique.html#pandas.Series.nunique) ，该方法返回唯一`non-null`（非空）值的数量：

In [86]:
series = pd.Series(np.random.randn(500))

In [87]:
series[20:500] = np.nan

In [88]:
series[10:20]  = 5

In [89]:
series.nunique()

11

**Summarizing data: describe**

There is a convenient [`describe()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

# 数据摘要：描述

有一个方便的[`describe()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) 函数，它可以计算 Series 或 DataFrame 各列的各种摘要统计信息（当然不包括NA）：

In [90]:
series = pd.Series(np.random.randn(1000))

series[::2] = np.nan

series.describe()

count    500.000000
mean      -0.011714
std        0.968774
min       -2.691227
25%       -0.676614
50%       -0.025075
75%        0.647565
max        3.271458
dtype: float64

In [91]:
frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

frame.iloc[::2] = np.nan

frame.head()

Unnamed: 0,a,b,c,d,e
0,,,,,
1,0.577046,0.923189,0.416258,2.401154,0.178065
2,,,,,
3,0.293774,0.864399,1.728471,1.751394,0.421959
4,,,,,


In [92]:
frame.describe()

Unnamed: 0,a,b,c,d,e
count,500.0,500.0,500.0,500.0,500.0
mean,-0.033734,-0.055907,-0.066803,-0.070475,-0.002335
std,1.021168,0.974215,0.999073,0.990371,1.016789
min,-3.009499,-2.635734,-2.965073,-2.828918,-3.173394
25%,-0.712264,-0.711691,-0.759921,-0.718786,-0.676743
50%,-0.014027,-0.030214,-0.091229,-0.021212,-0.014233
75%,0.618106,0.593661,0.600275,0.553915,0.720832
max,3.363386,3.003935,2.719365,2.731559,3.132681


You can select specific percentiles to include in the output:

您可以选择要包含在输出中的特定百分位数：

In [93]:
series.describe(percentiles=[.05, .25, .75, .95])

count    500.000000
mean      -0.011714
std        0.968774
min       -2.691227
5%        -1.607808
25%       -0.676614
50%       -0.025075
75%        0.647565
95%        1.626051
max        3.271458
dtype: float64

By default, the median is always included.

For a non-numerical Series object, [`describe()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.describe.html#pandas.Series.describe) will give a simple summary of the number of unique values and most frequently occurring values:

默认情况下，始终包含中位数。

对于非数值 Series 对象，[`describe()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.describe.html#pandas.Series.describe) 将给出唯一值的数量和最常出现的值的简单摘要：

In [94]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

s

0      a
1      a
2      b
3      b
4      a
5      a
6    NaN
7      c
8      d
9      a
dtype: object

In [95]:
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

**Note** that on a mixed-type DataFrame object, [`describe()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) will restrict the summary to include only numerical columns or, if none are, only categorical columns:

请注意，在混合类型的DataFrame对象上， [`describe()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) 会将摘要限制为仅包含数字列，如果不包含数字列，则限制为仅包含分类列：

In [96]:
frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
frame

Unnamed: 0,a,b
0,Yes,0
1,Yes,1
2,No,2
3,No,3


In [97]:
frame.describe()

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


This behaviour can be controlled by providing a list of types as `include`/`exclude` arguments. The special value `all` can also be used:

可以通过提供类型列表作为`include / exclude`参数来控制此行为。 也可以使用特殊值“all”：

In [98]:
frame.describe(include=['object'])

Unnamed: 0,a
count,4
unique,2
top,No
freq,2


In [99]:
frame.describe(include=['number'])

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


In [100]:
frame.describe(include='all')

Unnamed: 0,a,b
count,4,4.0
unique,2,
top,No,
freq,2,
mean,,1.5
std,,1.290994
min,,0.0
25%,,0.75
50%,,1.5
75%,,2.25


That feature relies on [select_dtypes](http://pandas.pydata.org/pandas-docs/version/0.20.3/basics.html#basics-selectdtypes). Refer to there for details about accepted inputs.

该功能依赖于[select_dtypes](http://pandas.pydata.org/pandas-docs/version/0.20.3/basics.html#basics-selectdtypes). 。 有关接受的输入的详细信息，请参阅此处

**Index of Min/Max Values**

The [`idxmin()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin) and [`idxmax()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmax.html#pandas.DataFrame.idxmax) functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

# Index of Min/Max Values

[`idxmin()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin) 和 [`idxmax()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmax.html#pandas.DataFrame.idxmax) 函数在 Series 和 DataFrame 上计算最大值和最小值对应的标签：

In [101]:
s1 = pd.Series(np.random.randn(5))
s1

0    0.823389
1   -0.255007
2   -0.901245
3   -0.367401
4   -0.551826
dtype: float64

In [102]:
s1.idxmin(), s1.idxmax()

(2, 0)

In [103]:
df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])

df1

Unnamed: 0,A,B,C
0,-1.68281,0.555216,-1.117237
1,-0.921893,0.41087,-0.494201
2,1.426556,-0.859601,0.499369
3,0.3963,2.49177,0.301529
4,2.357826,0.763612,1.613856


In [104]:
df1.idxmin(axis=0)

A    0
B    2
C    0
dtype: int64

In [105]:
df1.idxmax(axis=1)

0    B
1    B
2    A
3    B
4    A
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, [`idxmin()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin) and [`idxmax()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmax.html#pandas.DataFrame.idxmax)return the first matching index:

在有多个行（或列）匹配最大或最小值时，[`idxmin()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin) 和 [`idxmax()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.idxmax.html#pandas.DataFrame.idxmax)返回第一个匹配的索引：

In [106]:
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [107]:
df3['A'].idxmin()

'd'

**Note**`idxmin` and `idxmax` are called `argmin` and `argmax` in NumPy.

**Value counts (histogramming) / Mode**

The [`value_counts()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.value_counts.html#pandas.Series.value_counts) Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

# 值计数（直方图）/模式

[`value_counts()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.value_counts.html#pandas.Series.value_counts) Series 方法和 top- level 函数计算一维值数组的直方图。它也可以用作常规数组的函数：

In [108]:
data = np.random.randint(0, 7, size=50)
data

array([3, 0, 4, 2, 5, 6, 5, 3, 3, 1, 2, 6, 0, 4, 0, 2, 6, 4, 2, 5, 4, 6,
       2, 4, 2, 3, 3, 4, 0, 1, 5, 0, 6, 5, 5, 3, 5, 1, 0, 5, 1, 2, 5, 3,
       3, 2, 0, 5, 1, 4])

In [109]:
s = pd.Series(data)
s.value_counts()

5    10
3     8
2     8
4     7
0     7
6     5
1     5
dtype: int64

In [110]:
pd.value_counts(data)

5    10
3     8
2     8
4     7
0     7
6     5
1     5
dtype: int64

imilarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

同样，您可以获得Series或DataFrame中值的最常出现的值（模式）：

In [111]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

s5.mode()

0    3
1    7
dtype: int64

In [112]:
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
   .....:                     "B": np.random.randint(-10, 15, size=50)})

df5

Unnamed: 0,A,B
0,0,-4
1,3,13
2,5,-4
3,6,-1
4,2,2
5,1,12
6,1,9
7,3,8
8,3,13
9,4,4


In [113]:
df5.mode()

Unnamed: 0,A,B
0,1.0,2
1,,10
2,,12


**Discretization and quantiling**

Continuous values can be discretized using the [`cut()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.cut.html#pandas.cut) (bins based on values) and [`qcut()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.qcut.html#pandas.qcut) (bins based on sample quantiles) functions:

# 离散化和量化

连续值可以使用[`cut()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.cut.html#pandas.cut)（基于值的箱子）和[`qcut()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.qcut.html#pandas.qcut)（基于样本分位数的箱子）函数进行离散化：

In [114]:
arr = np.random.randn(20)

factor = pd.cut(arr, 4)

factor

[(0.354, 0.994], (0.354, 0.994], (-0.286, 0.354], (-0.286, 0.354], (-0.286, 0.354], ..., (0.354, 0.994], (-0.286, 0.354], (0.354, 0.994], (-0.286, 0.354], (-0.286, 0.354]]
Length: 20
Categories (4, interval[float64]): [(-0.929, -0.286] < (-0.286, 0.354] < (0.354, 0.994] < (0.994, 1.634]]

In [115]:
factor = pd.cut(arr, [-5, -1, 0, 1, 5])
factor

[(0, 1], (0, 1], (-1, 0], (-1, 0], (0, 1], ..., (0, 1], (0, 1], (0, 1], (0, 1], (-1, 0]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

[`qcut()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.qcut.html#pandas.qcut) computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

[`qcut()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.qcut.html#pandas.qcut) 计算样本分位数。我们可以将一些正态分布的数据分割成大小相等的四分位数，例如：

In [116]:
arr = np.random.randn(30)

arr

array([ 0.4339946 ,  1.25075566,  0.95895605, -0.65227289, -0.45383282,
       -3.22697742, -1.27184963,  0.86365741,  2.24063299,  1.63716785,
       -1.32991143,  0.0986352 ,  0.46357139, -1.23779086, -0.72289319,
       -0.63097156,  0.07247871, -0.13002796, -0.04123338, -0.06036894,
        2.3102521 , -1.60978179, -0.11447037,  0.1779092 , -0.64858589,
       -0.44770163, -1.4732261 ,  0.2301594 , -0.14409644,  0.07399241])

In [117]:
factor = pd.qcut(arr, [0, .25, .5, .75, 1])

factor

[(0.383, 2.31], (0.383, 2.31], (0.383, 2.31], (-3.2279999999999998, -0.651], (-0.651, -0.0874], ..., (-0.651, -0.0874], (-3.2279999999999998, -0.651], (-0.0874, 0.383], (-0.651, -0.0874], (-0.0874, 0.383]]
Length: 30
Categories (4, interval[float64]): [(-3.2279999999999998, -0.651] < (-0.651, -0.0874] < (-0.0874, 0.383] < (0.383, 2.31]]

In [118]:
pd.value_counts(factor)

(0.383, 2.31]                    8
(-3.2279999999999998, -0.651]    8
(-0.0874, 0.383]                 7
(-0.651, -0.0874]                7
dtype: int64

We can also pass infinite values to define the bins:

我们还可以传递无限值来定义bins：

In [120]:
arr = np.random.randn(20)

factor = pd.cut(arr, [-np.inf, 0, np.inf])

factor

[(0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], (0.0, inf], ..., (0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf], (-inf, 0.0]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

# End