Selecting Random Samples
====
**选择随机样本**

A random selection of rows or columns from a Series, DataFrame, or Panel with the [`sample()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.sample.html#pandas.DataFrame.sample) method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

使用 [`sample()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.sample.html#pandas.DataFrame.sample)方法随机选择Series，DataFrame或Panel中的行或列。该方法默认情况下对行进行采样，并接受要返回的特定行数/列数或部分行（比例）。

In [3]:
import numpy as np
import pandas as pd

s = pd.Series([0,1,2,3,4,5])

s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [4]:
# When no arguments are passed, returns 1 row.
s.sample()

1    1
dtype: int64

In [5]:
# One may specify either a number of rows:
s.sample(n=3)

1    1
5    5
0    0
dtype: int64

In [None]:
# Or a fraction of the rows:
s.sample(frac=0.5)

By default, `sample` will return each row at most once, but one can also sample with replacement using the `replace`option:

默认情况下，`sample`返回每行最多一次，但也可以使用`replace`选项进行替换：

In [7]:
s = pd.Series([0,1,2,3,4,5])
s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [8]:
# Without replacement (default):
s.sample(n=6, replace=False)

4    4
1    1
0    0
3    3
5    5
2    2
dtype: int64

In [9]:
 # With replacement:
s.sample(n=6, replace=True)

1    1
4    4
4    4
4    4
2    2
1    1
dtype: int64

By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the `sample` function sampling weights as `weights`. These weights can be a list, a numpy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:

默认情况下，每行具有相同的选择概率，但如果您希望行具有不同的概率，则可以将`weights`作为采样权重传递给`sample`。 这些权重可以是列表，numpy数组或Series，但它们的长度必须与要采样的对象的长度相同。 缺失值将被视为零权重，并且不允许使用inf值。如果权重总和不为1，则通过将所有权重除以权重之和来对它们进行重新规范化（平差）。例如：

In [10]:
 s = pd.Series([0,1,2,3,4,5])

In [11]:
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [12]:
s.sample(n=3, weights=example_weights)

5    5
4    4
2    2
dtype: int64

In [14]:
# Weights will be re-normalized automatically
example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [15]:
s.sample(n=1, weights=example_weights2)

0    0
dtype: int64

When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.

应用于DataFrame时，只需将列的名称作为字符串传递给`sample`，就可以使用DataFrame的列作为采样权重（假设您要对行而不是列进行采样）。

In [17]:
df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})

df2

Unnamed: 0,col1,weight_column
0,9,0.5
1,8,0.4
2,7,0.1
3,6,0.0


In [18]:
df2.sample(n = 3, weights = 'weight_column')

Unnamed: 0,col1,weight_column
0,9,0.5
1,8,0.4
2,7,0.1


`sample` also allows users to sample columns instead of rows using the `axis` argument.

`sample`还允许用户使用`axis`参数对列进行采样而不是行。

In [20]:
df3 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

df3

Unnamed: 0,col1,col2
0,1,2
1,2,3
2,3,4


In [22]:
df3.sample(n=1, axis=1)

Unnamed: 0,col2
0,2
1,3
2,4


Finally, one can also set a seed for `sample`‘s random number generator using the `random_state` argument, which will accept either an integer (as a seed) or a numpy RandomState object.

最后，还可以使用`random_state`参数为`sample`的随机数生成器设置种子，该参数将接受整数（作为种子）或numpy RandomState对象。

In [23]:
df4 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [24]:
# With a given seed, the sample will always draw the same rows.
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


In [25]:
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3
