# **Sampling with and without replacement** #




When we have to extract a random sample from a population (larger dataset) we can use:
* the random sampling routines of Numpy, `numpy.random`  [[doc]](https://numpy.org/doc/stable/reference/random/index.html),
* or the `pd.DataFrame.sample` [[doc]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) of Pandas,
* or the scikit-learn `sklearn.utils.resample`
[[doc]](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html).

`NumPy` is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects, and an assortment of routines for fast operations on arrays.

`Pandas` is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

`Scikit-learn` is an open source machine learning library.



**Setting a random seed**

The random draw functions only mimic random processes. These
algorithms are actually complex deterministic processes that generate very
long strings of pseudorandom numbers.

We can control the numbers that the random number generator
produce by setting a “seed”.
It means that every time you reset the seed to the same
value, the same string of “random” numbers will result.

It is useful in the code developing phase to check for reproducibility of the results.

## Numpy np.random.choice ##

In [1]:
import numpy as np

In [2]:
np.random.seed(13)

The `np.random.choice` generates a random sample from a given 1-D array.

It implements sampling with replacement by default

`np.random.choice(a, size=None, replace=True, p=None)`

In [3]:
a = [ 27, 34, 45, 63, 73, 88]
a

[27, 34, 45, 63, 73, 88]

In [4]:
np.random.choice(a)

45

In [None]:
np.random.choice(a, 3, replace = False)

In [None]:
np.random.choice(10)

In [None]:
np.arange(30,35)

In [None]:
np.random.choice(np.arange(30,35))

In [None]:
np.random.choice(np.arange(3,15), 10)

In [None]:
np.random.choice(4, 12)

Probability weights can be given

In [None]:
np.random.choice(4, 12, p=[.4, .1, .1, .4])

In [None]:
x = np.random.randint(0, 10, (8, 12)) # see also np.random.rand() and np.random.randn()
x

In [None]:
x = np.random.rand(8, 12) # see also np.random.rand() and np.random.randn()
x

In [None]:
x = np.random.randn(8, 12)
x

Individual elements can be sampled from the array

In [None]:
np.random.choice(x.ravel(), 12)

Rows can be sampled

In [None]:
x.shape[0]

In [None]:
idx = np.random.choice(x.shape[0], 4)
print(idx)
x[idx, :]

.. or columns

In [None]:
idx = np.random.choice(x.shape[1], 4)
x[:, idx]

## Sampling without replacement

Sampling without replacement can be done by giving the argument replace=False to `np.random.choice`.

In [None]:
np.random.choice(10, 4, replace=False)

A number of samples smaller than the population size shoud be asked, otherwise:

In [None]:
np.random.choice(4, 12, replace=False) # it raises an error (which can be handled)

In [None]:
try:
  np.random.choice(4, 12, replace=False)
except ValueError as ve:
  print(ve)
  #Cannot take a larger sample than population when 'replace=False'

## Not only numbers...

`np.random.choice` can be used with an arbitrary array-like instead of just integers.

In [None]:
a = ['cat','dog','pig','bird','fish']
type(a)

In [None]:
np.random.choice(a, 12)


# Pandas pd.DataFrame.sample

In [None]:
import pandas as pd
import numpy as np

The `DataFrame.sample` returns a random sample of items from an axis of object.

`DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)`






See also:

`DataFrameGroupBy.sample`
Generates random samples from each group of a DataFrame object.

`SeriesGroupBy.sample`
Generates random samples from each group of a Series object.

Also in this case you can use random_state for reproducibility.

In [None]:
df =  pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])
df

Rows can be sampled with and withour replacement

In [None]:
sample_data = df.sample(10, replace=True, random_state=123) # the defult setting for replace=False)
sample_data

A DataFrame column can be used as weights. Rows with larger value in the column chosen to weight are more likely to be sampled

In [None]:
sample_data = df.sample(n=10, replace=True, weights='num_specimen_seen', random_state=123)
sample_data

Also columns can be sampled

In [None]:
sample_data = df.sample(5, replace=True, axis = 1)
sample_data

##Stratified  sampling


We can stratify the sampling procedure for example according to a category (e.g. according to the number of wings)

`DataFrameGroupBy.sample(n=None, frac=None, replace=False, weights=None, random_state=None)`

Return a random sample of items from each group.

In [None]:
df.groupby("num_wings").sample(n=6, replace=True)

Control sample probabilities within groups by setting weights

In [None]:
df.num_specimen_seen

In [None]:
df.groupby("num_wings").sample(n=6, replace=True,   weights=df.num_specimen_seen, random_state=123)

# Scikit-learn resample

The `sklearn.utils.resample` resamples arrays or sparse matrices in a consistent way.

`sklearn.utils.resample(*arrays, replace=True, n_samples=None, random_state=None, stratify=None)`



In [None]:
import numpy as np
from sklearn.utils import resample

In [None]:
X_features = np.array([[1., 2.1], [2., 3.1], [0.5, 1.8]])
X_phenotypic = np.array([[1., 0.], [2., 1.], [2., 0.]])
Y_diagnosis = ['mal', 'ben', 'ben']

In [None]:
X_features

In [None]:
X_phenotypic

In [None]:
Y_diagnosis

You can consistenly resample different arrays.

In [None]:
sX_features, sX_phenotypic, sY_diagnosis = resample(X_features, X_phenotypic, Y_diagnosis, n_samples= 2, random_state=23)

In [None]:
sX_features, sX_phenotypic, sY_diagnosis

## Stratified sampling

Sampling using stratification

In [None]:
X_features = np.array([[1., 2.1], [2., 3.1], [2.5, 3.], [0.5, 1.8], [0.7, 1.5], [1.5, 3.8], [0.7, 1.5], [0.3, 1.2]])
Y_diagnosis = ['mal', 'ben', 'ben', 'ben','mal','mal','ben','mal']

In [None]:
sX_features, sY_diagnosis = resample(X_features, Y_diagnosis, n_samples=20, replace=True, stratify=Y_diagnosis, random_state=123)

Stratify means the distribution in your original classes is preserved.

Number of subjects with a 'mal' diagnosis in the original sample:

In [None]:
Y_diagnosis.count('mal')/len(Y_diagnosis)

In [None]:
sX_features, sY_diagnosis

Number of subjects with a 'mal' diagnosis in the resampled sample:

In [None]:
sY_diagnosis.count('mal')/len(sY_diagnosis)

In [None]:
sX_features, sY_diagnosis = resample(X_features, Y_diagnosis, n_samples=20, replace=True,  random_state=123)

In [None]:
sY_diagnosis.count('mal')/len(sY_diagnosis)