# Sampling with and without replacement




When we have to extract a random sample from a population (larger dataset), with or without replacement, we can use functions included in `NumPy` [[doc]](https://numpy.org/doc/stable/reference/random/index.html), `Pandas` [[doc]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) or `Scikit-learn` [[doc]](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) packages:

**Setting a random seed**

The random draw functions only mimic random processes. These algorithms are actually complex deterministic processes that generate very long strings of pseudorandom numbers.

We can control the numbers that the random number generator produce by setting a “seed”. It means that every time you reset the seed to the same value, the same string of “random” numbers will result.

It is useful in the code developing phase to check for reproducibility of the results.

## Numpy

In [None]:
import numpy as np

In [None]:
np.random.seed(13)

The `np.random.choice` generates a random sample from a given 1-D array. It implements sampling with replacement by default [[doc]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html).

`np.random.choice(a, size=None, replace=True, p=None)`

In [None]:
a = [ 27, 34, 45, 63, 73, 88]

In [None]:
#random sampling with replacements from the list 'a'

np.random.choice(a,5)

In [None]:
#random sampling without replacements from the list 'a'

np.random.choice(5, 3, replace = False)

In [None]:
#random sampling with replacements from 0 to 9

np.random.choice(10)

#the previous function is equal to np.random.choice(range(10))

The `np.arange` return evenly spaced values within a given interval [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.arange.html#numpy.arange).
 
`numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None)`

In [None]:
start, stop = 20,30
np.arange(start,stop,3)

In [None]:
np.random.choice(np.arange(30,35))

Probability weights can be given

In [None]:
np.random.choice(4, 12, p=[.4, .1, .1, .4])

The `np.randint` return random integers from low (inclusive) to high (exclusive) [[doc]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.RandomState.randint.html).

`np.random.randint(low, high=None, size=None, dtype=int)`

The `np.random.rand` return random values in a given shape [[doc]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.RandomState.rand.html).

`np.random.rand(d0, d1, ..., dn)`

The `np.random.randn` return a sample (or samples) from the “standard normal” distribution [[doc]](https://numpy.org/doc/stable/reference/random/generated/numpy.random.RandomState.randn.html).

`np.random.randn(d0, d1, ..., dn)`

In [None]:
np.random.randint(0, 10, size=(8, 3))  # Generate a 2D array of random integers with a size of (8, 3)

In [None]:
np.random.rand(4, 4) 

In [None]:
np.random.randn(4, 4)

In [None]:
x = np.random.randn(4,5)

print(f"The code above generates a 2D array with {x.shape[0]} rows and {x.shape[1]} columns.")

In [None]:
# Resampling rows' indices of the 2D array 'x'
idx = np.random.choice(x.shape[0], 4)
print(f"Selecting these rows {idx} from the 2D array 'x'")
resampled_x = x[idx, :]

# It's possible to select columns using x.shape[1] instead of x.shape[0]


## Pandas 

In [None]:
import pandas as pd
import numpy as np

The `DataFrame.sample` returns a random sample of items from an axis of object.

`DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)`






See also:

`DataFrameGroupBy.sample`
Generates random samples from each group of a DataFrame object.

`SeriesGroupBy.sample`
Generates random samples from each group of a Series object.

Also in this case you can use random_state for reproducibility.

In [None]:
df =  pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])
print(df)

Rows can be sampled with and withour replacement

In [None]:
df.sample(10, replace=True, random_state=123) # the defult setting for replace=False)sample_data

A DataFrame column can be used as weights. Rows with larger value in the column chosen to weight are more likely to be sampled

In [None]:
df.sample(n=10, replace=True, weights='num_specimen_seen', random_state=123)

Also columns can be sampled

In [None]:
df.sample(5, replace=True, axis = 1)

### Stratified  sampling


We can stratify the sampling procedure for example according to a category (e.g. according to the number of wings)

`DataFrameGroupBy.sample(n=None, frac=None, replace=False, weights=None, random_state=None)`

Return a random sample of items from each group.

In [None]:
df.groupby("num_wings").sample(n=6, replace=True)

Control sample probabilities within groups by setting weights

In [None]:
df.groupby("num_wings").sample(n=6, replace=True,   weights=df.num_specimen_seen, random_state=123)

## Scikit-learn resample

The `sklearn.utils.resample` resamples arrays or sparse matrices in a consistent way.

`sklearn.utils.resample(*arrays, replace=True, n_samples=None, random_state=None, stratify=None)`



In [None]:
import numpy as np
from sklearn.utils import resample

In [None]:
X_features = np.array([[1., 2.1], [2., 3.1], [0.5, 1.8]])
X_phenotypic = np.array([[1., 0.], [2., 1.], [2., 0.]])
Y_diagnosis = ['mal', 'ben', 'ben']

In [None]:
X_features

In [None]:
X_phenotypic

In [None]:
Y_diagnosis

You can consistenly resample different arrays.

In [None]:
sX_features, sX_phenotypic, sY_diagnosis = resample(X_features, X_phenotypic, Y_diagnosis, n_samples= 2, random_state=23)

In [None]:
sX_features, sX_phenotypic, sY_diagnosis

This code generate `n_samples` sets of samples from the original data. The random_state parameter ensures reproducibility by seeding the random number generator.

### Stratified sampling

Sampling using stratification

In [None]:
X_features = np.array([[1., 2.1], [2., 3.1], [2.5, 3.], [0.5, 1.8], [0.7, 1.5], [1.5, 3.8], [0.7, 1.5], [0.3, 1.2]])
Y_diagnosis = ['mal', 'ben', 'ben', 'ben','mal','mal','ben','mal']

In [None]:
sX_features, sY_diagnosis = resample(X_features, Y_diagnosis, n_samples=5, replace=True, stratify=Y_diagnosis, random_state=123)

Stratify means the distribution in your original classes is preserved. Number of subjects with a 'mal' diagnosis in the original sample:

In [None]:
Y_diagnosis.count('mal')/len(Y_diagnosis)

In [None]:
sX_features, sY_diagnosis

Number of subjects with a 'mal' diagnosis in the resampled sample:

In [None]:
sY_diagnosis.count('mal')/len(sY_diagnosis)

In [None]:
sX_features, sY_diagnosis = resample(X_features, Y_diagnosis, n_samples=20, replace=True,  random_state=123)

In [None]:
sY_diagnosis.count('mal')/len(sY_diagnosis)