# Sampling Methods

* **Sampling** is a *process used in statistical analysis in which a predetermined number of observations are taken from a larger population.*

---

## Simple Random Sampling
* **Simple random sampling** is the *basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population).* Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. Every possible sample of a given size has the same chance of selection. 

![Simple random sampling of a sample “n” of 3 from a population “N” of 12. Image: Dan Kernler |Wikimedia Commons](https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/12/Simple_random_sampling-300x231.png)
*Simple random sampling of a sample “n” of 3 from a population “N” of 12. Image: Dan Kernler |Wikimedia Commons*

* Technically, a simple random sample is a set of n objects in a population of N objects where all possible samples are equally likely to happen. Here’s a basic example of how to get a simple random sample: put 100 numbered bingo balls into a bowl (this is the population N). Select 10 balls from the bowl without looking (this is your sample n). Note that it’s important not to look as you could (unknowingly) bias the sample. While the “lottery bowl” method can work fine for smaller populations, in reality you’ll be dealing with much larger populations.

![](https://research-methodology.net/wp-content/uploads/2015/04/Simple-random-sampling2.png)

In [5]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5000, 4), columns=list('ABCD'))

In [6]:
df

Unnamed: 0,A,B,C,D
0,1.236193,0.773223,-1.524389,-1.178702
1,-0.404296,0.675620,1.018546,0.028778
2,-0.444872,0.302627,1.266296,0.750915
3,0.505131,1.894547,0.873071,1.263101
4,0.323050,0.744162,-0.339963,0.035713
5,1.249726,0.272994,-1.493245,-0.841633
6,-0.848853,0.559860,0.089391,-0.402880
7,0.342564,1.890367,1.091281,-0.390963
8,0.151172,-0.174208,1.720094,-0.136986
9,0.325052,-0.487900,-2.641470,0.716419


In [7]:
sample_df = df.sample(100)

In [8]:
sample_df.shape

(100, 4)

## Stratified Sampling

---

* **Stratified random sampling** is a method of sampling that *involves the division of a population into smaller sub-groups known* as **strata** In stratified random sampling or stratification, the strata are formed based on members' shared attributes or characteristics such as income or educational attainment.

* **Stratified random sampling** is also called *proportional random sampling or quota random sampling.*

![](https://image.slidesharecdn.com/sampling-stratifiedvscluster-170115160432/95/sampling-stratified-vs-cluster-2-638.jpg?cb=1484496290)

##### Assume that we need to estimate the average number of votes for each candidate in an election. Assume that the country has 3 towns:
* Town A has 1 million factory workers,
* Town B has 2 million workers, and
* Town C has 3 million retirees.
* We can choose to get a random sample of size 60 over the entire population but there is some chance that the random sample turns out to be not well balanced across these towns and hence is biased causing a significant error in estimation.
* Instead, if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively then we can produce a smaller error in estimation for the same total size of the sample.

### Method

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, ## we need one categorical variable for that
                                                    test_size=0.25)

## Random Undersampling and Oversampling

---

![](https://miro.medium.com/max/700/0*u6pKLqdCDsG_5kXa.png)

* A widely adopted technique for dealing with highly imbalanced datasets is called resampling. It consists of *removing samples from the majority class* (**under-sampling**) and/or *adding more examples from the minority class* (**over-sampling**).

In [9]:
from sklearn.datasets import make_classification
X, y = make_classification(
    n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=100, random_state=10
)
X = pd.DataFrame(X)
X['target'] = y

In [10]:
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,target
0,0.327419,-0.123936,0.377707,-0.650123,0.267562,1.228781,2.208772,-0.185977,0.238732,-2.565438,...,0.644056,0.104375,-1.703024,-0.510083,-0.108812,-0.230132,1.553707,1.497538,-1.476485,0
1,-0.843981,-0.018691,-0.841018,1.374583,0.157199,-0.599719,2.217041,-2.032194,-2.310214,-0.490477,...,1.360939,-1.844740,-0.341096,0.137243,1.704764,0.464255,1.225786,-0.842880,1.303258,0
2,-0.204642,0.472155,-0.140616,-2.902493,-1.513665,1.149545,2.283673,-0.809117,-1.723535,-0.958556,...,-0.279701,-1.431391,0.260146,-0.501306,-2.320545,0.422214,1.386474,-0.073335,0.586859,0
3,0.208274,-0.156982,0.063369,-0.545759,-0.395416,-2.679969,1.507772,0.391485,-0.487337,-0.946147,...,-1.011854,-1.124795,0.347291,-1.078836,0.046923,-0.978324,1.100517,-0.697134,0.339577,0
4,0.785568,0.208472,0.760082,-0.046130,0.310844,-0.403927,1.462897,0.962173,-0.520996,1.647360,...,0.316792,-0.261528,-1.260698,0.822700,0.141031,-0.294805,2.216364,-1.129875,-1.059984,1
5,-0.886195,0.548814,-1.844824,0.638066,0.023932,0.491861,0.722346,0.811078,-0.468527,0.035382,...,-0.751144,0.148616,-0.185694,2.102140,-0.166839,0.088302,0.632036,1.766467,-1.373949,0
6,-1.396231,1.175303,-0.444875,-0.061029,0.521757,-0.143775,1.580864,-1.639435,-0.954991,-0.628853,...,0.804454,0.802000,0.440853,0.299328,-1.049694,-1.638443,1.095820,1.734194,-0.244441,0
7,2.518215,-0.242515,-0.632592,-2.613839,-0.180870,1.268051,0.951805,1.686518,-0.667233,-0.100769,...,-0.757227,0.472841,-0.371284,0.704411,-0.006120,0.541701,0.740249,-0.150790,0.004707,0
8,0.711901,-2.016997,-1.256924,-1.367869,-0.330754,0.061428,1.003502,-1.437870,-0.878656,1.966788,...,0.637275,0.647296,-0.156446,-0.348578,-0.583574,1.102407,1.639860,-0.030437,-0.675223,1
9,-2.068195,0.267997,-0.384181,-0.183092,-0.609841,-0.162082,1.504678,0.457680,-3.990568,0.480947,...,-1.534631,-0.351153,1.208621,1.226685,-2.190442,0.343396,-0.018622,-2.294159,-1.489028,1


We can now do random oversampling and undersampling using:

In [35]:
num_0 = len(X[X['target']==0])
num_1 = len(X[X['target']==1])
print(num_0,num_1)
# random undersample
undersampled_data = pd.concat([ X[X['target']==0].sample(num_1) , X[X['target']==1] ])
print(len(undersampled_data))
# random oversample
oversampled_data = pd.concat([ X[X['target']==0] , X[X['target']==1].sample(num_0, replace=True) ])
print(len(oversampled_data))

90 10
20
180
