<a href="https://colab.research.google.com/github/MinakoNG63/DSFB/blob/main/19_Probability_Sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 19. Probability Sampling with Python

Term 1 2022 - Instructor: Teerapong Leelanupab

Teaching Assistant:
1. Piyawat Chuangkrud (Sam)
2. Suvapat Manu (Mint)

***

Credit: Roberto Salazar, [Probability Sampling with Python](https://towardsdatascience.com/probability-sampling-with-python-8c977ad78664)

Another good complete explianation: [data-sampling-methods-in-python](https://towardsdatascience.com/data-sampling-methods-in-python-a4400628ea1b)

Additional Reading: [การกำหนดกลุ่มตัวอย่าง](http://pioneer.netserv.chula.ac.th/~jaimorn/re6.htm)
***

### For the following example, let’s obtain samples from a set of 10 products using probability sampling to determine the population mean of a particular measure of interest.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd

# Set random seed
np.random.seed(42)

# Define total number of products
number_of_products = 10

# Create data dictionary
data = {'product_id':np.arange(1, number_of_products+1).tolist(),
       'measure':np.round(np.random.normal(loc=10, scale=0.5, size=number_of_products),3)}

# Transform dictionary into a data frame
df = pd.DataFrame(data)

# Store the real mean in a separate variable
real_mean = round(df['measure'].mean(),3)

# View data frame
df

Unnamed: 0,product_id,measure
0,1,10.248
1,2,9.931
2,3,10.324
3,4,10.762
4,5,9.883
5,6,9.883
6,7,10.79
7,8,10.384
8,9,9.765
9,10,10.271


## 1. Simple Random Sampling

As its name suggests, the simple random sampling method selects random samples from a process or population where every unit has the same probability of getting selected. This is the most direct method of probability sampling.

<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week2/images/1_Simple_Random_Sampling.png' alt='Simple Random Sampling'/>
<figcaption><em>Fig. 1: Simple Random Sampling</em></figcaption></center>
</figure>

In [None]:
# Obtain simple random sample
simple_random_sample = df.sample(n=4).sort_values(by='product_id')

# Save the sample mean in a separate variable
simple_random_mean = round(simple_random_sample['measure'].mean(),3)

# View sampled data frame
simple_random_sample

Unnamed: 0,product_id,measure
2,3,10.324
6,7,10.79
7,8,10.384
8,9,9.765


## 2. Systematic Sampling
The systematic sampling method selects units based on a fixed sampling interval (i.e., every nth unit is selected from a given process or population). This sampling method tends to be more effective than the simple random sampling method.

<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week2/images/2_Systematic_Sampling.png' alt='Systematic Sampling'/>
<figcaption><em>Fig. 2: Systematic Sampling</em></figcaption></center>
</figure>

In [None]:
# Define systematic sampling function
def systematic_sampling(df, step):

    indexes = np.arange(0,len(df),step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample

# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, 3)

# Save the sample mean in a separate variable
systematic_mean = round(systematic_sample['measure'].mean(),3)

# View sampled data frame
systematic_sample

Unnamed: 0,product_id,measure
0,1,10.248
3,4,10.762
6,7,10.79
9,10,10.271


### 3. Cluster Sampling
The cluster sampling method divides the population in clusters of equal size *n* and selects clusters every $T^{th}$ time.

<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week2/images/3_Cluster_Sampling.png' alt='Cluster Sampling'/>
<figcaption><em>Fig. 3: Cluster Sampling</em></figcaption></center>
</figure>

In [None]:
def cluster_sampling(df, number_of_clusters):

    try:
        # Divide the units into cluster of equal size
        df['cluster_id'] = np.repeat([range(1,number_of_clusters+1)],len(df)/number_of_clusters)

        # Create an empty list
        indexes = []

        # Append the indexes from the clusters that meet the criteria
        # For this formula, clusters id must be an even number
        for i in range(0,len(df)):
            if df['cluster_id'].iloc[i]%2 == 0:
                indexes.append(i)
        cluster_sample = df.iloc[indexes]
        return(cluster_sample)

    except:
        print("The population cannot be divided into clusters of equal size!")

# Obtain a cluster sample and save it in a new variable
cluster_sample = cluster_sampling(df,5)

# Save the sample mean in a separate variable
cluster_mean = round(cluster_sample['measure'].mean(),3)

# View sampled data frame
cluster_sample

Unnamed: 0,product_id,measure,cluster_id
2,3,10.324,2
3,4,10.762,2
6,7,10.79,4
7,8,10.384,4


## 4. Stratified Random Sampling
The stratified random sampling method divides the population in subgroups (i.e., strata) and selects random samples where every unit has the same probability of getting selected.

<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week2/images/4_Stratified_Random_Sampling.png' alt='Stratified Random Sampling'/>
<figcaption><em>Fig. 4: Stratified Random Sampling</em></figcaption></center>
</figure>

In [None]:
# Create data dictionary with strata (subgroup - 'column "product_strata"')
data = {'product_id':np.arange(1, number_of_products+1).tolist(),
       'product_strata':np.repeat([1,2], number_of_products/2).tolist(),
       'measure':np.round(np.random.normal(loc=10, scale=0.5, size=number_of_products),3)}

# Transform dictionary into a data frame
df = pd.DataFrame(data)

# View data frame
df

Unnamed: 0,product_id,product_strata,measure
0,1,1,9.853
1,2,1,9.985
2,3,1,10.048
3,4,1,10.332
4,5,1,9.93
5,6,2,9.983
6,7,2,9.625
7,8,2,9.611
8,9,2,10.474
9,10,2,10.79


In [None]:
# Import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit

# Set the split criteria
split = StratifiedShuffleSplit(n_splits=1, test_size=4)

# Perform data frame split
for x, y in split.split(df, df['product_strata']):
    stratified_random_sample = df.iloc[y].sort_values(by='product_id')

# Save the sample mean in a separate variable
stratified_random_mean = round(stratified_random_sample['measure'].mean(),3)

# View sampled data frame
stratified_random_sample

Unnamed: 0,product_id,product_strata,measure
3,4,1,10.332
4,5,1,9.93
8,9,2,10.474
9,10,2,10.79


In [None]:
# Obtain the sample mean for each group
stratified_random_sample.groupby('product_strata').mean().drop(['product_id'],axis=1)

Unnamed: 0_level_0,measure
product_strata,Unnamed: 1_level_1
1,10.131
2,10.632


# Measure Mean Comparison per Sampling Method
Once samples have been obtained using each sampling technique, let’s compare the samples means with the population mean (which usually is unknown, but not in this case) to determine the sampling technique that leads to the best approximation of the population measure mean.

In [None]:
# Create a dictionary with the mean outcomes for each sampling method and the real mean
outcomes = {'sample_mean':[simple_random_mean, systematic_mean, cluster_mean, stratified_random_mean],
           'real_mean':real_mean}

# Transform dictionary into a data frame
outcomes = pd.DataFrame(outcomes, index=['Simple Random Sampling','Systematic Sampling','Cluster Sampling', 'Stratified Random Sampling'])

# Add a value corresponding to the absolute error
outcomes['abs_error'] = abs(outcomes['real_mean'] - outcomes['sample_mean'])

# Sort data frame by absolute error
outcomes.sort_values(by='abs_error')

Unnamed: 0,sample_mean,real_mean,abs_error
Simple Random Sampling,10.316,10.224,0.092
Stratified Random Sampling,10.381,10.224,0.157
Systematic Sampling,10.518,10.224,0.294
Cluster Sampling,10.565,10.224,0.341


## Results
According to the Measure Mean Comparison per Sampling Method Table, the measure mean of the sample obtained through the simple random sampling technique was the closest one to the real mean, with an absolute error of 0.092 units.

## Concluding Thoughts
Sampling represents a useful and effective method for drawing conclusions about a population from a sample. However, analysts and engineers must define sampling techniques with adequate sample sizes capable of reducing sampling bias (e.g., convenience sampling selection bias, systematic sampling bias selection bias, environmental bias, non-response bias) to obtain representative samples of a given population