# <p style='text-align: center;'> Sampling Techniques for Data Science </p>

## sampling methods or Sampling Technique
In Statistics, the **sampling method or sampling technique** is the process of studying the population by gathering information and analyzing that data. It is the basis of the data where the sample space is enormous.

## Introduction to Population and Sample
To start with, let’s have a look at some basic terminology. It is important to learn the concepts of **population and sample**. The **population** is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse, whereas a **sample** is a subset of observations from the population that ideally is a true representation of the population.

![image.png](attachment:image.png)


Given that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials. To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased.


<b> Let’s understand the key terms in sampling — The population, sampling frame, and sample.
    
- **population** is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse
    
    
- **Sampling Frame** Contains the accessible target population under study. We derive a sample from the sampling frame.
    
    
- **Sample** is a subset of a population, selected through various techniques, in other words the sample is a subset of observations from the population that ideally is a true representation of the population.


## Advantages of Sampling
Sampling brings many advantages in terms of speed and accuracy. While we are inclined to think that studying each individual on the whole population will lead to accuracy, we tend to overlook the many sources of errors that can happen in a study of the whole population. Further, in most cases, it is just not feasible to study the whole population.

A sample can provide accuracy as we will be able to deploy trained field workers on whom we can rely to collect the observations, scientifically monitor the biases and remove them and since we are collecting limited observations, we reduce the possibility of mistakes that come from processing the data. Moreover, the smaller size of the sample means that we can supervise with efficacy and have clean, usable data.

As our analysis will rely on the sample, it’s important that we scientifically approach how we go about selecting samples. However, before we go into sample selection methodologies, let’s look at the errors that can happen while selecting samples.

## Errors in sample selection
Selecting a sample that closely represents the population is critical to business problem-solving. Here are some of the errors:

- **Cyclical business induced errors** — If we are looking at buying behaviour, taking samples around Christmas and Diwali will not be representative of the overall behaviour.


- **Specification error** — If the study is around sales of toys, and we survey the mothers only, that may not be accurate as children influence the buying behaviour.


- **Sample frame error** — This error happens when we select the wrong sub-population. For instance, our study was to understand if the population favours a new policy that has been introduced in India. We survey everyone who speaks English. It may not be accurate as ~90% of the country’s population does not speak English.


## Let’s understand the sampling process

![image.png](attachment:image.png)


- **1. Define target population:** Based on the objective of the study, clearly scope the target population. For instance, if we are studying a regional election, the target population would be all people who are domiciled in the region that are eligible to vote.


- **2. Define Sampling Frame:** The sampling frame is the approachable members from the overall population. In the above example, the sampling frame would consist of all the people from the population who are in the state and can participate in the study.


- **3. Select Sampling Technique:** Now that we have the sampling frame in place, we want to select an appropriate sampling technique. 


- **4. Determine Sample Size:** To ensure that we have an unbiased sample, free from errors and that closely represents the whole population, our sample needs to be of an appropriate size. What is an appropriate size? Well, this is dependent on factors like the complexity of the population under study, the researcher’s resources and associated constraints. Also, it’s important to keep in mind that not all individuals we approach for the study will respond. Researchers like Bartlett et al. suggest that we should increase the number of individuals we approach initially, by as much as 50%, to factor in the non-response rate.


- **5. Collect the Data:** Data collection is critical to solving the business case. We should attempt to ensure that we don’t have too many empty fields in our data, and we document the reasons in cases where the data is missing. This helps in analysis, as this gives us perspective on how to treat the missing data when we perform analysis.


- **6. Assess response rate:** It is important to closely monitor the response rate to ensure you make timely changes to your sample collection approach and ensure you achieve your determined sample collection.

## Popular Sampling Techniques
As we know, taking a subset from the sampling frame forms the act of sampling. The various ways in which we can select samples can be divided into two types:

### 1. Probability Sampling:
The probability sampling method utilizes some form of **random selection**. In this method, all the eligible individuals have a chance of selecting the sample from the whole sample space. This method is more time consuming and expensive than the non-probability sampling method. The benefit of using probability sampling is that it guarantees the sample that should be the representative of the population.

### 2. Non-Probability sampling:
This is also referred to as **non-random** sampling. The non-probability sampling method is a technique in which the researcher selects the sample based on subjective judgment rather than the random selection. In this method, not all the members of the population have a chance to participate in the study.

<b> Whether you decided to go for a probability or a non-probability approach depends on the following factors:

   1. Goal and scope of the study.


   2. Data collection methods that are feasible.


  3. Duration of the study.


  4. Level of precision you wish to have from the results.


  5. Design of the sampling frame and viability to maintain the frame.


<b> Probability Sampling and Non-Probability samplings are futher divided as follows:
    
<b> Probability Sampling Methods:

- Simple random sampling    
- Systematic sampling
- Clustered sampling    
- Weighted Sampling 
- Stratified sampling
    
    
<b> Non-probability Sampling Methods:
    
- Convenience sampling 
- Consecutive sampling  
- Quota sampling  
- Purposive or Judgmental sampling
- Snowball sampling    

### 1.1 Simple Random sampling:
The simplest data sampling technique that creates a random sample from the original population is Random Sampling. In this approach, every sampled observation has the same probability of getting selected during the sample generation process. Random Sampling is usually used when we don’t have any kind of prior information about the target population.

For example random selection of 3 individuals from a population of 10 individuals. Here, each individual has an equal chance of getting selected to the sample with a probability of selection of 1/10.

![image.png](attachment:image.png)

<b> Example:

Suppose we want to select a simple random sample of 200 students from a school. Here, we can assign a number to every student in the school database from 1 to 500 and use a random number generator to select a sample of 200 numbers.

<b> Random Sampling: Python Implementation
    
First, we generate random data that will serve as population data. We will, therefore, randomly sample 10K data points from Normal distribution with mean mu = 10 and standard deviation std = 2. After this, we create a Python function called **random_sampling()** that takes population data and desired sample size and produces as output a random sample.

In [6]:
# import numpy
import numpy as np

# generating population data following Normal Distribution
N = 10000                                                    # Population data
mu = 10                                                      # mean 
std = 2                                                      # Standard Deviation
population_df = np.random.normal(mu,std,N)

# function that creates random sample 
def random_sampling(df, n):
    random_sample = np.random.choice(df,replace = False, size = n)
    return(random_sample)
randomSample = random_sampling(population_df, N)
randomSample

array([10.60876528, 12.38190067,  5.68613168, ..., 11.64476715,
        8.50758592, 10.28613311])

In [7]:
len(randomSample)

10000

<b> From the above, we can see that, we have selected 10000 records randomly.

### 1.2 Systematic sampling:
**Systematic sampling** is defined as a probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval.

Stated differently, systematic sampling is an extended version of probability sampling techniques in which each member of the group is selected at regular periods to form a sample. We calculate the sampling interval by dividing the entire population size by the desired sample size.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

![image.png](attachment:image.png)

<b> Example:

Suppose the names of 300 students of a school are sorted in the reverse alphabetical order. To select a sample in a systematic sampling method, we have to choose some 15 students by randomly selecting a starting number, say 5.  From number 5 onwards, will select every 15th person from the sorted list. Finally, we can end up with a sample of some students.


<b> Systematic Sampling: Python Implementation
    
We generate data that serve as population data as in the previous case. We then create a Python function called **systematic_sample()** that takes population data and interval for the sampling and produces as output a systematic sample.

In [10]:
# import the required libraries
import numpy as np
import pandas as pd

# generating population data following Normal Distribution
N = 10000
mu = 10
std = 2
population_df = np.random.normal(mu,std,N)

# function that creates random sample using Systematic Sampling
def systematic_sampling(df, step):
    id = pd.Series(np.arange(1,len(df),1))
    df = pd.Series(df)
    df_pd = pd.concat([id, df], axis = 1)
    df_pd.columns = ["id", "data"]
    
    # these indices will increase with the step amount not 1
    selected_index = np.arange(1,len(df),step)
    
    # using iloc for getting thee data with selected indices
    systematic_sampling = df_pd.iloc[selected_index]
    
    return(systematic_sampling)

n = 10
step = int(N/n)
sample = systematic_sampling(population_df, step)
sample

Unnamed: 0,id,data
1,2.0,9.008347
1001,1002.0,9.800426
2001,2002.0,10.634387
3001,3002.0,8.942525
4001,4002.0,8.995485
5001,5002.0,8.376128
6001,6002.0,8.763486
7001,7002.0,12.720985
8001,8002.0,10.78661
9001,9002.0,9.162616


<b> From the above, we can see that, we have selected 10 records random starting point and after a fixed sampling interval.
    
**Note:** Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

### 1.3 Cluster sampling:
Cluster sampling is a probability sampling technique where we divide the population into multiple clusters(groups) based on certain clustering criteria. Then we select a random cluster(s) with simple random or systematic sampling techniques. So, in cluster sampling, the entire population is divided into clusters or segments and then cluster(s) are randomly selected.

For example, if you want to conduct an experience evaluating the performance of sophomores in business education across Europe. It is impossible to conduct an experiment that involves a student in every university across the EU. Instead, by using Cluster Sampling, we can group the universities from each country into one cluster. These clusters then define all the sophomore student population in the EU. Next, you can use simple random sampling or systematic sampling and randomly select cluster(s) for the purposes of your research study.

![image.png](attachment:image.png)

<b> Example:

An educational institution has ten branches across the country with almost the number of students. If we want to collect some data regarding facilities and other things, we can’t travel to every unit to collect the required data. Hence, we can use random sampling to select three or four branches as clusters.


<b> Cluster Sampling: Python Implementation
    
First, we generate data that will serve as population data with 10K observations, and this data consists of the following 4 variables:

- **Price:** generated using Uniform distribution
    
    
- **Id:** Identity
    
    
- **event_type:** which is a categorical variable with 3 possible values {type1, type2, type3}
    
    
- **click:** binary variable taking values {0: no click, 1: click}
    

In [20]:
# import the required libraries
import numpy as np
import pandas as pd

# Generating Population data 
price_vb = pd.Series(np.random.uniform(1,4,size = N))
id = pd.Series(np.arange(0,len(price_vb),1))
event_type = pd.Series(np.random.choice(["type1","type2","type3"],size = len(price_vb)))
click = pd.Series(np.random.choice([0,1],size = len(price_vb)))
df = pd.concat([id,price_vb,event_type, click],axis = 1)
df.columns = ["id","price","event_type", "click"]
df

Unnamed: 0,id,price,event_type,click
0,0,1.303115,type3,1
1,1,1.720620,type3,1
2,2,1.165739,type2,1
3,3,3.184245,type2,1
4,4,2.857674,type2,1
...,...,...,...,...
9995,9995,3.458771,type2,1
9996,9996,3.047328,type3,1
9997,9997,1.264651,type2,1
9998,9998,1.437098,type1,1


Then the function **get_clustered_Sample()** takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.

In [21]:
def get_clustered_Sample(df, n_per_cluster, num_select_clusters):
    N = len(df)
    K = int(N/n_per_cluster)
    data = None
    for k in range(K):
        sample_k = df.sample(n_per_cluster)
        sample_k["cluster"] = np.repeat(k,len(sample_k))
        df = df.drop(index = sample_k.index)
        data = pd.concat([data,sample_k],axis = 0)

    random_chosen_clusters = np.random.randint(0,K,size = num_select_clusters)
    samples = data[data.cluster.isin(random_chosen_clusters)]
    return(samples)

sample = get_clustered_Sample(df = df, n_per_cluster = 100, num_select_clusters = 2)
sample

Unnamed: 0,id,price,event_type,click,cluster
1412,1412,1.511824,type1,1,11
8657,8657,3.040880,type1,1,11
6796,6796,3.680256,type1,1,11
1701,1701,1.281335,type3,0,11
6431,6431,1.190848,type1,1,11
...,...,...,...,...,...
9371,9371,2.330950,type1,0,61
2564,2564,1.413113,type3,1,61
2886,2886,3.401720,type3,0,61
5198,5198,2.906605,type3,1,61


**Note:** Cluster Sampling usually produces a random sample but is not addressing the bias in the created sample.

### 1.4 Weighted Sampling:
In some experiments, you might need items sampling probabilities to be according to weights associated with each item, that’s when the proportions of the type of observations should be taken into account. For example, you might need a sample of queries in a search engine with weight as a number of times these queries have been performed so that the sample can be analyzed for overall impact on the user experience. In this case, Weighted Sampling is much more preferred compared to Random Sampling or Systematic Sampling.

Weighted Sampling is a data sampling method with weights, that intends to compensate for the selection of specific observations with unequal probabilities (oversampling), non-coverage, non-responses, and other types of bias. If a biased data set is not adjusted and a simple random sampling type of approach is used instead, then the population descriptors (e.g., mean, median) will be skewed and they will fail to correctly represent the population’s proportion to the population.

Weighted Sampling addresses the bias in the sample, by creating a sample that takes into account the proportions of the type of observations in the population. Hence, Weighted Sampling usually produces a random and unbiased sample.

![image.png](attachment:image.png)


Then the function **get_clustered_Sample()** takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.


<b> Weighted Sampling: Python Implementation
    
The function **get_weighted_sample()** takes as inputs the original data, and the desired sample size, and produces as output a weighted sample. Note that, the proportions, in this case, are defined based on the click event. That is, we compute the proportion of data points that had click events of 1 (let’s say X%) and 0 (Y%, where Y% = 100-X%), then we generate a random sample such that, the sample will also contain X% observations with click = 1 and Y% observations with click = 0.

In [23]:
def get_weighted_sample(df,n):
    def get_class_prob(x):
        weight_x = int(np.rint(n * len(x[x.click != 0]) / len(df[df.click != 0])))
        sampled_x = x.sample(weight_x).reset_index(drop=True)
        return (sampled_x)
        # we are grouping by the target class we use for the proportions

    weighted_sample = df.groupby('event_type').apply(get_class_prob)
    print(weighted_sample["event_type"].value_counts())
    return (weighted_sample)

sample = get_weighted_sample(df,100)
sample

type3    34
type1    33
type2    33
Name: event_type, dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,id,price,event_type,click
event_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
type1,0,7288,3.518471,type1,0
type1,1,9590,3.899270,type1,1
type1,2,3751,2.238137,type1,1
type1,3,6616,3.152926,type1,1
type1,4,3052,2.849388,type1,0
...,...,...,...,...,...
type3,29,8347,2.466757,type3,1
type3,30,570,1.908015,type3,0
type3,31,6674,3.102505,type3,0
type3,32,2589,1.896749,type3,1


**Note:** Weighted Sampling usually produces a random and unbiased sample.

### 1.5 Stratified sampling:
**Stratified Sampling** is a data sampling approach, where we divide a population into homogeneous subpopulations called strata based on specific characteristics (e.g., age, race, gender identity, location, event type etc.).

Every member of the population studied should be in exactly one stratum. Each stratum is then sampled using Cluster Sampling, allowing data scientists to estimate statistical measures for each sub-population. We rely on Stratified Sampling when the populations’ characteristics are diverse and we want to ensure that every characteristic is properly represented in the sample.

So, Stratified Sampling, is simply, the combination of Clustered Sampling and Weighted Sampling.

![image.png](attachment:image.png)


<b> Example:
    
For example,  there are three bags (A, B and C), each with different balls. Bag A has 50 balls, bag B has 100 balls, and bag C has 200 balls. We have to choose a sample of balls from each bag proportionally. Suppose 5 balls from bag A, 10 balls from bag B and 20 balls from bag C.


<b> Stratified Sampling: Python Implementation
    
The function **get_stratified_sample()** takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event. Secondly, it performs clustered sampling using the event_type.

In [25]:
def get_startified_sample(df,n,num_clusters_needed):
    N = len(df)
    num_obs_per_cluster = int(N/n)
    K = int(N/num_obs_per_cluster)

    def get_weighted_sample(df,num_obs_per_cluster):
        def get_sample_per_class(x):
            n_x = int(np.rint(num_obs_per_cluster*len(x[x.click !=0])/len(df[df.click !=0])))
            sample_x = x.sample(n_x)
            return(sample_x)
        weighted_sample = df.groupby("event_type").apply(get_sample_per_class)
        return(weighted_sample)

    stratas = None
    for k in range(K):
        weighted_sample_k = get_weighted_sample(df,num_obs_per_cluster).reset_index(drop = True)
        weighted_sample_k["cluster"] = np.repeat(k,len(weighted_sample_k))
        stratas = pd.concat([stratas, weighted_sample_k],axis = 0)
        df.drop(index = weighted_sample_k.index)
    selected_strata_clusters = np.random.randint(0,K,size = num_clusters_needed)
    stratified_samples = stratas[stratas.cluster.isin(selected_strata_clusters)]
    return(stratified_samples)

sample = get_startified_sample(df = df,n = 100,num_clusters_needed = 2)
sample

Unnamed: 0,id,price,event_type,click,cluster
0,8990,1.004787,type1,1,9
1,545,2.185406,type1,0,9
2,1400,2.555783,type1,1,9
3,7058,1.328249,type1,0,9
4,2823,1.300704,type1,0,9
...,...,...,...,...,...
95,6233,2.073821,type3,0,74
96,8180,1.505576,type3,1,74
97,9226,3.904034,type3,1,74
98,1981,3.170735,type3,1,74


**Note:** Stratified Sampling, is basically, the combination of Clustered Sampling and Weighted Sampling.

### 2.1 Convenience Sampling:
In a convenience sampling method, the samples are selected from the population directly because they are conveniently available for the researcher. The samples are easy to select, and the researcher did not choose the sample that outlines the entire population.

<b> Example:

In researching customer support services in a particular region, we ask your few customers to complete a survey on the products after the purchase. This is a convenient way to collect data. Still, as we only surveyed customers taking the same product. At the same time, the sample is not representative of all the customers in that area.

    
### 2.2 Consecutive Sampling:
Consecutive sampling is similar to convenience sampling with a slight variation. The researcher picks a single person or a group of people for sampling. Then the researcher researches for a period of time to analyze the result and move to another group if needed.

    
### 2.3 Quota Sampling:
In the quota sampling method, the researcher forms a sample that involves the individuals to represent the population based on specific traits or qualities. The researcher chooses the sample subsets that bring the useful collection of data that generalizes the entire population.



### 2.4 Purposive or Judgmental Sampling
In purposive sampling, the samples are selected only based on the researcher’s knowledge. As their knowledge is instrumental in creating the samples, there are the chances of obtaining highly accurate answers with a minimum marginal error. It is also known as judgmental sampling or authoritative sampling.

    
### 2.5 Snowball Sampling:
Snowball sampling is also known as a chain-referral sampling technique. In this method, the samples have traits that are difficult to find. So, each identified member of a population is asked to find the other sampling units. Those sampling units also belong to the same targeted population.

## Probability sampling vs Non-probability Sampling Methods
The below table shows a few differences between probability sampling methods and non-probability sampling methods.

![image.png](attachment:image.png)