<a href="https://colab.research.google.com/github/HanifaElahi/Statistical-Analysis/blob/main/Statistical%20Analysis%20Part%20V%20Sampling%20Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings

warnings.filterwarnings('ignore')

In [2]:
import numpy as np

import pandas as pd

# Sampling

---

Sampling is a statistical method of selecting a subset (called a "sample") from a larger population. The goal is to analyze the sample to infer information about the entire population. Sampling techniques are crucial for efficient data analysis, especially when dealing with large datasets or populations.



# Key Terms in Sampling

## 1. Population

The entire group of individuals or items that you want to study or make conclusions about.

**Example:** All citizens of a country, all employees in a company.

## 2. Sample

A subset of the population selected for analysis.

**Example:** 1,000 citizens surveyed to represent a country of 1 million.

## 3. Sampling Frame

A list or database of all members in the population from which the sample is drawn.

**Example:** A company’s employee directory.

## 4. Sampling Unit

An individual member of the population that can be selected as part of the sample.

**Example:** A single person, household, or company.

## 5. Parameter

A measurable characteristic of a population (e.g., mean, median).

**Example:** The average age of all citizens in a country.

## 6. Statistic

A measurable characteristic derived from the sample, used to estimate the population parameter.

**Example:** The average age of the sampled citizens.

## 7. Sampling Bias

A bias that occurs when the sample is not representative of the population, leading to inaccurate conclusions.

**Example:** Only surveying urban residents about national policies.

## 8. Sampling Error

The difference between a population parameter and a sample statistic due to the sampling process.

**Example:** If the population mean is 50 but the sample mean is 48, the sampling error is 2.

## 9. Sample Size

The number of individuals or units included in the sample.

**Example:**200 respondents in a survey.

To draw valid conclusions from your results, you have to carefully decide how you will select a sample that is representative of the group as a whole. This is called a sampling method. There are two primary types of sampling methods that you can use in your research:

Probability sampling involves random selection, allowing you to make strong statistical inferences about the whole group.
Non-probability sampling involves non-random selection based on convenience or other criteria, allowing you to easily collect data.

# Dataset

In [3]:
data = pd.DataFrame({
    'ID': range(1, 101),
    'Age': np.random.randint(18, 65, 100),
    'Gender': np.random.choice(['Male', 'Female'], 100),
    'Income': np.random.randint(30000, 100000, 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Feedback': np.random.choice(['Positive', 'Neutral', 'Negative'], 100)
})

In [4]:
data.shape

(100, 6)

In [5]:
data.dtypes

Unnamed: 0,0
ID,int64
Age,int64
Gender,object
Income,int64
Region,object
Feedback,object


# Probability sampling methods

- Probability sampling ensures that every individual in the population has an equal opportunity to be selected.

- This approach is commonly applied in quantitative research.

- When the goal is to obtain results that accurately reflect the entire population, probability sampling methods are the most reliable option.

- This method is more time consuming and expensive than the non-probability sampling method.

## 1. Random Sampling

Image Source : TGM Research

<img src = 'https://tgmresearch.com/templates/yootheme/cache/06/definition-06208138.jpeg' width="900" height="750">


### Definition

Random sampling, also known as **Representative Sampling**, ensures that every individual in the population has an equal chance of being selected. Here, item selection entirely depends on the chance, this method is known as “Method of chance Selection”.

### Key Features

- Equal probability of selection.
- No bias in the selection process.
- Requires a complete population list.

### Pros

- Simple and straightforward to implement.
- Reduces selection bias.
- Results are highly representative of the population (if the sample is large enough).

### Cons

- May not be feasible for large populations.
- Requires a comprehensive list of all individuals in the population.
- Randomness may still lead to unbalanced samples by chance.

### Use Case

- Selecting survey participants from a customer database for market research.

In [6]:
# Random Sampling
def random_sampling(data, sample_size):
    return data.sample(n=sample_size, random_state=42)

In [7]:
random_sample = random_sampling(data, 10)
random_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
83,84,44,Female,48493,North,Negative
53,54,22,Female,97721,South,Neutral
70,71,47,Female,78615,North,Positive
45,46,31,Female,87865,East,Positive
44,45,52,Male,87374,West,Positive
39,40,48,Male,97359,West,Neutral
22,23,46,Female,46144,West,Neutral
80,81,27,Male,42370,East,Negative
10,11,41,Male,95213,North,Neutral
0,1,27,Female,47472,North,Negative


## 2. Systematic Sampling

Image Source : TGM Research

<img src = 'https://tgmresearch.com/images/library/blog-2023/systematic-sampling/systematic-random-sampling.jpg' width="900" height="750">

### Definition

Systematic sampling selects individuals at regular intervals from an ordered population list.

### Key Features

- Requires an ordered population.
- Selection interval is calculated as [Population Size / Sample Size].

### Pros

- Easy to implement.
- Ensures even distribution of samples.
- Less time-consuming than random sampling.

### Cons

- Can introduce bias if the list has a pattern.
- Not ideal if the population order is cyclical or repetitive.

### Use Case

- Quality control in manufacturing (e.g., inspecting every 10th item produced).


In [8]:
# Systematic Sampling
def systematic_sampling(data, step):
    return data.iloc[::step]

In [9]:
systematic_sample = systematic_sampling(data, 10)
systematic_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
0,1,27,Female,47472,North,Negative
10,11,41,Male,95213,North,Neutral
20,21,36,Male,80730,South,Positive
30,31,50,Male,74239,West,Negative
40,41,37,Male,51997,East,Positive
50,51,22,Male,94679,East,Negative
60,61,59,Female,31898,West,Neutral
70,71,47,Female,78615,North,Positive
80,81,27,Male,42370,East,Negative
90,91,39,Male,94469,West,Neutral


## 3. Stratified Sampling

Image Source : TGM Research

<img src = 'https://tgmresearch.com/templates/yootheme/cache/1b/definition-1b838754.jpeg' width="900" height="750">

### Definition

The population is divided into subgroups (strata) based on shared characteristics, and random samples are taken from each stratum.

### Key Features

- Subgroups (strata) are predefined based on certain criteria.
- Ensures representation of all subgroups.

### Pros

- Provides more accurate results when subgroups vary significantly.
- Ensures representation of minority groups.

### Cons

- Requires prior knowledge of population strata.
- More complex and time-consuming than random sampling.

### Use Case

- Analyzing voting behavior by gender, age group, or income level.##

In [10]:
# Stratified Sampling
def stratified_sampling(data, stratify_col, sample_size_per_group):
    stratified_sample = data.groupby(stratify_col).apply(lambda x: x.sample(n=sample_size_per_group, random_state=42)).reset_index(drop=True)
    return stratified_sample

In [11]:
stratified_sample = stratified_sampling(data, 'Region', 3)
stratified_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
0,72,30,Female,91091,East,Negative
1,48,41,Male,65644,East,Negative
2,3,52,Male,91845,East,Neutral
3,27,62,Male,62896,North,Positive
4,58,32,Male,60156,North,Neutral
5,1,27,Female,47472,North,Negative
6,75,44,Female,44018,South,Positive
7,44,56,Female,89425,South,Negative
8,15,55,Female,31920,South,Negative
9,30,49,Male,58992,West,Negative


## 4. Cluster Sampling

Image Source : TGM Research

<img src = 'https://tgmresearch.com/templates/yootheme/cache/45/5-45fcd737.jpeg' width="900" height="750">

### Definition

The population is divided into clusters, and a random sample of clusters is selected. All individuals in the selected clusters are included in the sample.

### Key Features

- Population is naturally divided into clusters.
- Only some clusters are chosen for sampling.

### Pros

- Cost-effective for geographically dispersed populations.
- Reduces time and effort compared to other methods.

### Cons

- Less accurate if clusters are not homogenous.
- Results depend on the representativeness of the selected clusters.

### Use Case

- Conducting household surveys by selecting specific neighborhoods.

In [12]:
# Cluster Sampling
def cluster_sampling(data, cluster_col, num_clusters):
    clusters = data[cluster_col].unique()
    selected_clusters = np.random.choice(clusters, num_clusters, replace=False)
    return data[data[cluster_col].isin(selected_clusters)]

In [13]:
cluster_sample = cluster_sampling(data, 'Region', 2)
cluster_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
2,3,52,Male,91845,East,Neutral
3,4,48,Male,89735,East,Neutral
4,5,59,Female,65443,West,Positive
5,6,35,Male,61110,East,Neutral
8,9,64,Male,42804,East,Neutral
11,12,54,Female,73550,West,Neutral
12,13,37,Female,46069,East,Positive
13,14,24,Female,97435,West,Negative
15,16,21,Female,48447,West,Positive
16,17,43,Female,38381,West,Positive


# Non-Probability Sampling

In a non-probability sample, individuals are selected based on non-random criteria, and not every individual has a chance of being included.

This type of sample is easier and cheaper to access, but it has a higher risk of sampling bias. That means the inferences you can make about the population are weaker than with probability samples, and your conclusions may be more limited. If you use a non-probability sample, you should still aim to make it as representative of the population as possible.

Non-probability sampling techniques are often used in exploratory and qualitative research. In these types of research, the aim is not to test a hypothesis about a broad population, but to develop an initial understanding of a small or under-researched population.




## 1. Convenience Sampling

Image Source : TGM Research

<img src = 'https://tgmresearch.com/templates/yootheme/cache/34/convenience-sampling-definition-3469d2b6.jpeg'  width="900" height="750">

### Definition

Convenience sampling involves selecting samples that are easiest to access.

### Key Features

- Non-random and based on accessibility.
- Quick and simple to implement.

### Pros

- Requires minimal effort and resources.
- Useful for exploratory research.

### Cons

- Highly prone to selection bias.
- Results are not generalizable to the entire population.

### Use Case

- Testing a prototype with employees in the same organization.

In [14]:
# Convenience Sampling
def convenience_sampling(data, sample_size):
    return data.head(sample_size)

In [15]:
convenience_sample = convenience_sampling(data, 5)
convenience_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
0,1,27,Female,47472,North,Negative
1,2,36,Male,84116,North,Negative
2,3,52,Male,91845,East,Neutral
3,4,48,Male,89735,East,Neutral
4,5,59,Female,65443,West,Positive


## Snowball Sampling

Image Source : TGM Research

<img src = 'https://tgmresearch.com/templates/yootheme/cache/dc/snowball-sampling-method-dc052d91.jpeg'  width="900" height="750">

### Definition

Used for hard-to-reach populations, where existing participants recruit future participants.

### Key Features

- Relies on participant referrals.
- Common in social sciences and qualitative research.

### Pros

- Effective for reaching hidden or niche populations.
- Requires fewer initial resources to start.

### Cons

- May introduce bias if participants recruit similar individuals.
- Difficult to determine the sample's representativeness.

### Use Case
- Researching undocumented workers or marginalized communities.

In [16]:
# Snowball Sampling
def snowball_sampling(data, seed_ids, referrals):
    selected_ids = set(seed_ids)
    for _ in range(referrals):
        referred_ids = data[data['ID'].isin(selected_ids)]['ID'].sample(n=3, replace=True).tolist()
        selected_ids.update(referred_ids)
    return data[data['ID'].isin(selected_ids)]

In [17]:
# Assume IDs 1, 2, 3 are the seeds and they refer more participants
snowball_sample = snowball_sampling(data, seed_ids=[1, 2, 3], referrals=2)
snowball_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
0,1,27,Female,47472,North,Negative
1,2,36,Male,84116,North,Negative
2,3,52,Male,91845,East,Neutral


## 3. Voluntary Response Sampling

Image Source : TGM Research

<img src = ''>

### Definition

Voluntary response sampling involves participants self-selecting to be part of the sample. Individuals voluntarily participate, often in response to an open invitation.

### Key Features

- Relies on participants' willingness to participate.
- Typically involves surveys, polls, or questionnaires distributed to a broad audience.
- Participants are not randomly selected.

### Pros

- Cost-effective and easy to implement.
- Provides insights from highly engaged participants.
- Useful for gathering opinions on public platforms or controversial topics.

### Cons

- Prone to selection bias, as only motivated individuals respond.
- Results may not be representative of the entire population.
- Often overrepresents individuals with strong opinions (positive or negative).

### Use Cases

- Online polls or surveys asking for public opinion.
- Feedback collection from customers via a voluntary feedback form.
- Gauging interest in social or political issues.


In [18]:
# Voluntary Response Sampling
def voluntary_response_sampling(data, voluntary_ids):
    return data[data['ID'].isin(voluntary_ids)]

In [19]:
voluntary_sample = voluntary_response_sampling(data, [10, 20, 30])
voluntary_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
9,10,36,Male,65327,North,Positive
19,20,39,Male,64018,West,Negative
29,30,49,Male,58992,West,Negative


## 4. Quota Sampling

Image Source : TGM Research

<!-- <img src = 'https://tgmresearch.com/templates/yootheme/cache/2c/quota-sampling-method-2c0bd210.jpeg'> -->

<img src="https://tgmresearch.com/templates/yootheme/cache/2c/quota-sampling-method-2c0bd210.jpeg" width="900" height="750">


### Definition

Quota sampling involves dividing the population into subgroups (quotas) based on specific characteristics and selecting samples from each subgroup to meet predefined quotas.

### Key Features

- Non-random sampling method.
- Requires quotas to be set based on characteristics such as age, gender, income, etc.
- The researcher ensures that the sample matches the quotas.

### Pros

- Ensures representation of key subgroups.
- Less expensive and time-consuming compared to probability sampling methods.
- Useful when population demographics are known.

### Cons

- Results can be biased if the selection within quotas is not random.
- May overlook individuals outside the quotas.
- Cannot calculate sampling error or confidence intervals.

### Use Cases

- Market research studies to represent demographics (e.g., 50% male, 50% female).
- Surveys in industries to understand customer preferences across age groups.
- Political polling to include a balanced representation of regions.

In [20]:
# Quota Sampling
def quota_sampling(data, quotas):
    quota_sample = pd.DataFrame()
    for column, limit in quotas.items():
        group_sample = data.groupby(column).apply(lambda x: x.sample(n=limit, random_state=42)).reset_index(drop=True)
        quota_sample = pd.concat([quota_sample, group_sample])
    return quota_sample

In [21]:
quota_sample = quota_sampling(data, {'Gender': 3, 'Region': 2})
quota_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
0,84,44,Female,48493,North,Negative
1,57,62,Female,81938,East,Positive
2,59,23,Female,77219,West,Negative
3,58,32,Male,60156,North,Neutral
4,9,64,Male,42804,East,Neutral
5,60,34,Male,75735,North,Negative
0,72,30,Female,91091,East,Negative
1,48,41,Male,65644,East,Negative
2,27,62,Male,62896,North,Positive
3,58,32,Male,60156,North,Neutral


## 5. Purposive Sampling

Image Source : TGM Research

<img src = 'https://tgmresearch.com/templates/yootheme/cache/1c/purposive-sampling-method-1ccf764b.jpeg' width="900" height="750">

### Definition

Purposive sampling (also called judgmental or selective sampling) involves selecting individuals or units based on specific criteria or purpose defined by the researcher.

### Key Features

- The sample is handpicked based on the research objectives.
- Selection is based on specific traits, knowledge, or experiences relevant to the study.
- Often used in qualitative research.

### Pros

- Allows researchers to focus on specific, relevant individuals or cases.
- Useful for studying niche or specialized groups.
- Helps gather in-depth, detailed data.

### Cons
- Highly prone to researcher bias.
- Results are not generalizable to the broader population.
- May not provide a comprehensive view of the population.

### Use Cases

- Studying rare diseases by selecting patients with the condition.
- Researching expert opinions on a specific topic (e.g., climate change).
- Investigating behaviors of a particular demographic (e.g., high-income earners).


In [22]:
# Purposive Sampling
def purposive_sampling(data, condition):
    return data.query(condition)

In [23]:
purposive_sample = purposive_sampling(data, "Income > 80000")
purposive_sample

Unnamed: 0,ID,Age,Gender,Income,Region,Feedback
1,2,36,Male,84116,North,Negative
2,3,52,Male,91845,East,Neutral
3,4,48,Male,89735,East,Neutral
10,11,41,Male,95213,North,Neutral
13,14,24,Female,97435,West,Negative
20,21,36,Male,80730,South,Positive
27,28,60,Female,82929,South,Neutral
31,32,31,Female,99468,North,Neutral
33,34,55,Male,81154,North,Negative
35,36,63,Male,92520,South,Neutral
