# **Probability Sampling**
In probability sampling, every element in the population has a known, non-zero chance of being selected. This method is the foundation of inferential statistics, as it allows for the estimation of sampling error and the generalization of findings to the entire population.

In [1]:
import pandas as pd
import numpy as np

# Create a sample population DataFrame
data = {
    'ID': range(1, 101),
    'Age': np.random.randint(18, 65, size=100),
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=100, p=[0.4, 0.3, 0.2, 0.1]),
    'Value': np.random.randn(100) * 100
}
population_df = pd.DataFrame(data)

print("--- Sample Population ---")
print(population_df.head())
print(f"\nPopulation size: {len(population_df)}")
print("\nPopulation category distribution:")
print(population_df['Category'].value_counts())

--- Sample Population ---
   ID  Age Category       Value
0   1   50        C  -39.532689
1   2   37        B  114.108606
2   3   45        A  149.961259
3   4   35        B  105.532590
4   5   49        C -119.618822

Population size: 100

Population category distribution:
Category
A    41
C    25
B    22
D    12
Name: count, dtype: int64


### 1. `Simple Random Sampling`
This is the most basic form of probability sampling, where each individual in the population has an equal chance of being chosen. It is akin to a lottery system. For example, if you want to select 100 employees from a company of 1000, you could assign a number to each employee and then use a random number generator to pick 100 numbers.

In [3]:
# Set the sample size
sample_size = 10

# Perform simple random sampling
simple_random_sample = population_df.sample(n=sample_size, random_state=42)

print(f"--- Simple Random Sample (n={sample_size}) ---")
print(simple_random_sample)

--- Simple Random Sample (n=10) ---
    ID  Age Category       Value
83  84   53        A  212.888389
53  54   51        A   72.524583
70  71   44        B   91.393785
45  46   23        A   24.642528
44  45   34        B  -40.969078
39  40   64        A   20.937574
22  23   22        B   76.534670
80  81   51        A   19.316612
10  11   31        C   43.224254
0    1   50        C  -39.532689


### 2. `Systematic Sampling`
In this technique, the first individual is selected randomly, and then subsequent individuals are selected at regular intervals (e.g., every 10th person). This method is more straightforward and less time-consuming than simple random sampling but can be biased if there is an underlying pattern in the population that aligns with the sampling interval.

In [5]:
# Set the step size (k)
k = 10
sample_size = len(population_df) // k

# Choose a random starting point
start_index = np.random.randint(0, k)

# Get systematic sample indices
systematic_indices = np.arange(start_index, len(population_df), step=k)

# Extract the sample
systematic_sample = population_df.iloc[systematic_indices]

print(f"--- Systematic Sample (every {k}th element) ---")
print(systematic_sample)

--- Systematic Sample (every 10th element) ---
    ID  Age Category       Value
4    5   49        C -119.618822
14  15   18        A   47.827824
24  25   48        A    4.407001
34  35   53        B   -4.896002
44  45   34        B  -40.969078
54  55   46        A  -65.519102
64  65   39        C  238.245337
74  75   45        A  219.857297
84  85   25        A  -63.019948
94  95   26        A  107.558366


### 3. `Stratified Sampling`
This method involves dividing the population into subgroups, or "strata," based on shared characteristics such as age, gender, or income level. A random sample is then drawn from each stratum. This ensures that the sample is representative of the population's diversity. For instance, a political pollster might stratify voters by geographic region to ensure the sample accurately reflects the national distribution.

In [14]:
from sklearn.model_selection import train_test_split

# We want to stratify by the 'Category' column
# Let's take a 30% stratified sample
stratified_test_sample, stratified_train_sample = train_test_split(
    population_df,
    test_size=0.70, # Keep 30% for our sample
    stratify=population_df['Category'],
    random_state=42
)

print("--- Stratified Sample (30% of population) ---")
print(stratified_test_sample)

--- Stratified Sample (30% of population) ---
     ID  Age Category       Value
67   68   23        B   86.686896
75   76   42        A   84.080934
77   78   53        A  -61.063433
36   37   37        D   31.366740
63   64   30        B   49.350007
34   35   53        B   -4.896002
61   62   44        C  -51.196622
94   95   26        A  107.558366
18   19   19        D  233.733627
43   44   33        C   18.895459
56   57   57        C    4.049253
44   45   34        B  -40.969078
68   69   20        B    8.842671
10   11   31        C   43.224254
64   65   39        C  238.245337
39   40   64        A   20.937574
35   36   54        A  -30.269111
40   41   27        D  -98.258153
0     1   50        C  -39.532689
99  100   33        C  112.590136
80   81   51        A   19.316612
53   54   51        A   72.524583
20   21   40        A   90.519168
14   15   18        A   47.827824
74   75   45        A  219.857297
12   13   30        D   60.125635
24   25   48        A    4.407001
79

In [17]:
print("--- Stratified Sample (70% of population) ---")
print(stratified_train_sample)

--- Stratified Sample (70% of population) ---
    ID  Age Category       Value
17  18   34        C  -62.576204
27  28   61        C   -0.086912
37  38   23        C   28.605998
97  98   21        C -130.258137
98  99   48        A   20.874755
..  ..  ...      ...         ...
21  22   38        A   42.453540
81  82   29        A -196.810130
46  47   47        A  -85.180248
47  48   46        A   88.915672
45  46   23        A   24.642528

[70 rows x 4 columns]


In [None]:
print("\nOriginal population category distribution:")
print(population_df['Category'].value_counts(normalize=True))

print("\nStratified test sample category distribution:")
print(stratified_train_sample['Category'].value_counts(normalize=True))

print("\nStratified train sample category distribution:")
print(stratified_test_sample['Category'].value_counts(normalize=True))


Original population category distribution:
Category
A    0.41
C    0.25
B    0.22
D    0.12
Name: proportion, dtype: float64

Stratified sample category distribution:
Category
A    0.414286
C    0.257143
B    0.214286
D    0.114286
Name: proportion, dtype: float64

Stratified sample category distribution:
Category
A    0.400000
B    0.233333
C    0.233333
D    0.133333
Name: proportion, dtype: float64


### 4. `Cluster Sampling`
Here, the population is divided into clusters, typically based on geographical location. A random sample of these clusters is then selected, and all individuals within the chosen clusters are included in the sample. This is particularly useful when the population is large and geographically dispersed, making other methods impractical. For example, a researcher studying the academic performance of high school students in a state might randomly select a few school districts (clusters) and then survey all the students within those districts.

In [18]:
# Add a 'City' column to act as our cluster
population_df['City'] = np.random.choice(['Pune', 'Mumbai', 'Nagpur', 'Nashik'], size=100)

# Identify unique clusters
all_clusters = population_df['City'].unique()

# Randomly select a number of clusters to sample
num_clusters_to_sample = 2
chosen_clusters = np.random.choice(all_clusters, size=num_clusters_to_sample, replace=False)

# Select all rows belonging to the chosen clusters
cluster_sample = population_df[population_df['City'].isin(chosen_clusters)]

print(f"--- Cluster Sample (from clusters: {chosen_clusters}) ---")
print(cluster_sample)
print(f"\nTotal sample size: {len(cluster_sample)}")

--- Cluster Sample (from clusters: ['Nashik' 'Nagpur']) ---
    ID  Age Category       Value    City
1    2   37        B  114.108606  Nagpur
3    4   35        B  105.532590  Nashik
4    5   49        C -119.618822  Nashik
5    6   29        A   93.883822  Nagpur
6    7   27        A   42.859535  Nagpur
9   10   49        B -158.172535  Nagpur
10  11   31        C   43.224254  Nashik
12  13   30        D   60.125635  Nashik
15  16   36        C  -50.120332  Nagpur
18  19   19        D  233.733627  Nagpur
23  24   46        C  -85.036123  Nashik
24  25   48        A    4.407001  Nashik
25  26   42        A   91.858444  Nashik
31  32   24        B  -79.617470  Nagpur
33  34   48        C  -34.086026  Nagpur
36  37   37        D   31.366740  Nashik
37  38   23        C   28.605998  Nashik
38  39   27        C   86.819331  Nagpur
39  40   64        A   20.937574  Nagpur
40  41   27        D  -98.258153  Nagpur
41  42   58        A  -35.389775  Nagpur
42  43   29        B  -38.369644  Nash

---

# **Non-Probability Sampling**
Non-probability sampling methods do not provide every individual with an equal chance of being selected. The selection is often based on the researcher's judgment, convenience, or other non-random criteria. While these methods are often more convenient and less expensive, they carry a higher risk of sampling bias and the findings may not be generalizable to the broader population.

### 1. `Convenience Sampling`
As the name suggests, this technique involves selecting individuals who are easiest to reach. Examples include conducting a survey at a mall or interviewing people on a busy street. While simple and quick, it is highly susceptible to bias.

In [19]:
# Convenience sampling is taking what's easiest.
# In a dataframe, this is often the first N rows.
sample_size = 15
convenience_sample = population_df.head(sample_size)

print("--- Convenience Sample (first 15 rows) ---")
print(convenience_sample)

--- Convenience Sample (first 15 rows) ---
    ID  Age Category       Value    City
0    1   50        C  -39.532689  Mumbai
1    2   37        B  114.108606  Nagpur
2    3   45        A  149.961259    Pune
3    4   35        B  105.532590  Nashik
4    5   49        C -119.618822  Nashik
5    6   29        A   93.883822  Nagpur
6    7   27        A   42.859535  Nagpur
7    8   45        B  166.882345  Mumbai
8    9   43        D   -5.536272    Pune
9   10   49        B -158.172535  Nagpur
10  11   31        C   43.224254  Nashik
11  12   19        B  -11.367709  Mumbai
12  13   30        D   60.125635  Nashik
13  14   19        A  -23.898549  Mumbai
14  15   18        A   47.827824  Mumbai


### 2. `Purposive or Judgmental Sampling`
In this method, the researcher uses their expertise to select individuals who are most relevant to the study's objectives. This is often used in qualitative research where the focus is on gaining in-depth knowledge from a specific group. For example, a study on the experiences of startup founders would specifically seek out and select individuals who have started their own companies.

In [20]:
# The researcher decides to only sample from 'Category A' for a specific study.
purposive_sample = population_df[population_df['Category'] == 'A'].sample(n=10, random_state=42)

print("--- Purposive Sample (10 samples from Category A) ---")
print(purposive_sample)

--- Purposive Sample (10 samples from Category A) ---
    ID  Age Category       Value    City
71  72   56        A   -4.524437  Mumbai
39  40   64        A   20.937574  Nagpur
24  25   48        A    4.407001  Nashik
74  75   45        A  219.857297  Nagpur
14  15   18        A   47.827824  Mumbai
98  99   48        A   20.874755  Nagpur
53  54   51        A   72.524583    Pune
94  95   26        A  107.558366    Pune
79  80   22        A -227.215856  Mumbai
20  21   40        A   90.519168    Pune


### 3. `Quota Sampling`
Similar to stratified sampling, quota sampling involves dividing the population into subgroups. However, the selection from each subgroup is non-random and is done until a predefined number (quota) for each group is met. For example, a researcher might aim to interview 50 men and 50 women, selecting them based on convenience until the quotas are filled.

### 4. `Snowball Sampling`
This technique is used when the target population is hard to find or access. The researcher starts with a few known individuals and then asks them to refer others who meet the study's criteria. This is common in studies of hidden populations, such as homeless individuals or people with rare diseases.