#### What is Multistage Sampling in Data Analysis?
Multistage Sampling is a complex probability sampling technique where sampling is conducted in multiple stages. Instead of sampling directly from the entire population, the population is divided into hierarchical stages (e.g., regions, schools, individuals). Each stage involves random selection, narrowing the population step by step.

**For example:**

1. Randomly select regions (stage 1).
2. Within the selected regions, randomly select schools (stage 2).
3. Within the selected schools, randomly select students (stage 3).<br>
This method is highly efficient for large and dispersed populations and is often used in surveys like national censuses.



#### Advantages of Multistage Sampling
1. **Cost-Effective:**
Reduces costs by narrowing the sampling frame in stages, instead of sampling the entire population directly.

2. **Flexible:**
Allows researchers to use different sampling techniques at different stages (e.g., cluster sampling in stage 1, simple random sampling in stage 2).

3. **Practical for Large Populations:**
Particularly useful for geographically dispersed populations where accessing the entire population is challenging.

4. **Scalable:**
Easily adaptable to studies of varying sizes and complexity.


#### Disadvantages of Multistage Sampling
1. **Higher Sampling Error:**
Each stage introduces variability, leading to a higher cumulative sampling error compared to simpler methods like stratified sampling.

2. **Potential Bias:**
Improper sampling at any stage can propagate bias through the entire process.

3. **Complexity:**
Designing and implementing multistage sampling requires more effort and expertise.

#### When to Use Multistage Sampling
1. **Large-Scale Surveys:**
National or regional studies with limited resources, such as censuses or educational studies.

2. **Geographically Dispersed Populations:**
When the population spans large areas, making direct sampling impractical.

3. **Hierarchical Structures:**
When the population naturally forms groups (e.g., states, districts, schools, households).

#### Python Code Example

##### Scenario:
You have a dataset of students categorized by region, school, and class. You want to sample:<br>
Random regions (stage 1).<br>
Random schools within the selected regions (stage 2).<br>
Random students within the selected schools (stage 3).<br>

In [95]:
from numpy import random as rnd
import pandas as pd

In [129]:

population = pd.DataFrame({
    "StudentID" : range(1, 1001),
    "Region" : rnd.choice(["West", "East", "North", "South"], size=1000),
    "Schools" : rnd.choice([f"school_{i}"  for i in range(1,15)], size=1000),
    "Classroom" : rnd.choice(["class A", "class B", "class C", "class D"], size=1000),
    "MathScore" : rnd.randint(50, 100, size=1000) 
})
population.head(10)

Unnamed: 0,StudentID,Region,Schools,Classroom,MathScore
0,1,West,school_13,class B,59
1,2,North,school_12,class A,74
2,3,East,school_13,class C,67
3,4,North,school_13,class A,68
4,5,North,school_14,class B,85
5,6,North,school_3,class D,59
6,7,North,school_13,class B,82
7,8,South,school_6,class B,99
8,9,West,school_2,class D,54
9,10,South,school_3,class B,96


In [131]:
# Randomly Selected Region
rnd.seed(42)
selected_regions = rnd.choice(population["Region"].unique(), size=2)
selected_regions

array(['East', 'South'], dtype=object)

In [133]:
# Filter on population with selected regions and choose all of them
filter_by_selected_regions = population[population["Region"].isin(selected_regions)]
filter_by_selected_regions.head()

Unnamed: 0,StudentID,Region,Schools,Classroom,MathScore
2,3,East,school_13,class C,67
7,8,South,school_6,class B,99
9,10,South,school_3,class B,96
10,11,South,school_5,class B,98
11,12,South,school_9,class A,84


In [153]:
# pick up 4 unique schools
schools = filter_by_selected_regions["Schools"].unique()
selected_schools = rnd.choice(schools, size=4, replace=False )
selected_schools

array(['school_3', 'school_13', 'school_14', 'school_9'], dtype=object)

In [164]:
# From filter_by_selected_regions, select all students who are on selected_schools
filter_students_by_schools = filter_by_selected_regions[filter_by_selected_regions["Schools"].isin(selected_schools)]
filter_students_by_schools["Schools"].value_counts()

Schools
school_3     43
school_14    35
school_13    32
school_9     30
Name: count, dtype: int64

In [178]:
# pick up 50 sample student from filter_students_by_schools = ['school_13', 'school_3', 'school_14', 'school_9']
sample_students = filter_students_by_schools.sample(n=50, random_state=42)
sample_students.head(10)

Unnamed: 0,StudentID,Region,Schools,Classroom,MathScore
761,762,South,school_13,class A,59
470,471,South,school_13,class B,60
230,231,East,school_3,class C,69
865,866,East,school_3,class A,68
288,289,South,school_14,class A,76
66,67,East,school_13,class B,53
602,603,South,school_9,class D,85
476,477,East,school_13,class B,77
746,747,East,school_9,class C,73
771,772,South,school_9,class B,97


In [180]:
sample_students["Region"].unique()

array(['South', 'East'], dtype=object)

In [182]:
sample_students["Schools"].unique()

array(['school_13', 'school_3', 'school_14', 'school_9'], dtype=object)