## What is Cluster Sampling in Data Analysis?
Cluster Sampling is a probability sampling technique where the population is divided into groups or clusters, and then a random selection of clusters is chosen. All individuals within the selected clusters are included in the sample. This approach is useful when the population is large, geographically dispersed, or when obtaining a complete list of individuals in the population is challenging.

#### Steps in Cluster Sampling
1. **Divide the Population into Clusters:**
The population is grouped into smaller, naturally occurring clusters, such as cities, schools, or districts.
2. **Randomly Select Clusters:**
A predetermined number of clusters are randomly selected.
3. **Survey Everyone in the Selected Clusters:**
Instead of sampling individuals, all members of the selected clusters are included in the sample.


#### Advantages of Cluster Sampling
1. **Cost-Effective:**
Reduces costs and effort since data collection is confined to specific clusters.
2. **Feasibility:**
Easier to implement for geographically dispersed populations where a full list of individuals is unavailable.

3. **Efficient Data Collection:**
Surveying entire clusters requires less logistical planning than sampling individuals randomly across the entire population.
4. **Scalability:**
Useful for large populations where simple random sampling is impractical.

#### Disadvantages of Cluster Sampling
1. **Risk of Bias:**
If clusters are not representative of the entire population, the results may be biased.

2. **Increased Sampling Error:**
Since clusters may have similar characteristics, the variability within the sample could be lower than the variability in the population, leading to less precise estimates.

3. **Dependent Observations:**
Individuals within the same cluster may share common traits, reducing the independence of observations.

#### When to Use Cluster Sampling

1. **Geographically Dispersed Populations:**
When the population is spread over a wide area and surveying individuals randomly is logistically challenging.

2. **Cost and Time Constraints:**
When resources are limited, and sampling individuals across the entire population is not feasible.

3. **Homogeneous Clusters:**
When clusters are internally diverse and representative of the entire population.

#### Python Code Example

##### Scenario:
You have a dataset of students from different schools, and you want to select a sample by randomly choosing schools (clusters) and including all students from those schools.

In [9]:
import numpy as np
import pandas as pd

In [43]:
data = pd.DataFrame({
    "StudentID" : range(1,101),
    "Schools" : np.random.choice(["school A","school B","school C","school D","school E"], size=100),
    "Score" : np.random.randint(50,100, size=100)
    
})
data["Schools"].value_counts()

Schools
school A    26
school C    26
school D    24
school B    13
school E    11
Name: count, dtype: int64

In [45]:
# Clusters with Schools
cluster = data["Schools"].unique()
print("All of Clusters: ", cluster)

All of Clusters:  ['school A' 'school C' 'school B' 'school E' 'school D']


In [47]:
# select randomly schools or clusters
np.random.seed(42)
selected_clusters = np.random.choice(cluster, size=3)
print("Selected Clusters:", selected_clusters)

Selected Clusters: ['school E' 'school D' 'school B']


In [49]:
# select all of student in selected clusters
sample_data = data[data["Schools"].isin(selected_clusters)]

In [51]:
sample_data["Schools"].value_counts()

Schools
school D    24
school B    13
school E    11
Name: count, dtype: int64