## What is Stratified Sampling in Data Analysis?
Stratified Sampling is a probability sampling technique where the population is divided into distinct subgroups, called strata, based on shared characteristics. Samples are then randomly selected from each stratum in proportion to the stratum's size in the population. This ensures that the sample represents all significant subgroups within the population.

#### Key Steps in Stratified Sampling
1. **Divide the Population into Strata:** Group individuals based on shared characteristics, such as age, gender, income level, etc.
2. **Determine the Sample Size for Each Stratum:** Decide how many samples should come from each stratum, often proportional to the size of the stratum in the overall population.
3. **Randomly Sample Within Each Stratum:** Randomly select the required number of samples from each group.


#### Advantages of Stratified Sampling
1. **Improved Representation:** Ensures all subgroups are proportionally represented, reducing bias.
2. **Greater Precision:** Reduces sampling error compared to simple random sampling, especially in heterogeneous populations.
3. **Effective for Subgroup Analysis:** Facilitates more accurate insights into individual subgroups, which is helpful for targeted analysis.
4. **Flexibility:** Strata can be customized based on the research question or population characteristics.

#### Disadvantages of Stratified Sampling
1. **Complexity:** Dividing the population into strata requires additional effort and knowledge about the population.
2. **Requires Population Information:** A complete understanding of the population characteristics is essential to define the strata effectively.
3. **Smaller Sample Sizes for Strata:** If strata are too small, it may limit the reliability of statistical analysis for those subgroups.
4. **Time-Consuming:** Identifying strata and ensuring proportional representation takes more time than simple random sampling.

#### When to Use Stratified Sampling
1. Population Diversity: When the population is diverse, and different subgroups must be represented proportionally.
2. Subgroup Analysis: When insights into specific subgroups are required (e.g., analyzing the impact of a product by gender or income level).
3. Avoiding Bias: To ensure that small but significant subgroups (e.g., minority groups) are not underrepresented.
4. Resource Allocation: When you want to focus resources on specific groups while maintaining a representative sample.

#### Python Code Example
##### Scenario:
You have a dataset of 1000 individuals, categorized by gender, and you want to ensure that the sample maintains the same proportion of males and females as the original population.

In [7]:
import numpy as np
import pandas as pd

In [158]:
population = pd.DataFrame({
    "ID" : range(1,1001),
    "Gender" : np.random.choice(["Male", "Female"], size=1000, p=[0.6, 0.4]),
    "Age" : np.random.randint(18, 60, size=1000)
})

In [203]:
def stratified_sampling(df, stratify_column, n_sample, random_state=None):
    
    # give a proportion of male and female like: Male 0.6 and Female 0.4
    proportion = df[stratify_column].value_counts(normalize=True)
  
    # convert proportion into count we want to fit the proportion of sample data and population like Males 60 and Females 40
    strata_counts = (proportion * n_sample).round().astype(int)

    
    stratified_samples = []
    for stratum, count in strata_counts.items():
        # if column name = male generated n=60 sample for us , and if column name = Female generated n=40 for us
        stratified_samples.append(
            df[df[stratify_column] == stratum].sample(n=count, random_state=random_state)
        )
        
    # Combine all sampled strata
    return pd.concat(stratified_samples)


sample_data = stratified_sampling(population, stratify_column="Gender", n_sample=100, random_state=42)
print(sample_data["Gender"].value_counts())

Gender
Male      60
Female    40
Name: count, dtype: int64
