
### Stratified Sampling:
Stratified sampling is a sampling technique in which the population is divided into distinct groups, called strata, based on specific characteristics (e.g., age, gender, income level). Then, a random sample is taken from each stratum, ensuring that the sample represents all subgroups in the population proportionally.

This method is used to improve the representativeness of the sample, especially when the population is heterogeneous.

### Steps in Stratified Sampling:
1 Divide the Population into Strata:

    - Identify the relevant characteristic(s) and divide the population into non-overlapping subgroups (strata).
2 Determine Sample Size for Each Stratum:

    - Decide the proportion of the sample to be taken from each stratum. This can be based on the size of the stratum relative to the entire population (proportional allocation) or a fixed number per stratum (equal allocation).
3 Random Sampling within Strata:

     - Perform simple random sampling within each stratum to select the required number of samples.
4 Combine Samples:

     - Merge the samples from all strata to form the final stratified sample.

In [5]:
import pandas as pd
import numpy as np

# Sample population data
data = {
    'Customer_ID': range(1, 101),  # 100 customers
    'Age_Group': np.random.choice(['Youth', 'Adult', 'Senior'], size=100, p=[0.3, 0.5, 0.2]),
    'Satisfaction_Score': np.random.randint(1, 6, size=100)  # Satisfaction scores (1–5)
}

df = pd.DataFrame(data)

# Display population distribution by strata
print("Population Distribution:")
print(df['Age_Group'].value_counts())

# Stratified sampling
sample_size = 30  # Total sample size
strata_proportions = df['Age_Group'].value_counts(normalize=True)  # Calculate proportions
strata_sample_sizes = (strata_proportions * sample_size).astype(int)  # Calculate samples per stratum

# Function to sample within each stratum
def stratified_sample(group):
    n = strata_sample_sizes[group.name]  # Get sample size for the current stratum
    return group.sample(n=n, random_state=42)

# Apply stratified sampling
stratified_sample_df = df.groupby('Age_Group', group_keys=False).apply(stratified_sample)

# Display results
print("\nStratified Sample Distribution:")
print(stratified_sample_df['Age_Group'].value_counts())
print("\nStratified Sample Data:")
print(stratified_sample_df)


Population Distribution:
Age_Group
Adult     57
Youth     29
Senior    14
Name: count, dtype: int64

Stratified Sample Distribution:
Age_Group
Adult     17
Youth      8
Senior     4
Name: count, dtype: int64

Stratified Sample Data:
    Customer_ID Age_Group  Satisfaction_Score
4             5     Adult                   3
11           12     Adult                   1
52           53     Adult                   3
25           26     Adult                   3
61           62     Adult                   2
98           99     Adult                   4
47           48     Adult                   1
55           56     Adult                   1
82           83     Adult                   2
24           25     Adult                   2
84           85     Adult                   4
7             8     Adult                   3
79           80     Adult                   4
46           47     Adult                   2
16           17     Adult                   2
30           31     Adult      

  stratified_sample_df = df.groupby('Age_Group', group_keys=False).apply(stratified_sample)


In [5]:
import pandas as pd

# Create a sample dataset
data = {
    'ID': range(1, 101),
    'Department': ['IT'] * 50 + ['HR'] * 30 + ['Marketing'] * 20,
    'Satisfaction_Score': [80, 70, 90, 60, 85] * 20
}
df = pd.DataFrame(data)

# Calculate sample size for each stratum
stratum_sizes = df['Department'].value_counts(normalize=True) * 20  # Sample size = 20

# Stratified Sampling
stratified_sample = df.groupby('Department', group_keys=False).apply(
    lambda x: x.sample(int(stratum_sizes[x.name]))
)

print(stratified_sample)


    ID Department  Satisfaction_Score
74  75         HR                  85
70  71         HR                  80
62  63         HR                  90
68  69         HR                  60
76  77         HR                  70
59  60         HR                  85
1    2         IT                  70
18  19         IT                  60
43  44         IT                  60
45  46         IT                  80
29  30         IT                  85
23  24         IT                  60
0    1         IT                  80
48  49         IT                  60
32  33         IT                  90
2    3         IT                  90
83  84  Marketing                  60
88  89  Marketing                  60
94  95  Marketing                  85
96  97  Marketing                  70


  stratified_sample = df.groupby('Department', group_keys=False).apply(


In [7]:
# Equal sampling of 5 employees from each department
equal_sample = df.groupby('Department', group_keys=False).apply(lambda x: x.sample(5))
print(equal_sample)

    ID Department  Satisfaction_Score
59  60         HR                  85
63  64         HR                  60
56  57         HR                  70
65  66         HR                  80
57  58         HR                  90
4    5         IT                  85
6    7         IT                  70
0    1         IT                  80
13  14         IT                  60
49  50         IT                  85
94  95  Marketing                  85
97  98  Marketing                  90
96  97  Marketing                  70
95  96  Marketing                  80
88  89  Marketing                  60


  equal_sample = df.groupby('Department', group_keys=False).apply(lambda x: x.sample(5))


In [9]:
import pandas as pd
import numpy as np

# Simulating a large dataset
np.random.seed(42)
data = {
    'Customer_ID': range(1, 10001),  # 10,000 customers
    'Region': np.random.choice(['North', 'South', 'East', 'West'], size=10000, p=[0.4, 0.3, 0.2, 0.1]),
    'Satisfaction_Score': np.random.randint(1, 6, size=10000),  # Satisfaction score: 1-5
    'Purchase_Amount': np.random.uniform(100, 5000, size=10000)  # Random purchase amounts
}

# Create a DataFrame
df = pd.DataFrame(data)

# Analyze population distribution
region_distribution = df['Region'].value_counts(normalize=True)
print("Population Distribution:\n", region_distribution)

# Define the total sample size and calculate sample size per region
total_sample_size = 1000
stratum_sample_sizes = (region_distribution * total_sample_size).round().astype(int)
print("\nSample Sizes Per Region:\n", stratum_sample_sizes)

# Perform Stratified Sampling
stratified_sample = df.groupby('Region', group_keys=False).apply(
    lambda x: x.sample(n=stratum_sample_sizes[x.name], random_state=42)
)

# Display Results
print("\nStratified Sample:\n", stratified_sample.head())
print("\nStratified Sample Distribution:\n", stratified_sample['Region'].value_counts(normalize=True))


Population Distribution:
 Region
North    0.4058
South    0.3055
East     0.1926
West     0.0961
Name: proportion, dtype: float64

Sample Sizes Per Region:
 Region
North    406
South    306
East     193
West      96
Name: proportion, dtype: int64

Stratified Sample:
       Customer_ID Region  Satisfaction_Score  Purchase_Amount
5480         5481   East                   3      4282.472602
8582         8583   East                   1      3320.164876
9699         9700   East                   4      3279.602150
4755         4756   East                   3       310.145123
1685         1686   East                   3      2033.308614

Stratified Sample Distribution:
 Region
North    0.405594
South    0.305694
East     0.192807
West     0.095904
Name: proportion, dtype: float64


  stratified_sample = df.groupby('Region', group_keys=False).apply(
