# Sampling
- Sampling provides a feasible and cost-effictive way to gain insights about the population by examining a smaller,represntative subset called a sample.

**Benifits of sampling**
- efficiency
- feasibility
- insights and inference
- hypothesis testing
- model building


# Types of sampling Methods
1. Simple Random Sampling (SRS)
2. Stratified Sampling
3. Cluster Sampling
4. Systemetic Sampling

In [1]:
import  numpy as np
import pandas as pd

# simulate population
np.random.seed(42)

population_size=100000
population=pd.DataFrame({
    'customerid':range(population_size),
    'order_values':np.random.normal(3000,800,population_size) # here 3000 is mean and 800 is std
})
population.head()

Unnamed: 0,customerid,order_values
0,0,3397.371322
1,1,2889.388559
2,2,3518.15083
3,3,4218.423885
4,4,2812.6773


In [2]:
population['order_values'].mean()

np.float64(3000.7734945127595)

In [3]:
population['order_values'].std()

800.7247670697594

# Simple Random Sample
- Each data points of population has equal chance to be selected.
- used in creating training and test datasets for ml.

In [4]:
samplr_Srs=population.sample(n=5000,random_state=42)
samplr_Srs

Unnamed: 0,customerid,order_values
75721,75721,3248.256596
80184,80184,4592.808047
19864,19864,4007.521442
76699,76699,1367.075677
92991,92991,3264.159691
...,...,...
44719,44719,2219.383180
20980,20980,5010.914775
57224,57224,4112.982502
23910,23910,2028.748382


In [5]:
samplr_Srs['order_values'].mean()
# sample mean is greater than population mean

np.float64(3014.6804338225998)

# Stratified Sampling
- Divides population into homogenous subgroups (strata)
- Then randomly sample from each group.


## Business Case

Customers belong to:

1. Bronze
2. Silver
3. Gold
4. Platinum

High-value customers are fewer but important.

If we use simple random sampling, Platinum customers may be underrepresented.

### When To Use?

Population has distinct groups

Groups differ significantly

Need representation from each group

In [6]:
# add customer segment
population['segment']=np.random.choice(
    ['Bronze','Silver','Gold','Platinum'],
    size=population_size,
    p=[0.5,0.3,0.15,0.05] # percentage of values in population
)
population.head()

Unnamed: 0,customerid,order_values,segment
0,0,3397.371322,Bronze
1,1,2889.388559,Silver
2,2,3518.15083,Bronze
3,3,4218.423885,Silver
4,4,2812.6773,Bronze


In [9]:
sample_stratified=population.groupby('segment').apply(lambda 
                            x:x.sample(frac=0.05)).reset_index(drop=True)
sample_stratified['segment'].value_counts(normalize=True)


  sample_stratified=population.groupby('segment').apply(lambda


segment
Bronze      0.49790
Silver      0.30014
Gold        0.15037
Platinum    0.05159
Name: proportion, dtype: float64

# Cluster Sampling
## Definition

1. Divide population into clusters (naturally occurring groups)
2. Randomly select some clusters
3. Take all observations from those clusters.


### Business Case

Customers grouped by city:

1. Mumbai
2. Delhi
3. Bangalore
4. Hyderabad

Instead of sampling individuals, we sample cities.

In [12]:
# Add city cluster
population["City"] = np.random.choice(
    ["Mumbai", "Delhi", "Bangalore", "Hyderabad"],
    size=population_size
)

# Randomly select 2 cities
selected_cities = np.random.choice(
    population["City"].unique(),
    size=2
)

sample_cluster = population[population["City"].isin(selected_cities)]

selected_cities


array(['Bangalore', 'Hyderabad'], dtype=object)

In [17]:
sample_cluster.head()

Unnamed: 0,customerid,order_values,segment,City
1,1,2889.388559,Silver,Bangalore
2,2,3518.15083,Bronze,Hyderabad
3,3,4218.423885,Silver,Bangalore
4,4,2812.6773,Bronze,Hyderabad
5,5,2812.690434,Gold,Hyderabad


### Business Insight

Cheaper.

But:

Higher sampling error if clusters differ heavily.

# Systematic Sampling
## Definition

Select every kth observation.


### Business Case

Database sorted by Customer_ID.

Select every 20th customer.

In [19]:
k = population_size // 5000
k

20

In [20]:
sample_systematic = population.iloc[::k]

sample_systematic.head()


Unnamed: 0,customerid,order_values,segment,City
0,0,3397.371322,Bronze,Mumbai
20,20,4172.519015,Bronze,Bangalore
40,40,3590.773264,Bronze,Delhi
60,60,2616.66061,Silver,Delhi
80,80,2824.26249,Gold,Bangalore


### Caution

If population sorted by behavior (e.g., high spenders first):

Systematic sampling becomes biased.

### Summary:

| Method        | When to Use            | Risk                  |
| ------------- | ---------------------- | --------------------- |
| Simple Random | Homogeneous population | Slight variance       |
| Stratified    | Important subgroups    | More complex          |
| Cluster       | Cost reduction         | Higher sampling error |
| Systematic    | Easy implementation    | Pattern bias          |
