## Day 3: Population & Sampling

#### 1. Population vs. Sample

##### Population:
        The entire group of individuals or items you want to study.

##### Sample:
        A subset of the population, selected for actual analysis. We use samples to make inferences about populations.

##### Example:

        If you're interested in the average height of all students at a university (the population), but you only measure 100 students (the sample), you’ll use the sample to estimate the characteristics of the whole group.

#### 2. Sampling Methods


##### a. Simple Random Sampling
- Definition: Every member of the population has an equal chance of being selected.

##### Example:



In [None]:
import numpy as np
students = np.arange(1, 1001)  # Student IDs 1 to 1000
np.random.seed(42)
sample = np.random.choice(students, size=100, replace=False)
print("Sampled student IDs:", sample[:10])


##### b. Stratified Sampling
- Definition: The population is divided into subgroups (strata) and random samples are taken from each stratum. This ensures representation of each stratum.

##### Example:



In [2]:
import pandas as pd
import numpy as np
# Suppose we have 1000 students with gender labels (M/F)
students = pd.DataFrame({
    'id': np.arange(1, 1001),
    'gender': np.random.choice(['M','F'], size=1000)
})
# Take 10 males and 10 females
male_sample = students[students['gender'] == 'M'].sample(10, random_state=42)
female_sample = students[students['gender'] == 'F'].sample(10, random_state=42)
stratified_sample = pd.concat([male_sample, female_sample])
print(stratified_sample.head())


      id gender
337  338      M
552  553      M
972  973      M
139  140      M
607  608      M


##### c. Cluster Sampling
- Definition: The population is divided into clusters (often based on geography or another natural grouping), a few clusters are randomly selected, and all members of those clusters are included.

##### Example:

In [3]:
# Suppose students are grouped in 10 dorms
students['dorm'] = np.random.choice(['Dorm_A','Dorm_B','Dorm_C','Dorm_D','Dorm_E','Dorm_F','Dorm_G','Dorm_H','Dorm_I','Dorm_J'], size=1000)
# Randomly pick 2 dorms
selected_dorms = np.random.choice(students['dorm'].unique(), size=2, replace=False)
cluster_sample = students[students['dorm'].isin(selected_dorms)]
print('Selected dorms:', selected_dorms)
print(cluster_sample.head())


Selected dorms: ['Dorm_F' 'Dorm_D']
    id gender    dorm
5    6      M  Dorm_F
7    8      F  Dorm_D
16  17      F  Dorm_F
19  20      M  Dorm_F
20  21      M  Dorm_D
