#**Sampling**

In the first example, we used pandas' sample function to perform a simple random sample of the data. In the second example, we used scikit-learn's train_test_split with stratification to ensure that the cities are proportionally represented in our sample. While train_test_split is typically used to split data for training and testing in machine learning models, here we repurpose it for stratified sampling by treating the subset we're interested in as a "test" subset.

These examples require the pandas and scikit-learn libraries, so make sure you have them installed in your Python environment.

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame to simulate our population
data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'James', 'Laura', 'Sam', 'Monica', 'Paul', 'Diana'],
        'Age': [28, 34, 45, 32, 22, 29, 41, 25, 33, 40],
        'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago', 'New York']}

df = pd.DataFrame(data)

# SIMPLE RANDOM SAMPLING
# Let's select 5 random individuals from our dataset
sample_random = df.sample(n=5, random_state=1) # random_state for reproducibility

print(sample_random)


    Name  Age         City
2  Peter   45     New York
9  Diana   40     New York
6    Sam   41     New York
4  James   22  Los Angeles
0   John   28     New York


In [None]:
# STRATIFIED RANDO SAMMPLING

from sklearn.model_selection import train_test_split

# We use the City column as the basis for stratification
# Here, we're not actually creating a train/test split but using the function to perform stratification
_, sample_stratified = train_test_split(df, test_size=0.5, stratify=df['City'], random_state=1) # 50% sample size, stratified by City

print(sample_stratified)

     Name  Age         City
2   Peter   45     New York
0    John   28     New York
3   Linda   32      Chicago
4   James   22  Los Angeles
7  Monica   25  Los Angeles
