# Data Sampling - Methods

## 1) Objective

In the last examples, we dealed with larger than memory files. Another time and cost efficient way of exploring a large data set is sampling. The idea is that we don't need the entire data set in order to analyze it. We pick specific portions via sampling. Depending on the tasks, there are different sampling methods:<br>
<br>
- Random Sampling  
- Stratified Sampling  
- Systematic Sampling  
- Cluster Sampling  
- Bootstrap Sampling  
- Oversampling & Undersampling (Basic Concepts)


## 2) Preperation
First, we load the standard python libraries as ususal:

In [None]:
import os                    # for unix like commands like "listdir" etc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We are going to need a few more other methods that help us to determine the quality of the sample (see later).

In [None]:
from scipy.stats import gaussian_kde, entropy, ks_2samp     # gaussian_kde: for plotting a smoothed histogram
                                                            # entropy: for calculating the KL-divergence
                                                            # ks_2samp: for running a KS-test
from sklearn.model_selection import train_test_split        # for stratified sampling

<br>

We want to explore the data set *"DTXSID8031865 HTTr-Summary-2025-10-07.xlsx"* that contains chemical properies and experimental results of so called PFAS: **P**er- and poly**F**luoro**A**lkyl **S**ubstances which are of particular health and environmental intrest. 

In [None]:
filename = "DTXSID8031865 HTTr-Summary-2025-10-07.xlsx"

In order to locate the file we run *"FindMyFile"* as before...

In [None]:
def FindMyFile(filename: str, ServerHardDiscPath: str = r"c:\Users\MMH_user\Desktop") -> str:
    """
    finds file of name "filename" anywhere in "ServerHardDiscPath" and returns complete path
    """
    for r,d,f in os.walk(ServerHardDiscPath):
        for files in f:
             if files == filename: #for example "MessyFile.xlsx"
                 file_name_and_path =  os.path.join(r,files)
                 return file_name_and_path

In [None]:
File = pd.read_excel(FindMyFile(filename))

...and explore the file:

In [None]:
File.head()

<br>

## 3) Random Sampling
The most straight forward way is just random sampling. Say, we want to sample the information stored in column "BMD".

In [None]:
col = 'BMD'

In [None]:
sample_random = File[col].sample(n = 10)
sample_random

How representative the subsample (here n=10) is mainly depends on the subsample size and also on the total size of the data set.  

In [None]:
print("N total = " + str(len(File[col])))

We want to draw subsamples of different sizes, say 1%, 5% etc and compare the subsample distribution to the distribution of the complete dataset in order to get an idea how representative the subsample is in this case. In order to measure "representativeness" we generate a plot (subsample vs all data), but also calculate the **KL-divergence** (see lecture), which is entropy based and tells us how much information we loose if we sample the data with different *n*. The higher the **KL-divergence** the more the subsample differs from the actual data and the more information we lost.<br>  
Another method to measure how similar distributions are is the **K**olmogorov-**S**mirnov (KS)-test. The p-value from the KS-test equals the probability to have the measured (or more extreme) discripancy between the two data sets assuming that we drew them from the same distribution.

In [None]:
sample_sizes = [1, 5, 10, 20, 50, 90]# in %

In [None]:
def RandomSampleData(df, col: str, sample_sizes: list):

    # Full data distribution
    full_data = df[col].values
    kde_full  = gaussian_kde(full_data)
    n_total   = len(full_data)
    LS        = len(sample_sizes)
    
    # Shared x-axis for KDE plots
    x_vals       = np.linspace(full_data.min(), full_data.max(), 500)
    full_density = kde_full(x_vals)
    
    # Prepare subplots
    fig, axes = plt.subplots(2, int(np.ceil(LS/2)), figsize = (18, 10))
    axes      = axes.flatten()
    
    results   = [None]*LS
    
    for i, (ax, pct) in enumerate(zip(axes, sample_sizes)):

        n_sample = max(1, int(n_total * pct / 100))
        
        # Vectorized random sampling
        sample = df[col].sample(n_sample, replace=False).values
        # KDE of sample for plotting a smooth histogram
        kde_sample     = gaussian_kde(sample)
        sample_density = kde_sample(x_vals)
    
        # KL divergence (add small epsilon for numerical stability)
        KL = entropy(sample_density + 1e-12, full_density + 1e-12)
    
        # KS test
        KS = ks_2samp(sample, full_data).pvalue
    
        results[i] = (pct, KL, KS)
    
        # Plot
        ax.plot(x_vals, full_density, label="Full Data", linewidth=2.5, alpha=0.8)
        ax.plot(x_vals, sample_density, label=f"{pct}% sample", linewidth=2)
        ax.set_xlabel(col)
        ax.set_ylabel('rel. frequency')
        ax.set_title(f"Sample Size: {pct}%\nKL = {KL:.4f}, KS p-val = {KS:.4f}")
        ax.grid(alpha=0.2)
        ax.legend()
    
    plt.tight_layout()
    plt.show()
    
    # Show comparison table
    pd.DataFrame(results, columns=["Sample %", "KL Divergence", "KS Statistic"])

In [None]:
RandomSampleData(File, col, sample_sizes)

<br>

## 4) Stratified Sampling
Random sampling does not take into account that the data might have been drawn from different groups of different sizes. **Stratified sampling maintains the proportion of groups (strata)**. Note that random sampling results in **approximate proportions only on average across many repeated samples, not in any single sample,** especially if the different proportions cover a large dynamic range.<br>
For example the groups in *'TARGET_LEVEL'* have completely different sizes.

In [None]:
col_groups = 'TARGET_LEVEL'

In [None]:
Groups     = set(File[col_groups])
print(Groups)

Get sizes by counting appearance: 

In [None]:
L = list(File[col_groups])
for g in Groups:
    ct = L.count(g)
    print(g + ": " + str(ct) + " appearances" )

Therefore, we need to sample the groups separately. The **subsamples will have different absolute sizes, but identical relative sizes.**<br>
Say, we want to extract 20% of each group, we can take advantage over the method *"train_test_split"*, but only work with the training sample which will be the subsamples we are looking for.

In [None]:
fraction = 0.2
_, test  = train_test_split(File, test_size = fraction, stratify = File[col_groups])

In [None]:
test[[col_groups, col]]

Let us check if that worked! The subsamples should be roughly *"fraction"* of the size of the original data.

In [None]:
Ltest = list(test[col_groups])
for g in Groups:
    ct     = L.count(g)
    cttest = Ltest.count(g)
    r      = cttest/ct
    print(g + ":\t\t " + str(ct) + " appearances in full data set,\t " + str(cttest) + " appearances in subsample.\t Ratio = " + f"{r: .2f}")

Therefore, let us plot the smoothed histograms of the values of the different groups.  

In [None]:
def PlotSampleData(df, Groups, col_groups, col):

    for g in Groups:
        #extracting all(!) values of each group
        vals = np.array([df[col].iloc[i] for i, gcol in enumerate(df[col_groups]) if gcol == g])
        n    = len(vals)
        
        if n>1:#in case there is only one data point
            x_vals         = np.linspace(vals.min(), vals.max(), 500)
        
            kde_sample     = gaussian_kde(vals)
            sample_density = kde_sample(x_vals)
    
            #normalization for plotting
            sample_density /= np.sum(sample_density)
        
            plt.plot(x_vals, sample_density, linewidth = 2, label  = g + ": n = " + str(n))
            plt.xlabel(col)
            plt.ylabel('norm. frequency')
            plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
            plt.grid(alpha = 0.2)
    plt.show()

In [None]:
PlotSampleData(File, Groups, col_groups, col)

<br>

## 5) Systematic Sampling
Systematic sampling refers to sampling only **each k-th element** and is easy to implement. It ensures the sample is evenly distributed across the population. However, systematic sampling **should be avoided if there are pattern** expected to occur in the data.

In [None]:
k = 20
systematic_sample = File[col].iloc[::k]

In [None]:
systematic_sample

<br>

## 6) Cluster Sampling
For cluster sampling, we randomly select a **ceratin number of clusters (groups)** and then **select all members from these selected clusters**, whereas  for stratiefied sampling we selected all clusters (groups) and then sampled some members from those clusters (groups).

In [None]:
nClust            = 3
selected_clusters = np.random.choice(File[col_groups].unique(), size = nClust, replace = False)

In [None]:
print(selected_clusters)

Extracting all members from the randomly selected cluster:

In [None]:
cluster_sample = File[File[col_groups].isin(selected_clusters)]

In [None]:
cluster_sample[[col_groups, col]]

In [None]:
PlotSampleData(File, selected_clusters, col_groups, col)

<br>

## 7) Bootstrap Sampling
Bootstrap sampling is essentially **sampling with replacement** and often used for **estimating uncertainty**.<br>
When we randomly sample a certain number *n* from our column "BMT", we found in *3)* that this imposes uncertainty about the actual distribution of the sample. In order to estimate this uncertainy, we sample our *n* values with replacement.

In [None]:
print(col)

In [None]:
n = 20

In [None]:
sample_random = File[col].sample(n = n)

In [None]:
print("N total = " + str(len(File[col]))) # size of the actual data set

Comparing sample to actual distribution:

In [None]:
full_data = File[col].values
kde_full  = gaussian_kde(full_data)
    
# Shared x-axis for KDE plots
x_vals         = np.linspace(full_data.min(), full_data.max(), 500)
full_density   = kde_full(x_vals)
kde_sample     = gaussian_kde(sample_random)
sample_density = kde_sample(x_vals)

# Plot
plt.plot(x_vals, full_density, label = "Full Data (size " + str(len(File[col])) + ")", linewidth = 2.5, alpha = 0.8)
plt.plot(x_vals, sample_density, label = "Sample (size " + str(n) + ")", linewidth = 2)
plt.xlabel(col)
plt.ylabel('norm. frequency')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
plt.grid(alpha = 0.2)
plt.show()

Let us now bootstrap the sample many times, i. e. we "return" it to the overall dataset and then draw *n* data points again, save the result, put the datapoints back and so on. Some of the datapoints will be drawn many times during this process.

In [None]:
Nboot = 500 #number of boot strapping steps

For picking random integer numbers in a certain intervall without repetition, we call the library

In [None]:
import random

Check:

In [None]:
Nmin = 10
Nmax = 20
Nnum = 4

In [None]:
for _ in range(10):
    print(random.sample(range(Nmin, Nmax + 1), Nnum))

We will use this for indexing in the following function:

In [None]:
def BootStrap(File, col, Size_sample: int = 50, Nboot: int = 500):
    
    L         = len(File[col])
    Boot_data = np.zeros((Nboot, Size_sample))    #pre-allocating space


    for i in range(Nboot):
        idx          = random.sample(range(0, L), Size_sample) # drawing Size_sample random numbers between 0 and L
        Boot_data[i] = File[col].iloc[idx] 

    return Boot_data

In [None]:
Boot_data = BootStrap(File, col)

In [None]:
print(Boot_data.shape)

In order to understand how we can estimate uncerainty, we plot all the smoothed histograms in one figure and see who the curves vary.

In [None]:
full_data = File[col].values
x_vals    = np.linspace(full_data.min(), full_data.max(), 500)    

full_density   = kde_full(x_vals)
sample_density = kde_sample(x_vals)

#smooth histograms
for i in range(Nboot):
    boot       = Boot_data[i,:]
    kde_sample = gaussian_kde(boot)
    y          = kde_sample(x_vals)
    
    plt.plot(x_vals, y, c = 'black', linewidth = 2.5, alpha = 0.01)
plt.plot(x_vals, full_density, c = 'red', linewidth = 1, alpha = 0.8, label = "Full Data (size " + str(len(File[col])) + ")")    
plt.xlabel(col)
plt.ylabel('norm. frequency')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
plt.grid(alpha = 0.2)
plt.show()

We can clearly see how the sample curves fluctuate. Based on these fluctuations, it is possible to calculate percentiles and interprete them as confidence intervals.

In [None]:
def PlotConfIntervall(File, col, Boot_data, Conf: list = [68, 90, 95, 99]):

    full_data      = File[col].values
    x_vals         = np.linspace(full_data.min(), full_data.max(), 500)    

    full_density   = kde_full(x_vals)

    Y              = np.zeros((Boot_data.shape[0], x_vals.shape[0])) #pre-allocating space

    #smooth histograms
    for i in range(Nboot):
        boot       = Boot_data[i,:]
        kde_sample = gaussian_kde(boot)
        Y[i,:]     = kde_sample(x_vals)

    #calculating confidence intervals using np.percentile
    for v in Conf:
        alpha      = v/100
        y_ci_lower = np.percentile(Y, (1 - alpha) / 2 * 100, axis = 0)
        y_ci_upper = np.percentile(Y, (1 + alpha) / 2 * 100, axis = 0)
        
        plt.fill_between(x_vals, y_ci_lower, y_ci_upper, color = 'black', alpha = 0.1, label = str(v) + '% Confidence Interval')
    plt.plot(x_vals, full_density, c = 'red', linewidth = 1, alpha = 0.8, label = "Full Data (size " + str(len(File[col])) + ")")    
    plt.xlabel(col)
    plt.ylabel('norm. frequency')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
    plt.grid(alpha = 0.2)
    plt.title('bootstrapped sample')
    plt.show()

In [None]:
PlotConfIntervall(File, col, Boot_data)

Thanks to the percentiles, it is possible to quantify the uncertainty of the sample values. 

<br>

## 8) Oversampling & Undersampling
Imagine you like to train an ANN on images of cats and dogs. Ideally, the training data is well balanced, say we have 10,000 cat images and 10,000 dog images. Once the ANN is trained, one way to estimate the quality of the classification is calculating the accuracy: how often did the network classify the images correctly.<br>
Unfortunately, oftentimes the training **sample is not well balanced**. Imagine the extreme, when we have 10,000 dog images but only 1,000 cat images. If the network always votes for *"dog"* regardless of the actual image, the accuracy will be close to 90%! Apparantly it is impossible for a classifier to learn the classes when we have such a stark inbalance. The problem can be solved with oversampling and undersampling.

First, we create an inbalanced sample:

In [None]:
df = pd.DataFrame({
    'x': np.random.randn(200), #randon numbers stand symbolically for images
    'y': np.random.choice(['dog', 'cat'], size=200, p=[0.9, 0.1])
})

In [None]:
df.head

First, we need to determine the size of the different samples.

In [None]:
Ndog = len(df[df['y']=='dog'])
Ncat = len(df[df['y']=='cat'])

In [None]:
print(Ndog, Ncat)

Since the sample *"dog"* is overrepresented, it is called *"majority"* and the underrepresented sample is called *"minority"*.

In [None]:
majority = df[df['y']=='dog']
minority = df[df['y']=='cat']

We can either **oversample the minority**

In [None]:
minority_oversampled = minority.sample(Ndog, replace=True)

In [None]:
minority_oversampled

or **undersample the majority**

In [None]:
majority_undersampled = majority.sample(Ncat, replace=True)

In [None]:
majority_undersampled

in order to balance the training data.