# Basic Univariate Distributions in Python 

## Reidar B Bratvold, Professor, University of Stavanger 



### Basic Univariate Data Distribution Plotting in Python

Here's a simple workflow with some basic univariate statistics and distribution plotting of tabular (easily extended to gridded) data summary statistics and distributions. This should help you get started data visualization and interpretation.

#### Objective 

The objective is to illustrate univariate distributions   

#### Getting Started

You will need to copy the data files to your working directory.  They are avaiable in the same Colab directory as the Jupyter Notebook. A link has been provided on Canvas.

- sample_data.csv


We will also need some standard packages. These should have been installed with Anaconda 3.

In [None]:
import numpy as np                        # ndarrys for gridded data
import pandas as pd                       # DataFrames for tabular data
import os                                 # set working directory, run executables
import matplotlib.pyplot as plt           # for plotting
from scipy import stats                   # summary statistics
import seaborn as sns                     # advanced plotting
from scipy.stats import gaussian_kde

#### Loading Tabular Data

Here's the command to load our comma delimited data file in to a Pandas' DataFrame object.  For fun try misspelling the name. You will get an ugly, long error.  

In [None]:
#print(os.getcwd())  # Print the current working directory

In [None]:
df = pd.read_csv('sample_data.csv')     # load our data table

We loaded our file into our DataFrame called 'df'. But how do you really know that it worked? Visualizing the DataFrame would be useful and we already leard about these methods in this demo. 

We can preview the DataFrame by printing a slice or by utilizing the 'head' DataFrame member function (with a nice and clean format, see below). With the slice we could look at any subset of the data table and with the head command, add parameter 'n=13' to see the first 13 rows of the dataset.  

In [None]:
print(df.iloc[0:5,:])                   # display first 4 samples in the table as a preview
df.head(n=13)                           # we could also use this command for a table preview

# **Data Description**

This dataset contains key geophysical and petrophysical properties used in reservoir characterization. Below is a brief description of each parameter:

### **1. X, Y (Spatial Coordinates)**
- Represents the spatial location of data points in the reservoir.
- These coordinates help in mapping and visualization of subsurface properties.

### **2. Facies**
- A categorical variable representing different geological facies in the reservoir.
- Typically, **Facies = 0** and **Facies = 1** indicate different lithological units (e.g., shale vs. sandstone).

### **3. Porosity**
- Denoted as a fraction (0 to 1), indicating the pore space in the rock.
- Higher porosity means more storage capacity for fluids like oil, gas, or water.

### **4. Permeability (Perm)**
- Measured in millidarcies (mD), it quantifies the ability of fluids to flow through the rock.
- Higher permeability values indicate better reservoir quality.

### **5. Acoustic Impedance (AI)**
- Defined as the product of **P-wave velocity** and **rock density**.
- Used in seismic inversion to distinguish between different rock types and fluid contents.
- Higher AI values suggest denser and/or faster materials (e.g., carbonates, tight sands), while lower AI values indicate more porous formations.

This dataset can be used for **reservoir characterization, seismic interpretation, and petrophysical analysis**.

#### Summary Univariate Statistics for Tabular Data

The table includes X and Y coordinates (meters), Facies 1 and 2 (1 is sandstone and 0 interbedded sand and mudstone), Porosity (fraction), permeability as Perm (mDarcy) and acoustic impedance as AI (kg/m2s*10^6). 

There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames. The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table. We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.

In [None]:
df.describe()

We can also use a wide variety of statistical summaries built into NumPy's ndarrays.  When we use the command:
```p
df['Porosity']                       # returns an Pandas series
df['Porosity'].values                # returns an ndarray
```
Panda's DataFrame returns all the porosity data as a series and if we add 'values' it returns a NumPy ndarray and we have access to a lot of NumPy methods. I also like to use the round function to round the answer to a limited number of digits for accurate reporting of precision and ease of reading.

For example, now we could use commands. like this one:

In [None]:
print(f"The minimum is {df['Porosity'].min():.2f}.")
print(f"The maximum is {df['Porosity'].max():.2f}.")
print(f"The mean is {df['Porosity'].mean():.2f}.")
print(f"The sample variance is {df['Porosity'].var():.6f}.")
print(f"The standard deviation is {df['Porosity'].std():.2f}.")

Simetimes it is useful to calculate the variance as the Difference of Means.

### Derivation of Sample Variance Using the Difference of Means

The **sample variance** $s^2$ is defined as:

$$
s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

where:
- $x_i$ represents each sample value,
- $\bar{x}$ is the sample mean,
- $n$ is the number of observations.

#### **Step 1: Expand the Squared Terms**
Using the identity:

$$
(x_i - \bar{x})^2 = x_i^2 - 2\bar{x}x_i + \bar{x}^2
$$

Expanding the sum:

$$
\sum_{i=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - 2\bar{x} \sum_{i=1}^{n} x_i + \sum_{i=1}^{n} \bar{x}^2
$$

#### **Step 2: Use the Definition of the Mean**
Since the mean is:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

we substitute:

$$
\sum_{i=1}^{n} \bar{x}^2 = n \bar{x}^2
$$

and 

$$
\sum_{i=1}^{n} x_i = n \bar{x}
$$

so the equation simplifies to:

$$
\sum_{i=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - 2n\bar{x}^2 + n\bar{x}^2
$$

$$
= \sum_{i=1}^{n} x_i^2 - n\bar{x}^2
$$

#### **Step 3: Express in Terms of Mean of Squares**
Dividing by $n - 1$:

$$
s^2 = \frac{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2}{n - 1}
$$

Rewriting using the **mean of squared values**:

$$
\bar{x^2} = \frac{1}{n} \sum_{i=1}^{n} x_i^2
$$

we get:

$$
s^2 = \frac{n}{n-1} (\bar{x^2} - \bar{x}^2)
$$

#### **Conclusion**
Thus, we have derived the formula:

$$
s^2 = \frac{n}{n-1} (\bar{x^2} - \bar{x}^2)
$$

This expresses the **sample variance** as a function of:
- $\bar{x^2}$: the mean of squared values, and
- $\bar{x}^2$: the square of the mean.

The **$\frac{n}{n-1}$ factor** (Bessel’s correction) ensures an **unbiased estimate** of the population variance.


Here's some of the NumPy statistical functions that take ndarrays as an inputs.  With these methods if you had a multidimensional array you could calculate the average by row (axis = 1) or by column (axis = 0) or over the entire array (no axis specified). We just have a 1D ndarray so this is not applicable here.

We calculate the inverse of the CDF, $F^{-1}_x(x)$ with Numpy percentile function.

In [None]:
porosity = df['Porosity'].values

print(f"The minimum is {np.amin(porosity):.2f}.")
print(f"The maximum is {np.amax(porosity):.2f}.")
print(f"The range (maximum - minimum) is {np.ptp(porosity):.2f}.")
print(f"The P10 is {np.percentile(porosity, 10):.3f}.")
print(f"The P50 is {np.percentile(porosity, 50):.3f}.")
print(f"The P90 is {np.percentile(porosity, 90):.3f}.")
print(f"The P13 is {np.percentile(porosity, 13):.3f}.")
print(f"The median (P50) is {np.median(porosity):.3f}.")
print(f"The mean is {np.mean(porosity):.3f}.")

We can calculate the (Pearson) correlation matrix using `df.corr()`

In [None]:
df.corr()

We can calculate the CDF value, $F_x(x)$, directly from the data.
* we use a condition to creat a boolean array with the same size of the data and then count the cases that meet the condition
* we are assuming equal weighting.

In [None]:
value = 0.10
cumul_prob = np.count_nonzero(porosity <= value) / len(df)

print(f"The cumulative probability for porosity = {value:.2f} is {cumul_prob:.2f}.")

#### Weighted Univariate Statistics

Later we will talke about weights statistics. The NumPy command average allows for weighted averages as in the case of statistical expectation and declustered statistics. For demonstration, lets make a weighting array and apply it.

In [None]:
nd = len(df)  # Get the number of data values
wts = np.ones(nd)  # Create an array of ones for equal weighting

equal_weighted_avg = np.average(porosity, weights=wts)

print(f"The equal-weighted average is {equal_weighted_avg:.3f}, the same as the mean above.")

The formula for the **weighted average** is:

$$
\bar{x}_{\text{weighted}} = \frac{\sum_{i=1}^{n} x_i w_i}{\sum_{i=1}^{n} w_i}
$$

where:

- $x_i$ are the **porosity values**.
- $w_i$ are the corresponding **weights** (`wts`).
- $n$ is the **number of observations**.


We can modify the weights to be 0.5 if the porosity is greater than 13% and retain 1.0 if the porosity is less than or equal to 13%. The results should be a lower weighted average.  

In [None]:
wts = np.ones_like(porosity)  # Create an array of ones with the same shape as porosity

wts[porosity > 0.13] *= 0.5  # Reduce weights for porosity values greater than 0.13

weighted_avg = np.average(porosity, weights=wts)

print(f"The weighted average is {weighted_avg:.3f}, lower than the equal-weighted average above.")

SciPy stats functions provide a handy summary statistics function. The output is a 'list' of values (actually it is a SciPy.DescribeResult object). One can extract any one of them to use in a workflow as follows.

In [None]:
# Get summary statistics
por_stats = stats.describe(df['Porosity'])

# Print full summary statistics
print(por_stats)

# Extract and print kurtosis
print(f"Porosity kurtosis is {por_stats.kurtosis:.2f}")

#### Histograms

Let's display some histograms.

#### Display histogram of data

Here's a histogram for porosity.

In [None]:
pormin, pormax = 0.05, 0.25
plt.figure(figsize=(8, 4))
plt.hist(df['Porosity'].values,alpha=0.8,color="darkorange",edgecolor="black",bins=20,range=[pormin,pormax])
plt.title('Histogram'); plt.xlabel('Porosity (fraction)'); plt.ylabel("Frequency")
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.0, wspace=0.1, hspace=0.2); plt.show()

Looks bimodal. 

#### Histogram Bins, Number of Bins and Bin Size

Let's explore with a few bins sizes to check the impact on the histogram.

#### Start by creating a slightly modified plot function

See what happens when we use:

* **too large bins / too few bins** - often smooth out, removes information
* **too small bins / too many bins** - often too noisy, obscures information  


#### Normalized Histograms

Normalized histograms are convienient since we can read relative frequency to be in each bin and observe closure by summing the relative frequency for all bins is 1.0.

* to do this we need to explicity set the weight for each data as $\frac{1}{n}$

In [None]:
weights = np.ones(len(df)) / len(df)
plt.figure(figsize=(8, 4))
plt.hist(porosity,alpha=0.8,color="darkorange",edgecolor="black",bins=25,range=[pormin,pormax],weights=weights)
plt.title('Normalized Histogram'); plt.xlabel('Porosity (fraction)'); plt.ylabel("Prob")
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.0, wspace=0.1, hspace=0.2); plt.show()

## Kernel densities

Let's overlay a kernel on the histogram

In [None]:

# Compute normalized weights for the histogram
weights = np.ones(len(df)) / len(df)

plt.figure(figsize=(8, 4))

# Create histogram
plt.hist(porosity, alpha=0.8, color="darkorange", edgecolor="black", bins=25, 
         range=[pormin, pormax], weights=weights, density=True, label="")

# Compute and plot KDE
kde = gaussian_kde(porosity, bw_method=0.1)  # Adjust bandwidth as needed
x_vals = np.linspace(pormin, pormax, 200)  # Smooth x-axis for KDE curve
plt.plot(x_vals, kde(x_vals), color='darkblue', linewidth=3, label="")

# Labels and title
plt.title('Normalized Histogram with KDE', fontsize=16)
plt.xlabel('Porosity (fraction)', fontsize=14)
plt.ylabel("Probability Density", fontsize=14)

# Add legend
plt.legend(fontsize=12)

# Adjust layout and show plot
plt.subplots_adjust(left=0.1, bottom=0.1, right=0.95, top=0.9, wspace=0.1, hspace=0.2)
plt.show()

### Probability Density Functions

If we decide to use the sampled data to represent uncertainty, the practical way to calculate a probability density function (PDF) from data is to use of kernel density estimate (KDE).

* we place a kernel, in this case a parametric Gaussian PDF, at each data value and then calculate the sum of all data kernels.
* constrained for closure such that the area under the curve is 1.0.
* differentiating the data CDF is usually too noisy to be useful.

To demonstrate the KDE method, we calculate the KDE PDF for the first 2, 5, ..., 200 data. 

* when there are very few data you can see the individual Gaussian kernels
* with more data they start to smooth out

In [None]:

# Set font sizes
title_fontsize = 30  # Adjust title font size
label_fontsize = 25  # Adjust x and y labels font size
tick_fontsize = 22   # Adjust tick labels font size

nums = [2, 4, 10, 20, 50, 200]

# Define the range for the x-axis
x_vals = np.linspace(0, 0.25, 200)  # 200 points between 0 and 0.25

plt.figure(figsize=(15, 10))  # Adjust figure size for readability

for i, num in enumerate(nums):
    plt.subplot(2, 3, i + 1)

    # Extract the first 'num' data points
    porosity_subset = df['Porosity'].values[:num]

    if len(porosity_subset) > 1:  # Ensure we have enough data for KDE
        kde = gaussian_kde(porosity_subset, bw_method=0.1)  # Bandwidth = 0.1
        plt.plot(x_vals, kde(x_vals), color='darkorange', linewidth=8, alpha=1.0)
    
    plt.xlim([0, 0.25])
    plt.title(f'KDE PDF for First {num} Data', fontsize=title_fontsize)
    plt.xlabel('Porosity (fraction)', fontsize=label_fontsize)
    plt.ylabel("Density", fontsize=label_fontsize)
    
    # Set tick font sizes
    plt.xticks(fontsize=tick_fontsize)
    plt.yticks(fontsize=tick_fontsize)

# Adjust subplot layout
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=2.1, wspace=0.3, hspace=0.4)
plt.show()

#### What is the impact of changing the kernel width on the KDE PDF model? 

* let's loop over a variety of kernel sizes and observe the resulting PDF with the data histogram.
* note, kernel width is controlled by bandwidth, but the bandwidth parameter is poorly documented in Seaborn and seems to be related to original standard deviation. My hypothesis is the kernel standard deviation is the product of the bandwidth and the standard deviation of the feature.

In [None]:
# Define bandwidth values
bandwidths = [0.01, 0.05, 0.1, 0.3]

# Create subplots
plt.figure(figsize=(12, 10))

for i, bw in enumerate(bandwidths):
    plt.subplot(2, 2, i + 1)

    # Compute standard deviation of porosity
    porosity_std = np.std(df['Porosity'])
    
    # Print bandwidth information
    print(f'Bandwidth = {bw}, Bandwidth x Standard Deviation = {bw * porosity_std}')

    # Plot histogram
    plt.hist(df['Porosity'].values, alpha=0.7, color="darkorange", edgecolor="black",
             bins=25, range=[pormin, pormax], density=True, label="Histogram")

    # Compute and plot KDE using SciPy
    kde = gaussian_kde(df['Porosity'].values, bw_method=bw)
    x_vals = np.linspace(pormin, pormax, 200)  # Smooth x-axis for KDE curve
    plt.plot(x_vals, kde(x_vals), color='black', alpha=0.8, linewidth=4.0, label="KDE")

    # Axis limits and labels
    plt.xlim([0.0, 0.3])
    plt.title(f'Histogram and KDE, BW = {bw}', fontsize=14)
    plt.xlabel('Porosity (fraction)', fontsize=12)
    plt.ylabel("Density", fontsize=12)

# Adjust subplot layout and show plot
plt.subplots_adjust(left=0.05, bottom=0.05, right=0.95, top=0.95, wspace=0.3, hspace=0.4)
plt.show()

### **Bandwidth in Kernel Density Estimation (KDE)**

#### **What Do These Calculations Tell Us?**

When performing Kernel Density Estimation (KDE), the **bandwidth (`bw`)** parameter controls the level of **smoothing** applied to the estimated probability density function. The effect of bandwidth is influenced by the **spread of the data**, which is measured by the **standard deviation (`σ`)** of the dataset.

In the calculations below, we analyze the impact of different bandwidth values:

| Bandwidth (`bw`) | Bandwidth × Standard Deviation (`bw × σ`) |
|-----------------|-------------------------------------|
| 0.01           | 0.00050                             |
| 0.05           | 0.00248                             |
| 0.1            | 0.00497                             |
| 0.3            | 0.01491                             |

#### **Interpretation:**
- **Bandwidth (`bw`) controls the smoothing level** of the KDE.
- The **product `bw × σ`** represents the **absolute smoothing width**, indicating how much the KDE function is spread over the data.

#### **Key Takeaways:**
1. **For `bw = 0.01`:**  
   - `bw × σ = 0.0005`
   - Very small smoothing → KDE is **highly sensitive** to small variations, potentially capturing noise rather than the true distribution (**overfitting**).
  
2. **For `bw = 0.05`:**  
   - `bw × σ = 0.0025`
   - Still relatively narrow smoothing, capturing finer details but at risk of being too sensitive to small fluctuations.

3. **For `bw = 0.1`:**  
   - `bw × σ = 0.005`
   - A more balanced smoothing effect, offering a compromise between detail and generalization.

4. **For `bw = 0.3`:**  
   - `bw × σ = 0.015`
   - Very wide smoothing → KDE becomes much smoother, possibly missing important features (**underfitting**).

#### **Practical Use:**
- A **small bandwidth** makes the KDE **closer to a histogram**, revealing fine details but also amplifying noise.
- A **large bandwidth** smooths the KDE into a **broad Gaussian-like curve**, which may lose key structural elements.
- The **optimal bandwidth** depends on the **data distribution**. A common approach is to use **Silverman’s rule of thumb** or **cross-validation** to determine the best bandwidth automatically.



### **Optimal Bandwidth for Kernel Density Estimation (KDE)**

The **optimal bandwidth** for a dataset, can be found by using **Silverman’s rule of thumb** or a **cross-validation-based approach**. Below, we compute and visualize the effect of different bandwidth choices.

#### **Silverman’s Rule of Thumb**
Silverman’s rule provides a default bandwidth estimate:

$
h = 1.06 \cdot \sigma \cdot n^{-1/5}
$

where:
- $\sigma$ is the **standard deviation** of the data,
- $n$ is the **number of observations**.

This rule provides a well-balanced smoothing parameter that adapts to the spread and size of the dataset.

### Find the optimal bandwdith for KDE

In [None]:

# Compute standard deviation and sample size
sigma = np.std(porosity, ddof=1)  # Using sample standard deviation (unbiased)
n = len(porosity)

# Compute Silverman's optimal bandwidth
silverman_bw = 1.06 * sigma * n ** (-1 / 5)

# Display the optimal bandwidth value
print(f"Silverman's Optimal Bandwidth: {silverman_bw:.6f}")

#### Using the optimal bandwidth

In [None]:

# Compute Silverman's bandwidth
sigma = np.std(porosity, ddof=1)  # Unbiased standard deviation
n = len(porosity)
silverman_bw = 1.06 * sigma * n ** (-1 / 5)

# Compute normalized weights for histogram
weights = np.ones(len(porosity)) / len(porosity)

# Create histogram
plt.figure(figsize=(8, 5))
plt.hist(porosity, alpha=0.8, color="darkorange", edgecolor="black", bins=25, 
         range=[pormin, pormax], weights=weights, density=True, label="")

# Compute and plot KDE
kde = gaussian_kde(porosity, bw_method=silverman_bw)
x_vals = np.linspace(pormin, pormax, 200)
plt.plot(x_vals, kde(x_vals), color='darkblue', linewidth=3, label="")

# Labels and title
plt.title('Normalized Histogram with KDE', fontsize=16)
plt.xlabel('Porosity (fraction)', fontsize=14)
plt.ylabel("Probability Density", fontsize=14)

# Add annotation for bandwidth in the upper left corner
plt.annotate(f'Bandwidth: {silverman_bw:.6f}', xy=(0.03, 0.9), xycoords='axes fraction',
             fontsize=14, color='black', bbox=dict(facecolor='white', edgecolor='white', boxstyle='round,pad=0.5'))

# Add legend
plt.legend(fontsize=12)

# Adjust layout and show plot
plt.subplots_adjust(left=0.1, bottom=0.1, right=0.95, top=0.9, wspace=0.1, hspace=0.2)
plt.show()

## Cumulative Distribution Functions

* the y axis is cumulative probability with a minimum of 0.0 and maximum of 1.0 as expected for a CDF.
* note after the initial hist command we can add a variety of elements such as labels to our plot as shown below.
* we can increase or decrease the number of bins, $> n$ is data resolution 

In [None]:
plt.hist(porosity,density=True, cumulative=True, label='CDF',
           histtype='stepfilled', alpha=0.8, bins = 100, color='darkorange', edgecolor = 'black', range=[0.0,0.25])
plt.xlabel('Porosity (fraction)')
plt.title('Porosity CDF')
plt.ylabel('Cumulation Probability')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.0, wspace=0.1, hspace=0.2)
#plt.savefig('cdf_Porosity.tif',dpi=600,bbox_inches="tight")
plt.show()

#### Calculating and Plotting a CDF 'by- Hand'

Let's demonstrate the calculation and plotting of a non-parametric CDF by hand

1. make a copy of the feature as a 1D array (ndarray from NumPy)
2. sort the data in ascending order
3. assign cumulative probabilities based on the tail assumptions
4. plot cumuative probability vs. value

In [None]:
por = df['Porosity'].copy(deep = True).values # make a deepcopy of the feature from the DataFrame
print('The ndarray has a shape of ' + str(por.shape) + '.')

por = np.sort(por)                           # sort the data in ascending order
n = por.shape[0]                             # get the number of data samples

cprob = np.zeros(n)
for i in range(0,n):
    index = i + 1
    cprob[i] = index / n                     # known upper tail
    # cprob[i] = (index - 1)/n               # known lower tail
    # cprob[i] = (index - 1)/(n - 1)         # known upper and lower tails
    # cprob[i] = index/(n+1)                 # unknown tails  

plt.subplot(111)
plt.plot(por,cprob, alpha = 0.8, c = 'black',zorder=1) # plot piecewise linear interpolation
plt.scatter(por,cprob,s = 20, alpha = 1.0, c = 'darkorange', edgecolor = 'black',zorder=2) # plot the CDF points
plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])
plt.xlabel("Porosity (fraction)"); plt.ylabel("Cumulative Probability"); plt.title("Cumulative Distribution Function")

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.1, wspace=0.1, hspace=0.2)
plt.show()

### Cumulative Probability Equations

Different formulas are used to compute cumulative probabilities based on tail assumptions.

##### **Known Upper Tail (Default)**
$$
cprob[i] = \frac{i + 1}{n}
$$
- Assumes the dataset's **upper tail is known**.
- The last data point has **cumulative probability = 1**.
- Commonly used in **empirical CDFs**.

---

#### **Known Lower Tail**
$$
cprob[i] = \frac{i}{n}
$$
- Assumes the dataset’s **lower tail is known**.
- The first data point starts at **0** instead of **$1/n$**.

---

#### **Known Upper and Lower Tails**
$$
cprob[i] = \frac{i}{n - 1}
$$
- Adjusts for both **upper and lower tails**.
- Distributes probabilities more evenly.
- Ensures the **last value reaches 1 exactly**.

---

#### **Unknown Tails (Cunnane Plotting Position)**
$$
cprob[i] = \frac{i + 1}{n + 1}
$$
- **Adjusts for unknown tails** by using $n+1$.
- Used in **probability plotting positions** (e.g., **Cunnane's method**).
- **Ensures values never reach 0 or 1 exactly**, useful for interpolation.

---

### **Summary Table**
| **Method** | **Formula** | **Assumptions** |
|------------|------------|----------------|
| **Known Upper Tail** | $ \frac{i + 1}{n} $ | Last value = 1 (default) |
| **Known Lower Tail** | $ \frac{i}{n} $ | First value = 0 |
| **Known Upper & Lower Tails** | $ \frac{i}{n - 1} $ | Uses $n-1$ for scaling |
| **Unknown Tails (Cunnane)** | $ \frac{i + 1}{n + 1} $ | Ensures values stay within (0,1) |


For large datasets, such as the one we are working with, the resulting CDFs using the different formulas is not usually significant and can be ignored. 

Let's test using a smaller dataset

In [None]:

# -----------------------------
# 1. LOAD & SORT THE DATA
# -----------------------------
# (Assuming 'df' is your DataFrame containing the 'Porosity' column)
por = df['Porosity'].copy(deep=True).values  
#print(f'The ndarray has a shape of {por.shape}.')  # e.g., (261,)

# Sort the data
por = np.sort(por)
n_full = por.shape[0]

# -----------------------------
# 2. OPTIONALLY SAMPLE THE DATA
# -----------------------------
use_all_samples = False  # Change to False to use only a subset
n_samples = 20          # Only used if use_all_samples is False

if use_all_samples:
    porosity_sampled = por
    indices = np.arange(n_full)  # all indices
else:
    indices = np.linspace(0, n_full - 1, n_samples, dtype=int)
    porosity_sampled = por[indices]

# -----------------------------
# 3. COMPUTE FOUR CDF VARIATIONS
# -----------------------------
# We'll compute these for the full dataset then sample if needed.
cprob_upper   = np.zeros(n_full)   # Known Upper Tail: (i+1)/n
cprob_lower   = np.zeros(n_full)   # Known Lower Tail: i/n
cprob_both    = np.zeros(n_full)   # Known Upper & Lower Tails: i/(n-1)
cprob_unknown = np.zeros(n_full)   # Unknown Tails (Cunnane-like): (i+1)/(n+1)

for i in range(n_full):
    idx = i + 1  # using 1-based indexing for the formulas
    cprob_upper[i]   = idx / n_full
    cprob_lower[i]   = (idx - 1) / n_full
    cprob_both[i]    = (idx - 1) / (n_full - 1)
    cprob_unknown[i] = idx / (n_full + 1)

# Downsample the CDF arrays if not using all samples
if not use_all_samples:
    cprob_upper   = cprob_upper[indices]
    cprob_lower   = cprob_lower[indices]
    cprob_both    = cprob_both[indices]
    cprob_unknown = cprob_unknown[indices]

# -----------------------------
# 4. COMPUTE DIFFERENCES (relative to Known Upper Tail)
# -----------------------------
# These differences are small (order ~1/n) but we can plot them for clarity.
diff_lower   = cprob_upper - cprob_lower
diff_both    = cprob_upper - cprob_both
diff_unknown = cprob_upper - cprob_unknown

# -----------------------------
# 5. PLOT THE RESULTS
# -----------------------------
fig, axs = plt.subplots(2, 1, figsize=(10, 10), sharex=True)

# Top subplot: the four CDF curves
axs[0].plot(porosity_sampled, cprob_upper,   label="Known Upper Tail",       color='blue')
axs[0].plot(porosity_sampled, cprob_lower,   label="Known Lower Tail",       color='green')
axs[0].plot(porosity_sampled, cprob_both,    label="Known Upper & Lower",    color='red')
axs[0].plot(porosity_sampled, cprob_unknown, label="Unknown Tails (Cunnane)", color='orange')
axs[0].set_ylabel("Cumulative Probability")
axs[0].set_title("Empirical CDFs with Different Tail Conventions")
axs[0].legend()
axs[0].grid(True)
axs[0].set_xlim([0.05, 0.25])
axs[0].set_ylim([0, 1])

# Bottom subplot: differences relative to the Known Upper Tail
axs[1].plot(porosity_sampled, diff_lower,   label="Upper - Lower",   marker='o', color='green')
axs[1].plot(porosity_sampled, diff_both,    label="Upper - Both",    marker='s', color='red')
axs[1].plot(porosity_sampled, diff_unknown, label="Upper - Unknown", marker='^', color='orange')
axs[1].set_xlabel("Porosity (fraction)")
axs[1].set_ylabel("Difference")
axs[1].set_title("Differences Relative to Known Upper Tail")
axs[1].legend()
axs[1].grid(True)

# Adjust y-axis limits to zoom in on the small differences:
# The maximum difference is roughly ~1/n_full. For n_full=261, that’s about 0.0038.
axs[1].set_ylim([-0.001, 0.005])

plt.tight_layout()
plt.show()


## In conclusion, let's finish with the histograms of all of our features!

In [None]:

# Define min/max values for the variables
permmin, permmax = 0.01, 3000
AImin, AImax = 1000.0, 8000
Fmin, Fmax = 0, 1
pormin, pormax = 0.05, 0.25  # Adjusted based on the observed histogram

# Create figure and subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Facies Histogram
axes[0, 0].hist(df['Facies'].values, bins=20, range=[Fmin, Fmax], 
                color="darkorange", edgecolor="black", alpha=0.8)
axes[0, 0].set_title('Facies Well Data', fontsize=14)
axes[0, 0].set_xlabel('Facies (1-sand, 0-shale)', fontsize=12)
axes[0, 0].set_ylabel('Frequency', fontsize=12)

# Porosity Histogram
axes[0, 1].hist(df['Porosity'].values, bins=20, range=[pormin, pormax], 
                color="darkorange", edgecolor="black", alpha=0.8)
axes[0, 1].set_title('Porosity Well Data', fontsize=14)
axes[0, 1].set_xlabel('Porosity (fraction)', fontsize=12)
axes[0, 1].set_ylabel('Frequency', fontsize=12)

# Permeability Histogram
axes[1, 0].hist(df['Perm'].values, bins=20, range=[permmin, permmax], 
                color="darkorange", edgecolor="black", alpha=0.8)
axes[1, 0].set_title('Permeability Well Data', fontsize=14)
axes[1, 0].set_xlabel('Permeability (mD)', fontsize=12)
axes[1, 0].set_ylabel('Frequency', fontsize=12)

# Acoustic Impedance Histogram
axes[1, 1].hist(df['AI'].values, bins=20, range=[AImin, AImax], 
                color="darkorange", edgecolor="black", alpha=0.8)
axes[1, 1].set_title('Acoustic Impedance Well Data', fontsize=14)
axes[1, 1].set_xlabel('Acoustic Impedance (kg/m2s*10^6)', fontsize=12)
axes[1, 1].set_ylabel('Frequency', fontsize=12)

# Adjust layout for clarity
plt.tight_layout()
plt.show()

## Comment

This was a basic demonstration of calculating univariate statistics and visualizing data distributions. Much more could be done.

# The End