# Error Quantification in Mechanical Testing Data Processing

## Introduction

In mechanical testing, accurate data analysis is crucial for obtaining reliable results. However, various data processing techniques introduce additional errors that can affect the overall accuracy of the analysis. Understanding and quantifying these errors is essential for ensuring that conclusions drawn from the data are robust and meaningful.

This notebook aims to derive and quantify the **theoretical errors** introduced by common data processing techniques used in mechanical testing workflows. While measurement errors are inherent to experimental data, this notebook will focus on the **errors introduced by numerical methods** during data processing. The goal is to track how these errors propagate through processes such as interpolation, unit conversion, regression, integration, and averaging of samples.

By systematically accounting for errors introduced at each step, we can better understand how data processing affects the final results, ultimately improving the accuracy and reliability of our analysis.

### Scope and Organization

This notebook will focus on the following data processing methods, organized by their frequency of use in mechanical testing:

1. **Interpolation Errors**:
   - Introduced when estimating data points between known values (e.g., force-displacement or stress-strain curves).
   - We will derive and quantify errors for linear, polynomial, and spline interpolation methods.

2. **Unit Conversion Errors**:
   - Introduced when converting between units (e.g., inches to millimeters, pounds to Newtons).
   - While Pint ensures exact conversion factors, we will investigate the propagation of errors from the original data through the conversion process.

3. **Regression Errors for Modulus Calculation**:
   - Introduced when fitting a regression model to calculate mechanical properties such as the modulus of elasticity.
   - We will quantify the errors associated with linear regression and determine confidence intervals for the modulus.

4. **Integration Errors for Energy Calculation**:
   - Introduced when calculating energy from stress-strain curves via numerical integration.
   - We will derive the theoretical errors for Simpson’s rule and the trapezoidal rule, and account for error propagation from the input data.

5. **Averaging Samples Together**:
   - Introduced when averaging data from multiple mechanical tests on different samples.
   - We will quantify the error in the mean and account for the propagation of uncertainties from individual datasets.

### Goals

By the end of this notebook, you will:
- Understand how to derive the theoretical error for various data processing methods.
- Quantify the additional errors introduced at each stage of data processing.
- Be able to propagate these errors through different stages of your analysis.
- Gain insights into the overall reliability of the final results in your mechanical testing workflow.

This notebook will use real mechanical testing data to illustrate each concept and provide practical examples of how to manage and minimize errors during analysis. Let's begin with interpolation errors, one of the most commonly used techniques in mechanical data analysis.

---

### 1. Interpolation Errors

In mechanical testing workflows, interpolation is often used as an intermediate step for several processes, including:
- **Extracting data at specific discrete points** for averaging or comparison.
- **Downsampling** data to reduce the dataset size while preserving key information.
- **Extrapolating** to obtain derivatives or perform further numerical analysis.

Although interpolation provides a useful way to estimate values between known data points, it introduces errors that can propagate through subsequent calculations.

#### Sources and Quantification of Interpolation Errors

Several factors affect interpolation error, and it is important to understand their impact, especially when interpolation is followed by other operations like averaging or derivative calculations.

1. **Data Spacing (h)**: Interpolation errors are generally proportional to the spacing between data points. If data is sampled at regular intervals $ x_i $ with a spacing of $ h $, the error for different interpolation methods can be expressed as:
   - **Linear interpolation**: The error is of order $ O(h^2) $. This means that the error decreases quadratically as the data points become closer.
   - **Polynomial interpolation (degree n)**: The error is of order $ O(h^{n+1}) $, where $ n $ is the degree of the interpolating polynomial.
   - **Spline interpolation (cubic)**: Cubic splines have an error of order $ O(h^4) $, making them more accurate than linear interpolation for smooth functions.

2. **Higher-Order Derivatives**: The error in interpolation also depends on the smoothness of the underlying data. For smooth functions $ f(x) $, interpolation error is influenced by the higher-order derivatives of $ f(x) $. For example, in linear interpolation, the error at any point can be approximated by:
   $$
   E(x) \approx \frac{h^2}{8} f''(\xi),
   $$
   where $ \xi $ is some point in the interval $[x_i, x_{i+1}]$, and $ f''(\xi) $ is the second derivative of the underlying function. Higher-order interpolation methods reduce the impact of these higher derivatives.

3. **Error Propagation in Downsampling**: When interpolating to downsample data, the error introduced by interpolation can compound, especially if the interpolated data is used in subsequent averaging or derivative calculations. The cumulative error is often a function of both the interpolation method and the number of data points downsampled.

4. **Extrapolation and Derivatives**: Interpolating to obtain derivatives introduces additional challenges. The error in estimating the derivative of a function using interpolation is typically of lower order than the error in the function values themselves. For example:
   - Linear interpolation introduces an error of order $ O(h) $ for the first derivative.
   - For cubic splines, the error in the first derivative is $ O(h^3) $.

#### Error Formulas for Interpolation

To summarize the typical error orders for different interpolation methods:
- **Linear interpolation**: $ E(x) = O(h^2) $
- **Polynomial interpolation** (degree $ n $): $ E(x) = O(h^{n+1}) $
- **Cubic spline interpolation**: $ E(x) = O(h^4) $

In practical applications, these errors propagate into downstream calculations, so it is essential to account for the total error when interpolating and subsequently performing operations such as averaging or differentiation.

---

### 1. Linear Interpolation (Using `np.polyfit` and `np.poly1d`)
This method fits a linear model to your data using `np.polyfit` and generates a linear polynomial (`linear_model`) to estimate values at the points specified in `custom_array`.

- **Error Behavior**: As discussed earlier, linear interpolation errors are of order $ O(h^2) $, where $ h $ is the distance between the known data points. The interpolation assumes a straight-line relationship between points, which works well for linear or near-linear data but can introduce significant errors if the underlying relationship is curved.
  
- **Implication**: If the data has significant curvature between points, this method may underestimate or overestimate values between data points, especially for larger gaps between points.

### 2. Cubic Interpolation (Using `CubicSpline`)
This method uses `CubicSpline` from SciPy, which fits a piecewise cubic polynomial to your data. It can extrapolate beyond the known data points if `extrapolate=True`.

- **Error Behavior**: The error for cubic splines is of order $ O(h^4) $, meaning it decays much faster than linear interpolation for smooth data. However, if the data is noisy or if the function is not smooth, spline interpolation can lead to overfitting and oscillatory behavior (Runge's phenomenon).

- **Extrapolation Risk**: Cubic splines tend to behave poorly when extrapolating beyond the range of known data, as the polynomial may behave unpredictably outside of the data range. If extrapolation is not required, it’s often better to limit this.

### 3. PCHIP Interpolation (Using `PchipInterpolator`)
This method uses the Piecewise Cubic Hermite Interpolating Polynomial (PCHIP) from SciPy, which ensures that the interpolated function is monotonic between data points and does not overshoot.

- **Error Behavior**: PCHIP interpolation generally has an error of order $ O(h^2) $ to $ O(h^3) $, depending on the smoothness of the data. It is designed to handle data with sharp transitions or steep gradients, preventing the overshooting and oscillation seen in cubic splines. While it doesn’t provide the smoothness of cubic splines, it is more stable and reliable in cases where the data changes rapidly.

- **Use Case**: PCHIP is particularly useful for preserving the shape of the data, making it less prone to introducing artifacts, especially in the presence of sharp changes or irregular data.

---

#### Interpolation Methods and Error Growth

The interpolation methods used in this workflow are selected based on the data characteristics and the need for extrapolation or smoothness. Below are the key interpolation methods applied:

1. **Linear Interpolation (`np.polyfit`)**: Fits a straight line between data points. The interpolation error is proportional to the square of the distance between points ($ O(h^2) $). Suitable for near-linear data or cases where preserving simplicity is more important than capturing curvature.

2. **Cubic Interpolation (`CubicSpline`)**: Uses piecewise cubic polynomials to fit the data. The error scales as $ O(h^4) $, meaning it is more accurate for smooth functions but may overfit in the presence of noise or irregular data. Extrapolation using cubic splines is prone to introducing large errors due to oscillations outside the data range.

3. **PCHIP Interpolation (`PchipInterpolator`)**: Ensures that the interpolated function is monotonic, preventing overshooting and oscillations. It is ideal for data with sharp transitions or steep gradients. The error ranges from $ O(h^2) $ to $ O(h^3) $, depending on the data's smoothness.

4. **`interp1d` (SciPy)**: Allows flexible interpolation (linear, cubic, etc.). The error depends on the selected method:
   - Linear: $ O(h^2) $
   - Cubic: $ O(h^4) $
   - Extrapolation can introduce significant errors, particularly for higher-order methods.

Understanding the error introduced by each method helps in selecting the right approach for the specific task, especially when averaging, calculating derivatives, or downsampling data.

---

### General Approach to Estimating Interpolation Error

1. **Obtain Spacing $ h $**: First, calculate the average (or maximum) spacing between the data points. This is crucial because the interpolation error depends directly on the spacing between the known data points.

2. **Estimate the Derivative**: For interpolation methods like linear or cubic, the error formula involves higher-order derivatives (e.g., $ f''(x) $ for linear, $ f''''(x) $ for cubic). These derivatives can be estimated numerically from the data.

3. **Apply the Error Formula**: Use the appropriate error formula based on the method:
   - **Linear interpolation**: $ E(x) \approx \frac{h^2}{8} \max |f''(x)| $
   - **Cubic spline interpolation**: $ E(x) \approx \frac{h^4}{384} \max |f''''(x)| $
   - **PCHIP interpolation**: We can use a similar approach to cubic but estimate a derivative of lower order due to its smoother properties. $ E(x) \approx \frac{h^3}{24} \max |f''(x)| $



In [1]:
import numpy as np
import pandas as pd


def estimate_interpolation_error(df, interp_column, target_column, interpolation_method):
    df_copy = df.copy()
    # Step 1: Calculate average spacing h between points
    h = np.mean(np.diff(df[interp_column]))

    # Step 2: Estimate derivatives based on the interpolation method
    if interpolation_method == "linear":
        # Estimate second derivative numerically for linear interpolation
        second_derivative = np.gradient(np.gradient(df[target_column], df[interp_column]), df[interp_column])
        max_second_derivative = np.max(np.abs(second_derivative))
        # Step 3: Apply the error formula for linear interpolation
        error_estimate = (h**2 / 8) * max_second_derivative
    
    elif interpolation_method == "cubic":
        # Estimate fourth derivative numerically for cubic interpolation
        second_derivative = np.gradient(np.gradient(df[target_column], df[interp_column]), df[interp_column])
        fourth_derivative = np.gradient(np.gradient(second_derivative, df[interp_column]), df[interp_column])
        max_fourth_derivative = np.max(np.abs(fourth_derivative))
        # Step 3: Apply the error formula for cubic interpolation
        error_estimate = (h**4 / 384) * max_fourth_derivative
    
    elif interpolation_method == "pchip":
        # PCHIP is monotonic, and we can use a second derivative as a proxy for error estimation.
        second_derivative = np.gradient(np.gradient(df[target_column], df[interp_column]), df[interp_column])
        max_second_derivative = np.max(np.abs(second_derivative))
        # Conservative error estimate for PCHIP using second derivative (similar to cubic)
        error_estimate = (h**3 / 24) * max_second_derivative
    
    else:
        raise ValueError(f"Unsupported interpolation method: {interpolation_method}")
    
    return error_estimate


In [5]:
import numpy as np
import pandas as pd

# Define the test functions
def linear_function(x):
    return 2 * x + 1

def quadratic_function(x):
    return x**2

def sine_function(x):
    return np.sin(x)

# Generate some test data for each function
x_values = np.linspace(0, 10, 20)  # 20 evenly spaced points between 0 and 10
h_step = np.mean(np.diff(x_values))  # Average spacing between points

# Create DataFrames for each test case
df_linear = pd.DataFrame({'x': x_values, 'y': linear_function(x_values)})
df_quadratic = pd.DataFrame({'x': x_values, 'y': quadratic_function(x_values)})
df_sine = pd.DataFrame({'x': x_values, 'y': sine_function(x_values)})

# Function to test the interpolation error estimation
def test_interpolation_error(df, interp_column, target_column, interpolation_method):
    error_estimate = estimate_interpolation_error(df, interp_column, target_column, interpolation_method)
    return error_estimate

# Analytical error for quadratic function (linear interpolation)
# E(x) = (h^2 / 8) * max(f'')

max_second_derivative_linear = 0  # f''(x) for 2x + 1 is 0
max_second_derivative_quadratic = 2  # f''(x) for x^2 is 2

# Test the linear interpolation error estimation on the linear function
linear_error_estimate = test_interpolation_error(df_linear, 'x', 'y', 'linear')
linear_analytical_error = (h_step**2 / 8) * max_second_derivative_linear
linear_error_discrepancy = np.abs(linear_analytical_error - linear_error_estimate)

quadratic_analytical_error = (h_step**2 / 8) * max_second_derivative_quadratic
# Test the linear interpolation error estimation on the quadratic function
quadratic_linear_error_esitmate = test_interpolation_error(df_quadratic, 'x', 'y', 'linear')
quadratic_linear_error_discrepancy = np.abs(quadratic_analytical_error - quadratic_linear_error_esitmate)

# Analytical error for sine function (cubic interpolation)
# E(x) = (h^4 / 384) * max(f'''')
max_fourth_derivative_sine = 1  # f''''(x) for sin(x) is sin(x), with max of 1
sine_analytical_error = (h_step**4 / 384) * max_fourth_derivative_sine
# Test the cubic interpolation error estimation on the sine function
sine_cubic_error_esitmate = test_interpolation_error(df_sine, 'x', 'y', 'cubic')
sine_cubic_error_discrepancy = np.abs(sine_analytical_error - sine_cubic_error_esitmate)

# Print the results
print(f"Linear interpolation error estimate: {linear_error_estimate} (analytical: {linear_analytical_error}) which is {linear_error_discrepancy} off")
print(f"Quadratic interpolation error estimate: {quadratic_linear_error_esitmate} (analytical: {quadratic_analytical_error}) which is {quadratic_linear_error_discrepancy} off")
print(f"Sine interpolation error estimate: {sine_cubic_error_esitmate} (analytical: {sine_analytical_error}) which is {sine_cubic_error_discrepancy} off")


Linear interpolation error estimate: 2.8447542874744303e-16 (analytical: 0.0) which is 2.8447542874744303e-16 off
Quadratic interpolation error estimate: 0.06925207756232794 (analytical: 0.06925207756232683) which is 1.1102230246251565e-15 off
Sine interpolation error estimate: 0.00016583796788432293 (analytical: 0.00019982709361243881) which is 3.398912572811588e-05 off


---

### Interpolation Error Summary

We tested the interpolation error estimation function on three functions with known analytical solutions: a linear function, a quadratic function, and a sine function. The results are summarized below:

1. **Linear Function** $ f(x) = 2x + 1 $ with Linear Interpolation:
   - **Estimated Error**: $ 2.84 \times 10^{-16} $
   - **Analytical Error**: $ 0 $
   - **Discrepancy**: $ 2.84 \times 10^{-16} $

   The error for the linear function is effectively zero, confirming that linear interpolation introduces negligible error for a perfectly linear function.

2. **Quadratic Function** $ f(x) = x^2 $ with Linear Interpolation:
   - **Estimated Error**: $ 0.069252 $
   - **Analytical Error**: $ 0.069252 $
   - **Discrepancy**: $ 1.11 \times 10^{-15} $

   The error estimate for linear interpolation is highly accurate for quadratic data. However, higher-order methods, such as cubic interpolation, could reduce the error further.

3. **Sine Function** $ f(x) = \sin(x) $ with Cubic Interpolation:
   - **Estimated Error**: $ 0.0001658 $
   - **Analytical Error**: $ 0.0001998 $
   - **Discrepancy**: $ 3.4 \times 10^{-5} $

   Cubic interpolation provides a close approximation to the analytical error, demonstrating its effectiveness for smooth functions like sine waves. Minor discrepancies can arise from the numerical estimation of the fourth derivative.

### Impact on Uncertainty

In real-world scenarios, where the true function is unknown, these error estimates provide an important diagnostic tool:

1. **Identifying Issues**: If the estimated interpolation error is large, it can signal that the chosen interpolation method does not fit the data well. For instance, if you apply linear interpolation to a dataset with significant curvature, the error estimate will likely be high.

2. **Comparing Interpolation Methods**: By testing multiple interpolation methods (e.g., linear, cubic, PCHIP) and comparing the estimated errors, you can identify which technique introduces the least uncertainty. A smaller error suggests that the method fits the data more closely, even if the true function is unknown.

3. **Actionable Insights**: If the error is consistently high, it might indicate that the data requires a higher-order interpolation method (such as cubic or spline interpolation). Conversely, if the error is small, the simpler method might suffice, minimizing computational complexity.

### Next Steps

- **Set error thresholds**: Establish a threshold for acceptable error. If the error exceeds this threshold, consider switching to a more sophisticated interpolation technique.
  
- **Refine data processing**: Based on the interpolation error estimates, continuously refine the choice of interpolation techniques in your workflow. This will help improve the accuracy of your processed results, especially when performing operations like averaging or differentiation on interpolated data.



---

### 2. Unit Conversion and Its Impact

In this workflow, unit conversion plays a minimal role in affecting the accuracy of the results due to the following reasons:

1. **Internal Standard Units**: All data is processed and stored in a consistent internal unit system (e.g., SI units). This eliminates the possibility of errors due to mixed units during calculations.

2. **Localized Conversions**: Unit conversions only occur at two points:
   - When user inputs data in non-standard units, it is converted to the internal unit system for processing.
   - When data is converted to user-requested units for plotting or visualization, which does not impact the processed data or final results.

3. **Precision with `Pint`**: The `Pint` library is used for all unit conversions, ensuring that exact conversion factors are applied, thereby minimizing any risk of rounding or truncation errors.

### Negligible Impact on Uncertainty

Since conversions are isolated to input and output stages, and calculations are performed in standardized units, the errors introduced by unit conversions are negligible. Additionally, any rounding errors are localized and unlikely to affect the overall accuracy of the workflow. Therefore, unit conversion has no meaningful impact on the uncertainty of the final results.

---

### 3. Regression Errors for Modulus Calculation

In mechanical testing, regression analysis is commonly used to calculate material properties such as the modulus of elasticity. The most common method is linear regression, which fits a straight line to the stress-strain curve to determine the slope (modulus).

Great! Let's dive into the **regression error** section, particularly focusing on **linear regression** for calculating mechanical properties such as the modulus. In this case, since you’re always performing **linear fitting** to determine the modulus (e.g., elastic modulus from stress-strain data), we'll focus on the common sources of error, especially those related to data selection and fitting.

### Key Sources of Regression Error

1. **Data Selection Error (Fitting Range)**:
   - The range of data points selected for fitting can have a significant impact on the accuracy of the linear regression.
   - If the chosen range does not represent the linear portion of the data (e.g., the elastic region in stress-strain curves), the modulus calculation can be inaccurate.
   - **Error Impact**: Poor range selection introduces systematic error, leading to incorrect results.

2. **Measurement Noise**:
   - Real-world data often contains noise from measurement instruments, sample preparation, or testing conditions. Noise introduces uncertainty into the regression model by making it harder to find the true linear relationship.
   - **Error Impact**: Noise leads to a less accurate slope (modulus) estimate and increases the uncertainty in the regression coefficients.

3. **Outliers and Inconsistent Data**:
   - Outliers or inconsistent data points can skew the results of the linear fit, especially if they are not properly handled.
   - **Error Impact**: Outliers can disproportionately influence the slope, leading to an inaccurate modulus.

### Quantifying Regression Error: Metrics for Linear Fitting

For linear regression, several key metrics can be used to quantify the error and uncertainty in the fit:

1. **Standard Error of the Slope**:
     **Purpose**: The SE quantifies the **uncertainty** in the slope (modulus) estimate. It tells you how much the calculated slope may vary due to random errors in the data.
   - **Formula**: 
     $$
     \text{SE} = \sqrt{\frac{1}{n-2} \sum_{i=1}^{n} (y_i - \hat{y_i})^2} \times \frac{1}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2}}
     $$
   - **Meaning**: A smaller SE indicates a more reliable estimate of the modulus, while a larger SE suggests higher uncertainty due to noise or poor data quality.
   - **Impact**: SE represents the uncertainty in the modulus. You can use this value to communicate how confident you are in the calculated modulus. A large SE suggests the results may be unreliable, and you may need to refine the range or data quality.

2. **Coefficient of Determination ($R^2$)**:
   - **Purpose**: The $ R^2 $ value measures the **goodness-of-fit** of the linear regression model. It ranges from 0 to 1, with values closer to 1 indicating a better fit.
   - **Formula**: 
     $$
     R^2 = 1 - \frac{\sum (y_i - \hat{y_i})^2}{\sum (y_i - \bar{y})^2}
     $$
    - **Meaning**: A high $ R^2 $ value means the data fits well to a line, indicating that the selected range is appropriate for calculating the modulus. A low $ R^2 $ suggests that the data may not be linear or that the chosen range is poor.
   - **Impact**: If $ R^2 $ is too low, it might indicate that the user should choose a new range. While $ R^2 $ doesn’t directly quantify uncertainty in the modulus, it helps assess whether the data is appropriate for linear regression. - **Determine a threshold for acceptable $ R^2 $**

3. **Confidence Intervals**:
   - **Purpose**: Confidence intervals can be calculated for the slope to provide a range within which the true modulus is likely to lie. A narrower confidence interval indicates higher confidence in the estimate.
   - **Formula** for 95% confidence interval of slope:
     $$
     \hat{\beta} \pm t_{\alpha/2, n-2} \cdot \text{SE}
     $$
   - $t_{\alpha/2, n-2}$ is the t-distribution critical value for the desired confidence level and sample size.
   - **Impact**: Confidence intervals complement SE by providing an explicit range for the slope estimate, though SE alone is often sufficient to represent uncertainty in the modulus.



In [None]:
import numpy as np
import pandas as pd
from scipy.stats import linregress

# Example function to perform linear regression and quantify error
def estimate_regression_error(x, y):
    # Perform linear regression
    slope, intercept, r_value, p_value, std_err = linregress(x, y)
    
    # Calculate R^2
    r_squared = r_value**2

    # Return regression metrics
    return {
        'slope': slope,
        'intercept': intercept,
        'R_squared': r_squared,
        'standard_error': std_err
    }

# Example usage with test data
x_data = np.linspace(0, 10, 50)  # Example x-values
y_data = 3 * x_data + np.random.normal(0, 0.5, size=x_data.shape)  # Linear y-values with noise

regression_results = estimate_regression_error(x_data, y_data)

# Print results
print(f"Slope (modulus): {regression_results['slope']}")
print(f"Intercept: {regression_results['intercept']}")
print(f"R-squared: {regression_results['R_squared']}")
print(f"Standard Error of Slope: {regression_results['standard_error']}")
