Jacob Singer

Pearson Correlation Coefficient


The Pearson Correlation Coefficient (PCC) is a fundamental statistical measure that evaluates the strength and direction of the linear relationship between two continuous variables [1]. Developed by the British mathematician Karl Pearson in the late 19th century, this coefficient has become a cornerstone in statistical analyses across a diverse range of fields, including finance, biology, psychology, and social sciences. It serves to quantify the degree to which two variables are interrelated, providing valuable insights into their relationship [2].


Mathematical Definition: The Pearson correlation coefficient, denoted as (r), is computed using the following formula: 

![Figure #1 PCC.png.gif.jpg](<attachment:Figure #1 PCC.png.gif.jpg>)

Where:
•	n is the number of data points.
•	x and y are the individual data points for the two variables. 
•	∑sum∑ denotes the summation across all data points. 

This formula assesses how changes in one variable are associated with changes in another, providing a value between -1 and 1.

This relationship yields a coefficient that ranges from -1 to +1, where the magnitude reflects the strength of the correlation, and the sign indicates its direction.

Interpretation of (r) Values: 
•	Perfect Positive Correlation: This value indicates a perfect linear relationship in which increases in one variable are always accompanied by proportional increases in the other. For instance, if one were to assess the relationship between the weight of an object and the force required to lift it at a constant acceleration, a perfect correlation would be expected. 
•	No Correlation: A value close to zero suggests the absence of a linear relationship between the two variables. This means that changes in one variable do not predict changes in the other. 
•	Perfect Negative Correlation: This value indicates a perfect inverse relationship, meaning that as one variable increases, the other decreases in a perfectly linear manner. An example might be the relationship between the speed of a vehicle and the time it takes to travel a fixed distance. Values that lie between these extremes indicate varying degrees of correlation, and the closer (r) is to either -1 or +1, the stronger the linear relationship.

Figure Title: Scatter Plot Depicting the Pearson Correlation Between Functional Error (em) and Measured Error (ef) of Current Transformers Across Various Load Percentages

![image.png](attachment:image.png)


Figure Legend: This scatter plot displays the functional error (em) on the x-axis and the measured error (ef) on the y-axis of Current Transformers (CT) at different load percentages. Each point represents a specific measurement pair at a given load percentage. A regression line is fitted to the data points, and the Pearson correlation coefficient (r) is provided to quantify the strength and direction of the linear relationship between em and ef [3].

Summary: In this analysis, the Pearson correlation coefficient was utilized to assess the linear relationship between the functional error (em) and the measured error (ef) of Current Transformers (CT) across varying load percentages [3]. By plotting em against ef and calculating the correlation coefficient, we can determine the degree to which these errors are linearly related. A correlation coefficient close to +1 or -1 indicates a strong linear relationship, while a value near 0 suggests a weak or no linear relationship. This statistical measure aids in understanding the consistency between the functional and measured errors, providing insights into the performance and reliability of the CT measurements under different load conditions [3].

When to Use the Pearson Correlation Coefficient: The application of the Pearson correlation coefficient is most appropriate when analyzing linear relationships between two continuous variables, provided that the data meet the assumptions outlined below [2]:

1. Linearity: The relationship between the variables must be linear. This means that a change in one variable should result in a proportional change in the other, ideally depicted by a straight line when graphed. 
2. Homoscedasticity: The variability of one variable should remain consistent at all levels of the other variable. This characteristic, known as homoscedasticity, ensures that the spread of the data points does not change with the values of either variable. 
3. Normality: Both variables should be approximately normally distributed, allowing for more accurate statistical inference. This assumption can often be evaluated using graphical methods such as histograms or Q-Q plots. 

Pearson Correlation Coefficient can serve as a robust tool for uncovering relationships between variables in various research contexts, ultimately contributing to a deeper understanding of underlying patterns within data.

Example: A situation where someone might want to use PCC is testing the relationship between age and blood pressure. This shows how applications of pearson are correlated in analysis of medical research. Cardiovascular scientists employ this statistical technique to quantify how systolic and diastolic blood pressure measurements change across the human lifespan. When researchers collect blood pressure readings alongside demographic data from diverse population samples, pearson correlation provides a precise mathematical framework to evaluate the strength and direction of these associations. A typical analysis might reveal a moderate positive correlation (r ≈ 0.45-0.65) between advancing age and systolic blood pressure, demonstrating that arterial stiffening progresses in a relatively linear fashion throughout adulthood. The statistical significance (p-value) of this correlation helps researchers distinguish between genuine physiological relationships and random variation, which proves essential when developing evidence-based guidelines for hypertension screening and prevention. Beyond simply identifying correlations, these analyses often serve as foundational steps toward more sophisticated cardiovascular risk models that incorporate multiple variables such as genetic predisposition, environmental factors, and socioeconomic determinants. The standardized nature of the pearson correlation coefficient allows for meaningful comparisons across different studies, demographic groups, and geographic regions, ultimately contributing to our evolving understanding of how aging impacts cardiovascular health on both individual and population levels.

AI References: ChatGPT Prompt - help me find resources and relevant information on Pearson Correlation Coefficient. Give me bulletpoint and discussion points. Claude: Help me come up with a function a simple python function that performs the statistical test and returns the p-value. Help me come up with an example where PCC can be used in medical research. 

References: 

1.	Vos, P., Pearson's correlation between three variables; using students’ basic knowledge of geometry for an exercise in mathematical statistics. International Journal of Mathematical Education in Science and Technology, 2009. 40(4): p. 533-541.
2.	Obilor, E.I. and E.C. Amadi, Test for significance of Pearson’s correlation coefficient. International Journal of Innovative Mathematics, Statistics & Energy Policies, 2018. 6(1): p. 11-23.
3.	Burgund, D., et al., Pearson correlation in determination of quality of current transformers. Sensors, 2023. 23(5): p. 2704.




In [1]:
def pearson_correlation_test(x, y):
    '''
    Perform a Pearson correlation test between two arrays of continuous variables.
    
    Parameters:
    x : array_like
        First variable to correlate, must be continuous data.
    y : array_like
        Second variable to correlate, must be continuous data.
        
    Returns:
    r : float
        Pearson correlation coefficient between -1 and 1.
    p_value : float
        Two-tailed p-value for testing non-correlation.
    
    Example:
    >>> x = [1, 2, 3, 4, 5]
    >>> y = [2, 3, 5, 6, 8]
    >>> r, p_value = pearson_correlation_test(x, y)
    >>> print(r)
    0.9914860303065454
    >>> print(p_value)
    0.000898315196118051
    
    Notes:
    - The Pearson correlation measures linear relationships only.
    - Values close to 1 indicate strong positive correlation.
    - Values close to -1 indicate strong negative correlation.
    - Values close to 0 indicate little to no linear correlation.
    - Both variables should follow approximately normal distributions for the p-value to be valid.
    - Requires at least 3 paired data points to compute.
    '''
    import scipy.stats as stats
    import numpy as np
    
    # Convert inputs to numpy arrays
    x = np.array(x, dtype=float)
    y = np.array(y, dtype=float)
    
    # Check for valid inputs
    if len(x) != len(y):
        raise ValueError("Input arrays must have the same length")
    
    if len(x) < 3:
        raise ValueError("Need at least 3 data points to compute correlation")
    
    # Calculate Pearson correlation and p-value
    r, p_value = stats.pearsonr(x, y)
    
    return r, p_value

In [2]:
# Import the function

# Create sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 6, 8]

# Call the function
r, p_value = pearson_correlation_test(x, y)

# Print the results
print(f"Pearson correlation coefficient (r): {r:.4f}")
print(f"p-value: {p_value:.6f}")

Pearson correlation coefficient (r): 0.9934
p-value: 0.000643
