<h1 style='font-size: 35px; color: green; font-family: Dubai; font-weight: 600'>Effect Size Analysis: Quantifying Relationships and Differences in Data</h1>

Effect size is a statistical measure that quantifies the strength or magnitude of a relationship or difference between variables, providing a standardized way to interpret results. Unlike **p-values**, which only indicate statistical significance, effect size reveals the practical importance of findings, making it an essential tool in research. Common measures include `Cohen's d` for comparing group means, `Pearson’s r` for assessing correlations, and `eta-squared (η²)` or `partial eta-squared (ηp²)` for variance explained in ANOVA. Effect size helps researchers understand the real-world implications of their results, guiding better decision-making and interpretation.

<h2 style='font-size: 25px; color: crimson; font-family: Dubai; font-weight: 600'>Import Required Libraries</h2>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import squarify 
import seaborn as sns
print('Libraries loaded Succesfully')

Libraries loaded Succesfully


<h2 style='font-size: 25px; color: crimson; font-family: Dubai; font-weight: 600'>Loading Dataset</h2>

In [25]:
df = pd.read_csv('Datasets/Hypothesis 101.csv')
df.head()

Unnamed: 0,Fertilizer,Yield (tones/ha),Days to Maturity,Biomass,Dry matter,Irrigation
0,C,38.87497,6.26185,2.525738,75.436679,Furrow Irrigation
1,A,29.921425,7.194156,2.594134,51.828069,Sprinkler Irrigation
2,C,41.152436,7.329974,3.76043,49.058974,Drip Irrigation
3,C,42.161544,7.137822,3.340263,46.778604,Furrow Irrigation
4,A,36.715841,6.61532,3.701663,57.817993,Furrow Irrigation


<h2 style='font-size: 25px; color: crimson; font-family: Dubai; font-weight: 600'>Renaming the columns</h2>

In [26]:
df = df.rename(columns={'Yield (tones/ha)': 'Yield'})


<h2 style='font-size: 35px; color: balck; font-family: Dubai; font-weight: 600'>1: Cohen's d</h2>

This is used to measure the difference between two group means in terms of standard deviations. It is often used in t-tests or comparing two independent samples. A small Cohen's d (around 0.2), medium (around 0.5), and large (around 0.8) indicate small, medium, and large effects, respectively.

In [27]:
import numpy as np
import pandas as pd

# Function to calculate Cohen's d for independent samples
def cohens_d(group1, group2):
    mean1 = np.mean(group1)
    mean2 = np.mean(group2)
    std1 = np.std(group1, ddof=1)
    std2 = np.std(group2, ddof=1)
    pooled_std = np.sqrt(((std1 ** 2) + (std2 ** 2)) / 2)
    return (mean1 - mean2) / pooled_std

# Function to interpret Cohen's d value
def interpret_cohens_d(d_value):
    if abs(d_value) < 0.2:
        interpretation = "Small effect size"
        explanation = "There is a small difference between the groups, and the effect is minimal."
    elif 0.2 <= abs(d_value) < 0.5:
        interpretation = "Medium effect size"
        explanation = "The difference between the groups is moderate, with noticeable effects."
    elif 0.5 <= abs(d_value) < 0.8:
        interpretation = "Large effect size"
        explanation = "There is a large difference between the groups, with a strong effect."
    else:
        interpretation = "Very large effect size"
        explanation = "The difference between the groups is very large, indicating a very strong effect."
    return interpretation, explanation

# Function to calculate and interpret Cohen's d for all numerical columns
def calculate_effect_sizes(df, group_column, numerical_columns):
    effect_size_results = []

    for column in numerical_columns:
        group_values = df[group_column].unique()  # Get unique groups in the 'group_column'
        for i in range(len(group_values)):
            for j in range(i + 1, len(group_values)):  # Ensure each pair is unique
                group1 = df[df[group_column] == group_values[i]][column]
                group2 = df[df[group_column] == group_values[j]][column]
                
                # Calculate Cohen's d for this pair
                d_value = cohens_d(group1, group2)
                interpretation, explanation = interpret_cohens_d(d_value)
                
                # Store results in a list
                effect_size_results.append({
                    'Column': column,
                    'Pair': f'{group_values[i]} vs {group_values[j]}',
                    'Cohen\'s d': d_value,
                    'Interpretation': interpretation,
                    'Explanation': explanation
                })

    # Create a DataFrame from the results
    effect_size_df = pd.DataFrame(effect_size_results)
    return effect_size_df

# Example usage with your DataFrame
numerical_columns = df.select_dtypes(include=["float64", "int64"]).columns
effect_sizes_df = calculate_effect_sizes(df, group_column="Fertilizer", numerical_columns=numerical_columns)

# Display the results
pd.set_option('display.max_colwidth', 120)  # Increase max column width to avoid truncation
effect_sizes_df


Unnamed: 0,Column,Pair,Cohen's d,Interpretation,Explanation
0,Yield,C vs A,2.257827,Very large effect size,"The difference between the groups is very large, indicating a very strong effect."
1,Yield,C vs B,0.945697,Very large effect size,"The difference between the groups is very large, indicating a very strong effect."
2,Yield,A vs B,-1.146051,Very large effect size,"The difference between the groups is very large, indicating a very strong effect."
3,Days to Maturity,C vs A,0.111385,Small effect size,"There is a small difference between the groups, and the effect is minimal."
4,Days to Maturity,C vs B,0.196781,Small effect size,"There is a small difference between the groups, and the effect is minimal."
5,Days to Maturity,A vs B,0.080842,Small effect size,"There is a small difference between the groups, and the effect is minimal."
6,Biomass,C vs A,0.054233,Small effect size,"There is a small difference between the groups, and the effect is minimal."
7,Biomass,C vs B,0.15536,Small effect size,"There is a small difference between the groups, and the effect is minimal."
8,Biomass,A vs B,0.101509,Small effect size,"There is a small difference between the groups, and the effect is minimal."
9,Dry matter,C vs A,-0.082927,Small effect size,"There is a small difference between the groups, and the effect is minimal."


<h2 style='font-size: 35px; color: Green; font-family: Dubai; font-weight: 600'>Eta-squared (η²)</h2>

**Eta-squared (η²)** is a measure of effect size used in the context of **ANOVA** to quantify the proportion of the total variance in the dependent variable that is attributable to a specific independent variable (or factor). In other words, it tells you how much of the variance in the dependent variable can be explained by the independent variable or factor in the model.


Its Often used in ANOVA, eta-squared measures the proportion of variance in the dependent variable that is explained by the independent variable(s). It is interpreted as the percentage of variability explained by the factors in the model, with values ranging from 0 to 1.


In [28]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Function to calculate Eta-squared from ANOVA results
def eta_squared(aov_table):
    ss_between = aov_table["sum_sq"][0]  # Between-group sum of squares
    ss_total = aov_table["sum_sq"].sum()  # Total sum of squares
    eta_sq = ss_between / ss_total
    return eta_sq

# Example ANOVA function for a DataFrame
def perform_anova(df, dependent_var, independent_var):
    model = ols(f'{dependent_var} ~ C({independent_var})', data=df).fit()
    aov_table = sm.stats.anova_lm(model, typ=2)
    eta_sq = eta_squared(aov_table)
    return aov_table, eta_sq

# Apply ANOVA on Yield (tones/ha) vs Fertilizer type
aov_results, eta_squared_value = perform_anova(df, 'Yield', 'Fertilizer')

# Prepare a DataFrame to display results
anova_df = pd.DataFrame({
    'Source': aov_results.index,
    'Sum of Squares': aov_results['sum_sq'],
    'Degrees of Freedom': aov_results['df'],
    'F-statistic': aov_results['F'],
    'P-value': aov_results['PR(>F)'],
    'Eta-squared (η²)': [eta_squared_value if index == 'C(Fertilizer)' else np.nan for index in aov_results.index]
})

anova_df

  ss_between = aov_table["sum_sq"][0]  # Between-group sum of squares


Unnamed: 0,Source,Sum of Squares,Degrees of Freedom,F-statistic,P-value,Eta-squared (η²)
C(Fertilizer),C(Fertilizer),6109.246208,2.0,118.60653,1.3763519999999998e-38,0.444042
Residual,Residual,7649.014472,297.0,,,



<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>Interpretation of Eta-squared (η²):</span>

- **Small Effect Size (η² < 0.01):** The independent variable explains a very small proportion of the variance in the dependent variable. There is likely little to no meaningful impact of the independent variable on the dependent variable.
- **Medium Effect Size (0.01 ≤ η² < 0.06):** The independent variable explains a moderate proportion of the variance. There is a noticeable effect, but it may not be very strong.
- **Large Effect Size (η² ≥ 0.06):** The independent variable explains a large proportion of the variance in the dependent variable. The effect is strong, and the independent variable has a substantial impact on the dependent variable.


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>Example Interpretation</span>

- **η² = 0.10:** This means that 10% of the variance in the dependent variable is explained by the independent variable. This is considered a **medium** effect size, suggesting a moderate impact.
- **η² = 0.25:** This means that 25% of the variance is explained, which is a **large** effect size, indicating a strong relationship between the independent variable and the dependent variable.
- **η² = 0.02:** This would indicate a **small** effect size, meaning the independent variable explains only a small portion of the variance, and its impact is limited.


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'> Practical Use</span>

- **Small Effect (η² < 0.01):** You may not need to consider the factor as a meaningful contributor to explaining variability in your data.
- **Medium Effect (0.01 ≤ η² < 0.06):** The factor has a meaningful but moderate influence and should be considered when interpreting results.
- **Large Effect (η² ≥ 0.06):** The factor plays a significant role in explaining variability, and its impact should be emphasized.

Eta-squared is particularly useful for understanding the practical significance of a result, beyond just statistical significance (p-value).

<h2 style='font-size: 35px; color: Green; font-family: Dubai; font-weight: 600'>Partial Eta-squared (ηp²))</h2>

**Partial Eta-squared (ηp²)** is a measure of effect size that quantifies the proportion of the total variance in the dependent variable that is attributed to a particular independent variable (or factor), while controlling for the influence of other variables. This is useful in an analysis of variance (ANOVA) context, especially when multiple factors are involved.

In [19]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

def compute_partial_eta_squared(df, numerical_columns, factor_columns):
    results = []

    for column in numerical_columns:
        for factor in factor_columns:
            # Perform the ANOVA
            formula = f'{column} ~ C({factor})'
            model = ols(formula, data=df).fit()
            anova_results = anova_lm(model, typ=2)
            
            # Extract the Sum of Squares for Factor and Error
            SS_factor = anova_results['sum_sq'][f'C({factor})']
            SS_error = anova_results['sum_sq']['Residual']
            
            # Calculate Partial Eta-squared (ηp²)
            partial_eta_squared = SS_factor / (SS_factor + SS_error)
            
            # Append the result to the list
            results.append({
                "Column": column,
                "Factor": factor,
                "Partial Eta-squared (ηp²)": partial_eta_squared
            })
    
    # Create a DataFrame from the results
    eta_squared_df = pd.DataFrame(results)
    return eta_squared_df

# Example usage:
# List of numerical columns


#numerical_columns = df.select_dtypes(include=["float64", "int64"]).columns
numerical_columns = ['Yield', 'Biomass']
# List of categorical factors to compare (e.g., Fertilizer and Irrigation)
factor_columns = ['Fertilizer', 'Irrigation']

# Calculate Partial Eta-squared for all numerical columns and factors
eta_squared_df = compute_partial_eta_squared(df, numerical_columns, factor_columns)

# Display the result

# Function to interpret Partial Eta-squared values
def interpret_eta_squared(eta_squared):
    if eta_squared >= 0.14:
        return "Large effect size (≥ 14%)"
    elif eta_squared >= 0.06:
        return "Medium effect size (6% - 14%)"
    else:
        return "Small effect size (< 6%)"

# Add interpretations to the results
eta_squared_df['Interpretation'] = eta_squared_df['Partial Eta-squared (ηp²)'].apply(interpret_eta_squared)

eta_squared_df

Unnamed: 0,Column,Factor,Partial Eta-squared (ηp²),Interpretation
0,Yield,Fertilizer,0.444042,Large effect size (≥ 14%)
1,Yield,Irrigation,0.028331,Small effect size (< 6%)
2,Biomass,Fertilizer,0.004205,Small effect size (< 6%)
3,Biomass,Irrigation,0.012414,Small effect size (< 6%)


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>General Interpretation of Partial Eta-squared (ηp²) </span>

- **Small effect size (ηp² < 0.01)**:  
  A very small proportion of the variance is explained by the factor. This suggests that the factor has a negligible effect on the outcome variable.
  
- **Medium effect size (0.01 ≤ ηp² < 0.06)**:  
  The factor explains a moderate proportion of the variance, indicating a moderate effect on the outcome variable.
  
- **Large effect size (ηp² ≥ 0.06)**:  
  A large proportion of the variance in the dependent variable is explained by the factor, suggesting a strong and substantial effect.

<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>Example Implimentation</span>

For each numerical variable (e.g., `Yield`, `Biomass`), you'll compute the **Partial Eta-squared (ηp²)** value for factors like **Fertilizer** and **Irrigation**. Here's how you would interpret them:

- **For `Yield` and `Fertilizer` with ηp² = 0.12**:
  - This means that **12%** of the variance in `Yield` is explained by the type of **Fertilizer** used, which is a large effect size, suggesting that Fertilizer has a strong impact on yield.

- **For `Biomass` and `Irrigation` with ηp² = 0.04**:
  - This means that **4%** of the variance in `Biomass` is explained by the type of **Irrigation** used. This represents a medium effect size, indicating a moderate effect of irrigation type on biomass.

- **For `Yield` and `Irrigation` with ηp² = 0.005**:
  - This indicates that only **0.5%** of the variance in `Yield` is explained by the type of **Irrigation** used. Since this value is less than 0.01, it suggests a very small effect and implies that irrigation type has a negligible effect on yield in this case.


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>Considerations</span>

- **Partial Eta-squared** is sensitive to the number of levels of the factor and the total variance, so it's important to interpret the effect size in the context of the study and the data.
- For **multiple factors** or **interaction terms**, Partial Eta-squared can help you assess which factors contribute the most to explaining variance, which is crucial in identifying which factors are the most important drivers of the dependent variable.

In summary, Partial Eta-squared allows you to assess the magnitude of the effects of each factor on the dependent variable, helping you to understand which factors have the most significant influence and whether the effects are strong, moderate, or weak.

<h2 style='font-size: 35px; color: Green; font-family: Dubai; font-weight: 600'>Pearson’s r</h2>

4. **Pearson’s r**: This measures the strength and direction of the linear relationship between two continuous variables. Values range from -1 to 1, with 0 indicating no relationship, 1 indicating a perfect positive relationship, and -1 indicating a perfect negative relationship.


In [22]:
import pandas as pd
from scipy.stats import pearsonr

def compute_pearson_r(df):
    numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
    results = []

    # Loop through each pair of numerical columns
    for i in range(len(numerical_columns)):
        for j in range(i + 1, len(numerical_columns)):
            col1 = numerical_columns[i]
            col2 = numerical_columns[j]
            
            # Compute Pearson's r and p-value
            r_value, p_value = pearsonr(df[col1], df[col2])
            
            # Interpret the result
            if r_value > 0:
                direction = "positive"
            elif r_value < 0:
                direction = "negative"
            else:
                direction = "no correlation"
            
            if abs(r_value) >= 0.7:
                strength = "strong"
            elif abs(r_value) >= 0.3:
                strength = "moderate"
            else:
                strength = "weak"
            
            # Append the results
            results.append({
                'Variable 1': col1,
                'Variable 2': col2,
                'Pearson\'s r': r_value,
                'P-value': p_value,
                'Direction': direction,
                'Strength': strength
            })
    
    # Convert the results to a DataFrame
    result_df = pd.DataFrame(results)
    return result_df

# Compute Pearson's r for all numerical columns and return as DataFrame
pearson_results_df = compute_pearson_r(df)

pearson_results_df

Unnamed: 0,Variable 1,Variable 2,Pearson's r,P-value,Direction,Strength
0,Yield,Days to Maturity,0.040764,0.48181,positive,weak
1,Yield,Biomass,0.076219,0.187989,positive,weak
2,Yield,Dry matter,-0.000522,0.992816,negative,weak
3,Days to Maturity,Biomass,-0.04457,0.441817,negative,weak
4,Days to Maturity,Dry matter,0.099686,0.084761,positive,weak
5,Biomass,Dry matter,-0.018644,0.747758,negative,weak


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>How to Interpret Pearson's r Results</span>

Pearson's r measures the strength and direction of the linear relationship between two continuous variables. Here's how you can interpret the results:


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>1. Value of Pearson's r</span>

- **-1 to -0.7**: Strong negative correlation  → As one variable increases, the other decreases significantly in a linear fashion.
- **-0.7 to -0.3**: Moderate negative correlation  → As one variable increases, the other decreases moderately.
- **-0.3 to 0.0**: Weak negative correlation   → Minimal inverse relationship between the variables.
- **0.0**: No correlation   → No linear relationship between the variables.
- **0.0 to 0.3**: Weak positive correlation   → Minimal direct relationship between the variables.
- **0.3 to 0.7**: Moderate positive correlation   → As one variable increases, the other increases moderately.
- **0.7 to 1.0**: Strong positive correlation  → As one variable increases, the other increases significantly in a linear fashion.


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>2. P-Value</span>

- **p < 0.05**: The correlation is statistically significant, meaning it is unlikely to have occurred by chance.
- **p ≥ 0.05**: The correlation is not statistically significant, suggesting the relationship might be due to random variation.


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>3. Direction of Correlation</span>

- **Positive Correlation**: As one variable increases, the other variable also increases.
- **Negative Correlation**: As one variable increases, the other variable decreases.
- **No Correlation**: Changes in one variable do not predict changes in the other variable.


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>4. Strength of Correlation</span>

The strength describes how closely the data points lie on a straight line:
- **Strong Correlation**: Data points are tightly clustered around the regression line.
- **Moderate Correlation**: Data points are moderately scattered around the regression line.
- **Weak Correlation**: Data points are widely scattered and may not follow a clear linear trend.


<span style='font-size: 15px; color: Green; font-family: Dubai; font-weight: 600'>Use in Practice</span>

- **Weak Correlations** (|r| < 0.3): Might not be meaningful for practical applications, especially if p ≥ 0.05.
- **Strong Correlations** (|r| ≥ 0.7): Suggest a meaningful linear relationship that can guide predictions or decision-making.
- **Statistical Significance**: Even a strong correlation may not imply causation—other factors or confounders should be considered.