<h1 style='font-size: 35px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'> Effect Size Measurements | Real World Implications</h1>

---

<h1 style=' font-weight: 600; font-size: 18px; text-align: left'>1.0. Import Required Libraries</h1>

In [1]:
from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd
import numpy as np
import re

pd.set_option('display.max_columns', 70) 
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print("....Libraries Loaded Successfully....")

....Libraries Loaded Successfully....


<h1 style='font-weight: 600; font-size: 18px; text-align: left'>2.0. Import and Preprocessing Dataset</h1>

In [2]:
filepath = "../Datasets/Eggplant Fusarium Fresistance Data.csv"
df = pd.read_csv(filepath)
display(df)

Unnamed: 0,Variety,Resistance Level,Replication ID,Infection Severity (%),Wilt index,Plant height (cm),Days to wilt symptoms,Survival rate (%),Disease incidence (%)
0,EP-R1,Resistant,1,22.50,0.70,88.90,21,88.80,23.40
1,EP-R1,Resistant,2,27.90,1.20,82.20,19,87.70,21.70
2,EP-R1,Resistant,3,21.20,0.00,74.70,17,84.90,27.20
3,EP-R1,Resistant,4,15.50,0.10,93.80,18,90.30,15.00
4,EP-R1,Resistant,5,17.30,0.90,78.10,19,87.00,23.00
...,...,...,...,...,...,...,...,...,...
795,EP-S3,Susceptible,96,75.20,3.60,68.20,7,6.40,85.50
796,EP-S3,Susceptible,97,74.80,4.90,59.50,4,27.20,82.00
797,EP-S3,Susceptible,98,58.10,3.60,78.80,7,30.80,75.40
798,EP-S3,Susceptible,99,54.10,4.10,63.70,7,24.10,81.80


<h1 style='font-weight: 600; font-size: 18px; text-align: left'>3.0. Effect Size Measurements</h1>

<h2 style='font-weight: 600; font-size: 15px; text-align: left'>3.1. Eta-squared (η²)</h2>

**Eta-squared (η²)** is a measure of effect size used in the context of **ANOVA** to quantify the proportion of the total variance in the dependent variable that is attributable to a specific independent variable (or factor). In other words, it tells you how much of the variance in the dependent variable can be explained by the independent variable or factor in the model. It is interpreted as the percentage of variability explained by the factors in the model, with values ranging from 0 to 1.

In [5]:
def rename(text): return re.sub(r'[^a-zA-Z]', "", text)

def calculate_eta_squared(aov_table):
    ss_between = aov_table["sum_sq"].iloc[0]
    ss_total = aov_table["sum_sq"].sum()
    return ss_between / ss_total

def perform_anova_and_calculate_eta(df, metrics, group):
    group = [rename(col) for col in group]
    df = df.rename(columns={col: rename(col) for col in df.columns})  # Rename all columns
    
    results = []
    for metric in metrics:
        safe_column_name = rename(metric)
        data = df.rename(columns={metric: safe_column_name})
        
        formula = f'{safe_column_name} ~ ' + ' + '.join([f'C({g})' for g in group])
        model = ols(formula, data=data).fit()
        aov_table = sm.stats.anova_lm(model, typ=2)
        
        eta_sq = calculate_eta_squared(aov_table)
        aov_table["Eta-squared (η²)"] = np.nan
        aov_table.loc[aov_table.index[0], "Eta-squared (η²)"] = eta_sq  # Add Eta squared to the first row
        
        anova_df = aov_table.reset_index().rename(columns={"index": "Source"})
        anova_df.insert(0, "Metric", metric)
        results.append(anova_df)
    
    return pd.concat(results, ignore_index=True)

group=['Variety']
Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
Eta_squared_df = perform_anova_and_calculate_eta(df, Metrics, group)
Eta_squared_df

Unnamed: 0,Metric,Source,sum_sq,df,F,PR(>F),Eta-squared (η²)
0,Infection Severity (%),C(Variety),432094.82,7.0,897.47,0.0,0.89
1,Infection Severity (%),Residual,54473.51,792.0,,,
2,Wilt index,C(Variety),1705.22,7.0,730.61,0.0,0.87
3,Wilt index,Residual,264.07,792.0,,,
4,Plant height (cm),C(Variety),59239.97,7.0,293.62,0.0,0.72
5,Plant height (cm),Residual,22827.23,792.0,,,
6,Days to wilt symptoms,C(Variety),19302.91,7.0,1091.56,0.0,0.91
7,Days to wilt symptoms,Residual,2000.79,792.0,,,
8,Survival rate (%),C(Variety),641645.8,7.0,1559.68,0.0,0.93
9,Survival rate (%),Residual,46546.44,792.0,,,


<h2 style='font-size: 15px;  font-weight: 600'>3.2: Pearson’s </h2>

**Pearson’s r**: This measures the strength and direction of the linear relationship between two continuous variables. Values range from -1 to 1, with 0 indicating no relationship, 1 indicating a perfect positive relationship, and -1 indicating a perfect negative relationship.

In [17]:
from scipy.stats import pearsonr

def compute_pearson_r(df, numerical_columns):
    results = []

    for i, col1 in enumerate(numerical_columns):
        for col2 in numerical_columns[i+1:]:
            r_value, p_value = pearsonr(df[col1], df[col2])
            direction = ("Positive" if r_value > 0 else 
                         "Negative" if r_value < 0 else "No correlation")
            strength = ("Strong" if abs(r_value) >= 0.7 else 
                        "Moderate" if abs(r_value) >= 0.3 else "Weak")

            results.append({
                'Variable 1': col1, 'Variable 2': col2,
                'Pearson\'s r': r_value, 'P-value': p_value,
                'Direction': direction, 'Strength': strength
            })
    
    return pd.DataFrame(results)

Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
pearson_results_df = compute_pearson_r(df, numerical_columns=Metrics)
pearson_results_df

Unnamed: 0,Variable 1,Variable 2,Pearson's r,P-value,Direction,Strength
0,Infection Severity (%),Wilt index,0.88,0.0,Positive,Strong
1,Infection Severity (%),Plant height (cm),-0.81,0.0,Negative,Strong
2,Infection Severity (%),Days to wilt symptoms,-0.89,0.0,Negative,Strong
3,Infection Severity (%),Survival rate (%),-0.91,0.0,Negative,Strong
4,Infection Severity (%),Disease incidence (%),0.9,0.0,Positive,Strong
5,Wilt index,Plant height (cm),-0.8,0.0,Negative,Strong
6,Wilt index,Days to wilt symptoms,-0.89,0.0,Negative,Strong
7,Wilt index,Survival rate (%),-0.9,0.0,Negative,Strong
8,Wilt index,Disease incidence (%),0.88,0.0,Positive,Strong
9,Plant height (cm),Days to wilt symptoms,0.81,0.0,Positive,Strong


<h2 style='font-size: 15px;  font-weight: 600'>3.3: Cohen’s d </h2>

**Cohen’s d** is a widely used measure of **effect size** that quantifies the **difference between two group means** in terms of **standard deviation units**. It helps us understand how large or meaningful the difference is, beyond just knowing whether it’s statistically significant.

In [9]:
from itertools import combinations

def compute_cohens_d(df, numerical_columns, group_column):
    def cohens_d(group1, group2):
        mean1, mean2 = np.mean(group1), np.mean(group2)
        std1, std2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
        pooled_std = np.sqrt((std1**2 + std2**2) / 2)
        return (mean1 - mean2) / pooled_std if pooled_std != 0 else np.nan

    def interpret_d(d):
        abs_d = abs(d)
        if abs_d < 0.2:
            return "Small effect size"
        elif abs_d < 0.5:
            return "Medium effect size"
        elif abs_d < 0.8:
            return "Large effect size"
        else:
            return "Very large effect size"

    results = []

    unique_groups = df[group_column].dropna().unique()
    for var in numerical_columns:
        for group_a, group_b in combinations(unique_groups, 2):
            group1 = df[df[group_column] == group_a][var].dropna()
            group2 = df[df[group_column] == group_b][var].dropna()

            if not group1.empty and not group2.empty:
                d = cohens_d(group1, group2)
                results.append({
                    "Variable": var,
                    "Group Comparison": f"{group_a} vs {group_b}",
                    "Cohen's d": d,
                    "Interpretation": interpret_d(d)
                })

    return pd.DataFrame(results)

Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
results = compute_cohens_d(df, numerical_columns=Metrics, group_column='Variety')
results

Unnamed: 0,Variable,Group Comparison,Cohen's d,Interpretation
0,Infection Severity (%),EP-R1 vs EP-R2,-0.03,Small effect size
1,Infection Severity (%),EP-R1 vs EP-R3,-0.05,Small effect size
2,Infection Severity (%),EP-R1 vs EP-M1,-2.91,Very large effect size
3,Infection Severity (%),EP-R1 vs EP-M2,-3.20,Very large effect size
4,Infection Severity (%),EP-R1 vs EP-S1,-6.63,Very large effect size
...,...,...,...,...
163,Disease incidence (%),EP-M2 vs EP-S2,-3.26,Very large effect size
164,Disease incidence (%),EP-M2 vs EP-S3,-3.46,Very large effect size
165,Disease incidence (%),EP-S1 vs EP-S2,0.24,Medium effect size
166,Disease incidence (%),EP-S1 vs EP-S3,0.07,Small effect size


In [10]:
Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
results = compute_cohens_d(df, numerical_columns=Metrics, group_column='Resistance Level')
results

Unnamed: 0,Variable,Group Comparison,Cohen's d,Interpretation
0,Infection Severity (%),Resistant vs Moderate,-3.03,Very large effect size
1,Infection Severity (%),Resistant vs Susceptible,-7.16,Very large effect size
2,Infection Severity (%),Moderate vs Susceptible,-2.96,Very large effect size
3,Wilt index,Resistant vs Moderate,-2.83,Very large effect size
4,Wilt index,Resistant vs Susceptible,-6.41,Very large effect size
5,Wilt index,Moderate vs Susceptible,-2.57,Very large effect size
6,Plant height (cm),Resistant vs Moderate,1.91,Very large effect size
7,Plant height (cm),Resistant vs Susceptible,3.57,Very large effect size
8,Plant height (cm),Moderate vs Susceptible,1.95,Very large effect size
9,Days to wilt symptoms,Resistant vs Moderate,3.52,Very large effect size


<h2 style='font-size: 15px;  font-weight: 600'>3.4: Partial Eta-squared (ηp²)</h2>

**Partial Eta-squared (ηp²)** is a commonly used effect size measure in the context of **ANOVA (Analysis of Variance)**. It represents the proportion of variance in a dependent variable that can be attributed to a specific independent variable or factor, while **controlling for the effects of other variables** in the model. Unlike simple Eta-squared, which measures the total variance explained by a factor, partial Eta-squared isolates the effect of each factor in the presence of others. The value of partial ηp² ranges from 0 to 1, where higher values indicate a larger effect size. A value of 0.01 is often interpreted as a small effect, 0.06 as medium, and 0.14 or above as a large effect, although interpretation can vary slightly by field.

In [16]:
def interpret_eta_squared(eta_squared):
    if eta_squared >= 0.14: return "Large effect size (≥ 14%)"
    elif eta_squared >= 0.06: return "Medium effect size (6% - 14%)"
    else: return "Small effect size (< 6%)"

def compute_partial_eta_squared(df, variables, categories):
    results = []

    for var in variables:
        for cat in categories:
            def safe_rename(text): return re.sub(r'[^a-zA-Z0-9_]', "", text)
            
            safe_var, safe_cat = safe_rename(var) + "_var", safe_rename(cat) + "_cat"
            temp_df = df.rename(columns={var: safe_var, cat: safe_cat})
            formula = f'{safe_var} ~ C({safe_cat})'
            model = ols(formula, data=temp_df).fit()
            anova_table = anova_lm(model, typ=2)

            ss_factor = anova_table.loc[f'C({safe_cat})', 'sum_sq']
            ss_error = anova_table.loc['Residual', 'sum_sq']
            eta_squared = ss_factor / (ss_factor + ss_error)

            results.append({
                "Variable": var,
                "Factor": cat,
                "Partial Eta-squared (ηp²)": eta_squared,
                "Interpretation": interpret_eta_squared(eta_squared)
            })

    return pd.DataFrame(results)


categories = ['Variety', 'Resistance Level']
Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
eta_squared_df = compute_partial_eta_squared(df, Metrics, categories)
display(eta_squared_df)

Unnamed: 0,Variable,Factor,Partial Eta-squared (ηp²),Interpretation
0,Infection Severity (%),Variety,0.89,Large effect size (≥ 14%)
1,Infection Severity (%),Resistance Level,0.89,Large effect size (≥ 14%)
2,Wilt index,Variety,0.87,Large effect size (≥ 14%)
3,Wilt index,Resistance Level,0.87,Large effect size (≥ 14%)
4,Plant height (cm),Variety,0.72,Large effect size (≥ 14%)
5,Plant height (cm),Resistance Level,0.72,Large effect size (≥ 14%)
6,Days to wilt symptoms,Variety,0.91,Large effect size (≥ 14%)
7,Days to wilt symptoms,Resistance Level,0.91,Large effect size (≥ 14%)
8,Survival rate (%),Variety,0.93,Large effect size (≥ 14%)
9,Survival rate (%),Resistance Level,0.93,Large effect size (≥ 14%)


---

This analysis was performed by **Jabulente**, a passionate and dedicated data analyst with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

--- 

<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>

<h5 style='font-size: 55px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'>THE END</h5>
