<h1 style='font-size: 25px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'>Partial Eta-squared (ηp²)</h1>
<hr>

- **Eta-squared (η²)** is a commonly used effect size measure in the context of ANOVA (Analysis of Variance). It represents the proportion of the total variance in a dependent (numeric) variable that is associated with the effect of an independent (categorical) variable. In simpler terms, η² helps to answer the question: *"How much of the variation in the outcome can be explained by the group differences?"* Unlike **Partial Eta-squared**, which isolates the effect of a factor while controlling for other variables, Eta-squared measures the **overall** effect without isolating it, making it suitable for one-way ANOVA where only one factor is involved.

- For example, suppose a researcher wants to investigate whether different teaching methods (Method A, B, and C) affect students’ final exam scores. The exam scores are the dependent variable, and the teaching method is the categorical independent variable. After performing one-way ANOVA, the researcher calculates η² and finds a value of 0.25. This means that 25% of the total variance in students' exam scores can be explained by the difference in teaching methods. This gives a more intuitive understanding of how important or influential the teaching method is in determining the outcome.

- In research practice, interpretation of η² values follows general guidelines: 0.01 indicates a small effect, 0.06 a medium effect, and 0.14 or higher a large effect. Eta-squared is a useful addition to p-values, as it provides insight into the **magnitude** of the effect, not just whether the result is statistically significant.

- In this project, I am going to implement this statistical test in a **scalable and reusable approach** to make it easy to compute results across multiple variables. By designing clean and modular functions, such as those for computing **Eta-squared (η²)**, I ensure that the code can automatically handle multiple dependent variables, apply appropriate statistical models, and generate interpretable outputs with minimal manual intervention. This approach not only saves time but also promotes consistency and accuracy when analyzing different factors or conditions in the dataset. Scalability is especially important when working with large datasets or when extending the analysis to new variables or experimental groups in the future.

<h1 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>1.0. Import Required Libraries</h1>

In [5]:
import re
import warnings
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

warnings.simplefilter("ignore")
pd.set_option('display.max_columns', 10)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print("....Libraries Loaded Successfully....")

....Libraries Loaded Successfully....


<h1 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>2.0. Import and Preprocessing Dataset</h1>

In [2]:
filepath = 'Datasets/Fertilizer and Light Exposure Experiment Dataset.csv'
df = pd.read_csv(filepath)
df.sample(10)

Unnamed: 0,Fertilizer,Plant Height (cm),Leaf Area (cm²),Chlorophyll Content (SPAD units),Root Length (cm),Biomass (g),Seed Yield (g)
10,Orgarnic,41.48,152.1,34.25,16.13,9.1,4.13
26,Orgarnic,46.13,165.77,30.08,19.33,9.74,4.27
118,Orgarnic,39.6,144.32,29.12,18.74,9.15,4.46
8,Orgarnic,40.86,114.49,38.28,20.74,11.2,5.14
89,Synthetic,92.31,253.61,55.3,26.73,16.63,8.15
101,Orgarnic,40.15,116.0,33.89,16.47,10.58,4.19
77,Orgarnic,49.74,163.25,31.22,17.41,9.44,4.18
65,Synthetic,84.68,259.71,43.47,30.71,12.37,7.97
96,Synthetic + Organic,64.65,185.95,39.39,25.73,11.25,6.44
88,Synthetic,75.54,256.82,51.7,33.79,16.53,8.14


<h1 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>3.0.Dataset Column Profiling </h1>

In [3]:
def column_summary(df):
    summary_data = []
    
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
            'distinct_values_counts': distinct_values_counts
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df


summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values,distinct_values_counts
0,Fertilizer,object,0,120,3,"{'Orgarnic': 44, 'Synthetic': 40, 'Synthetic +..."
1,Plant Height (cm),float64,0,120,120,"{58.56151388665052: 1, 46.696826238466286: 1, ..."
2,Leaf Area (cm²),float64,0,120,120,"{185.73856643236127: 1, 138.7980608962804: 1, ..."
3,Chlorophyll Content (SPAD units),float64,0,120,120,"{46.5196207922374: 1, 34.69363266870892: 1, 51..."
4,Root Length (cm),float64,0,120,120,"{24.31891050096943: 1, 17.6585349528435: 1, 33..."
5,Biomass (g),float64,0,120,120,"{11.994074041165357: 1, 8.667791843721698: 1, ..."
6,Seed Yield (g),float64,0,120,120,"{6.687959618540082: 1, 6.165373569255893: 1, 8..."


<h1 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>4.0. Eta-squared (η²)</h1>

In [6]:
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Utility: Clean variable names for formula compatibility
def rename(text):
    return re.sub(r'[^a-zA-Z0-9]', "", text)

# Helper: Calculate Eta-squared from ANOVA table
def calculate_eta_squared(aov_table):
    ss_between = aov_table["sum_sq"].iloc[0]
    ss_total = aov_table["sum_sq"].sum()
    return ss_between / ss_total

# Helper: Run ANOVA and calculate Eta-squared
def perform_anova(data, dependent_var, independent_var):
    formula = f"{dependent_var} ~ C({independent_var})"
    model = ols(formula, data=data).fit()
    aov_table = sm.stats.anova_lm(model, typ=2)
    
    eta_sq = calculate_eta_squared(aov_table)
    aov_table["Eta-squared (η²)"] = np.nan
    aov_table.loc[f"C({independent_var})", "Eta-squared (η²)"] = eta_sq
    
    return aov_table.reset_index().rename(columns={"index": "Source"})

# Main: Compute eta squared for multiple dependent variables
def compute_eta_squared(df, independent_variable, dependent_variables):
    results = []
    for dep_var in dependent_variables:
        safe_dep = rename(dep_var)
        temp_df = df.rename(columns={dep_var: safe_dep})  # Non-destructive rename

        aov_df = perform_anova(temp_df, safe_dep, independent_variable)
        aov_df.insert(0, "Dependent Variable", dep_var)  # Preserve original name
        results.append(aov_df)

    results = pd.concat(results, ignore_index=True).fillna("-")
    return results


numeric_variables = df.select_dtypes(include=np.number).columns.tolist()
eta_results = compute_eta_squared(df, independent_variable='Fertilizer', dependent_variables=numeric_variables)
display(eta_results)

Unnamed: 0,Dependent Variable,Source,sum_sq,df,F,PR(>F),Eta-squared (η²)
0,Plant Height (cm),C(Fertilizer),18145.5,2.0,126.68,0.00,0.68
1,Plant Height (cm),Residual,8379.37,117.0,-,-,-
2,Leaf Area (cm²),C(Fertilizer),172586.54,2.0,128.61,0.00,0.69
3,Leaf Area (cm²),Residual,78501.9,117.0,-,-,-
4,Chlorophyll Content (SPAD units),C(Fertilizer),7682.98,2.0,115.40,0.00,0.66
5,Chlorophyll Content (SPAD units),Residual,3894.72,117.0,-,-,-
6,Root Length (cm),C(Fertilizer),2279.3,2.0,107.80,0.00,0.65
7,Root Length (cm),Residual,1236.97,117.0,-,-,-
8,Biomass (g),C(Fertilizer),630.13,2.0,102.03,0.00,0.64
9,Biomass (g),Residual,361.28,117.0,-,-,-


---

This analysis was performed by **Jabulente**, a passionate and dedicated data analyst with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

    
<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>

<h1 style='font-size: 55px; color: Tomato; font-family: Colonna MT; font-weight: 700; text-align: center'>THE END</h1>