<h1 style='font-size: 35px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'>Effect Size Measurements </h1>


---

<h2 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>1.0. Import Required Libraries</h2>

In [39]:
# Standard library imports
import math
import re
import warnings

# Third-party numerical/scientific computing imports
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import (
    ttest_ind, ttest_rel, ttest_1samp,
    shapiro, levene, skew, kurtosis, zscore,
    pearsonr
)
from itertools import combinations

# Statistical modeling imports
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
warnings.simplefilter("ignore")
pd.set_option('display.max_columns', 10)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("Libraries Loaded Successfully")

Libraries Loaded Successfully


<h2 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>2.0. Import and Preprocessing Dataset</h2>

In [3]:
filepath = 'Datasets/Fertilizer and Light Exposure Experiment Dataset.csv'
df = pd.read_csv(filepath)
df.sample(10)

Unnamed: 0,Fertilizer,Plant Height (cm),Leaf Area (cm²),Chlorophyll Content (SPAD units),Root Length (cm),Biomass (g),Seed Yield (g)
72,Synthetic,78.45,241.59,45.3,32.43,17.06,7.45
24,Synthetic,87.95,294.5,54.66,35.25,14.26,8.43
59,Synthetic,69.64,197.29,51.73,23.93,13.14,6.53
79,Synthetic,70.68,205.59,58.26,29.93,16.71,6.61
8,Orgarnic,40.86,114.49,38.28,20.74,11.2,5.14
41,Orgarnic,53.54,142.94,31.05,18.09,8.95,4.34
14,Orgarnic,44.18,117.59,27.7,18.88,7.96,4.41
4,Orgarnic,41.82,129.78,34.73,19.78,10.55,4.64
70,Synthetic,59.25,208.3,52.91,24.73,11.56,6.9
106,Synthetic + Organic,60.78,197.07,35.93,25.36,11.44,5.29


<h2 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>3.0. Dataset Informations/ Overview</h2>

In [4]:
df.shape

(120, 7)

In [5]:
df.columns

Index(['Fertilizer', 'Plant Height (cm)', 'Leaf Area (cm²)',
       'Chlorophyll Content (SPAD units)', 'Root Length (cm)', 'Biomass (g)',
       'Seed Yield (g)'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 7 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Fertilizer                        120 non-null    object 
 1   Plant Height (cm)                 120 non-null    float64
 2   Leaf Area (cm²)                   120 non-null    float64
 3   Chlorophyll Content (SPAD units)  120 non-null    float64
 4   Root Length (cm)                  120 non-null    float64
 5   Biomass (g)                       120 non-null    float64
 6   Seed Yield (g)                    120 non-null    float64
dtypes: float64(6), object(1)
memory usage: 6.7+ KB


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>3.1: Columns Summary</h4>

To begin the analysis, it is important to explore the dataset by summarizing its structure and key attributes. This involves examining the **data types (dtypes)** of each column to determine whether they contain numerical or categorical values, which helps in selecting appropriate analytical techniques. Additionally, checking the **number of unique values** in each column provides insight into the variability of the data, distinguishing between continuous and discrete features.  

Assessing **distinct values** allows for a better understanding of the diversity within each variable, while identifying **missing values** is essential to evaluate data completeness and potential gaps that may require handling. Lastly, reviewing the **count of non-null entries** ensures the dataset’s integrity and helps in deciding whether any preprocessing steps, such as data imputation or cleaning, are necessary. This exploratory step lays the foundation for effective analysis and meaningful insights.

In [7]:
def column_summary(df):
    summary_data = []
    
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
            'distinct_values_counts': distinct_values_counts
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df


summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values,distinct_values_counts
0,Fertilizer,object,0,120,3,"{'Orgarnic': 44, 'Synthetic': 40, 'Synthetic +..."
1,Plant Height (cm),float64,0,120,120,"{58.56151388665052: 1, 46.696826238466286: 1, ..."
2,Leaf Area (cm²),float64,0,120,120,"{185.73856643236127: 1, 138.7980608962804: 1, ..."
3,Chlorophyll Content (SPAD units),float64,0,120,120,"{46.5196207922374: 1, 34.69363266870892: 1, 51..."
4,Root Length (cm),float64,0,120,120,"{24.31891050096943: 1, 17.6585349528435: 1, 33..."
5,Biomass (g),float64,0,120,120,"{11.994074041165357: 1, 8.667791843721698: 1, ..."
6,Seed Yield (g),float64,0,120,120,"{6.687959618540082: 1, 6.165373569255893: 1, 8..."


<h4 style='font-size: 18px; color: Blue; font-family: Colonna MT; font-weight: 600'>3.2: Checking Missing Values</h4>

Checking for missing values is a crucial step in data analysis to assess the completeness and reliability of the dataset. This involves identifying any columns with null or empty entries, which may affect the accuracy of statistical and machine learning models. Missing values can arise due to various reasons, such as incomplete survey responses or data collection errors.

In [8]:
def Missig_values_info(df):   
    isna_df = df.isna().sum().reset_index(name='Missing Values Counts')
    isna_df['Proportions (%)'] = isna_df['Missing Values Counts']/len(df)*100
    return isna_df
    
isna_df = Missig_values_info(df)
isna_df

Unnamed: 0,index,Missing Values Counts,Proportions (%)
0,Fertilizer,0,0.0
1,Plant Height (cm),0,0.0
2,Leaf Area (cm²),0,0.0
3,Chlorophyll Content (SPAD units),0,0.0
4,Root Length (cm),0,0.0
5,Biomass (g),0,0.0
6,Seed Yield (g),0,0.0


<h4 style='font-size:18px; color: Blue; font-family: Colonna MT; font-weight: 600'>3.4: Renaming Columns</h4>

When working with data, especially in statistical tests like ANOVA, column names with special characters or spaces can cause errors. To avoid this, we can clean the column names by converting them to lowercase, replacing any special characters (such as punctuation) with underscores, and ensuring there are no spaces. This makes the column names consistent and compatible with most analysis functions, preventing errors during data processing. The renaming process is automatic, making it easy to handle datasets with potentially problematic column names.

In [17]:
def rename(text):
    text = re.sub(r'[^a-zA-Z]', "",  text) 
    return text

test_df = df.copy()
test_df = test_df.rename(columns={col: rename(col) for col in test_df.columns})
print("\n Columns names before renaming\n")
print(df.columns)
print("\n Columns names after renaming\n")
print(test_df.columns)


 Columns names before renaming

Index(['Fertilizer', 'Plant Height (cm)', 'Leaf Area (cm²)',
       'Chlorophyll Content (SPAD units)', 'Root Length (cm)', 'Biomass (g)',
       'Seed Yield (g)'],
      dtype='object')

 Columns names after renaming

Index(['Fertilizer', 'PlantHeightcm', 'LeafAreacm',
       'ChlorophyllContentSPADunits', 'RootLengthcm', 'Biomassg',
       'SeedYieldg'],
      dtype='object')


<h1 style='font-size: 25px; font-family: Colonna MT; font-weight: 600'>4.0: Statistic Description of The Datasets</h1>

Let's take a moment to quickly explore some essential statistics of our dataset. By using the `describe()` function in pandas, we can generate a summary of key metrics for each numerical column in the dataset. This gives us a bird's-eye view of the data, helping us understand the general distribution and characteristics of the values.

In [9]:
Summary_stats = df.describe().T.reset_index()
Summary_stats.rename(columns={'index': 'Variables'}, inplace=True)
Summary_stats = Summary_stats.round(2)
Summary_stats

Unnamed: 0,Variables,count,mean,std,min,25%,50%,75%,max
0,Plant Height (cm),120.0,60.58,14.93,35.89,46.82,60.18,69.33,95.14
1,Leaf Area (cm²),120.0,181.9,45.93,108.65,142.32,180.71,205.0,312.3
2,Chlorophyll Content (SPAD units),120.0,42.03,9.86,23.99,34.58,40.37,48.66,73.21
3,Root Length (cm),120.0,23.86,5.44,14.75,19.01,23.64,27.07,39.4
4,Biomass (g),120.0,11.97,2.89,7.23,9.63,11.45,13.87,19.61
5,Seed Yield (g),120.0,6.16,1.49,3.86,4.94,6.14,7.03,10.16


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.2:  Distribution of Continuous variables</h4>

Let's explore the distribution of continuous variables by examining key statistics. The **Mean** represents the average value, while the **Median** offers a more robust measure against outliers. The **Mode** identifies the most frequent value. **Standard Deviation** and **Variance** indicate how much data deviates from the mean, with larger values showing greater spread. The **Range** highlights the difference between the maximum and minimum values. **Skewness** measures distribution symmetry, and **Kurtosis** reveals the presence of outliers by analyzing the distribution's "tailedness." These metrics provide a comprehensive view of data distribution and help identify potential issues.

In [10]:
def compute_overall_distribution_stats(df):
    results = []
    for col in df.select_dtypes(include=[np.number]).columns:
        mean = df[col].mean()
        median = df[col].median()
        mode = df[col].mode().iloc[0] if not df[col].mode().empty else np.nan
        std_dev = df[col].std()
        variance = df[col].var()
        value_range = df[col].max() - df[col].min()
        skewness_val = skew(df[col], nan_policy='omit')  # Skewness
        kurtosis_val = kurtosis(df[col], nan_policy='omit')  # Kurtosis

        results.append({
            'Parameter': col,
            'Mean': mean,
            'Median': median,
            'Mode': mode,
            'Standard Deviation': std_dev,
            'Variance': variance,
            'Range': value_range,
            'Skewness': skewness_val,
            'Kurtosis': kurtosis_val
        })

    
    result_df = pd.DataFrame(results)
    return result_df

Results = compute_overall_distribution_stats(df)
Results

Unnamed: 0,Parameter,Mean,Median,Mode,Standard Deviation,Variance,Range,Skewness,Kurtosis
0,Plant Height (cm),60.58,60.18,35.89,14.93,222.9,59.26,0.37,-0.73
1,Leaf Area (cm²),181.9,180.71,108.65,45.93,2109.99,203.66,0.56,-0.32
2,Chlorophyll Content (SPAD units),42.03,40.37,23.99,9.86,97.29,49.22,0.59,-0.17
3,Root Length (cm),23.86,23.64,14.75,5.44,29.55,24.65,0.62,-0.22
4,Biomass (g),11.97,11.45,7.23,2.89,8.33,12.38,0.6,-0.48
5,Seed Yield (g),6.16,6.14,3.86,1.49,2.21,6.3,0.5,-0.51


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.3: Comparatives Analysis</h4>

***Comparing the Means of all the Variables of Interest Across Different Specified Groups***

Now, let's shift our focus to comparing the means of variables across different groups to understand how each behaves within specific categories. This comparison helps identify  differences between groups, revealing patterns or trends that might otherwise go unnoticed. By analyzing these mean comparisons, we gain valuable insights into the relationships between variables and groups, guiding further analysis or decision-making.

In [11]:
def summary_stats(df, Metrics, group=None):
    df_without_location = df.drop(columns=[group])
    grand_mean = df_without_location[Metrics].mean()
    sem = df_without_location[Metrics].sem()
    cv = df_without_location[Metrics].std() / df_without_location[Metrics].mean() * 100
    grouped = df.groupby(group)[Metrics].agg(['mean', 'sem']).reset_index()
    
    summary_df = pd.DataFrame()
    for col in Metrics:
        summary_df[col] = grouped.apply(
            lambda x: f"{x[(col, 'mean')]:.2f} ± {x[(col, 'sem')]:.2f}", axis=1
        )
    
    summary_df.insert(0, group, grouped[group])
    grand_mean_row = ['Grand Mean'] + grand_mean.tolist()
    sem_row = ['SEM'] + sem.tolist()
    cv_row = ['%CV'] + cv.tolist()
    
    summary_df.loc[len(summary_df)] = grand_mean_row
    summary_df.loc[len(summary_df)] = sem_row
    summary_df.loc[len(summary_df)] = cv_row
    
    return summary_df

Metrics = df.select_dtypes(include=np.number).columns.tolist()
Results = summary_stats(df, Metrics, group='Fertilizer')
Results.T

Unnamed: 0,0,1,2,3,4,5
Fertilizer,Orgarnic,Synthetic,Synthetic + Organic,Grand Mean,SEM,%CV
Plant Height (cm),45.39 ± 0.78,74.39 ± 1.81,63.78 ± 1.30,60.58,1.36,24.65
Leaf Area (cm²),136.79 ± 2.32,227.23 ± 5.95,186.68 ± 3.22,181.90,4.19,25.25
Chlorophyll Content (SPAD units),33.02 ± 0.62,52.17 ± 1.18,41.76 ± 0.88,42.03,0.90,23.47
Root Length (cm),18.52 ± 0.26,28.84 ± 0.77,24.84 ± 0.37,23.86,0.50,22.78
Biomass (g),9.29 ± 0.16,14.77 ± 0.38,12.12 ± 0.27,11.97,0.26,24.12
Seed Yield (g),4.68 ± 0.07,7.66 ± 0.18,6.32 ± 0.11,6.16,0.14,24.13


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.4: Distributions of Categorical Variables</h4>

Now, let's explore the counts and proportions of categorical variables in our dataset, both individually and across groups. Counts show how many times each category appears, giving insight into the distribution of data within each variable. Proportions reveal the relative frequency of each category, helping us understand the balance or imbalance in the data. By examining these across different groups, we can uncover patterns or relationships that are important for further analysis, aiding in a better understanding of the structure and distribution of categorical variables.

In [12]:
def Distributions_of_Categorical_Variables(df, categories):
    data = []
    for category in categories:
        counts = df[category].value_counts()
        proportions = df[category].value_counts(normalize=True)
        for value, count in counts.items():
            proportion = proportions[value]
            data.append({
                'Category': category,
                'Sub-category': value,
                'Count': count,
                'Proportion': f"{proportion:.2%}" 
            })
    
    result_df = pd.DataFrame(data)
    return result_df
categorical_variables = df.select_dtypes(include=['object']).columns
Results = Distributions_of_Categorical_Variables(df, categorical_variables)
Results.head(10)

Unnamed: 0,Category,Sub-category,Count,Proportion
0,Fertilizer,Orgarnic,44,36.67%
1,Fertilizer,Synthetic,40,33.33%
2,Fertilizer,Synthetic + Organic,36,30.00%


<h1 style='font-size: 25px; font-family: Colonna MT; font-weight: 600'>5.0: Effect Size Measurements </h1>

***Quantifying Relationships and Magnitudes of an The effects***

*Effect size calculation is crucial for understanding the practical significance of the results in a study. While statistical tests like t-tests and ANOVA tell us whether the results are statistically significant, effect size tells us how big or how meaningful that effect actually is. Let's break down the most common effect size measures and how we calculate them.*

<h2 style='font-size: 18px; color: Blue; font-family: Candara; font-weight: 600'>5.1: Pearson’s (r)</h2>

**Pearson’s r**: This measures the strength and direction of the linear relationship between two continuous variables. Values range from -1 to 1, with 0 indicating no relationship, 1 indicating a perfect positive relationship, and -1 indicating a perfect negative relationship.

In [14]:
from scipy.stats import pearsonr
def compute_pearson_r(df):
    numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
    results = []

    for i, col1 in enumerate(numerical_columns):
        for col2 in numerical_columns[i+1:]:
            r_value, p_value = pearsonr(df[col1], df[col2])

            direction = ("Positive" if r_value > 0 else 
                         "Negative" if r_value < 0 else "No correlation")
            strength = ("Strong" if abs(r_value) >= 0.7 else 
                        "Moderate" if abs(r_value) >= 0.3 else "Weak")

            results.append({
                'Variable 1': col1, 'Variable 2': col2,
                'Pearson\'s r': r_value, 'P-value': p_value,
                'Direction': direction, 'Strength': strength
            })
    
    return pd.DataFrame(results)

# Compute and return the Pearson's r results as a DataFrame
pearson_results_df = compute_pearson_r(df)
pearson_results_df

Unnamed: 0,Variable 1,Variable 2,Pearson's r,P-value,Direction,Strength
0,Plant Height (cm),Leaf Area (cm²),0.89,0.0,Positive,Strong
1,Plant Height (cm),Chlorophyll Content (SPAD units),0.82,0.0,Positive,Strong
2,Plant Height (cm),Root Length (cm),0.84,0.0,Positive,Strong
3,Plant Height (cm),Biomass (g),0.81,0.0,Positive,Strong
4,Plant Height (cm),Seed Yield (g),0.8,0.0,Positive,Strong
5,Leaf Area (cm²),Chlorophyll Content (SPAD units),0.82,0.0,Positive,Strong
6,Leaf Area (cm²),Root Length (cm),0.86,0.0,Positive,Strong
7,Leaf Area (cm²),Biomass (g),0.8,0.0,Positive,Strong
8,Leaf Area (cm²),Seed Yield (g),0.85,0.0,Positive,Strong
9,Chlorophyll Content (SPAD units),Root Length (cm),0.79,0.0,Positive,Strong


<h2 style='font-size: 18px; color: blue; font-family: Candara; font-weight: 600'>5.2: Partial Eta-squared (ηp²))</h2>

**Partial Eta-squared (ηp²)** is a measure of effect size that quantifies the proportion of the total variance in the dependent variable that is attributed to a particular independent variable (or factor), while controlling for the influence of other variables. This is useful in an analysis of variance (ANOVA) context, especially when multiple factors are involved.

In [36]:
def compute_partial_eta_squared(df, Variables, category):
    results = []
    for column in Variables:
        for factor in category:
            col = rename(column)
            df = df.rename(columns={column: col})
            formula = f'{col} ~ C({factor})'
            model = ols(formula, data=df).fit()
            anova_results = anova_lm(model, typ=2)
            SS_factor = anova_results['sum_sq'][f'C({factor})']
            SS_error = anova_results['sum_sq']['Residual']
            partial_eta_squared = SS_factor / (SS_factor + SS_error)
            
            results.append({
                "Variables": column,
                "Factor": factor,
                "Partial Eta-squared (ηp²)": partial_eta_squared
            })
    

    eta_squared_df = pd.DataFrame(results)
    
    def interpret_eta_squared(eta_squared):
        if eta_squared >= 0.14: 
            return "Large effect size (≥ 14%)"
        elif eta_squared >= 0.06: 
            return "Medium effect size (6% - 14%)"
        else: 
            return "Small effect size (< 6%)"
            
    eta_squared_df['Interpretation'] = eta_squared_df['Partial Eta-squared (ηp²)'].apply(interpret_eta_squared)
    return eta_squared_df


Variables = df.select_dtypes(include=["float64", "int64"]).columns
eta_squared_df = compute_partial_eta_squared(df, Variables, category=['Fertilizer'])
eta_squared_df

Unnamed: 0,Variables,Factor,Partial Eta-squared (ηp²),Interpretation
0,Plant Height (cm),Fertilizer,0.68,Large effect size (≥ 14%)
1,Leaf Area (cm²),Fertilizer,0.69,Large effect size (≥ 14%)
2,Chlorophyll Content (SPAD units),Fertilizer,0.66,Large effect size (≥ 14%)
3,Root Length (cm),Fertilizer,0.65,Large effect size (≥ 14%)
4,Biomass (g),Fertilizer,0.64,Large effect size (≥ 14%)
5,Seed Yield (g),Fertilizer,0.71,Large effect size (≥ 14%)


<h2 style='font-size: 18px; color: blue; font-family: Candara; font-weight: 600'>5.3. Cohen's d</h2>

This is used to measure the difference between two group means in terms of standard deviations. It is often used in t-tests or comparing two independent samples. A small Cohen's d (around 0.2), medium (around 0.5), and large (around 0.8) indicate small, medium, and large effects, respectively.

In [37]:
def Effectsize(df, numerical_columns, group_column):
    def cohens_d(group1, group2):
        mean1 = np.mean(group1)
        mean2 = np.mean(group2)
        std1 = np.std(group1, ddof=1)
        std2 = np.std(group2, ddof=1)
        pooled_std = np.sqrt(((std1 ** 2) + (std2 ** 2)) / 2)
        return (mean1 - mean2) / pooled_std
    
    def interpret_cohens_d(d_value):
        if abs(d_value) < 0.2: interpretation = "Small effect size"
        elif 0.2 <= abs(d_value) < 0.5: interpretation = "Medium effect size"
        elif 0.5 <= abs(d_value) < 0.8: interpretation = "Large effect size"
        else: interpretation = "Very large effect size"
        return interpretation
    
    def calculate_effect_sizes(df, group_column, numerical_columns):
        effect_size_results = []
    
        for column in numerical_columns:
            group_values = df[group_column].unique() 
            for i in range(len(group_values)):
                for j in range(i + 1, len(group_values)):
                    group1 = df[df[group_column] == group_values[i]][column]
                    group2 = df[df[group_column] == group_values[j]][column]
                    
                    d_value = cohens_d(group1, group2)
                    interpretation = interpret_cohens_d(d_value)
                    
                    effect_size_results.append({
                        'Variable': column,
                        'Pair': f'{group_values[i]} vs {group_values[j]}',
                        'Cohen\'s d': d_value,
                        'Interpretation': interpretation,
                    })
    
        
        results = pd.DataFrame(effect_size_results)
        return results
        
    results = calculate_effect_sizes(df, group_column, numerical_columns)
    return results

results = Effectsize(df, Metrics, group_column='Fertilizer')
results

Unnamed: 0,Variable,Pair,Cohen's d,Interpretation
0,Plant Height (cm),Synthetic vs Orgarnic,3.27,Very large effect size
1,Plant Height (cm),Synthetic vs Synthetic + Organic,1.08,Very large effect size
2,Plant Height (cm),Orgarnic vs Synthetic + Organic,-2.77,Very large effect size
3,Leaf Area (cm²),Synthetic vs Orgarnic,3.15,Very large effect size
4,Leaf Area (cm²),Synthetic vs Synthetic + Organic,1.36,Very large effect size
5,Leaf Area (cm²),Orgarnic vs Synthetic + Organic,-2.86,Very large effect size
6,Chlorophyll Content (SPAD units),Synthetic vs Orgarnic,3.17,Very large effect size
7,Chlorophyll Content (SPAD units),Synthetic vs Synthetic + Organic,1.6,Very large effect size
8,Chlorophyll Content (SPAD units),Orgarnic vs Synthetic + Organic,-1.84,Very large effect size
9,Root Length (cm),Synthetic vs Orgarnic,2.81,Very large effect size


<h2 style='font-size: 18px; color: Blue; font-family: Candara; font-weight: 600'>5.4. Eta-squared (η²)</h2>

**Eta-squared (η²)** is a measure of effect size used in the context of **ANOVA** to quantify the proportion of the total variance in the dependent variable that is attributable to a specific independent variable (or factor). In other words, it tells you how much of the variance in the dependent variable can be explained by the independent variable or factor in the model. Its Often used in ANOVA, eta-squared measures the proportion of variance in the dependent variable that is explained by the independent variable(s). It is interpreted as the percentage of variability explained by the factors in the model, with values ranging from 0 to 1.

In [38]:
def compute_eta_squared(df, independent_variable, dependent_variables):
    def calculate_eta_squared(aov_table):
        ss_between = aov_table["sum_sq"].iloc[0]
        ss_total = aov_table["sum_sq"].sum()
        return ss_between / ss_total
    
    def perform_anova(df, dependent_var, independent_var):
        model = ols(f'{dependent_var} ~ C({independent_var})', data=df).fit()
        aov_table = sm.stats.anova_lm(model, typ=2)

        eta_sq = calculate_eta_squared(aov_table)
        
        aov_table["Eta-squared (η²)"] = np.nan
        aov_table.loc[f'C({independent_var})', "Eta-squared (η²)"] = eta_sq
        
        return aov_table.reset_index().rename(columns={"index": "Source"})
    
    results = [] 
    for variable in dependent_variables:
        safe_column_name = rename(variable)
        data = df.rename(columns={variable: safe_column_name})
        
        anova_df = perform_anova(data, safe_column_name, independent_variable)
        anova_df.insert(0, "Variable", variable)  # Add original variable name
        results.append(anova_df)
        
    eta_squared_df = pd.concat(results, ignore_index=True)
    
    return eta_squared_df    

numeric_variables = df.select_dtypes(include=np.number).columns.tolist()
results = compute_eta_squared(df, independent_variable='Fertilizer', dependent_variables=numeric_variables)
results

Unnamed: 0,Variable,Source,sum_sq,df,F,PR(>F),Eta-squared (η²)
0,Plant Height (cm),C(Fertilizer),18145.5,2.0,126.68,0.0,0.68
1,Plant Height (cm),Residual,8379.37,117.0,,,
2,Leaf Area (cm²),C(Fertilizer),172586.54,2.0,128.61,0.0,0.69
3,Leaf Area (cm²),Residual,78501.9,117.0,,,
4,Chlorophyll Content (SPAD units),C(Fertilizer),7682.98,2.0,115.4,0.0,0.66
5,Chlorophyll Content (SPAD units),Residual,3894.72,117.0,,,
6,Root Length (cm),C(Fertilizer),2279.3,2.0,107.8,0.0,0.65
7,Root Length (cm),Residual,1236.97,117.0,,,
8,Biomass (g),C(Fertilizer),630.13,2.0,102.03,0.0,0.64
9,Biomass (g),Residual,361.28,117.0,,,


---

This analysis was performed by **Jabulente**, a passionate and dedicated data scientist with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

    
<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![X (Twitter)](https://img.shields.io/badge/X-@Jabulente-black?logo=x)](https://x.com/Jabulente)  [![Instagram](https://img.shields.io/badge/Instagram-@Jabulente-purple?logo=instagram)](https://instagram.com/Jabulente)  [![Threads](https://img.shields.io/badge/Threads-@Jabulente-black?logo=threads)](https://threads.net/@Jabulente)  [![TikTok](https://img.shields.io/badge/TikTok-@Jabulente-teal?logo=tiktok)](https://tiktok.com/@Jabulente)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>

</div>

<h1 style='font-size: 55px; color: Tomato; font-family: Colonna MT; font-weight: 700; text-align: center'>THE END</h1>