<h2 style=' color: crimson;font-family: Colonna MT; font-weight: 600; font-size: 35px; text-align: Center'>Analysis of Variance (ANOVA)</h2>

---

***Analysis of Variance (ANOVA)** is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. It analyzes the variation within each group and between groups to assess if the observed differences in sample means are likely due to a true effect or if they could have occurred by random chance. ANOVA assumes that the data within each group is normally distributed, that the groups have similar variances (homogeneity of variance), and that the samples are independent. If the ANOVA test reveals significant differences, post-hoc tests can be used to identify which specific groups differ.*

*This notebook works through the process of assumption validation and performs both **One-Way** and **Two-Way ANOVA** analyses. First, we validate the assumptions necessary for conducting ANOVA, such as **normality**, **homogeneity of variance**, and **independence** of observations. We use visual tools like Q-Q plots and statistical tests like the Shapiro-Wilk test to assess normality and Levene's test to check for equal variances. Once the assumptions are validated, we proceed with **One-Way ANOVA**, which compares the means of three or more groups based on a single factor, and **Two-Way ANOVA**, which examines the interaction between two factors on the dependent variable. These analyses help identify whether there are significant differences in means across the groups, offering insights into the relationships between the variables.*

<h2 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>1.0. Import Required Libraries</h2>

In [125]:
from scipy.stats import shapiro, levene, stats
from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
import statsmodels.api as sm
import pingouin as pg
import pandas as pd
import numpy as np
import re

print("Libraries Loaded Successfully")

Libraries Loaded Successfully


<h2 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>2.0. Import and Preprocessing Dataset</h2>

In [113]:
filepath = 'Datasets/Fertilizer and Light Exposure Experiment Dataset.csv'
df = pd.read_csv(filepath)
df.sample(10)

Unnamed: 0,Fertilizer,Light Exposure,Plant Height (cm),Leaf Area (cm²),Chlorophyll Content (SPAD units),Root Length (cm),Biomass (g),Flower Count (number),Seed Yield (g),Stomatal Conductance (mmol/m²/s)
4,Organic,Full Shade,41.824216,129.775873,34.734928,19.784214,10.547822,15.14083,4.641545,200.536467
11,Control,Full Sun,62.9066,181.200445,46.334206,25.505119,13.943497,21.416247,6.80642,196.979138
46,Synthetic,Partial Shade,65.491129,186.330578,39.112393,26.225549,12.845067,21.052167,7.304005,318.786887
55,Control,Full Sun,59.428476,184.07647,53.486801,23.062762,12.289083,16.896079,6.24433,245.0838
57,Organic,Partial Shade,63.418958,192.57822,46.464603,27.688596,11.442456,19.258213,6.823463,293.842943
39,Control,Full Shade,46.857514,108.646815,33.461977,14.751212,8.754549,12.550712,5.331018,187.082737
97,Control,Full Shade,37.742779,128.156267,32.174037,22.087247,7.732385,13.010168,4.455371,148.685152
51,Organic,Partial Shade,69.293881,187.077486,42.839882,23.269376,12.070809,17.700754,6.138762,230.289963
86,Control,Partial Shade,63.25936,174.563314,36.440202,23.855202,10.301372,19.831889,4.938772,242.412097
94,Organic,Partial Shade,74.503684,166.5767,50.670611,25.019161,11.467505,17.211121,5.275416,260.675109


<h2 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>3.0. Dataset Informations/ Overview</h2>

In [114]:
df.shape

(120, 10)

In [115]:
df.columns

Index(['Fertilizer', 'Light Exposure', 'Plant Height (cm)', 'Leaf Area (cm²)',
       'Chlorophyll Content (SPAD units)', 'Root Length (cm)', 'Biomass (g)',
       'Flower Count (number)', 'Seed Yield (g)',
       'Stomatal Conductance (mmol/m²/s)'],
      dtype='object')

In [116]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Fertilizer                        120 non-null    object 
 1   Light Exposure                    120 non-null    object 
 2   Plant Height (cm)                 120 non-null    float64
 3   Leaf Area (cm²)                   120 non-null    float64
 4   Chlorophyll Content (SPAD units)  120 non-null    float64
 5   Root Length (cm)                  120 non-null    float64
 6   Biomass (g)                       120 non-null    float64
 7   Flower Count (number)             120 non-null    float64
 8   Seed Yield (g)                    120 non-null    float64
 9   Stomatal Conductance (mmol/m²/s)  120 non-null    float64
dtypes: float64(8), object(2)
memory usage: 9.5+ KB


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>3.1: Columns Summary</h4>

To begin the analysis, it is important to explore the dataset by summarizing its structure and key attributes. This involves examining the **data types (dtypes)** of each column to determine whether they contain numerical or categorical values, which helps in selecting appropriate analytical techniques. Additionally, checking the **number of unique values** in each column provides insight into the variability of the data, distinguishing between continuous and discrete features.  

Assessing **distinct values** allows for a better understanding of the diversity within each variable, while identifying **missing values** is essential to evaluate data completeness and potential gaps that may require handling. Lastly, reviewing the **count of non-null entries** ensures the dataset’s integrity and helps in deciding whether any preprocessing steps, such as data imputation or cleaning, are necessary. This exploratory step lays the foundation for effective analysis and meaningful insights.

In [117]:
def column_summary(df):
    summary_data = []
    
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
            'distinct_values_counts': distinct_values_counts
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df


summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values,distinct_values_counts
0,Fertilizer,object,0,120,3,"{'Control': 41, 'Synthetic': 40, 'Organic': 39}"
1,Light Exposure,object,0,120,3,"{'Full Shade': 44, 'Full Sun': 40, 'Partial Sh..."
2,Plant Height (cm),float64,0,120,120,"{58.56151388665052: 1, 46.696826238466286: 1, ..."
3,Leaf Area (cm²),float64,0,120,120,"{185.73856643236132: 1, 138.7980608962804: 1, ..."
4,Chlorophyll Content (SPAD units),float64,0,120,120,"{46.5196207922374: 1, 34.69363266870892: 1, 51..."
5,Root Length (cm),float64,0,120,120,"{24.31891050096943: 1, 17.6585349528435: 1, 33..."
6,Biomass (g),float64,0,120,120,"{11.994074041165357: 1, 8.667791843721698: 1, ..."
7,Flower Count (number),float64,0,120,120,"{19.53594616947752: 1, 15.366158832462084: 1, ..."
8,Seed Yield (g),float64,0,120,120,"{6.687959618540082: 1, 6.165373569255893: 1, 8..."
9,Stomatal Conductance (mmol/m²/s),float64,0,120,120,"{242.41380014645895: 1, 233.65862057163417: 1,..."


<h4 style='font-size: 18px; color: Blue; font-family: Colonna MT; font-weight: 600'>3.2: Checking Missing Values</h4>

Checking for missing values is a crucial step in data analysis to assess the completeness and reliability of the dataset. This involves identifying any columns with null or empty entries, which may affect the accuracy of statistical and machine learning models. Missing values can arise due to various reasons, such as incomplete survey responses or data collection errors.

In [118]:
def Missig_values_info(df):   
    isna_df = df.isna().sum().reset_index(name='Missing Values Counts')
    isna_df['Proportions (%)'] = isna_df['Missing Values Counts']/len(df)*100
    return isna_df
    
isna_df = Missig_values_info(df)
isna_df

Unnamed: 0,index,Missing Values Counts,Proportions (%)
0,Fertilizer,0,0.0
1,Light Exposure,0,0.0
2,Plant Height (cm),0,0.0
3,Leaf Area (cm²),0,0.0
4,Chlorophyll Content (SPAD units),0,0.0
5,Root Length (cm),0,0.0
6,Biomass (g),0,0.0
7,Flower Count (number),0,0.0
8,Seed Yield (g),0,0.0
9,Stomatal Conductance (mmol/m²/s),0,0.0


<h4 style='font-size:18px; color: Blue; font-family: Colonna MT; font-weight: 600'>3.4: Renaming Columns</h4>

When working with data, especially in statistical tests like ANOVA, column names with special characters or spaces can cause errors. To avoid this, we can clean the column names by converting them to lowercase, replacing any special characters (such as punctuation) with underscores, and ensuring there are no spaces. This makes the column names consistent and compatible with most analysis functions, preventing errors during data processing. The renaming process is automatic, making it easy to handle datasets with potentially problematic column names.

In [119]:
def rename(text):
    text = re.sub(r'[^a-zA-Z]', "",  text) 
    return text

test_df = df.copy()
test_df = test_df.rename(columns={col: rename(col) for col in test_df.columns})
print("\n Columns names before renaming\n")
print(df.columns)
print("\n Columns names after renaming\n")
print(test_df.columns)


 Columns names before renaming

Index(['Fertilizer', 'Light Exposure', 'Plant Height (cm)', 'Leaf Area (cm²)',
       'Chlorophyll Content (SPAD units)', 'Root Length (cm)', 'Biomass (g)',
       'Flower Count (number)', 'Seed Yield (g)',
       'Stomatal Conductance (mmol/m²/s)'],
      dtype='object')

 Columns names after renaming

Index(['Fertilizer', 'LightExposure', 'PlantHeightcm', 'LeafAreacm',
       'ChlorophyllContentSPADunits', 'RootLengthcm', 'Biomassg',
       'FlowerCountnumber', 'SeedYieldg', 'StomatalConductancemmolms'],
      dtype='object')


<h1 style='font-size: 20px; font-family: Colonna MT; font-weight: 600'>4.0: ANOVA Assumption Validation</h1>

<H4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.1: Homogeneity of Variance (Homoskedasticity)</H4>

Levene’s Test is a statistical method used to assess **homogeneity of variance (homoskedasticity)**, a key assumption in analyses like **ANOVA**. It tests whether the variances of different groups are equal, with a p-value greater than 0.05 indicating that the assumption holds. If violated, alternative approaches like **Welch's ANOVA** or data transformation may be necessary to ensure reliable results. This test helps maintain the integrity of statistical analysis by confirming whether ANOVA is appropriate for a given dataset.

In [127]:
def Levene_test(df, group_cols, numeric_cols=None): 
    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
        for g in group_cols:
            if g in numeric_cols:
                numeric_cols.remove(g)
    
    results = []
    for group_col in group_cols:
        for col in numeric_cols:
            grouped_data = [g[col].dropna().values for _, g in df.groupby(group_col)]
            if all(len(g) > 1 for g in grouped_data):  # Ensure each group has enough data
                levene_stat, levene_p = levene(*grouped_data)
                #interpretation = 'Homoscedasticity' if levene_p > 0.05 else 'Heteroscedasticity'
                interpretation = '✔' if levene_p > 0.05 else '✖'
                
            else:
                levene_stat, levene_p, interpretation = None, None, 'Insufficient data'
            
            results.append({
                'Group Column': group_col,
                'Variable': col,
                'Test Statistic': levene_stat,
                'P-Value': levene_p,
                'Interpretation': interpretation
            })
    
    return pd.DataFrame(results)

# Example usage
result_df = Levene_test(df, group_cols=['Fertilizer', 'Light Exposure'])
display(result_df)

Unnamed: 0,Group Column,Variable,Test Statistic,P-Value,Interpretation
0,Fertilizer,Plant Height (cm),5.402929,0.005697095,✖
1,Fertilizer,Leaf Area (cm²),7.773899,0.0006762083,✖
2,Fertilizer,Chlorophyll Content (SPAD units),3.918409,0.02253334,✖
3,Fertilizer,Root Length (cm),4.070979,0.01953442,✖
4,Fertilizer,Biomass (g),11.775226,2.191223e-05,✖
5,Fertilizer,Flower Count (number),4.619853,0.01171949,✖
6,Fertilizer,Seed Yield (g),3.205186,0.04413678,✖
7,Fertilizer,Stomatal Conductance (mmol/m²/s),2.583279,0.07982908,✔
8,Light Exposure,Plant Height (cm),19.047867,6.898926e-08,✖
9,Light Exposure,Leaf Area (cm²),23.555614,2.530944e-09,✖


<h3 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.2: Normal Distribution (Normality Test)</h3>

- In statistical analysis, assessing whether data follows a normal distribution is a critical preliminary step, particularly before applying parametric tests such as ANOVA or t-tests. The assumption of normality underpins the reliability of these tests, as violations can lead to misleading results and incorrect conclusions. To evaluate this, normality tests are employed to determine if the distribution of a dataset aligns closely with a theoretical normal distribution. By verifying this assumption, analysts can decide whether the data is suitable for parametric testing or if alternative methods, such as data transformation or non-parametric tests, are more appropriate.

- In the context of our analysis, we utilize the **Shapiro-Wilk test** to examine normality, especially given its effectiveness with small to moderately sized samples. This test compares the order statistics of the observed data against a normal distribution and yields both a **W statistic** and a **p-value**. A **p-value greater than 0.05** indicates that we fail to reject the null hypothesis, suggesting the data is normally distributed. Conversely, a **p-value less than 0.05** implies that the data significantly deviates from normality.

- To strengthen this approach, we incorporate the Central Limit Theorem (CLT) through **bootstrapping**, where appropriate. By repeatedly sampling from the data and calculating the means of these samples, we approximate a sampling distribution of the mean. When bootstrapping is enabled, the Shapiro-Wilk test is applied to this distribution of sample means rather than the raw data. This helps determine whether the distribution of means — rather than individual observations — approximates normality, aligning with the assumptions of inferential statistics based on the CLT. If bootstrapping is disabled, the Shapiro-Wilk test is applied directly to the original dataset, offering a more traditional view of the data's normality. This dual approach provides flexibility and robustness in assessing the suitability of the data for further statistical analysis.


In [121]:
def bootstrapping(df, column, num_samples=1000, sample_size=30):
    sample_means = []
    for _ in range(num_samples):
        sample = df[column].dropna().sample(n=sample_size, replace=True)
        sample_means.append(sample.mean())
    return sample_means

def shapiro_wilk_test(df, group_col, numeric_cols=None, use_bootstrap=True, num_samples=1000, sample_size=30): 
    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
        if group_col in numeric_cols:
            numeric_cols.remove(group_col)
    
    results = []
    for group, group_df in df.groupby(group_col):
        for col in numeric_cols:
            if use_bootstrap:
                data = bootstrapping(group_df, col, num_samples=num_samples, sample_size=sample_size)
            else:
                data = group_df[col].dropna()
                
            if len(data) >= 3:  # Shapiro requires at least 3 values
                stat, p_value = shapiro(data)
                interpretation = 'Normal' if p_value > 0.05 else 'Not Normal'
            else:
                stat, p_value, interpretation = None, None, 'Insufficient data'
            
            results.append({
                'Group': group,
                'Variable': col,
                'Test Statistic': stat,
                'P-Value': p_value,
                'Interpretation': interpretation,
                'Used Bootstrap': use_bootstrap
            })
        
    results_df = pd.DataFrame(results)
    return results_df

result_df = shapiro_wilk_test(df, group_col='Fertilizer', use_bootstrap=False)
display(result_df)

Unnamed: 0,Group,Variable,Test Statistic,P-Value,Interpretation,Used Bootstrap
0,Control,Plant Height (cm),0.921272,0.007493,Not Normal,False
1,Control,Leaf Area (cm²),0.971742,0.392311,Normal,False
2,Control,Chlorophyll Content (SPAD units),0.946995,0.054914,Normal,False
3,Control,Root Length (cm),0.965686,0.247463,Normal,False
4,Control,Biomass (g),0.962779,0.196585,Normal,False
5,Control,Flower Count (number),0.968992,0.319532,Normal,False
6,Control,Seed Yield (g),0.938713,0.028394,Not Normal,False
7,Control,Stomatal Conductance (mmol/m²/s),0.957663,0.130226,Normal,False
8,Organic,Plant Height (cm),0.923545,0.011217,Not Normal,False
9,Organic,Leaf Area (cm²),0.905155,0.003111,Not Normal,False


<h1 style='font-family: Colonna MT; font-weight: 600; font-size: 20px; text-align: left'>5.0. Analysis of Varience (ANOVA)</h1>

<h4 style='font-size: 18px; color: Blue; font-family: Colonna MT; font-weight: 600'>5.1: Analysis of Varience (One Way ANOVA)</h4>

A **one-way ANOVA** (Analysis of Variance) is a statistical test used to determine if there are significant differences between the means of three or more independent groups based on a single factor (or independent variable). It assesses whether the factor has an effect on the dependent variable. If the p-value from the test is less than a specified significance level (usually 0.05), it suggests that at least one group mean is significantly different from the others. The test assumes that the data is normally distributed, the variances are equal across groups (homogeneity of variance), and the observations are independent.

In [122]:
def One_way_anova(data, Metrics, group_cols):
    results = []
    group_cols = [rename(col) for col in group_cols]
    data = data.rename(columns={col: rename(col) for col in data.columns})
    for group in group_cols:
        for col in Metrics:
            column_name = rename(col)  
            formula = f"{column_name} ~ C({group})" 
            model = smf.ols(formula, data=data).fit()
            anova_table = sm.stats.anova_lm(model, typ=2)
            for source, row in anova_table.iterrows():
                p_value = row["PR(>F)"]
                interpretation = "Significant" if p_value < 0.05 else "No significant"
                if source == "Residual": interpretation = "-"
        
                results.append({
                    "Variable": col,
                    "Factor": group.title(),
                    "Source": source,
                    "Sum Sq": row["sum_sq"],
                    "df": row["df"],
                    "F-Value": row["F"],
                    "p-Value": p_value,
                    "Interpretation": interpretation
                })

    return pd.DataFrame(results)

group_cols = ["Fertilizer", "Light Exposure"]
Metrics = ['Plant Height (cm)', 'Leaf Area (cm²)',
       'Chlorophyll Content (SPAD units)', 'Root Length (cm)', 'Biomass (g)',
       'Flower Count (number)', 'Seed Yield (g)']

Anova_results = One_way_anova(df, Metrics, group_cols)
Anova_results

Unnamed: 0,Variable,Factor,Source,Sum Sq,df,F-Value,p-Value,Interpretation
0,Plant Height (cm),Fertilizer,C(Fertilizer),2540.401132,2.0,6.196237,0.002768186,Significant
1,Plant Height (cm),Fertilizer,Residual,23984.469551,117.0,,,-
2,Leaf Area (cm²),Fertilizer,C(Fertilizer),14394.904665,2.0,3.557773,0.03162578,Significant
3,Leaf Area (cm²),Fertilizer,Residual,236693.540764,117.0,,,-
4,Chlorophyll Content (SPAD units),Fertilizer,C(Fertilizer),529.814969,2.0,2.80544,0.06455443,No significant
5,Chlorophyll Content (SPAD units),Fertilizer,Residual,11047.885633,117.0,,,-
6,Root Length (cm),Fertilizer,C(Fertilizer),229.565194,2.0,4.086024,0.0192616,Significant
7,Root Length (cm),Fertilizer,Residual,3286.706916,117.0,,,-
8,Biomass (g),Fertilizer,C(Fertilizer),70.052523,2.0,4.447828,0.01374816,Significant
9,Biomass (g),Fertilizer,Residual,921.364965,117.0,,,-


<h4 style='font-size: 18px; color: Blue; font-family: Colonna MT; font-weight: 600'>5.2: Two Way ANOVA (Interaction Effect)</h4>

The **interaction effect** in a two-way ANOVA examines how two independent variables (factors) jointly influence the dependent variable. It tests if the effect of one factor changes depending on the level of the other factor. If there’s **no interaction**, the effect of one factor is the same at all levels of the other factor. If there **is an interaction**, the combined effect of the factors is more complex, and their influence on the dependent variable is not simply additive. Understanding the interaction is important because it helps reveal how factors work together to affect the outcome, which may be missed when only considering main effects separately.

In [123]:
def two_way_anova_all(data, numerical_columns, Factor1, Factor2):
    results = []

    Factor1 = rename(Factor1)
    Factor2 = rename(Factor2)
    data = data.rename(columns={col: rename(col) for col in data.columns})
    
    for response_column in numerical_columns:
        safe_column_name = rename(response_column)
        data = data.rename(columns={response_column: safe_column_name})
        formula = f"{safe_column_name} ~ C({Factor1}) + C({Factor2}) + C({Factor1}):C({Factor2})"
        
        model = ols(formula, data=data).fit()
        anova_table = sm.stats.anova_lm(model, typ=2)
        for source, row in anova_table.iterrows():
            p_value = row["PR(>F)"]
            interpretation = "Significant difference" if p_value < 0.05 else "No significant difference"
            if source == "Residual":
                interpretation = "-"
                
            results.append({
                "Variable": response_column,
                "Source": source,
                "Sum Sq": row["sum_sq"],
                "df": row["df"],
                "F-Value": row["F"],
                "p-Value": p_value,
                "Interpretation": interpretation
            })

    results_df = pd.DataFrame(results)
    return results_df

Factor1, Factor2 = "Fertilizer", "Light Exposure"
numerical_columns = df.select_dtypes(include=["float64", "int64"]).columns
Interaction_anova = two_way_anova_all(df, numerical_columns, Factor1, Factor2)
Interaction_anova

Unnamed: 0,Variable,Source,Sum Sq,df,F-Value,p-Value,Interpretation
0,Plant Height (cm),C(Fertilizer),3092.828782,2.0,48.502082,7.28497e-16,Significant difference
1,Plant Height (cm),C(LightExposure),18697.924478,2.0,293.222912,5.018729e-45,Significant difference
2,Plant Height (cm),C(Fertilizer):C(LightExposure),1747.480541,4.0,13.702091,4.110094e-09,Significant difference
3,Plant Height (cm),Residual,3539.064533,111.0,,,-
4,Leaf Area (cm²),C(Fertilizer),19516.697898,2.0,35.5914,1.140949e-12,Significant difference
5,Leaf Area (cm²),C(LightExposure),177708.337114,2.0,324.075748,4.541468e-47,Significant difference
6,Leaf Area (cm²),C(Fertilizer):C(LightExposure),28551.538747,4.0,26.033841,3.125877e-15,Significant difference
7,Leaf Area (cm²),Residual,30433.664903,111.0,,,-
8,Chlorophyll Content (SPAD units),C(Fertilizer),772.906707,2.0,18.855267,8.921936e-08,Significant difference
9,Chlorophyll Content (SPAD units),C(LightExposure),7926.072476,2.0,193.358671,6.805835e-37,Significant difference


<h4 style='font-size: 18px; color: Blue; font-family: Colonna MT; font-weight: 600'>5.3: Welch's ANOVA (Welch's F test)</h4>

Welch's ANOVA (often called Welch's F test) is a statistical test used to compare the means of three or more groups when the assumption of equal variances (homoscedasticity) among the groups is violated. It is an adaptation of the traditional one-way ANOVA that is more robust in the presence of heteroscedasticity (unequal variances) and unequal sample sizes.

In [124]:
def welchs_anova(data, Metrics, group_cols):
    results = []
    
    group_cols = [rename(col) for col in group_cols]
    data = data.rename(columns={col: rename(col) for col in data.columns})
    for group in group_cols:
        for col in Metrics:
            column_name = rename(col)
            
            # Perform Welch's ANOVA using pingouin
            aov = pg.welch_anova(data=data, dv=column_name, between=group)
            
            for _, row in aov.iterrows():
                p_value = row["p-unc"]
                interpretation = "Significant difference" if p_value < 0.05 else "No significant difference"
                results.append({
                    "Variable": col,
                    "Grouping Factor": group.title(),
                    "Source": row["Source"],
                    "df": row["ddof1"],  # Degrees of freedom between groups
                    "F-Value": row["F"],
                    "p-Value": p_value,
                    "Interpretation": interpretation
                })

    return pd.DataFrame(results)

group_cols = ["Fertilizer", "Light Exposure"]
Metrics = df.select_dtypes(include=["float64", "int64"]).columns
welch_results = welchs_anova(df, Metrics, group_cols)
welch_results

Unnamed: 0,Variable,Grouping Factor,Source,df,F-Value,p-Value,Interpretation
0,Plant Height (cm),Fertilizer,Fertilizer,2,7.497738,0.001094265,Significant difference
1,Leaf Area (cm²),Fertilizer,Fertilizer,2,4.666998,0.01252944,Significant difference
2,Chlorophyll Content (SPAD units),Fertilizer,Fertilizer,2,2.957898,0.05811551,No significant difference
3,Root Length (cm),Fertilizer,Fertilizer,2,5.159282,0.008024587,Significant difference
4,Biomass (g),Fertilizer,Fertilizer,2,6.571393,0.002437003,Significant difference
5,Flower Count (number),Fertilizer,Fertilizer,2,2.874003,0.06289439,No significant difference
6,Seed Yield (g),Fertilizer,Fertilizer,2,4.40389,0.01559353,Significant difference
7,Stomatal Conductance (mmol/m²/s),Fertilizer,Fertilizer,2,5.685189,0.005018764,Significant difference
8,Plant Height (cm),Lightexposure,LightExposure,2,150.46019,1.413111e-25,Significant difference
9,Leaf Area (cm²),Lightexposure,LightExposure,2,146.717737,1.258094e-25,Significant difference


---

This analysis was performed by **Jabulente**, a passionate and dedicated data scientist with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

    
<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![X (Twitter)](https://img.shields.io/badge/X-@Jabulente-black?logo=x)](https://x.com/Jabulente)  [![Instagram](https://img.shields.io/badge/Instagram-@Jabulente-purple?logo=instagram)](https://instagram.com/Jabulente)  [![Threads](https://img.shields.io/badge/Threads-@Jabulente-black?logo=threads)](https://threads.net/@Jabulente)  [![TikTok](https://img.shields.io/badge/TikTok-@Jabulente-teal?logo=tiktok)](https://tiktok.com/@Jabulente)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>

</div>

<h1 style='font-size: 55px; color: Tomato; font-family: Colonna MT; font-weight: 700; text-align: center'>THE END</h1>