<h1 style='font-size: 25px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'>Independent (Two-Sample) T-Test</h1>

---

The independent two-sample t-test compares the means of two independent groups to see if they are significantly different. This test is commonly used when you are comparing two distinct groups that have no relationship, such as two different treatment groups in an experiment or two separate populations. 

**Example**: <span style='color: green'>*A researcher might want to compare the average peformance in growth and yield of wheat grown with organic fertilizer versus synthetic fertilizer. The two samples (wheat growth and yields with organic fertilizer and synthetic fertilizer) are independent of each other, and the test would assess whether the two fertilizers produce significantly different yields.*</span>


**Assumptions for the independent t-test:**
    
1. The data in both groups should be independently sampled.

2. Each group should ideally follow a normal distribution.

3. The variances of the two groups should be equal (homogeneity of variance). If this assumption is violated, alternative tests such as Welch’s t-test may be used.

<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>1.0: Import required libraries</h4>

In [14]:
# Data manipulation and visualization
from scipy.stats import shapiro, levene, skew, kurtosis 
from scipy.stats import ttest_ind
from itertools import combinations
import matplotlib.pyplot as plt  
import seaborn as sns  
import pandas as pd  
import numpy as np 
import re 

import warnings  
warnings.simplefilter("ignore")  
pd.set_option('display.max_columns', 7) 
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print("....Libraries Loaded Successfully....")

....Libraries Loaded Successfully....


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>3.0: Import and Preprocessing Dataset</h4>

In [5]:
filepath = "Datasets/Fertilizer and Light Exposure Experiment Dataset.csv"
df = pd.read_csv(filepath)
display(df)

Unnamed: 0,Fertilizer,Light Exposure,Plant Height (cm),...,Flower Count (number),Seed Yield (g),Stomatal Conductance (mmol/m²/s)
0,Control,Full Sun,58.56,...,19.54,6.69,242.41
1,Organic,Full Shade,46.70,...,15.37,6.17,233.66
2,Control,Partial Shade,58.33,...,16.39,5.41,230.07
3,Control,Full Shade,42.73,...,12.45,4.26,154.25
4,Organic,Full Shade,41.82,...,15.14,4.64,200.54
...,...,...,...,...,...,...,...
115,Synthetic,Partial Shade,65.24,...,21.14,7.48,254.78
116,Organic,Partial Shade,63.56,...,16.11,6.17,234.22
117,Control,Partial Shade,62.75,...,17.99,6.18,278.97
118,Control,Full Shade,39.60,...,13.72,4.46,186.87



<h2 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.0: Exploratory Data Analysis</h2>

Now, let’s move into Exploratory Data Analysis (EDA) — an important step where we take a closer look at our dataset to understand its structure, identify patterns, detect anomalies, and get a sense of the overall distribution of plant parameters. This will help us gain valuable insights and guide the direction of our statistical tests and interpretations.


<h3 style='font-size: 15px; font-weight: 600'>4.1: Dataset Informartion Overviews</h3>


In [6]:
df.shape

(120, 10)

In [7]:
for column in df.columns.tolist(): print(f"{'-'*15} {column}")

--------------- Fertilizer
--------------- Light Exposure
--------------- Plant Height (cm)
--------------- Leaf Area (cm²)
--------------- Chlorophyll Content (SPAD units)
--------------- Root Length (cm)
--------------- Biomass (g)
--------------- Flower Count (number)
--------------- Seed Yield (g)
--------------- Stomatal Conductance (mmol/m²/s)


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Fertilizer                        120 non-null    object 
 1   Light Exposure                    120 non-null    object 
 2   Plant Height (cm)                 120 non-null    float64
 3   Leaf Area (cm²)                   120 non-null    float64
 4   Chlorophyll Content (SPAD units)  120 non-null    float64
 5   Root Length (cm)                  120 non-null    float64
 6   Biomass (g)                       120 non-null    float64
 7   Flower Count (number)             120 non-null    float64
 8   Seed Yield (g)                    120 non-null    float64
 9   Stomatal Conductance (mmol/m²/s)  120 non-null    float64
dtypes: float64(8), object(2)
memory usage: 9.5+ KB


<h4 style='font-size: 15px; font-weight: 600'>4.1.2: Columns Summary</h4>

In [9]:
def column_summary(df):
    summary_data = []
    
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
            'distinct_values_counts': distinct_values_counts
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df


summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values,distinct_values_counts
0,Fertilizer,object,0,120,3,"{'Control': 41, 'Synthetic': 40, 'Organic': 39}"
1,Light Exposure,object,0,120,3,"{'Full Shade': 44, 'Full Sun': 40, 'Partial Sh..."
2,Plant Height (cm),float64,0,120,120,"{58.56151388665052: 1, 46.696826238466286: 1, ..."
3,Leaf Area (cm²),float64,0,120,120,"{185.73856643236132: 1, 138.7980608962804: 1, ..."
4,Chlorophyll Content (SPAD units),float64,0,120,120,"{46.5196207922374: 1, 34.69363266870892: 1, 40..."
5,Root Length (cm),float64,0,120,120,"{24.31891050096943: 1, 17.6585349528435: 1, 26..."
6,Biomass (g),float64,0,120,120,"{11.994074041165357: 1, 8.667791843721698: 1, ..."
7,Flower Count (number),float64,0,120,120,"{19.53594616947752: 1, 15.366158832462084: 1, ..."
8,Seed Yield (g),float64,0,120,120,"{6.687959618540082: 1, 6.165373569255893: 1, 5..."
9,Stomatal Conductance (mmol/m²/s),float64,0,120,120,"{242.41380014645895: 1, 233.65862057163417: 1,..."


<h4 style='font-size: 15px; font-weight: 600'>4.1.4: Checking Missing Values</h4>

Checking for missing values is a crucial step in data analysis to assess the completeness and reliability of the dataset. This involves identifying any columns with null or empty entries, which may affect the accuracy of statistical models.

In [10]:
def Missig_values_info(df):   
    isna_df = df.isna().sum().reset_index(name='Missing Values Counts')
    isna_df['Proportions (%)'] = isna_df['Missing Values Counts']/len(df)*100
    return isna_df
    
isna_df = Missig_values_info(df)
isna_df

Unnamed: 0,index,Missing Values Counts,Proportions (%)
0,Fertilizer,0,0.0
1,Light Exposure,0,0.0
2,Plant Height (cm),0,0.0
3,Leaf Area (cm²),0,0.0
4,Chlorophyll Content (SPAD units),0,0.0
5,Root Length (cm),0,0.0
6,Biomass (g),0,0.0
7,Flower Count (number),0,0.0
8,Seed Yield (g),0,0.0
9,Stomatal Conductance (mmol/m²/s),0,0.0



<h4 style='font-size: 15px; color: green; font-family: Colonna MT; font-weight: 600'>4.1.5: Exploring Invalid Entries Dtypes</h4>


Exploring invalid entries in data types involves identifying values that do not match the expected format or category within each column. This includes detecting inconsistencies such as numerical values in categorical fields, incorrect data formats, or unexpected symbols and typos. Invalid entries can lead to errors in analysis and model performance, making it essential to standardize data types and correct anomalies.

In [11]:
def simplify_dtype(dtype):
    if dtype in (int, float, np.number): return 'Numeric'
    elif np.issubdtype(dtype, np.datetime64): return 'Datetime'
    elif dtype == str: return 'String'
    elif dtype == type(None): return 'Missing'
    else: return 'Other'

def analyze_column_dtypes(df):
    all_dtypes = {'Numeric', 'Datetime', 'String', 'Missing', 'Other'}
    results = pd.DataFrame(index=df.columns, columns=list(all_dtypes), dtype=object).fillna('-')
    
    for column in df.columns:
        dtypes = df[column].apply(lambda x: simplify_dtype(type(x))).value_counts()
        percentages = (dtypes / len(df)) * 100
        for dtype, percent in percentages.items():
            if percent > 0:
                results.at[column, dtype] = f'{percent:.2f}%'  # Add % sign and format to 2 decimal places
            else:
                results.at[column, dtype] = '-'  # Add dash for 0%
    return results

results = analyze_column_dtypes(df)
display(results)

Unnamed: 0,Missing,String,Numeric,Datetime,Other
Fertilizer,-,100.00%,-,-,-
Light Exposure,-,100.00%,-,-,-
Plant Height (cm),-,-,100.00%,-,-
Leaf Area (cm²),-,-,100.00%,-,-
Chlorophyll Content (SPAD units),-,-,100.00%,-,-
Root Length (cm),-,-,100.00%,-,-
Biomass (g),-,-,100.00%,-,-
Flower Count (number),-,-,100.00%,-,-
Seed Yield (g),-,-,100.00%,-,-
Stomatal Conductance (mmol/m²/s),-,-,100.00%,-,-


<h3 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.3: Statistic Description of The Datasets</h3>

Let's take a moment to quickly explore some essential statistics of our dataset. By using the describe() function in pandas, we can generate a summary of key metrics for each numerical column in the dataset. This gives us a bird's-eye view of the data, helping us understand the general distribution and characteristics of the values.

In [12]:
summary_stats = df.describe().T.reset_index()
summary_stats

Unnamed: 0,index,count,mean,...,50%,75%,max
0,Plant Height (cm),120.0,60.58,...,60.18,69.33,95.14
1,Leaf Area (cm²),120.0,181.9,...,180.71,205.0,312.3
2,Chlorophyll Content (SPAD units),120.0,42.03,...,40.37,48.66,73.21
3,Root Length (cm),120.0,23.86,...,23.64,27.07,39.4
4,Biomass (g),120.0,11.97,...,11.45,13.87,19.61
5,Flower Count (number),120.0,18.18,...,17.64,21.08,30.03
6,Seed Yield (g),120.0,6.16,...,6.14,7.03,10.16
7,Stomatal Conductance (mmol/m²/s),120.0,241.68,...,240.25,276.74,383.45



<h4 style='font-size: 15px;  font-weight: 600'>4.3.1: Distribution of Continuous variables</h4>


Let’s explore the distribution of **continuous variables** in our dataset by examining key statistics. The ***Mean*** gives us the average value, while the ***Median*** provides the middle value, offering a more robust measure against outliers. The ***Mode*** identifies the most frequent value. ***Standard Deviation*** and ***Variance*** show how much the data deviates from the mean, with larger values indicating greater spread. The ***Range*** reveals the difference between the maximum and minimum values, while ***Skewness*** measures the symmetry of the distribution. Lastly, ***Kurtosis*** tells us about the presence of outliers by analyzing the **"tailedness"** of the distribution. Together, these metrics give us a comprehensive view of how the data is distributed and guide us in identifying any potential issues like skewness or outliers.

In [15]:
def distribution_statistics(df):
    results = []
    for col in df.select_dtypes(include=[np.number]).columns:
        mean = df[col].mean()
        median = df[col].median()
        mode = df[col].mode().iloc[0] if not df[col].mode().empty else np.nan
        std_dev = df[col].std()
        variance = df[col].var()
        value_range = df[col].max() - df[col].min()
        skewness_val = skew(df[col], nan_policy='omit')  # Skewness
        kurtosis_val = kurtosis(df[col], nan_policy='omit')  # Kurtosis


        results.append({
            'Parameter': col,
            'Mean': mean,
            'Median': median,
            'Mode': mode,
            'Standard Deviation': std_dev,
            'Variance': variance,
            'Range': value_range,
            'Skewness': skewness_val,
            'Kurtosis': kurtosis_val
        })

    
    result_df = pd.DataFrame(results)
    return result_df

pd.set_option('display.max_columns', 10) 
Continuous_variables_distribution = distribution_statistics(df)
display(Continuous_variables_distribution)

Unnamed: 0,Parameter,Mean,Median,Mode,Standard Deviation,Variance,Range,Skewness,Kurtosis
0,Plant Height (cm),60.58,60.18,35.89,14.93,222.9,59.26,0.37,-0.73
1,Leaf Area (cm²),181.9,180.71,108.65,45.93,2109.99,203.66,0.56,-0.32
2,Chlorophyll Content (SPAD units),42.03,40.37,23.99,9.86,97.29,49.22,0.59,-0.17
3,Root Length (cm),23.86,23.64,14.75,5.44,29.55,24.65,0.62,-0.22
4,Biomass (g),11.97,11.45,7.23,2.89,8.33,12.38,0.6,-0.48
5,Flower Count (number),18.18,17.64,10.57,4.43,19.6,19.45,0.62,-0.26
6,Seed Yield (g),6.16,6.14,3.86,1.49,2.21,6.3,0.5,-0.51
7,Stomatal Conductance (mmol/m²/s),241.68,240.25,148.69,55.74,3106.48,234.76,0.41,-0.58


<h4 style='font-size: 15px;  font-weight: 600'>4.3.3: Group-wise Comparatives Analysis of Continuous variables</h4>

Now, let’s turn our attention to comparing the means of variables across different specified groups. By grouping the data based on a categorical feature, we can calculate the mean of each continuous variable within each group. This allows us to identify differences or similarities in average values between groups, offering insights into how the variable behaves under different conditions or categories.


In [17]:
def summary_stats(df, group):
    Metrics = df.select_dtypes(include=np.number).columns.tolist()
    df_without_location = df.drop(columns=[group])
    grand_mean = df_without_location[Metrics].mean()
    sem = df_without_location[Metrics].sem()
    cv = df_without_location[Metrics].std() / df_without_location[Metrics].mean() * 100
    grouped = df.groupby(group)[Metrics].agg(['mean', 'sem']).reset_index()
    
    summary_df = pd.DataFrame()
    for col in Metrics:
        summary_df[col] = grouped.apply(
            lambda x: f"{x[(col, 'mean')]:.2f} ± {x[(col, 'sem')]:.2f}", axis=1
        )
    
    summary_df.insert(0, group, grouped[group])
    grand_mean_row = ['Grand Mean'] + grand_mean.tolist()
    sem_row = ['SEM'] + sem.tolist()
    cv_row = ['%CV'] + cv.tolist()
    
    summary_df.loc[len(summary_df)] = grand_mean_row
    summary_df.loc[len(summary_df)] = sem_row
    summary_df.loc[len(summary_df)] = cv_row
    
    return summary_df

results = summary_stats(df, group='Fertilizer')
results.T

Unnamed: 0,0,1,2,3,4,5
Fertilizer,Control,Organic,Synthetic,Grand Mean,SEM,%CV
Plant Height (cm),54.55 ± 1.58,65.61 ± 2.69,61.84 ± 2.42,60.58,1.36,24.65
Leaf Area (cm²),167.76 ± 4.32,194.26 ± 8.37,184.36 ± 8.09,181.90,4.19,25.25
Chlorophyll Content (SPAD units),39.76 ± 1.12,44.85 ± 1.77,41.60 ± 1.67,42.03,0.90,23.47
Root Length (cm),21.97 ± 0.58,25.15 ± 1.03,24.53 ± 0.86,23.86,0.50,22.78
Biomass (g),10.91 ± 0.26,12.56 ± 0.47,12.48 ± 0.56,11.97,0.26,24.12
Flower Count (number),17.13 ± 0.48,19.36 ± 0.82,18.12 ± 0.75,18.18,0.40,24.35
Seed Yield (g),5.68 ± 0.17,6.41 ± 0.26,6.43 ± 0.25,6.16,0.14,24.13
Stomatal Conductance (mmol/m²/s),221.38 ± 6.67,256.74 ± 9.06,247.81 ± 9.74,241.68,5.09,23.06


<h2 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>5.0: Parametictic Test Assumption Validation</h2>

Before conducting hypothesis testing, it's essential to verify that our variables meet the assumptions required for parametric tests—specifically, **normality** and **homogeneity of variance (homoskedasticity)**. To assess normality, we’ll use the Shapiro-Wilk test, which evaluates whether the data are approximately normally distributed within each group. To test for equal variances across groups, we’ll use Levene’s test. Checking these assumptions helps ensure the validity of our statistical results and informs the appropriate choice of analysis.


<h4 style='font-size: 15px; font-weight: 600'>5.1: Homogeneity of Variance (Homoskedasticity)</h4>

Levene’s Test is a statistical method used to assess homogeneity of variance (homoskedasticity), a key assumption in analyses. It tests whether the variances of different groups are equal, with a p-value greater than 0.05 indicating that the assumption holds. If violated, alternative approaches like Welch's test or data transformation may be necessary to ensure reliable results. This test helps maintain the integrity of statistical analysis by confirming whether t-test is appropriate for a given dataset.

In [19]:
def Levene_test(df, group_cols, numeric_cols=None): 
    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
        for g in group_cols:
            if g in numeric_cols:
                numeric_cols.remove(g)
    
    results = []
    for group_col in group_cols:
        for col in numeric_cols:
            grouped_data = [g[col].dropna().values for _, g in df.groupby(group_col)]
            if all(len(g) > 1 for g in grouped_data):  # Ensure each group has enough data
                levene_stat, levene_p = levene(*grouped_data)
                interpretation = '✔' if levene_p > 0.05 else '✖'
                #interpretation = 'Homoscedasticity' if levene_p > 0.05 else 'Heteroscedasticity'
                
            else:
                levene_stat, levene_p, interpretation = None, None, 'Insufficient data'
            
            results.append({
                'Group Column': group_col,
                'Variable': col,
                'Test Statistic': levene_stat,
                'P-Value': levene_p,
                'Interpretation': interpretation
            })
    
    return pd.DataFrame(results)

result_df = Levene_test(df, group_cols=['Fertilizer'])
display(result_df)

Unnamed: 0,Group Column,Variable,Test Statistic,P-Value,Interpretation
0,Fertilizer,Plant Height (cm),5.4,0.01,✖
1,Fertilizer,Leaf Area (cm²),7.77,0.0,✖
2,Fertilizer,Chlorophyll Content (SPAD units),3.92,0.02,✖
3,Fertilizer,Root Length (cm),4.07,0.02,✖
4,Fertilizer,Biomass (g),11.78,0.0,✖
5,Fertilizer,Flower Count (number),4.62,0.01,✖
6,Fertilizer,Seed Yield (g),3.21,0.04,✖
7,Fertilizer,Stomatal Conductance (mmol/m²/s),2.58,0.08,✔


<h4 style='font-size: 15px; font-weight: 600'>5.2: Normal Distribution (Normality Test)</h4>


- In statistical analysis, assessing whether data follows a normal distribution is a critical preliminary step, particularly before applying parametric tests such as ANOVA or t-tests. The assumption of normality underpins the reliability of these tests, as violations can lead to misleading results and incorrect conclusions. To evaluate this, normality tests are employed to determine if the distribution of a dataset aligns closely with a theoretical normal distribution. By verifying this assumption, analysts can decide whether the data is suitable for parametric testing or if alternative methods, such as data transformation or non-parametric tests, are more appropriate.

- In the context of our analysis, we utilize the **Shapiro-Wilk** test to examine normality, especially given its effectiveness with small to moderately sized samples. This test compares the order statistics of the observed data against a normal distribution and yields both a **W statistic** and a **p-value**. A p-value greater than 0.05 indicates that we fail to reject the null hypothesis, suggesting the data is normally distributed. Conversely, a p-value less than **0.05** implies that the data significantly deviates from normality.

- To strengthen this approach, we incorporate the **Central Limit Theorem (CLT)** through **bootstrapping**, where appropriate. By repeatedly sampling from the data and calculating the means of these samples, we approximate a sampling distribution of the mean. When bootstrapping is enabled, the Shapiro-Wilk test is applied to this distribution of sample means rather than the raw data. This helps determine whether the distribution of means — rather than individual observations — approximates normality, aligning with the assumptions of inferential statistics based on the CLT. If bootstrapping is disabled, the Shapiro-Wilk test is applied directly to the original dataset, offering a more traditional view of the data's normality. This dual approach provides flexibility and robustness in assessing the suitability of the data for further statistical analysis.

In [20]:
def bootstrapping(df, column, num_samples=1000, sample_size=30):
    sample_means = []
    for _ in range(num_samples):
        sample = df[column].dropna().sample(n=sample_size, replace=True)
        sample_means.append(sample.mean())
    return sample_means

def shapiro_wilk_test(df, group_col, numeric_cols=None, use_bootstrap=True, num_samples=1000, sample_size=30): 
    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
        if group_col in numeric_cols:
            numeric_cols.remove(group_col)
    
    results = []
    for group, group_df in df.groupby(group_col):
        for col in numeric_cols:
            if use_bootstrap:
                data = bootstrapping(group_df, col, num_samples=num_samples, sample_size=sample_size)
            else:
                data = group_df[col].dropna()
                
            if len(data) >= 3:  # Shapiro requires at least 3 values
                stat, p_value = shapiro(data)
                interpretation = 'Normal' if p_value > 0.05 else 'Not Normal'
            else:
                stat, p_value, interpretation = None, None, 'Insufficient data'
            
            results.append({
                'Main-Group': group_col,
                'Group': group,
                'Variable': col,
                'Test Statistic': stat,
                'P-Value': p_value,
                'Interpretation': interpretation,
                'Used Bootstrap': use_bootstrap
            })
        
    results_df = pd.DataFrame(results)
    return results_df

result_df = shapiro_wilk_test(df, group_col='Fertilizer', use_bootstrap=True)
display(result_df)

Unnamed: 0,Main-Group,Group,Variable,Test Statistic,P-Value,Interpretation,Used Bootstrap
0,Fertilizer,Control,Plant Height (cm),1.0,0.34,Normal,True
1,Fertilizer,Control,Leaf Area (cm²),1.0,0.46,Normal,True
2,Fertilizer,Control,Chlorophyll Content (SPAD units),1.0,0.07,Normal,True
3,Fertilizer,Control,Root Length (cm),1.0,0.11,Normal,True
4,Fertilizer,Control,Biomass (g),1.0,0.33,Normal,True
5,Fertilizer,Control,Flower Count (number),1.0,0.89,Normal,True
6,Fertilizer,Control,Seed Yield (g),1.0,0.49,Normal,True
7,Fertilizer,Control,Stomatal Conductance (mmol/m²/s),1.0,0.64,Normal,True
8,Fertilizer,Organic,Plant Height (cm),1.0,0.47,Normal,True
9,Fertilizer,Organic,Leaf Area (cm²),1.0,0.01,Not Normal,True


<h2 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>6.0. Independent (Two-Sample) T-Test</h2>


<h4 style='font-size: 15px; font-weight: 600'>6.1: Automate Test over Multiple Variables</h4>

In [64]:
def Independent_ttest(df, group_column, Variables):
    unique_groups = df[group_column].unique()
    group_combinations = list(combinations(unique_groups, 2))
    results = []
    for column in Variables:
        for group1, group2 in group_combinations:
            group1_data = df[df[group_column] == group1][column]
            group2_data = df[df[group_column] == group2][column]
            t_stat, p_value = ttest_ind(group1_data, group2_data, equal_var=False)
            
            results.append({
                'Group': group_column,
                'Parameter': column,
                'Group 1': group1,
                'Group 2': group2,
                'T-Statistic': t_stat,
                'P-Value': p_value,
                'Interpretation': 'Significant' if p_value < 0.05 else 'Not Significant'
            })
        
    results_df = pd.DataFrame(results)
    return results_df

group_col = 'Fertilizer'
Variables = ['Plant Height (cm)', 'Leaf Area (cm²)', 'Chlorophyll Content (SPAD units)']
Results = Independent_ttest(df, group_column=group_col, Variables=Variables)
display(Results)

Unnamed: 0,Group,Parameter,Group 1,Group 2,T-Statistic,P-Value,Interpretation
0,Fertilizer,Plant Height (cm),Control,Organic,-3.54,0.0,Significant
1,Fertilizer,Plant Height (cm),Control,Synthetic,-2.52,0.01,Significant
2,Fertilizer,Plant Height (cm),Organic,Synthetic,1.04,0.3,Not Significant
3,Fertilizer,Leaf Area (cm²),Control,Organic,-2.81,0.01,Significant
4,Fertilizer,Leaf Area (cm²),Control,Synthetic,-1.81,0.08,Not Significant
5,Fertilizer,Leaf Area (cm²),Organic,Synthetic,0.85,0.4,Not Significant
6,Fertilizer,Chlorophyll Content (SPAD units),Control,Organic,-2.44,0.02,Significant
7,Fertilizer,Chlorophyll Content (SPAD units),Control,Synthetic,-0.91,0.36,Not Significant
8,Fertilizer,Chlorophyll Content (SPAD units),Organic,Synthetic,1.34,0.18,Not Significant


<h4 style='font-size: 15px; font-weight: 600'>6.2: Automate Test over Multiple Categories and Variables</h4>

In [21]:
def Independent_ttest(df, group_cols, Variables):
    results = []
    for category in group_cols:
        unique_groups = df[category].unique()
        group_combinations = list(combinations(unique_groups, 2))
        
        for column in Variables:
            for group1, group2 in group_combinations:
                group1_data = df[df[category] == group1][column]
                group2_data = df[df[category] == group2][column]
                t_stat, p_value = ttest_ind(group1_data, group2_data, equal_var=False)
                
                results.append({
                    'Group': category,
                    'Parameter': column,
                    'Group 1': group1,
                    'Group 2': group2,
                    'T-Statistic': t_stat,
                    'P-Value': p_value,
                    'Interpretation': 'Significant' if p_value < 0.05 else 'Not Significant'
                })
        
    results_df = pd.DataFrame(results)
    return results_df

group_col = ['Fertilizer', 'Light Exposure']
Variables = ['Plant Height (cm)', 'Leaf Area (cm²)', 'Chlorophyll Content (SPAD units)']
Results = Independent_ttest(df, group_cols=group_col, Variables=Variables)
display(Results)

Unnamed: 0,Group,Parameter,Group 1,Group 2,T-Statistic,P-Value,Interpretation
0,Fertilizer,Plant Height (cm),Control,Organic,-3.54,0.0,Significant
1,Fertilizer,Plant Height (cm),Control,Synthetic,-2.52,0.01,Significant
2,Fertilizer,Plant Height (cm),Organic,Synthetic,1.04,0.3,Not Significant
3,Fertilizer,Leaf Area (cm²),Control,Organic,-2.81,0.01,Significant
4,Fertilizer,Leaf Area (cm²),Control,Synthetic,-1.81,0.08,Not Significant
5,Fertilizer,Leaf Area (cm²),Organic,Synthetic,0.85,0.4,Not Significant
6,Fertilizer,Chlorophyll Content (SPAD units),Control,Organic,-2.44,0.02,Significant
7,Fertilizer,Chlorophyll Content (SPAD units),Control,Synthetic,-0.91,0.36,Not Significant
8,Fertilizer,Chlorophyll Content (SPAD units),Organic,Synthetic,1.34,0.18,Not Significant
9,Light Exposure,Plant Height (cm),Full Sun,Full Shade,14.74,0.0,Significant


---

This analysis was performed by **Jabulente**, a passionate and dedicated data scientist with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

---

<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>

<h1 style='font-size: 35px; color: Tomato; font-family: Colonna MT; font-weight: 700; text-align: center'>THE END</h1>