<h1 style='font-size: 35px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'>Hypothes Testing | Inferential Statistics</h1>

---

<h1 style=' font-weight: 600; font-size: 20px; text-align: left'>1.0. Import Required Libraries</h1>

In [16]:
# Statistical Analysis and Statistical Modeling
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd
import numpy as np
import sys
import re
import os


pd.set_option('display.max_columns', 70) 
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print("....Libraries Loaded Successfully....")

....Libraries Loaded Successfully....


<h1 style='font-weight: 600; font-size: 18px; text-align: left'>2.0. Import and Preprocessing Dataset</h1>

In [3]:
filepath = "../Datasets/Eggplant Fusarium Fresistance Data.csv"
df = pd.read_csv(filepath)
display(df)

Unnamed: 0,Variety,Resistance Level,Replication ID,Infection Severity (%),Wilt index,Plant height (cm),Days to wilt symptoms,Survival rate (%),Disease incidence (%)
0,EP-R1,Resistant,1,22.50,0.70,88.90,21,88.80,23.40
1,EP-R1,Resistant,2,27.90,1.20,82.20,19,87.70,21.70
2,EP-R1,Resistant,3,21.20,0.00,74.70,17,84.90,27.20
3,EP-R1,Resistant,4,15.50,0.10,93.80,18,90.30,15.00
4,EP-R1,Resistant,5,17.30,0.90,78.10,19,87.00,23.00
...,...,...,...,...,...,...,...,...,...
795,EP-S3,Susceptible,96,75.20,3.60,68.20,7,6.40,85.50
796,EP-S3,Susceptible,97,74.80,4.90,59.50,4,27.20,82.00
797,EP-S3,Susceptible,98,58.10,3.60,78.80,7,30.80,75.40
798,EP-S3,Susceptible,99,54.10,4.10,63.70,7,24.10,81.80


<h1 style='font-weight: 600; font-size: 20px; text-align: left'>5.0. Hypothes Testing (Inferential Statistics)</h1>

<h4 style='font-size: 15px;  font-weight: 600'>5.1: Group-wise Comparatives Analysis of Continuous variables</h4>

Now, let’s turn our attention to comparing the means of variables across different specified groups. By grouping the data based on a categorical feature, we can calculate the mean of each continuous variable within each group. This allows us to identify differences or similarities in average values between groups, offering insights into how the variable behaves under different conditions or categories.


In [4]:
def summary_stats(df, group):
    Metrics = df.select_dtypes(include=np.number).columns.tolist()
    df_without_location = df.drop(columns=[group])
    grand_mean = df_without_location[Metrics].mean()
    sem = df_without_location[Metrics].sem()
    cv = df_without_location[Metrics].std() / df_without_location[Metrics].mean() * 100
    grouped = df.groupby(group)[Metrics].agg(['mean', 'sem']).reset_index()
    
    summary_df = pd.DataFrame()
    for col in Metrics:
        summary_df[col] = grouped.apply(
            lambda x: f"{x[(col, 'mean')]:.2f} ± {x[(col, 'sem')]:.2f}", axis=1
        )
    
    summary_df.insert(0, group, grouped[group])
    grand_mean_row = ['Grand Mean'] + grand_mean.tolist()
    sem_row = ['SEM'] + sem.tolist()
    cv_row = ['%CV'] + cv.tolist()
    
    summary_df.loc[len(summary_df)] = grand_mean_row
    summary_df.loc[len(summary_df)] = sem_row
    summary_df.loc[len(summary_df)] = cv_row
    
    return summary_df

results = summary_stats(df, group='Variety')
results

Unnamed: 0,Variety,Replication ID,Infection Severity (%),Wilt index,Plant height (cm),Days to wilt symptoms,Survival rate (%),Disease incidence (%)
0,EP-M1,50.50 ± 2.90,44.16 ± 1.06,2.56 ± 0.07,74.44 ± 0.51,11.99 ± 0.15,54.57 ± 1.00,50.54 ± 0.79
1,EP-M2,50.50 ± 2.90,45.44 ± 1.01,2.51 ± 0.07,75.18 ± 0.44,11.85 ± 0.16,55.77 ± 1.00,51.75 ± 1.02
2,EP-R1,50.50 ± 2.90,20.69 ± 0.43,0.73 ± 0.05,84.80 ± 0.59,17.97 ± 0.19,89.25 ± 0.52,25.84 ± 0.72
3,EP-R2,50.50 ± 2.90,20.81 ± 0.46,0.83 ± 0.05,85.33 ± 0.60,17.98 ± 0.20,90.49 ± 0.46,26.17 ± 0.78
4,EP-R3,50.50 ± 2.90,20.89 ± 0.42,0.82 ± 0.05,84.84 ± 0.58,18.51 ± 0.20,89.61 ± 0.47,25.25 ± 0.70
5,EP-S1,50.50 ± 2.90,75.26 ± 1.08,4.20 ± 0.06,65.22 ± 0.50,6.82 ± 0.12,24.14 ± 0.80,81.24 ± 0.75
6,EP-S2,50.50 ± 2.90,73.99 ± 0.91,4.11 ± 0.05,65.47 ± 0.54,6.91 ± 0.10,24.66 ± 0.89,79.57 ± 0.65
7,EP-S3,50.50 ± 2.90,73.91 ± 0.89,4.18 ± 0.06,64.71 ± 0.53,6.76 ± 0.12,24.44 ± 0.75,80.79 ± 0.61
8,Grand Mean,50.50,46.89,2.49,75.00,12.35,56.62,52.64
9,SEM,1.02,0.87,0.06,0.36,0.18,1.04,0.88


In [5]:
results = summary_stats(df, group='Resistance Level')
results.T

Unnamed: 0,0,1,2,3,4,5
Resistance Level,Moderate,Resistant,Susceptible,Grand Mean,SEM,%CV
Replication ID,50.50 ± 2.05,50.50 ± 1.67,50.50 ± 1.67,50.50,1.02,57.20
Infection Severity (%),44.80 ± 0.73,20.80 ± 0.25,74.38 ± 0.56,46.89,0.87,52.63
Wilt index,2.53 ± 0.05,0.79 ± 0.03,4.16 ± 0.03,2.49,0.06,63.01
Plant height (cm),74.81 ± 0.33,84.99 ± 0.34,65.13 ± 0.30,75.00,0.36,13.51
Days to wilt symptoms,11.92 ± 0.11,18.15 ± 0.12,6.83 ± 0.06,12.35,0.18,41.81
Survival rate (%),55.17 ± 0.71,89.78 ± 0.28,24.42 ± 0.47,56.62,1.04,51.84
Disease incidence (%),51.15 ± 0.64,25.75 ± 0.42,80.53 ± 0.39,52.64,0.88,47.36


<h4 style='font-size: 15px; font-weight: 600'>5.3: Analysis of Varience (One Way ANOVA)</h4>


A one-way ANOVA (Analysis of Variance) is a statistical test used to determine if there are significant differences between the means of three or more independent groups based on a single factor (or independent variable). It assesses whether the factor has an effect on the dependent variable. If the p-value from the test is less than a specified significance level (usually 0.05), it suggests that at least one group mean is significantly different from the others. 

In [6]:
def rename(text): return re.sub(r'[^a-zA-Z]', "", text)
    
def One_way_anova(data, Metrics, group_cols):
    results = []
    original_group_cols = group_cols[:]  # Save original names for reporting
    group_cols = [rename(col) for col in group_cols]
    data = data.rename(columns={col: rename(col) for col in data.columns})
    
    for original_group, group in zip(original_group_cols, group_cols):
        for col in Metrics:
            column_name = rename(col)  
            formula = f"{column_name} ~ C({group})" 
            model = smf.ols(formula, data=data).fit()
            anova_table = sm.stats.anova_lm(model, typ=2)
            
            for source, row in anova_table.iterrows():
                p_value = row["PR(>F)"]
                interpretation = "Significant" if p_value < 0.05 else "No significant"
                if source == "Residual":
                    interpretation = "-"
                
                results.append({
                    "Variable": col,
                    #"Factor": original_group,  # Use original name here
                    "Source": source,
                    "Sum Sq": row["sum_sq"],
                    "df": row["df"],
                    "F-Value": row["F"],
                    "p-Value": p_value,
                    "Interpretation": interpretation
                })

    return pd.DataFrame(results)

group_cols = ['Variety', 'Resistance Level']
Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
Anova_results = One_way_anova(df, Metrics, group_cols)
Anova_results

Unnamed: 0,Variable,Source,Sum Sq,df,F-Value,p-Value,Interpretation
0,Infection Severity (%),C(Variety),432094.82,7.0,897.47,0.0,Significant
1,Infection Severity (%),Residual,54473.51,792.0,,,-
2,Wilt index,C(Variety),1705.22,7.0,730.61,0.0,Significant
3,Wilt index,Residual,264.07,792.0,,,-
4,Plant height (cm),C(Variety),59239.97,7.0,293.62,0.0,Significant
5,Plant height (cm),Residual,22827.23,792.0,,,-
6,Days to wilt symptoms,C(Variety),19302.91,7.0,1091.56,0.0,Significant
7,Days to wilt symptoms,Residual,2000.79,792.0,,,-
8,Survival rate (%),C(Variety),641645.8,7.0,1559.68,0.0,Significant
9,Survival rate (%),Residual,46546.44,792.0,,,-


<h4 style='font-size: 15px; font-weight: 600'>5.4: Tukey's Honest Significant Difference (THSD)</h4>

Now, let's perform Tukey’s Honest Significant Difference (THSD) test to determine which groups are statistically similar or different. While ANOVA tells us if there is a significant difference among groups, it does not specify which groups differ. Tukey’s HSD helps by comparing all possible group pairs and assigning statistical significance to their differences. This step is crucial in identifying which groups share similarities and which ones are distinct, allowing for a more detailed interpretation of the results.

In [7]:
def Turkey_results(df, Metrics, group=''):
    results_data = []
    for metric in Metrics:
        turkey_results = pairwise_tukeyhsd(endog=df[metric], groups=df[group], alpha=0.05)
        results_table = turkey_results.summary()
        
        for i in range(1, len(results_table)):
            row = results_table.data[i]
            results_data.append({
                'Metric': metric,
                'Group1': row[0],
                'Group2': row[1],
                'Mean Difference': row[2],
                'P-Value': row[3],
                'Lower CI': row[4],
                'Upper CI': row[5],
                'Reject Null': row[6]
            })
        
        result_df = pd.DataFrame(results_data)
    return result_df


Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
Turkeyresults = Turkey_results(df, Metrics, group='Variety')
pd.set_option("display.float_format", "{:.3f}".format)
Turkeyresults

Unnamed: 0,Metric,Group1,Group2,Mean Difference,P-Value,Lower CI,Upper CI,Reject Null
0,Infection Severity (%),EP-M1,EP-M2,1.270,0.960,-2.294,4.834,False
1,Infection Severity (%),EP-M1,EP-R1,-23.478,0.000,-27.042,-19.914,True
2,Infection Severity (%),EP-M1,EP-R2,-23.354,0.000,-26.918,-19.790,True
3,Infection Severity (%),EP-M1,EP-R3,-23.273,0.000,-26.837,-19.709,True
4,Infection Severity (%),EP-M1,EP-S1,31.092,0.000,27.528,34.656,True
...,...,...,...,...,...,...,...,...
163,Disease incidence (%),EP-R3,EP-S2,54.322,0.000,51.054,57.590,True
164,Disease incidence (%),EP-R3,EP-S3,55.547,0.000,52.279,58.815,True
165,Disease incidence (%),EP-S1,EP-S2,-1.672,0.777,-4.940,1.596,False
166,Disease incidence (%),EP-S1,EP-S3,-0.447,1.000,-3.715,2.821,False


In [8]:
Metrics = ['Infection Severity (%)', 'Wilt index', 'Plant height (cm)', 'Days to wilt symptoms', 'Survival rate (%)', 'Disease incidence (%)']
Turkeyresults = Turkey_results(df, Metrics, group='Resistance Level')
pd.set_option("display.float_format", "{:.3f}".format)
Turkeyresults

Unnamed: 0,Metric,Group1,Group2,Mean Difference,P-Value,Lower CI,Upper CI,Reject Null
0,Infection Severity (%),Moderate,Resistant,-24.003,0.0,-25.779,-22.228,True
1,Infection Severity (%),Moderate,Susceptible,29.583,0.0,27.808,31.359,True
2,Infection Severity (%),Resistant,Susceptible,53.587,0.0,51.999,55.175,True
3,Wilt index,Moderate,Resistant,-1.741,0.0,-1.865,-1.617,True
4,Wilt index,Moderate,Susceptible,1.629,0.0,1.505,1.752,True
5,Wilt index,Resistant,Susceptible,3.37,0.0,3.259,3.481,True
6,Plant height (cm),Moderate,Resistant,10.185,0.0,9.036,11.334,True
7,Plant height (cm),Moderate,Susceptible,-9.673,0.0,-10.822,-8.524,True
8,Plant height (cm),Resistant,Susceptible,-19.859,0.0,-20.886,-18.831,True
9,Days to wilt symptoms,Moderate,Resistant,6.233,0.0,5.892,6.575,True


<h4 style='font-size: 15px; font-weight: 600'>5.5: Compact Letter Display (CLD)</h4>

Now, let's compute the **Compact Letter Display (CLD)** across group pairs to summarize the statistical differences between them in a clear and concise table. CLD helps visually represent which groups differ significantly in their means by assigning letters—groups sharing the same letter are not significantly different. This summary makes it easier to interpret multiple pairwise comparisons simultaneously, providing an intuitive overview of group similarities and differences across the dataset.


In [15]:
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
analysis_path = os.path.join(project_root, 'Scripts')
if analysis_path not in sys.path: sys.path.append(analysis_path)

from  compact_letter_display import compact_letter_table

results = compact_letter_table(df, group="Variety", savepath=None)
display(results.T)

Unnamed: 0,EP-M1,EP-M2,EP-R1,EP-R2,EP-R3,EP-S1,EP-S2,EP-S3,p-value
Replication ID,50.50 ± 2.90 a,50.50 ± 2.90 a,50.50 ± 2.90 a,50.50 ± 2.90 a,50.50 ± 2.90 a,50.50 ± 2.90 a,50.50 ± 2.90 a,50.50 ± 2.90 a,1.0000ns
Infection Severity (%),44.16 ± 1.06 a,45.44 ± 1.01 a,20.69 ± 0.43 c,20.81 ± 0.46 c,20.89 ± 0.42 c,75.26 ± 1.08 b,73.99 ± 0.91 b,73.91 ± 0.89 b,0.0000***
Wilt index,2.56 ± 0.07 a,2.51 ± 0.07 a,0.73 ± 0.05 c,0.83 ± 0.05 c,0.82 ± 0.05 c,4.20 ± 0.06 b,4.11 ± 0.05 b,4.18 ± 0.06 b,0.0000***
Plant height (cm),74.44 ± 0.51 a,75.18 ± 0.44 a,84.80 ± 0.59 c,85.34 ± 0.60 c,84.84 ± 0.58 c,65.22 ± 0.50 b,65.47 ± 0.54 b,64.71 ± 0.53 b,0.0000***
Days to wilt symptoms,11.99 ± 0.15 a,11.85 ± 0.16 a,17.97 ± 0.19 c,17.98 ± 0.20 c,18.51 ± 0.20 c,6.82 ± 0.12 b,6.91 ± 0.10 b,6.76 ± 0.12 b,0.0000***
Survival rate (%),54.57 ± 1.00 a,55.77 ± 1.00 a,89.25 ± 0.52 c,90.49 ± 0.46 c,89.61 ± 0.47 c,24.14 ± 0.80 b,24.66 ± 0.89 b,24.44 ± 0.75 b,0.0000***
Disease incidence (%),50.54 ± 0.79 a,51.75 ± 1.02 a,25.84 ± 0.72 c,26.17 ± 0.78 c,25.24 ± 0.70 c,81.24 ± 0.75 b,79.57 ± 0.65 b,80.79 ± 0.61 b,0.0000***


In [17]:
results = compact_letter_table(df, group="Resistance Level", savepath=None)
display(results.T)

Unnamed: 0,Moderate,Resistant,Susceptible,p-value
Replication ID,50.50 ± 2.05 a,50.50 ± 1.67 a,50.50 ± 1.67 a,1.0000ns
Infection Severity (%),44.80 ± 0.73 b,20.80 ± 0.25 c,74.38 ± 0.56 a,0.0000***
Wilt index,2.53 ± 0.05 b,0.79 ± 0.03 c,4.16 ± 0.03 a,0.0000***
Plant height (cm),74.81 ± 0.33 b,84.99 ± 0.34 c,65.13 ± 0.30 a,0.0000***
Days to wilt symptoms,11.92 ± 0.11 b,18.15 ± 0.12 c,6.83 ± 0.06 a,0.0000***
Survival rate (%),55.17 ± 0.71 b,89.78 ± 0.28 c,24.42 ± 0.47 a,0.0000***
Disease incidence (%),51.15 ± 0.64 b,25.75 ± 0.42 c,80.53 ± 0.39 a,0.0000***


---

This analysis was performed by **Jabulente**, a passionate and dedicated data analyst with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

---

<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>

<h1 style='font-size: 55px; color: red; font-family: Colonna MT; font-weight: 700; text-align: center'>THE END</h1>