<h1 style='font-size: 35px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'>T-Tests for Pairwise Comparison</h1>

---

- A t-test is a statistical method used to determine whether there is a significant difference between the means of two groups. It is commonly applied when analyzing small sample sizes where the population standard deviation is unknown. The test is based on Student’s t-distribution, which helps estimate the likelihood that the observed difference in sample means is due to chance. The t-test is widely used in various fields, including agriculture, medicine, business, and social sciences, to validate hypotheses and support data-driven decision-making.

- The core principle of a t-test lies in comparing the mean values of one or two datasets and determining whether the difference between them is statistically significant. This is done by calculating the t-statistic, which measures the size of the difference relative to the variation within the samples. A corresponding p-value is then used to assess whether this difference is likely to occur by random chance. If the p-value is below a predefined significance level (commonly 0.05), the difference is considered statistically significant, leading to the rejection of the null hypothesis.

- There are three main types of t-tests, each designed for specific scenarios. The one-sample t-test compares the mean of a single sample to a known or hypothesized population mean. The independent (two-sample) t-test is used to compare the means of two separate groups, such as different farming methods or fertilizer treatments. The paired t-test, on the other hand, is applied when comparing two related measurements, such as soil quality before and after an intervention.

- T-tests play a crucial role in research and data analysis, helping to validate experimental findings and draw meaningful conclusions from data. By ensuring that observed differences are not due to random variation, they provide a solid statistical foundation for making informed decisions in scientific studies, market research, and experimental designs.

---

<span style='font-size: 20px; color: crimson; font-family: Colonna MT; font-weight: 600'>When to Use a T-Test</span>
 
- Use a t-test when comparing means and the sample size is relatively small (<30).  
- Ensure assumptions such as normality, independence, and equal variance (for independent t-tests) are met.  
- For larger datasets or non-normal data, consider alternative methods like the Mann-Whitney U test (for independent groups) or Wilcoxon signed-rank test (for paired groups).  

<h1 style='font-size: 20px; color: crimson; font-family: Candara; font-weight: 600'>Import Required Libraries</h1>

In [2]:
from scipy.stats import shapiro, levene, skew, kurtosis, zscore, stats
from scipy.stats import ttest_ind, ttest_rel, ttest_1samp
from itertools import combinations
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings
import math

warnings.simplefilter("ignore")
pd.set_option('display.max_columns', 10)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print("Libraries Loaded Successfully")

Libraries Loaded Successfully


<h1 style='font-size: 25px; color: crimson; font-family: Colonna MT; font-weight: 600'>1.0: Types and Assumptions of T-Test</h1>

**There are three main types of t-tests, each suited to different types of data and research questions. These include the one-sample t-test, the independent (two-sample) t-test, and the paired t-test. Each test has its specific assumptions, which must be met for the results to be valid.**


<h4 style='font-size: 20px; color: crimson; font-family: Candara; font-weight: 600'>1.1: Independent (Two-Sample) T-Test</h4>

The independent two-sample t-test compares the means of two independent groups to see if they are significantly different. This test is commonly used when you are comparing two distinct groups that have no relationship, such as two different treatment groups in an experiment or two separate populations. 

**Example**: <span style='color: green'>*A researcher might want to compare the average peformance in growth and yield of wheat grown with organic fertilizer versus synthetic fertilizer. The two samples (wheat growth and yields with organic fertilizer and synthetic fertilizer) are independent of each other, and the test would assess whether the two fertilizers produce significantly different yields.*</span>

**Assumptions for the independent t-test:**

1. The data in both groups should be independently sampled.

2. Each group should ideally follow a normal distribution.

3. The variances of the two groups should be equal (homogeneity of variance). If this assumption is violated, alternative tests such as Welch’s t-test may be used.

In [3]:
def dataset_generation(sample_size=1000):
    np.random.seed(42)
    fertilizer = np.random.choice(['Organic', 'Synthetic'], size=sample_size)
    plant_height = np.random.normal(100, 8, size=sample_size)
    Number_of_leaves = np.random.normal(20, 4, size=sample_size)
    Leaf_width = np.random.normal(12, 2, size=sample_size)
    leaf_height = np.random.normal(15, 3, size=sample_size)
    root_length = np.random.normal(25, 9, size=sample_size)
    number_of_pods = np.random.normal(70, 8, size=sample_size)
    pod_length = np.random.normal(18, 5, size=sample_size)
    Yield = np.random.normal(70, 8, size=sample_size)
    
    data = pd.DataFrame({
        "Fertilizer": fertilizer,
        'Plant Height (cm)': plant_height,
        'Number of Leaves': Number_of_leaves,
        'Leaf Width (cm)': Leaf_width,
        'Leaf Height (cm)': leaf_height,
        'Root Length (cm)': root_length,
        'Number of Pods': number_of_pods,
        'Pod Length (cm)': pod_length,
        'Yield (kg/ha)': Yield,
        
    })
    return data

def Independent_ttest(df, group_column, Variables):
    unique_groups = df[group_column].unique()
    group_combinations = list(combinations(unique_groups, 2))
    results = []
    for column in Variables:
        for group1, group2 in group_combinations:
            group1_data = df[df[group_column] == group1][column]
            group2_data = df[df[group_column] == group2][column]
            t_stat, p_value = ttest_ind(group1_data, group2_data, equal_var=False)
            
            results.append({
                'Parameter': column,
                'Group 1': group1,
                'Group 2': group2,
                'T-Statistic': t_stat,
                'P-Value': p_value,
                'Interpretation': 'Significant' if p_value < 0.05 else 'Not Significant'
            })
        
    results_df = pd.DataFrame(results)
    return results_df

df = dataset_generation(sample_size=1000)
Metrics = df.select_dtypes(include=[np.number]).columns
Results = Independent_ttest(df, group_column='Fertilizer', Variables=Metrics)
Results

Unnamed: 0,Parameter,Group 1,Group 2,T-Statistic,P-Value,Interpretation
0,Plant Height (cm),Organic,Synthetic,0.19,0.85,Not Significant
1,Number of Leaves,Organic,Synthetic,0.56,0.58,Not Significant
2,Leaf Width (cm),Organic,Synthetic,-0.28,0.78,Not Significant
3,Leaf Height (cm),Organic,Synthetic,-0.24,0.81,Not Significant
4,Root Length (cm),Organic,Synthetic,1.16,0.24,Not Significant
5,Number of Pods,Organic,Synthetic,-0.9,0.37,Not Significant
6,Pod Length (cm),Organic,Synthetic,-0.31,0.76,Not Significant
7,Yield (kg/ha),Organic,Synthetic,0.87,0.38,Not Significant


---
<h4 style='font-size: 20px; color: crimson; font-family: Candara; font-weight: 600'>1.2: Paired samples data (e.g., before and after)</h4>

The paired t-test is used when comparing two related or matched groups. The data consists of pairs of observations that are naturally linked, such as measurements taken before and after a treatment or intervention on the same subjects. The paired t-test is useful when you want to see if there is a significant change within the same sample over time or under different conditions.

**Example**: <span style='color: green'>*A researcher might want to compare the soil pH level before and after adding lime to the soil. Since both measurements are taken from the same plot, they are not independent, and a paired t-test would be used to determine whether there is a significant difference in the pH levels before and after treatment.*</span>

**Assumptions for the paired t-test:**

1. The differences between the paired observations should be approximately normally distributed.

2. The data points should be paired in a meaningful way, such as "before and after" measurements.

In [18]:
def paired_t_test(df, pairs, alpha=0.05): 
    """ Perform paired t-tests for multiple parameter pairs.
    Parameters:
        df (pd.DataFrame): Input dataframe containing paired sample data.
        pairs (dict): Dictionary with keys as parameter names and values as tuples of (before, after) column names.
        alpha (float): Significance level (default is 0.05).
    
    Returns:
        pd.DataFrame: Results including t-statistic, p-value, and interpretation.
    """
    results = []
    for param, (before_col, after_col) in pairs.items():
        if before_col in df.columns and after_col in df.columns:
            before_data = df[before_col].dropna()
            after_data = df[after_col].dropna()
            
            # Ensure both samples have equal lengths after dropping NaN values
            min_length = min(len(before_data), len(after_data))
            before_data = before_data[:min_length]
            after_data = after_data[:min_length]
            
            # Perform paired t-test
            t_stat, p_value = stats.ttest_rel(before_data, after_data)
            
            # Interpretation
            conclusion = "Significant Difference" if p_value < alpha else "No Significant Difference"
            
            results.append({
                "Parameter": param,
                "Before Mean": before_data.mean(),
                "After Mean": after_data.mean(),
                "T-Statistic": t_stat,
                "P-Value": p_value,
                "Alpha": alpha,
                "Conclusion": conclusion
            })
    
    return pd.DataFrame(results)



data = {
    "Soil_pH_Before": [6.4, 6.3, 6.5, 6.2, 6.1], "Soil_pH_After": [6.7, 6.6, 6.8, 6.5, 6.4], 
    "Nitrogen_(%)_Before": [6.4, 3.3, 6.5, 5.2, 4.1], "Nitrogen_(%)_After": [8.7, 9.6, 6.9, 9.5, 6.4],
    "Phosphorous (%)_Before": [6.4, 6.3, 6.5, 6.2, 6.1], "Phosphorous (%)_After": [6.7, 6.6, 6.8, 6.5, 6.4],
    "CEC (Meq/100g)_Before": [9.4, 6.3, 8.5, 6.2, 5.1], "CEC (Meq/100g)_After": [6.7, 8.6, 9.8, 6.5, 7.4]  
    }

data = pd.DataFrame(data)
parameter_pairs = {
    "Soil pH": ("Soil_pH_Before", "Soil_pH_After"),
    "Nitrogen (%)": ("Nitrogen_(%)_Before", "Nitrogen_(%)_After"), 
    "Phosphorous (%)": ("Phosphorous (%)_Before", "Phosphorous (%)_After"), 
    "CEC (Meq/100g)": ("CEC (Meq/100g)_Before", "CEC (Meq/100g)_After"),
    }


results_df = paired_t_test(data, parameter_pairs)
results_df

Unnamed: 0,Parameter,Before Mean,After Mean,T-Statistic,P-Value,Alpha,Conclusion
0,Soil pH,6.3,6.6,-1688025830031779.5,0.0,0.05,Significant Difference
1,Nitrogen (%),5.1,8.22,-3.1,0.04,0.05,Significant Difference
2,Phosphorous (%),6.3,6.6,-1688025830031779.5,0.0,0.05,Significant Difference
3,CEC (Meq/100g),7.1,7.8,-0.75,0.49,0.05,No Significant Difference


In [21]:
df = pd.read_excel('Datasets/Pair dataset.xlsx')
df.sample(10)

Unnamed: 0,SampleID,Soil pH1,Soil pH2,Nitrogen (%) 1,Nitrogen (%) 2,Phosphorous (%) 1,Phosphorous (%) 2,CEC (Meq/100g) 1,CEC (Meq/100g) 2
58,81504,4.81,6.81,2.93,2.43,2.09,5.23,268.37,311.37
53,82711,5.3,7.3,2.13,1.38,2.72,4.44,295.65,335.65
96,33078,4.31,6.31,4.27,3.4,5.96,7.07,271.32,312.32
63,27389,5.72,7.72,4.23,3.51,4.97,7.12,307.64,364.64
14,71768,5.03,7.03,4.23,3.33,7.74,6.92,304.77,361.77
56,69746,3.51,5.51,2.69,1.84,5.23,5.17,334.15,386.15
44,33733,3.87,5.87,4.7,3.87,5.18,7.33,239.88,285.88
1,46858,5.69,7.69,4.59,3.76,4.12,6.63,274.93,328.93
37,83040,5.09,7.09,4.19,3.43,5.07,6.87,204.17,254.17
45,62015,5.19,7.19,2.69,2.35,4.95,5.18,294.0,340.0


In [22]:
parameter_pairs = {
    "Soil pH": ("Soil pH1", "Soil pH2"),
    "Nitrogen (%)": ("Nitrogen (%) 1", "Nitrogen (%) 2"),
    "Phosphorous (%)": ("Phosphorous (%) 1", "Phosphorous (%) 2"),
    "CEC (Meq/100g)": ("CEC (Meq/100g) 1", "CEC (Meq/100g) 2")
}

results = paired_t_test(df, parameter_pairs)
results.round(3)

Unnamed: 0,Parameter,Before Mean,After Mean,T-Statistic,P-Value,Alpha,Conclusion
0,Soil pH,4.38,6.38,-1.0280176505656318e+17,0.0,0.05,Significant Difference
1,Nitrogen (%),3.54,3.04,17.05,0.0,0.05,Significant Difference
2,Phosphorous (%),4.94,6.0,-4.96,0.0,0.05,Significant Difference
3,CEC (Meq/100g),272.86,322.44,-84.7,0.0,0.05,Significant Difference


---

<h4 style='font-size: 20px; color: crimson; font-family: Candara; font-weight: 600'>1.3: One-Sample T-Test</h4>



The one-sample t-test is used when you want to compare the mean of a single sample to a known value or a population mean. This test is typically used when you have a sample and want to assess whether its mean differs from a known population mean or a theoretical expectation.

**Example**: <span style='color: green'>*Suppose a farmer wants to test whether the average soil nitrogen level in a plot of land is different from the national average of 2.5%. The one-sample t-test would compare the average nitrogen level in the sample from the plot to the national average of 2.5%.*</span>

**Assumptions for the one-sample t-test:**

1. The sample data should be approximately normally distributed, particularly important for small sample sizes (typically fewer than 30 observations).

2. The data points should be independent of each other, meaning that the measurement of one observation does not influence another.

In [64]:
def dataset_generation(sample_size=1000):
    np.random.seed(42)
    Plot = np.random.choice(['Plot 1', 'Plot 2', 'Plot 3', 'Plot 4'], size=sample_size)
    Nitrogen = np.random.normal(5, 10, size=sample_size)
    Phosphorous = np.random.normal(7, 2, size=sample_size) 
    Calicium = np.random.normal(5, 8, size=sample_size)
    
    data = pd.DataFrame({
        "Plot": Plot,
        'Nitrogens': Nitrogen,
        'Phosphorous': Phosphorous,
        'Calicium': Calicium,
    })
    return data

def one_sample_t_test(df, columns, population_means, alpha=0.05): 
    """ Perform one-sample t-tests for multiple parameters.

    Parameters:
        df (pd.DataFrame): Input dataframe containing sample data.
        columns (list): List of column names to test.
        population_means (dict): Dictionary with column names as keys and hypothesized means as values.
        alpha (float): Significance level (default is 0.05).
    
    Returns:
        pd.DataFrame: Results including t-statistic, p-value, and interpretation.
    """
    results = []
    
    for col in columns:
        if col in df.columns and col in population_means:
            sample_data = df[col].dropna()  # Remove NaN values
            pop_mean = population_means[col]
            
            t_stat, p_value = stats.ttest_1samp(sample_data, pop_mean)
            Interpretation = "Significant Difference" if p_value < alpha else "No Significant Difference"
            
            results.append({
                "Parameter": col,
                "Sample Mean": sample_data.mean(),
                "Hypothesized Mean": pop_mean,
                "T-Statistic": t_stat,
                "P-Value": p_value,
                "Alpha": alpha,
                "Conclusion": Interpretation
            })
    
    return pd.DataFrame(results)

df = dataset_generation(sample_size=1000)
population_means = {"Nitrogens": 2.5, "Phosphorous": 3, "Calicium":4 }
results_df = one_sample_t_test(df, df.columns, population_means)
results_df

Unnamed: 0,Parameter,Sample Mean,Hypothesized Mean,T-Statistic,P-Value,Alpha,Conclusion
0,Nitrogens,5.4,2.5,9.16,0.0,0.05,Significant Difference
1,Phosphorous,7.08,3.0,65.52,0.0,0.05,Significant Difference
2,Calicium,5.01,4.0,3.97,0.0,0.05,Significant Difference


---

This analysis was performed by **Jabulente**, a passionate and dedicated data scientist with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

---

<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![X (Twitter)](https://img.shields.io/badge/X-@Jabulente-black?logo=x)](https://x.com/Jabulente)  [![Instagram](https://img.shields.io/badge/Instagram-@Jabulente-purple?logo=instagram)](https://instagram.com/Jabulente)  [![Threads](https://img.shields.io/badge/Threads-@Jabulente-black?logo=threads)](https://threads.net/@Jabulente)  [![TikTok](https://img.shields.io/badge/TikTok-@Jabulente-teal?logo=tiktok)](https://tiktok.com/@Jabulente)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>

<h1 style='font-size: 55px; color: Tomato; font-family: Colonna MT; font-weight: 700; text-align: center'>THE END</h1>