<a href="https://colab.research.google.com/github/Magero-Steven/Research-Analysis/blob/main/Statistical_Hypothesis_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **StatisticalHypothesis Testing in Python**

### **Import needed files and libries for working with data sets**

In [9]:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf



from google.colab import files

# Upload the CSV file
# Upload the CSV file and specify the filename
uploaded = files.upload()
file_name = list(uploaded.keys())[0] # Get the name of the uploaded file

# Read the uploaded CSV file into a pandas DataFrame
import io
df = pd.read_csv(io.BytesIO(uploaded[file_name]))

Saving west_pokot_leishmaniasis_study.csv to west_pokot_leishmaniasis_study.csv


# **Independent T-Test**

**Purpose:**

Compare the means of two independent groups.

**Study Example:**

Comparing average leishmaniasis knowledge scores between male and female participants.

In [10]:
### 1. Independent T-Test: Compare diagnosis delay days between males and females
male_delay = df[df['gender'] == 'Male']['diagnosis_delay_days']
female_delay = df[df['gender'] == 'Female']['diagnosis_delay_days']
t_stat, p_val = stats.ttest_ind(male_delay, female_delay)
print("\n1. Independent T-Test:")
print(f"T-Statistic = {t_stat:.4f}, P-Value = {p_val:.4f}")



1. Independent T-Test:
T-Statistic = -1.5746, P-Value = 0.1169


# **Chi-Square Test**
**Purpose:**
Test for a relationship between two categorical variables.

**Example:**
Is there an association between gender and preference for public health centers for treatment?


In [11]:
### 2. Chi-Square Test: Relationship between gender and treatment choice
contingency_table = pd.crosstab(df['gender'], df['treatment_choice'])
chi2_stat, chi2_p, dof, expected = stats.chi2_contingency(contingency_table)
print("\n2. Chi-Square Test:")
print(f"Chi2 Statistic = {chi2_stat:.4f}, P-Value = {chi2_p:.4f}")


2. Chi-Square Test:
Chi2 Statistic = 0.1113, P-Value = 0.7387


# **Pearson Correlation**

**Purpose:**
Measure the strength of a linear relationship between two continuous variables.

**Study Example:**
Assess whether more community awareness sessions attended correlate with faster diagnosis.

Assessing correlation between awareness sessions attended and early diagnosis rates


In [12]:
### 3. Pearson Correlation: Attendance vs Knowledge Scores
# Encode 'Yes' = 1, 'No' = 0
df['attended_numeric'] = df['attended_awareness_sessions'].map({'Yes': 1, 'No': 0})
corr_coeff, corr_p = stats.pearsonr(df['attended_numeric'], df['knowledge_score_post'])
print("\n3. Pearson Correlation:")
print(f"Correlation Coefficient = {corr_coeff:.4f}, P-Value = {corr_p:.4f}")


3. Pearson Correlation:
Correlation Coefficient = 0.0698, P-Value = 0.3262


# **ANOVA**
#  One-Way ANOVA (Analysis of Variance)

**Purpose:**
Comparing mean satisfaction scores across three outreach methods (community dialogues, radio programs, CHV visits)


**Study Example:**
Assess whether satisfaction with information received differs by outreach method.

In [13]:
### 4. ANOVA: Compare knowledge scores by session attendance
# Simulating groups: attended vs not attended
attended = df[df['attended_awareness_sessions'] == 'Yes']['knowledge_score_post']
not_attended = df[df['attended_awareness_sessions'] == 'No']['knowledge_score_post']
f_stat, anova_p = stats.f_oneway(attended, not_attended)
print("\n4. ANOVA:")
print(f"F-Statistic = {f_stat:.4f}, P-Value = {anova_p:.4f}")


4. ANOVA:
F-Statistic = 0.9689, P-Value = 0.3262


# **Paired T-Test**

**Purpose:**
Compare means before and after an intervention for the same participants.

**Example:**
*Has participants’ knowledge about leishmaniasis improved after awareness sessions?*

In [14]:
### 5. Paired T-Test: Knowledge scores before and after
paired_t_stat, paired_p_val = stats.ttest_rel(df['knowledge_score_pre'], df['knowledge_score_post'])
print("\n5. Paired T-Test:")
print(f"T-Statistic = {paired_t_stat:.4f}, P-Value = {paired_p_val:.4f}")



5. Paired T-Test:
T-Statistic = -14.1248, P-Value = 0.0000


### **Mann-Whitney U Test**

Purpose:
Non-parametric alternative to the independent t-test.

**Example:**
*Compare diagnosis delays between males and females (when data is not normally distributed).*

In [15]:
### 6. Mann-Whitney U Test: Diagnosis delay between genders
u_stat, mannwhitney_p = stats.mannwhitneyu(male_delay, female_delay)
print("\n6. Mann-Whitney U Test:")
print(f"U-Statistic = {u_stat:.4f}, P-Value = {mannwhitney_p:.4f}")


6. Mann-Whitney U Test:
U-Statistic = 4364.5000, P-Value = 0.1208


# **Kruskal-Wallis Test**

**Purpose:**
Non-parametric alternative to ANOVA (for 3+ groups).

**Example:**
*Compare diagnosis delays among participants who attended 1, 2, or 3+ awareness sessions.*

In [16]:
### 7. Kruskal-Wallis Test: Diagnosis delay by session attendance groups
kruskal_stat, kruskal_p = stats.kruskal(attended, not_attended)
print("\n7. Kruskal-Wallis Test:")
print(f"Kruskal-Wallis Statistic = {kruskal_stat:.4f}, P-Value = {kruskal_p:.4f}")



7. Kruskal-Wallis Test:
Kruskal-Wallis Statistic = 1.4149, P-Value = 0.2343


# **Logistic Regression**
**Purpose:**
Model binary outcomes (like treatment success).

**Example:**
*Predict likelihood of treatment success based on number of awareness sessions attended.*

In [17]:
### 8. Logistic Regression: Predict treatment success based on awareness attendance
# Logistic Regression model
logit_model = smf.logit('treatment_success ~ attended_numeric', data=df).fit()
print("\n8. Logistic Regression Summary:")
print(logit_model.summary())


Optimization terminated successfully.
         Current function value: 0.394911
         Iterations 6

8. Logistic Regression Summary:
                           Logit Regression Results                           
Dep. Variable:      treatment_success   No. Observations:                  200
Model:                          Logit   Df Residuals:                      198
Method:                           MLE   Df Model:                            1
Date:                Mon, 28 Apr 2025   Pseudo R-squ.:                0.002202
Time:                        18:54:20   Log-Likelihood:                -78.982
converged:                       True   LL-Null:                       -79.156
Covariance Type:            nonrobust   LLR p-value:                    0.5549
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            1.6740      0.363      4.608      0.000   