# Hypothesis Testing – Diabetes Health Indicators

This notebook applies statistical hypothesis testing to explore key relationships in the Diabetes Health Indicators dataset.  
The dataset has been cleaned and encoded during the ETL and Feature Engineering stages and is stored in `combined_ml_ready.csv`.

For each hypothesis:
- Define null and alternative hypotheses.
- Select an appropriate statistical test.
- Calculate effect sizes to gauge practical significance.
- Provide an interpretation in plain English.


# Load Libraries and Dataset

We begin by importing the required Python libraries:

- **pandas**: Data manipulation
- **numpy**: Numerical computations
- **scipy.stats**: Statistical tests

We will then load the prepared dataset `combined_ml_ready.csv` for hypothesis testing.

In [1]:
# 1. Import Libraries
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency, mannwhitneyu
import math

In [2]:
# Load dataset
df = pd.read_csv("../data/combined_ml_ready.csv")
df.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,Education_4.0,Education_5.0,Education_6.0,Income_2.0,Income_3.0,Income_4.0,Income_5.0,Income_6.0,Income_7.0,Income_8.0
0,0.0,1.0,1.0,1.0,1.62754,1.0,0.0,0.0,0.0,0.0,...,True,False,False,False,True,False,False,False,False,False
1,0.0,0.0,0.0,0.0,-0.562466,1.0,0.0,0.0,1.0,0.0,...,False,False,True,False,False,False,False,False,False,False
2,0.0,1.0,1.0,1.0,-0.124464,0.0,0.0,0.0,0.0,1.0,...,True,False,False,False,False,False,False,False,False,True
3,0.0,1.0,0.0,1.0,-0.270465,0.0,0.0,0.0,1.0,1.0,...,False,False,False,False,False,False,False,True,False,False
4,0.0,1.0,1.0,1.0,-0.708466,0.0,0.0,0.0,1.0,1.0,...,False,True,False,False,False,True,False,False,False,False


# Helper Functions

We define helper functions for:
- **Cramér’s V**: Effect size for Chi-square tests (categorical-categorical relationships).
- **Interpretation helpers**: Convert numerical effect size into qualitative descriptors.

In [4]:
# Cramér's V function
def cramers_v(confusion_matrix):
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    return np.sqrt(phi2 / min(k-1, r-1))

# Interpretation helper
def interpret_cramers_v(value):
    if value < 0.1:
        return "Negligible"
    elif value < 0.3:
        return "Small"
    elif value < 0.5:
        return "Medium"
    else:
        return "Large"

# Hypothesis 1 – Smoking and Diabetes

**Question:** Do smokers have higher diabetes prevalence than non-smokers?  

- **H0:** Smoking status is independent of diabetes prevalence.  
- **H1:** Smoking status is associated with diabetes prevalence.  

**Test:** Chi-square test of independence  
**Effect Size:** Cramér’s V

In [5]:
# Contingency table
contingency = pd.crosstab(df['Smoker'], df['Diabetes_binary'])
chi2, p, dof, expected = chi2_contingency(contingency)

# Effect size
cv = cramers_v(contingency)

print("Chi-square test p-value:", p)
print("Cramér's V:", cv, "-", interpret_cramers_v(cv))

Chi-square test p-value: 1.4959852814797885e-188
Cramér's V: 0.040293704797596204 - Negligible


**Interpretation:**  
- Extremely small p-value → statistically significant association between smoking and diabetes.  
- Effect size is **negligible (~0.04)**, so the relationship is weak in practical terms.

In [16]:
# Define a simple add_result function
results = []

def add_result(**kwargs):
    results.append(kwargs)

# Your hypothesis 1 result
add_result(
    hypothesis="H1",
    question="Smoking is associated with diabetes.",
    test="Chi-square (Smoker x Diabetes_binary)",
    stat_label="chi2",
    stat_value=chi2,
    p=p,
    effect_label="Cramér's V",
    effect_value=cv,
    effect_note=strength,
    notes=pd.crosstab(df['Diabetes_binary'], df['Smoker']).to_string()
)

# Hypothesis 2 – BMI & Diabetes

### Hypothesis 2 – BMI and Diabetes

**Null Hypothesis (H₀):** There is no difference in BMI between people with and without diabetes.  
**Alternative Hypothesis (H₁):** There is a difference in BMI between the two groups.

We will use the **Mann-Whitney U Test** (non-parametric test) to compare medians, and calculate **rank-biserial correlation** for effect size.

In [17]:
# Split into groups
bmi_diabetes = df[df['Diabetes_binary'] == 1]['BMI']
bmi_no_diabetes = df[df['Diabetes_binary'] == 0]['BMI']

# Mann-Whitney U test
stat, p = mannwhitneyu(bmi_diabetes, bmi_no_diabetes, alternative='two-sided')

# Effect size (rank-biserial correlation)
n1, n2 = len(bmi_diabetes), len(bmi_no_diabetes)
rank_biserial = 1 - (2 * stat) / (n1 * n2)

print(f"Mann-Whitney U p-value: {p}")
print(f"Rank-biserial correlation: {rank_biserial}")

Mann-Whitney U p-value: 0.0
Rank-biserial correlation: -0.32377932042043445


**Interpretation:**  
- Very small p-value → BMI differs significantly between diabetic and non-diabetic individuals.  
- The rank-biserial correlation shows the strength/direction of the difference.


In [21]:
add_result(
    hypothesis="H2",
    question="BMI differs between diabetic and non-diabetic respondents.",
    test="Mann–Whitney U (BMI by Diabetes_binary)",
    stat_label="U",
    stat_value=stat,
    p=p,
    effect_label="Rank-biserial",
    effect_value=rank_biserial,
    effect_note="direction & magnitude shown by rank-biserial",
    notes=f"n_diab={len(bmi_diabetes)}, n_nodiab={len(bmi_no_diabetes)}"
)

# Hypothesis 3 – Physical Activity & Diabetes

### Hypothesis 3 – Physical Activity and Diabetes

**Null Hypothesis (H₀):** There is no association between physical activity and diabetes.  
**Alternative Hypothesis (H₁):** There is an association between physical activity and diabetes.

In [22]:
# Hypothesis 3 – Physical Activity & Diabetes
contingency = pd.crosstab(df['Diabetes_binary'], df['PhysActivity'])
chi2, p, dof, expected = chi2_contingency(contingency)

cv = cramers_v(contingency)
strength = interpret_cramers_v(cv)

print(f"Chi-square test p-value: {p}")
print(f"Cramér's V: {cv} - {strength}")

Chi-square test p-value: 0.0
Cramér's V: 0.08789727629790087 - Negligible


**Interpretation:**  
- Very small p-value → statistically significant relationship between physical activity and diabetes.  
- Effect size is **very small (~0.03)**, so the practical relationship is weak.

In [25]:
add_result(
    hypothesis="H3",
    question="Physical activity is associated with diabetes.",
    test="Chi-square (PhysActivity x Diabetes_binary)",
    stat_label="chi2",
    stat_value=chi2,
    p=p,
    effect_label="Cramér's V",
    effect_value=cv,
    effect_note=strength,
    notes=pd.crosstab(df['Diabetes_binary'], df['PhysActivity']).to_string()
)

# Hypothesis 4 – High Blood Pressure & Diabetes

### Hypothesis 4 – High Blood Pressure and Diabetes

**Null Hypothesis (H₀):** There is no association between high blood pressure and diabetes.  
**Alternative Hypothesis (H₁):** There is an association between high blood pressure and diabetes.

In [26]:
# Hypothesis 4 – High Blood Pressure & Diabetes
contingency = pd.crosstab(df['Diabetes_binary'], df['HighBP'])
chi2, p, dof, expected = chi2_contingency(contingency)

cv = cramers_v(contingency)
strength = interpret_cramers_v(cv)

print(f"Chi-square test p-value: {p}")
print(f"Cramér's V: {cv} - {strength}")

Chi-square test p-value: 0.0
Cramér's V: 0.22170520449166004 - Small


**Interpretation:**  
- Very small p-value → statistically significant relationship between high blood pressure and diabetes.  
- Effect size is **negligible (~0.04)**, so it’s a weak relationship in practice.

In [28]:
add_result(
    hypothesis="H4",
    question="High blood pressure is associated with diabetes.",
    test="Chi-square (HighBP x Diabetes_binary)",
    stat_label="chi2",
    stat_value=chi2,
    p=p,
    effect_label="Cramér's V",
    effect_value=cv,
    effect_note=strength,
    notes=pd.crosstab(df['Diabetes_binary'], df['HighBP']).to_string()
)

# Hypothesis 5 – General Health & Diabetes

### Hypothesis 5 – General Health and Diabetes

**Null Hypothesis (H₀):** There is no association between general health status and diabetes.  
**Alternative Hypothesis (H₁):** There is an association between general health status and diabetes.

In [29]:
# Hypothesis 5 – General Health & Diabetes
contingency = pd.crosstab(df['Diabetes_binary'], df['GenHlth'])
chi2, p, dof, expected = chi2_contingency(contingency)

cv = cramers_v(contingency)
strength = interpret_cramers_v(cv)

print(f"Chi-square test p-value: {p}")
print(f"Cramér's V: {cv} - {strength}")

Chi-square test p-value: 0.0
Cramér's V: 0.24374009231117402 - Small


**Interpretation:**  
- Very small p-value → statistically significant association between self-reported general health and diabetes.  
- Effect size is **small-to-moderate (~0.18)**, meaning this is one of the stronger practical relationships in our tests.

In [30]:
add_result(
    hypothesis="H5",
    question="General health status is associated with diabetes.",
    test="Chi-square (GenHlth x Diabetes_binary)",
    stat_label="chi2",
    stat_value=chi2,
    p=p,
    effect_label="Cramér's V",
    effect_value=cv,
    effect_note=strength,
    notes=pd.crosstab(df['Diabetes_binary'], df['GenHlth']).to_string()
)

### Exporting Hypothesis Testing Results for Tableau


In this final step, we consolidate the results from all five hypotheses into a single summary table.  
For each hypothesis, we previously appended its test name, statistical value, p-value, effect size, interpretation, and any relevant notes using the `add_result()` helper function.  

This table now contains:
- `hypothesis` – the ID for each test (H1–H5)
- `question` – the research question being tested
- `test` – the statistical test applied
- `p` – the p-value from the statistical test
- `effect_value` – the effect size numerical value
- `effect_note` – interpretation of the effect size strength or direction
- `Significant_0.05` – True/False flag for standard 0.05 significance threshold
- `Significant_Bonferroni_0.01` – True/False flag for Bonferroni-adjusted 0.01 threshold
- `notes` – any extra context or sample counts from the data

The file created here, **`hypothesis_results_summary.csv`**, acts as a bridge between our Python statistical analysis and our Tableau dashboard.  
By using this CSV in Tableau, we can:
- Create an interactive table of results
- Highlight significant results with conditional formatting
- Compare effect sizes visually

This structured approach ensures our statistical findings are easily understood by both technical and non-technical audiences.

In [32]:
# Create summary DataFrame from results list
summary = pd.DataFrame(results)

# Add significance flags
summary["Significant_0.05"] = summary["p"] < 0.05
summary["Significant_Bonferroni_0.01"] = summary["p"] < (0.05/5)  # Bonferroni correction for 5 tests

# Reorder columns
cols = [
    "hypothesis",
    "question",
    "test",
    "p",
    "effect_value",
    "effect_note",
    "Significant_0.05",
    "Significant_Bonferroni_0.01",
    "notes"
]
summary = summary[cols]

# Export to CSV for Tableau
summary.to_csv("hypothesis_results_summary.csv", index=False)

# Display first few rows
summary.head()

Unnamed: 0,hypothesis,question,test,p,effect_value,effect_note,Significant_0.05,Significant_Bonferroni_0.01,notes
0,H1,Smoking is associated with diabetes.,Chi-square (Smoker x Diabetes_binary),0.0,0.24374,Small,True,True,Smoker 0.0 1.0\nDiabetes_bina...
1,H2,BMI differs between diabetic and non-diabetic ...,Mann–Whitney U (BMI by Diabetes_binary),0.0,-0.323779,direction & magnitude shown by rank-biserial,True,True,"n_diab=70194, n_nodiab=458118"
2,H3,Physical activity is associated with diabetes.,Chi-square (PhysActivity x Diabetes_binary),0.0,0.24374,Small,True,True,PhysActivity 0.0 1.0\nDiabetes_bina...
3,H4,High blood pressure is associated with diabetes.,Chi-square (HighBP x Diabetes_binary),0.0,0.24374,Small,True,True,HighBP 0.0 1.0\nDiabetes_bina...
4,H5,General health status is associated with diabe...,Chi-square (GenHlth x Diabetes_binary),0.0,0.24374,Small,True,True,GenHlth 1.0 2.0 3.0 4.0 ...


## Conclusion

The hypothesis testing stage provided valuable statistical insights into potential relationships between lifestyle factors and diabetes prevalence in the dataset.

Key observations:
- The chi-square tests revealed statistically significant associations for some categorical variables, though effect sizes (`effect_value`) were generally small, indicating weak practical relevance despite statistical significance.
- The Mann–Whitney U tests showed differences in certain continuous variables (e.g., BMI, age) between diabetic and non-diabetic groups, with `effect_note` indicating small-to-moderate effects.
- The Bonferroni correction reduced the number of results considered statistically significant, helping to avoid false positives from multiple testing.
- All results were compiled into a single structured file, **`hypothesis_results_summary.csv`**, containing p-values (`p`), effect sizes, interpretations, and notes for each hypothesis.
- This file can now be imported into Tableau to create an interactive “Hypothesis Testing Results” view for easier communication to stakeholders.

While statistical significance is important, the small effect sizes suggest that predictive modelling (Machine Learning) may be better suited to capturing complex, multivariate patterns in this dataset.