
# Biostatistical Inference: Hypothesis Testing and Effect Size

This notebook covers statistical inference techniques, including t-tests, chi-squared tests, and ANOVA, with dataset examples. Dataset links are provided for convenience.
        


## Dataset Information and Download Links

The analysis in this notebook uses a **Diabetes dataset**, which can be downloaded from the following source:

1. **Kaggle:**
   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)
   - This dataset includes clinical patient data for diabetes-related research.

### Dataset Attributes

- **Pregnancies**: Number of pregnancies.
- **Glucose**: Plasma glucose concentration.
- **BloodPressure**: Diastolic blood pressure (mm Hg).
- **SkinThickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-Hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **Age**: Age of the patient.
- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).

### Usage Notes

- Ensure the dataset is preprocessed (e.g., handle missing values).
- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for detailed information.
        


## T-Test and Cohen's d: Comparing HDL Between Males and Females

A t-test is performed to compare HDL levels between males and females, and Cohen's d is calculated to assess the effect size.
        

In [None]:

import pandas as pd
import numpy as np
from scipy import stats

# Load the dataset (replace with your file path)
data = pd.read_csv(r'C:\Path\to\diabetes.csv')

# Define the separate datasets for Males and Females
males = data[data['Gender'] == 'M']
females = data[data['Gender'] == 'F']

# Select the HDL values
HDL_males = males['HDL']
HDL_females = females['HDL']

# Perform t-test
t_statistic, p_value = stats.ttest_ind(HDL_males, HDL_females)

# Calculate Cohen's d
cohens_d = (HDL_males.mean() - HDL_females.mean()) / np.sqrt((HDL_males.std()**2 + HDL_females.std()**2) / 2)

print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Cohen's d: {cohens_d}")
        


## Chi-Squared Test: Diabetes and Elevated Triglycerides

A chi-squared test is performed to evaluate the association between diabetes status and elevated triglycerides.
        

In [None]:

from scipy.stats import chi2_contingency

# Create a new column 'Elevated_TG'
data['Elevated_TG'] = data['TG'].apply(lambda x: 'Yes' if x >= 1.7 else 'No')

# Create a contingency table
contingency_table = pd.crosstab(data['Outcome'], data['Elevated_TG'])

# Perform chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-squared value: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")
        


## ANOVA: HbA1c by BMI Categories

ANOVA is used to assess differences in mean HbA1c levels across BMI categories.
        

In [None]:

# Define BMI categories
def bmi_category(bmi):
    if bmi < 25:
        return 'Normal'
    elif 25 <= bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

data['BMI_Category'] = data['BMI'].apply(bmi_category)

# Perform ANOVA
fvalue, pvalue = stats.f_oneway(
    data[data['BMI_Category'] == 'Normal']['HbA1c'],
    data[data['BMI_Category'] == 'Overweight']['HbA1c'],
    data[data['BMI_Category'] == 'Obese']['HbA1c']
)

print(f"F-value: {fvalue}")
print(f"P-value: {pvalue}")
        


## Correlation Analysis: Exploring Relationships Between Variables

A correlation matrix is created to explore relationships between numerical variables in the dataset.
        

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
corr_matrix = data.corr()

# Plot the heatmap
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
        