
# Clinical Study Biostatistics: T-tests, ANOVA, and Regression Analysis

This notebook demonstrates hypothesis testing, ANOVA, and regression analysis techniques using a hypothetical dataset. Links to the dataset are provided for reproducibility.
        


## Dataset Information and Download Links

The examples in this notebook use a diabetes dataset. You can download the dataset from the following sources:

1. **Kaggle:**
   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)
   - This dataset provides detailed patient data for diabetes analysis.

### Dataset Attributes

- **Pregnancies**: Number of pregnancies.
- **Glucose**: Plasma glucose concentration.
- **BloodPressure**: Diastolic blood pressure (mm Hg).
- **SkinThickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-Hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **Age**: Age of the patient.
- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).

### Usage Notes

- Ensure the dataset is preprocessed (handle missing values and normalize if required).
- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more details.
        


## Two-Sample T-Test: Comparing HbA1c Between Diabetic and Non-Diabetic Groups

A t-test is used to compare the mean HbA1c levels between diabetic and non-diabetic patients.
        

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind

# Load the dataset (replace with your file path)
data = pd.read_csv(r'C:\Path\to\diabetes.csv')

# Separate diabetic and non-diabetic groups
diabetic = data[data['Outcome'] == 1]
non_diabetic = data[data['Outcome'] == 0]

# Perform t-test for HbA1c (using 'Glucose' as a proxy for HbA1c)
t_statistic, p_value = ttest_ind(diabetic['Glucose'], non_diabetic['Glucose'])
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")

# Plot mean glucose levels with error bars
means = [diabetic['Glucose'].mean(), non_diabetic['Glucose'].mean()]
stds = [diabetic['Glucose'].std(), non_diabetic['Glucose'].std()]
labels = ['Diabetic', 'Non-Diabetic']
plt.bar(labels, means, yerr=stds, capsize=5)
plt.ylabel('Mean Glucose Level')
plt.title('Mean Glucose Levels with Error Bars')
plt.show()
        


## ANOVA: HbA1c by BMI Categories

ANOVA is used to evaluate differences in mean HbA1c levels across BMI categories.
        

In [None]:

from scipy.stats import f_oneway

# Create BMI categories
def bmi_category(bmi):
    if bmi < 25:
        return 'Normal'
    elif 25 <= bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

data['BMI_Category'] = data['BMI'].apply(bmi_category)

# Perform ANOVA
fvalue, pvalue = f_oneway(
    data[data['BMI_Category'] == 'Normal']['Glucose'],
    data[data['BMI_Category'] == 'Overweight']['Glucose'],
    data[data['BMI_Category'] == 'Obese']['Glucose']
)

print(f"F-value: {fvalue}")
print(f"P-value: {pvalue}")
        


## Logistic Regression: Predicting Diabetes Using Clinical Variables

Logistic regression is used to predict diabetes based on glucose, BMI, and age.
        

In [None]:

import statsmodels.api as sm

# Prepare data for logistic regression
data['Intercept'] = 1
predictors = ['Glucose', 'BMI', 'Age', 'Intercept']
logit_model = sm.Logit(data['Outcome'], data[predictors]).fit()

# Print logistic regression summary
print(logit_model.summary())
        