
# Biostatistical Hypothesis Testing

This notebook demonstrates various hypothesis testing techniques including t-tests, chi-squared tests, and regression analysis. Dataset links are provided for ease of access.
        


## Dataset Information and Download Links

The examples in this notebook use a **Diabetes dataset**, which can be downloaded from the following source:

1. **Kaggle:**
   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)
   - This dataset includes clinical data for diabetes prediction and analysis.

### Dataset Attributes

- **Pregnancies**: Number of pregnancies.
- **Glucose**: Plasma glucose concentration.
- **BloodPressure**: Diastolic blood pressure (mm Hg).
- **SkinThickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-Hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **Age**: Age of the patient.
- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).

### Usage Notes

- Preprocess the dataset as needed (e.g., handle missing values).
- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more information.
        


## T-Test: Comparing Glucose Levels Between Diabetic and Non-Diabetic Groups

A t-test is performed to compare the mean glucose levels between diabetic and non-diabetic patients.
        

In [None]:

import pandas as pd
from scipy.stats import ttest_ind

# Load the dataset (replace with your file path)
data = pd.read_csv(r'C:\Path\to\diabetes.csv')

# Separate diabetic and non-diabetic groups
diabetic = data[data['Outcome'] == 1]
non_diabetic = data[data['Outcome'] == 0]

# Perform t-test for glucose levels
t_statistic, p_value = ttest_ind(diabetic['Glucose'], non_diabetic['Glucose'])

print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
        


## Chi-Squared Test: Diabetes and Elevated BMI

A chi-squared test is performed to evaluate the association between diabetes status and elevated BMI.
        

In [None]:

from scipy.stats import chi2_contingency

# Define elevated BMI
data['Elevated_BMI'] = data['BMI'].apply(lambda x: 'Yes' if x >= 25 else 'No')

# Create a contingency table
contingency_table = pd.crosstab(data['Outcome'], data['Elevated_BMI'])

# Perform chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-squared value: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")
        


## Linear Regression: Glucose vs Age

A simple linear regression is performed to analyze the relationship between glucose levels and age.
        

In [None]:

import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

# Perform linear regression
linreg = smf.ols(formula='Glucose ~ Age', data=data).fit()

# Print regression summary
print(linreg.summary())

# Plot the regression line
sns.regplot(x='Age', y='Glucose', data=data, ci=95)
plt.title("Linear Regression: Glucose vs Age")
plt.show()
        


## Logistic Regression: Predicting Diabetes Using Glucose and BMI

A logistic regression model is created to predict diabetes status based on glucose levels and BMI.
        

In [None]:

import statsmodels.api as sm

# Prepare data for logistic regression
data['Intercept'] = 1
logit_model = sm.Logit(data['Outcome'], data[['Glucose', 'BMI', 'Intercept']]).fit()

# Print logistic regression summary
print(logit_model.summary())
        