
# Predictive Biostatistics: Regression and Logistic Models

This notebook explores predictive models for clinical variables, including linear and logistic regression. Links to the dataset are provided for easy access.
        


## Dataset Information and Download Links

The analysis in this notebook uses a **Diabetes dataset**, which can be downloaded from the following source:

1. **Kaggle:**
   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)
   - This dataset includes detailed patient data for diabetes prediction and analysis.

### Dataset Attributes

- **Pregnancies**: Number of pregnancies.
- **Glucose**: Plasma glucose concentration.
- **BloodPressure**: Diastolic blood pressure (mm Hg).
- **SkinThickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-Hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **Age**: Age of the patient.
- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).

### Usage Notes

- Preprocess the dataset as needed (e.g., handle missing values).
- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for detailed information.
        


## Univariate Linear Regression: HbA1c ~ BMI

A simple linear regression model is created to predict HbA1c levels (proxy: glucose) based on BMI.
        

In [None]:

import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset (replace with your file path)
data = pd.read_csv(r'C:\Path\to\diabetes.csv')

# Univariate Linear Regression
model1 = smf.ols(formula='Glucose ~ BMI', data=data).fit()
print(model1.summary())

# Scatter plot with regression line
sns.regplot(x='BMI', y='Glucose', data=data, ci=95)
plt.title("Linear Regression: Glucose vs BMI")
plt.show()
        


## Logistic Regression: Predicting Diabetes Using Glucose Levels

A logistic regression model is created to predict diabetes status based on glucose levels.
        

In [None]:

import statsmodels.api as sm

# Logistic Regression
data['Intercept'] = 1
logit_model = sm.Logit(data['Outcome'], data[['Glucose', 'Intercept']]).fit()

# Print logistic regression summary
print(logit_model.summary())

# Plot probabilities vs glucose levels
fitted_probabilities = logit_model.predict(data[['Glucose', 'Intercept']])
plt.figure(figsize=(10, 6))
sns.scatterplot(x=data['Glucose'], y=fitted_probabilities)
plt.title("Logistic Regression: Predicted Probabilities vs Glucose Levels")
plt.xlabel("Glucose Level")
plt.ylabel("Predicted Probability")
plt.show()
        


## Multivariate Linear Regression: HbA1c ~ BMI + Age + Diabetes Pedigree

A multivariate regression model is created to predict HbA1c levels (proxy: glucose) based on BMI, age, and diabetes pedigree.
        

In [None]:

# Multivariate Linear Regression
model2 = smf.ols(formula='Glucose ~ BMI + Age + DiabetesPedigreeFunction', data=data).fit()
print(model2.summary())
        


## Multivariate Logistic Regression: Diabetes Prediction Using Multiple Variables

A multivariate logistic regression model predicts diabetes using glucose, BMI, age, and blood pressure.
        

In [None]:

# Multivariate Logistic Regression
predictors = ['Glucose', 'BMI', 'Age', 'BloodPressure', 'Intercept']
logit_model_multivariate = sm.Logit(data['Outcome'], data[predictors]).fit()

# Print logistic regression summary
print(logit_model_multivariate.summary())
        