# Submodule 2 Project SOLUTIONS: Predictors of Diabetes   
------------------------------------------------------------

## Overview

In this project, students will analyze a dataset containing information about various factors related to diabetes. They will use Python libraries such as NumPy, Pandas, and Matplotlib/Seaborn for data wrangling and visualization, and apply statistical inference techniques to explore relationships between variables and diabetes. The goal is to identify potential causes of diabetes through single and multiple regression analysis.

## Learning Objectives
In this project, you will practice the skills obtained in this module to understand the basis for diabetes. These skills include:
1. Working with Pandas Dataframes with NumPy skills
2. Visualizing the datsets
3. Performing regression and statistical analysis to assess the relationships between biological characteristics and diabetes.

## Prerequisites

You should have completed all the tutorials in Module 2 and developed some level of comfort with the tools

## Getting Started
The 5 tasks are described below. Solutions for each part are provided in a second version of this notebook

# Investigating Possible Causes of Diabetes Using Data Analysis and Statistical Inference 

Many clinical measurements have been associated with the risk of developing diabetes. We will use your newly developed skills to assess this dataset to evaluate the efficacy of each parameter as a predictor of diabetes. The final column ("Diabetes") is already dummy coded with diabetes =1. Thus, simple linear regression analysis will provide a kind of measure of the fraction of disease risk associated with that factor. 

Of course, more than one factor may be interacting and relating. 

In your bioinformatics or biological/clinical work, you might expect to do similar kinds of analysis. You should find that, compared to running these analyses in Excel or some (commercial) statistical package, you will be able to rapidly and reproducibly analyze, numerically and visually, any size dataset. In the cloud, you could analyze the kinds of datasets which could be created from the Million Veteran project or UK Biobank. 


These are the columns of the diabetes data table
1. **Pregnancies** This column represents the number of pregnancies the individual has had
2. **Glucose** This column represents the glucose (blood sugar) level measured in the individual, often in mg/dL
3. **Diastolic** This represents the diastolic blood pressure of the individual, measured in mmHg (millimeters of mercury)
4. **Triceps** This might refer to the thickness of the skinfold at the triceps (back of the upper arm), often used as a measure of body fat percentage
5. **Insulin** This column represents insulin levels measured in the individual, often in µU/mL (micro-units per milliliter).
6. **BMI** BMI stands for Body Mass Index, a calculated value derived from an individual's weight and height (weight in kilograms divided by the square of height in meters)
7. **DPF** DPF refers to a Diabetes Pedigree Function, a measure estimating diabetes heredity based on family history
8. **Age** This column represents the age of the individual
9. **Diabetes** This column indicates the presence or absence of diabetes, coded as 0 (absence) and 1 (presence) of diabetes

## Task 1: Data Exploration and Cleaning

**Objective** Get familiar with the dataset and ensure it is ready for analysis.

1. Load the dataset (diabetes.csv) using pandas and display the first few rows.
2. Summarize the dataset by using df.describe() to check for mean, median, and standard deviation.
3. Identify any missing or invalid values (e.g., 0 values in BMI, Glucose, etc., where they may not make sense).
4. Handle missing or invalid values by replacing invalid zeros in relevant columns (e.g., BMI, Glucose, Blood Pressure) with the column's mean or median.
5. Create histograms and boxplots for each numeric column to visualize the distribution and detect potential outliers.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load dataset
df = pd.read_csv("." + os.sep + "Datasets" + os.sep + "diabetes.csv")

# Display first few rows
print(df.head())

# Summary statistics
print(df.describe())

# Check for missing or invalid values (e.g., zeros in specific columns)
invalid_columns = ['glucose', 'diastolic', 'bmi', 'insulin']
for col in invalid_columns:
    print(f"{col} has {sum(df[col] == 0)} invalid entries")

# Replace invalid zeros with mean or median (example: replace zeros in 'insulin')
for col in invalid_columns:
    df[col] = df[col].replace(0, df[col].mean())

# Histograms for numeric columns
df.hist(bins=20, figsize=(15, 10))
plt.tight_layout()
plt.show()

# Boxplots for detecting outliers
for col in df.columns[:-1]:  # Exclude 'Diabetes'
    sns.boxplot(data=df, x='diabetes', y=col)
    plt.title(f"Boxplot of {col} by Diabetes")
    plt.show()


### Analysis
There are no obvious outliers in the histograms nor the data summary.

HOWEVER, almost half of the insulin entries were recorded as zero. Since our code replaces all 0 with column averages. We should be very cautious about drawing ANY conclusions about insulin effects.

## Task 2: Visualizing Relationships Between Variables
**Objective** Explore correlations between independent variables and diabetes.
1. Create pairplots or scatterplots (e.g., using Seaborn.pairplot) for all numeric columns against the Diabetes column.
2. Calculate the correlation matrix using np.corrcoef or pandas.corr() and visualize it with a heatmap (e.g., using Seaborn.heatmap).
3. Identify the top 3 variables that have the strongest correlations with the Diabetes column.

In [None]:
# Pairplot to visualize relationships
sns.pairplot(df, hue='diabetes', vars=['glucose', 'bmi', 'insulin', 'age'])
plt.show()

# Correlation matrix
corr_matrix = df.corr()
print(corr_matrix)

# Heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


### Analysis
The correlations between diabetes and the set of predictors were positive. The largest corellation coefficients were observed for glucose (0.49), bmi (0.31) and age (o.24) so we will pursue these in the regression analysis. 

## Task 3: Single Linear Regression Analysis

**Objective** Assess the relationship between individual predictors and diabetes.
1. Perform single linear regression using statsmodels or sklearn to predict Diabetes (as a continuous variable) based on Glucose, BMI, Insulin
2. Interpret the results, such as
   - Coefficients and their significance (p-values).
   - values (goodness of fit) for each predictor.
3. Determine which single variable is the most important in explaining diabetes risk.

In [None]:
import statsmodels.formula.api as smf

# Define independent variables for single regression
single_vars = ['glucose', 'bmi', 'age']
Y = df['diabetes']  # Dependent variable (binary)

# Perform single regressions. 
modelG = smf.ols("diabetes ~ glucose", data=df).fit()
print("Regression with glucose:")
print(modelG.summary())

modelB = smf.ols("diabetes ~ bmi", data=df).fit()
print("Regression with BMI:")
print(modelB.summary())

modelA = smf.ols("diabetes ~ age", data=df).fit()
print("Regression with age:")
print(modelA.summary())


print(f'The regression coefficients: FOr glucose {modelG.params.iloc[1]:.5f}, for bmi {modelG.params.iloc[1]:.5f}, and for Age {modelA.params.iloc[1]:.5f}')

### Analysis
While glucose, BMI, and age were most correlated with diabetes, they had very small regression coefficients. 

## Task 4: Multiple Linear Regression

**Objective** Build a comprehensive model to predict diabetes.
1. Perform multiple linear regression using all independent variables (Pregnancies, Glucose, Diastolic, Triceps, Insulin, BMI, DPF, and Age)
2. Evaluate the model, finding rsquared, rsquared_adj, coefficients
3. Check coefficients and their significance.
4. Perform a backward elimination process by remove variables with high p-values (> 0.05) and re-run the regression to refine the model
5. Check the models (final vs. single linear regression) using AIC to find the **best** predictors of diabetes from this dataset

In [None]:
import statsmodels.api as sm
# Define independent variables for multiple regression
X = df[['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']]
X = sm.add_constant(X)  # Add intercept
Y = df['diabetes']

# Fit the model
model = sm.OLS(Y, X).fit()

# Model summary
print("Multiple Regression Model Summary")
print(model.summary())


In [None]:
# Backward elimination: remove variables with high p-values (>0.05) iteratively
# just a computational version of doing what you would do one step at a time. 
while True:
    p_values = model.pvalues
    max_p = p_values.max()
    if max_p > 0.05:
        excluded_var = p_values.idxmax()
        X = X.drop(columns=[excluded_var])
        model = sm.OLS(Y, X).fit()
    else:
        break

print("Refined Model After Backward Elimination")
print(model.summary())


### Analysis
The model with only #pregnancies, glucose, BMI, and dpf is BETTER than the model with all the factors. The AIC is lower and the rsquared is almost identical. 

It appears that dpf (family hereditary factors) are more substantial than predicted by the correlation matrix. Thus, we should re-run the initial single linear regression (next box)

In [None]:
modelDPF = smf.ols("diabetes ~ dpf", data=df).fit()
print("Regression with dpf:")
print(modelDPF.summary())


### Analysis part 2:
The regression coefficient for dpf is substantially higher than any of the three factors we evaluated previously. While the relationship is strong, the rsquared is quite low (0.03) meaning that only a small fraction of the variation in diabetes (with or without) can be accounted for by this family factor. 

## Task 5 Statistical Inference and Hypothesis Testing

**Objective** Draw statistical conclusions about the dataset.
1. Perform hypothesis tests
   - Conduct a t-test to compare the mean Glucose levels between individuals with and without diabetes.
   - Perform ANOVA to check if there are significant differences in BMI across multiple age groups (e.g., group ages into <30, 30-50, >50).
2. Interpret the results and assess the significance of findings.

In [None]:
from scipy.stats import ttest_ind, f_oneway

# T-test: Compare mean Glucose levels for individuals with and without diabetes
diabetes_present = df[df['diabetes'] == 1]['glucose']
diabetes_absent = df[df['diabetes'] == 0]['glucose']
t_stat, p_val = ttest_ind(diabetes_present, diabetes_absent)
print(f"T-test for Glucose levels: t-statistic = {t_stat:.2f}, p-value = {p_val:.4f}")

# T-test: Compare mean family factors (dpf) for individuals with and without diabetes
diabetes_present = df[df['diabetes'] == 1]['dpf']
diabetes_absent = df[df['diabetes'] == 0]['dpf']
t_stat, p_val = ttest_ind(diabetes_present, diabetes_absent)
print(f"T-test for DPF: t-statistic = {t_stat:.2f}, p-value = {p_val:.4f}")

# ANOVA: Compare BMI across age groups (<30, 30-50, >50)
df['Age_Group'] = pd.cut(df['age'], bins=[0, 30, 50, np.inf], labels=['<30', '30-50', '>50'])
anova_result = f_oneway(df[df['Age_Group'] == '<30']['diabetes'],
                        df[df['Age_Group'] == '30-50']['diabetes'],
                        df[df['Age_Group'] == '>50']['diabetes'])
print(f"ANOVA for Diabetes across Age Groups: F-statistic = {anova_result.statistic:.2f}, p-value = {anova_result.pvalue:.4f}")

# Grouped boxplot for BMI by Age Group
sns.boxplot(data=df, x='Age_Group', y='bmi', hue='diabetes')
plt.title("BMI by Age Group and Diabetes Status")
plt.show()



### Analysis
Diabetes rates are higher with higher \[glucose] and dpf. 

Diabetes differs with age group in this dataset. 

However, we can see in the graph that the BMI of young diabetics is often higher than older diabetics. 

## Conclusion
You have demonstrated your ability to use Python for common data science tasks using NumPy, Pandas, matplotlib, Seaborn, and Statsmodel. 

The final [Module](../Submodule_3_Overview.ipynb) will enable you to effectively use object oriented programming in python.  

### Clean up
In order to avoid unnecessary charges, be sure to stop your compute instance when you are done working with Jupyter notebooks for the day.