# Submodule 2 Project: Predictors of Diabetes

## Overview

In this project, students will analyze a dataset containing information about various factors related to diabetes. They will use Python libraries such as NumPy, Pandas, and Matplotlib/Seaborn for data wrangling and visualization, and apply statistical inference techniques to explore relationships between variables and diabetes. The goal is to identify potential causes of diabetes through single and multiple regression analysis.

A [solved version](./Submodule_2_Tutorial_7_ProjectSolutions.ipynb) of this exercise is also provided.

## Learning Objectives
In this project, you will practice the skills obtained in this module to understand the basis for diabetes. These skills include:
1. Working with Pandas Dataframes with NumPy skills
2. Visualizing the datsets
3. Performing regression and statistical analysis to assess the relationships between biological characteristics and diabetes.

## Prerequisites

You should have completed all the tutorials in Module 2 and developed some level of comfort with the tools.

## Getting Started
The 5 tasks are described below. Solutions for each part are provided in a [solved version](./Submodule_2_Tutorial6_ProjectSolutions.ipynb) of this notebook

# Investigating Possible Causes of Diabetes Using Data Analysis and Statistical Inference 

Many clinical measurements have been associated with the risk of developing diabetes. We will use your newly developed skills to assess this dataset to evaluate the efficacy of each parameter as a predictor of diabetes. The final column ("Diabetes") is already dummy coded with diabetes =1. Thus, simple linear regression analysis will provide a kind of measure of the fraction of disease risk associated with that factor. 

Of course, more than one factor may be interacting and relating. 

In your bioinformatics or biological/clinical work, you might expect to do similar kinds of analysis. You should find that, compared to running these analyses in Excel or some (commercial) statistical package, you will be able to rapidly and reproducibly analyze, numerically and visually, any size dataset. In the cloud, you could analyze the kinds of datasets which could be created from the Million Veteran project or UK Biobank. 


These are the columns of the diabetes data table
1. **Pregnancies** This column represents the number of pregnancies the individual has had
2. **Glucose** This column represents the glucose (blood sugar) level measured in the individual, often in mg/dL
3. **Diastolic** This represents the diastolic blood pressure of the individual, measured in mmHg (millimeters of mercury)
4. **Triceps** This might refer to the thickness of the skinfold at the triceps (back of the upper arm), often used as a measure of body fat percentage
5. **Insulin** This column represents insulin levels measured in the individual, often in µU/mL (micro-units per milliliter).
6. **BMI** BMI stands for Body Mass Index, a calculated value derived from an individual's weight and height (weight in kilograms divided by the square of height in meters)
7. **DPF** DPF refers to a Diabetes Pedigree Function, a measure estimating diabetes heredity based on family history
8. **Age** This column represents the age of the individual
9. **Diabetes** This column indicates the presence or absence of diabetes, coded as 0 (absence) and 1 (presence) of diabetes

## Task 1: Data Exploration and Cleaning

**Objective** Get familiar with the dataset and ensure it is ready for analysis.

1. Load the dataset (diabetes.csv) using pandas and display the first few rows.
2. Summarize the dataset by using df.describe() to check for mean, median, and standard deviation.
3. Identify any missing or invalid values (e.g., 0 values in BMI, Glucose, etc., where they may not make sense).
4. Handle missing or invalid values by replacing invalid zeros in relevant columns (e.g., BMI, Glucose, Blood Pressure) with the column's mean or median.
5. Create histograms and boxplots for each numeric column to visualize the distribution and detect potential outliers.


## Task 2: Visualizing Relationships Between Variables
**Objective** Explore correlations between independent variables and diabetes.
1. Create pairplots or scatterplots (e.g., using Seaborn.pairplot) for all numeric columns against the Diabetes column.
2. Calculate the correlation matrix using np.corrcoef or pandas.corr() and visualize it with a heatmap (e.g., using Seaborn.heatmap).
3. Identify the top 3 variables that have the strongest correlations with the Diabetes column.

## Task 3: Single Linear Regression Analysis

**Objective** Assess the relationship between individual predictors and diabetes.
1. Perform single linear regression using statsmodels or sklearn to predict Diabetes (as a continuous variable) based on Glucose, BMI, Insulin
2. Interpret the results, such as
   - Coefficients and their significance (p-values).
   - values (goodness of fit) for each predictor.
3. Determine which single variable is the most important in explaining diabetes risk.

## Task 4: Multiple Linear Regression

**Objective** Build a comprehensive model to predict diabetes.
1. Perform multiple linear regression using all independent variables (Pregnancies, Glucose, Diastolic, Triceps, Insulin, BMI, DPF, and Age)
2. Evaluate the model, finding rsquared, rsquared_adj, coefficients
3. Check coefficients and their significance.
4. Perform a backward elimination process by remove variables with high p-values (> 0.05) and re-run the regression to refine the model
5. Check the models (final vs. single linear regression) using AIC to find the **best** predictors of diabetes from this dataset

## Task 5 Statistical Inference and Hypothesis Testing

**Objective** Draw statistical conclusions about the dataset.
1. Perform hypothesis tests
   - Conduct a t-test to compare the mean Glucose levels between individuals with and without diabetes.
   - Perform ANOVA to check if there are significant differences in BMI across multiple age groups (e.g., group ages into <30, 30-50, >50).
2. Interpret the results and assess the significance of findings.

## Conclusion
You have demonstrated your ability to use Python for common data science tasks using NumPy, Pandas, matplotlib, Seaborn, and Statsmodel. 

The next module builds on this to use machine learning to expand the analysis of data beyond what a human can hypothesize to identify underling relationships. 

The final [Module](../Submodule_3_Overview.ipynb) will enable you to effectively use object oriented programming in python.  

### Clean up
In order to avoid unnecessary charges, be sure to stop your compute instance when you are done working with Jupyter notebooks for the day.