# Supervised Learning - Project

In this Project, we are going to perform a full supervised learning machine learning project on a "Diabetes" dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes,
based on certain diagnostic measurements included in the dataset. 

[Kaggle Dataset](https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset)

# Part I : EDA - Exploratory Data Analysis

For this task, you are required to conduct an exploratory data analysis on the diabetes dataset. You have the freedom to choose the visualizations you want to use, but your analysis should cover the following tasks mostly:

- Are there any missing values in the dataset?
- How are the predictor variables related to the outcome variable?
- What is the correlation between the predictor variables?
- What is the distribution of each predictor variable?
- Are there any outliers in the predictor variables?
- How are the predictor variables related to each other?
- Is there any interaction effect between the predictor variables?
- What is the average age of the individuals in the dataset?
- What is the average glucose level for individuals with diabetes and without diabetes?
- What is the average BMI for individuals with diabetes and without diabetes?
- How does the distribution of the predictor variables differ for individuals with diabetes and without diabetes?
- Are there any differences in the predictor variables between males and females (if gender information is available)?

In [None]:
See EDA.ipynb

# Part II : Preprocessing & Feature Engineering

You need to perform preprocessing on the given dataset. Please consider the following tasks and carry out the necessary steps accordingly.
- Handling missing values
- Handling outliers
- Scaling and normalization
- Feature Engineering
- Handling imbalanced data

In [None]:
See Feature Engineering.ipynb

# Part III : Training ML Model

For this task, you are required to build a machine learning model to predict the outcome variable. This will be a binary classification task, as the target variable is binary. You should select at least two models, one of which should be an ensemble model, and compare their performance.

- Train the models: Train the selected models on the training set.
- Model evaluation: Evaluate the trained models on the testing set using appropriate evaluation metrics, such as accuracy, precision, recall, F1-score, and ROC-AUC.
- Model comparison: Compare the performance of the selected models and choose the best-performing model based on the evaluation metrics. You can also perform additional analysis, such as model tuning and cross-validation, to improve the model's performance.

In [None]:
See Training ML Model.ipynb

# Part IV : Conclusion

From the machine learning models developed and the exploratory data analysis (EDA) conducted, generate four bullet points as your findings.

- It seems as if pregnancy is one of the primary contributing factors. This was seen in the PCA, and makes sense intuitively. This does not quite align with what the random forest concluded, but the accuracy of that model is poor enough that it is likely wise to take it with a grain of salt. The PCA gave pregnancy a very high explained variance ratio, at just over 26%. Random forest was barely above 8% for pregnancies.

- Glucose was seen by both models to be either the largest or second largest factor correlating with incidences of diabetes. This also makes sense- insulin is used to control glucose levels. Impaired insulin production due to diabetes will probably result in higher blood sugar levels. The PCA contributed 21.7% of the variance to glucose levels. The random forest estimated over 25%.

- BMI and Age were the two factors that I initially thought would be the two most principal components, but this turned out to be quite wrong. PCA placed Age last, with less than 5% contribution, and BMI at under 7%.

- The diabetes pedigree function predicts the probability of diabetes depending on their family genetics. The PCA predicted that this explains 5.3% of the variation, and the random forest estimated 12.1%. At the very least genetics do not appear to be the largest factor when predicting if someone will have diabetes. This leaves controllable factors such as glucose levels and BMI as the most impactful. This knowledge can be used by individuals to mitigate their risk of developing diabetes later in life. But that isn't exactly a new revelation.