# Machine Learning - Assignment 1

> Hoang Dang - s3927234

- Trong mỗi section có thể có các mục kiểu "chú ý", "kết luận", ..
- Tham khảo: https://www.kaggle.com/code/alejopaullier/make-your-notebooks-look-better


## Abstract

This study aims to employ machine learning methodologies for the prediction of diabetes onset in individuals utilizing various features. We will go through a series of steps including Problem Formulation, Exploratory Data Analysis, Evaluation Framework, Modelling, and lastly the Discussion and Final Conclusion.

---

## Table of Contents

- [1. Problem Formulation](#problem_formulation)
- [2. Exploratory Data Analysis](#EDA)
  - [2.1. Descriptive Statistic]()
  - [2.2. Variable Distributions]()
  - [2.3. Correlation Analysis]()
  - [2.4. Feature Engineering]()
- [3. Evaluation Framework](#evaluation_framework)
- [4. Modelling](#modelling)
  - [4.1. Models Proposal](#model_proposal)
  - [4.2. Models Implementation](#model_implementation)
  - [4.3. Models Evaluation](#model_evaluataion)
- [5. Discussion and Conclusion](#sumup)

## 1. Problem Formulation <a class="anchor" id="problem_formulation"></a>

The first step in developing a model is to formulate the problem in a way that we can apply machine learning. In here, the task is to predict the **presence or absence of diabetes** using various health metrics and demographic information of individuals.

Based on the problem type, the training dataset, and the `code_book.txt`, we can see that:
- This is a `Supervised Learning` and specifically, `Binary Classification` problem since the training data contains the `Status` column representing the occurance of diabetes, which is also what we are trying to predict (0 for absence and 1 for presence)
- In the training dataset, there are in total 25 columns, in which:
  - The `Id` column will not be used since it does not contribute to the occurance of diabetes
  - The `Status` column will be used as the target variable for the training process
  - The remaining 23 columns are the attributes and will be used for the data analyzing and training processes
- Data structure:
  - There are 8 ordinal attributes namely GenHlth (1 - 5), MentHlth (1 - 30), PhysHlth (1 - 30), Age (1 - 13), Education (1 - 6), Income (1 - 8), ExtraMedTest (-199, 199), ExtraAlcoholTest (-199, 199)
  - BMI is the only continuous attribute
  - The remaining 14 features are binary attributes, which can only be either 0 or 1

## 2. Exploratory Data Analysis (EDA) <a id="EDA"></a>

### 2.1. Descriptive Statistic

After loading the training dataset into a `Pandas DataFrame`, we can extract some information using `.shape` and `.describe` methods. Based on output[2], we can see that:
- Our dataset has in total 202,944 rows (records) and 24 columns (23 of which are attributes and the `Status` column as the expected output)
- There are **no missing values** in the dataset since all columns have exactly 202,944 rows
- The percentage of people suffering from High Blood Pressure and High Cholesterol are nearly the same at about 42%
- BMI, **the only continuous attribute**, has values ranging from 12 to 98. The average BMI is 28.38, which is relatively high [ref](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/)
- Most people had Cholesterol Check (96%) and Health Care Coverage (95%)
- There are minor of people suffered from Stroke (4%), Heavy Alcohol Consump (6%), No Doctor because of Cost (8%), Heart Disease and Attack (9%), and Walking Difficulty (17%)
- There are more than a half of people having a healthy lifestyle with Physical Activity at 76%, Fruits Consume at 64%, and Veggies Consume at 81%
- Surprisingly, the statistics of ExtraMedTest and ExtraAlcoholTest are nearly the same
- Only 18% of people suffering from Diabetes (The Target variable)


### 2.2. Variable Distributions

In addition to statistic features provided by `Pandas`, we can also use `Matplotlib` and `Seaborn` to visualize different attributes in the dataset. We begin by plotting the `histogram` to see the data distribution of each attribute [ref](https://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htm) and after that, we use `box plot` to see the variation of information [ref](https://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htm)


#### 2.2.1. Histogram

Based on output[4], we can see that:
- 


#### 2.2.2. Box plot

Based on output[5], we can see that:
- 


### 2.3. Correlation Analysis

### 2.4. Feature Engineering

---

- Perform thorough exploratory data analysis to understand the distribution of features, identify correlations, and visualize patterns in the dataset

- Handle missing values, outliers, and perform necessary data preprocessing steps


https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm


- Có cần check duplicated data?
- Outliers
- Correlation


## 3. Evaluation Framework <a id="evaluation_framework"></a>

## 4. Modelling <a id="modelling"></a>

### 4.1. Models Proposal <a id="model_proposal"></a>

- Propose at least three different machine learning models suitable for predicting the status of diabetes development in patients (only use techniques taught in class up to week 5 - inclusive)

- Justify the selection of each model based on its strengths, weaknesses, and suitability for the task


### 4.2. Models Implementation <a id="model_implementation"></a>

- Implement the proposed machine learning models using appropriate libraries (e.g scikit-learn, TensorFlow, Keras, etc.).

- Fine-tune hyperparameters using cross-validation techniques to optimize model performance


### 4.3. Models Evaluataion <a id="model_evaluation"></a>

Submit the prediction for patients in a given test set whose dependent variable (i.e., the presence or absence of diabetes) is hidden


## 5. Discussion and Conclusion <a id="sumup"></a>

- Provide a detailed discussion on the effectiveness of different machine learning models for diabetes prediction.

- Discuss the implications of the findings and potential applications of the predictive models in healthcare settings.

- Highlight areas for future research or improvements in predictive modeling for diabetes

The following describes the dataset columns:

| Attribute             | Description                                                                                                              |
|-----------------------|--------------------------------------------------------------------------------------------------------------------------|
| ID                    | Patient ID                                                                                                               |
| Status                | 0 = no diabetes, 1 = prediabetes or diabetes                                                                             |
| HighBP                | 0 = no high blood pressure, 1 = high blood pressure                                                                      |
| HighChol              | 0 = no high cholesterol, 1 = high cholesterol                                                                            |
| CholCheck             | 0 = no cholesterol check in 5 years, 1 = yes cholesterol check in 5 years                                                |
| BMI                   | Body Mass Index                                                                                                          |
| Smoker                | Have you smoked at least 100 cigarettes in your entire life? (0 = no, 1 = yes)                                           |
| Stroke                | (Ever told) you had a stroke (0 = no, 1 = yes)                                                                          |
| HeartDiseaseorAttack | Coronary heart disease (CHD) or myocardial infarction (MI) (0 = no, 1 = yes)                                             |
| PhysActivity          | Physical activity in the past 30 days (not including job) (0 = no, 1 = yes)                                               |
| Fruits                | Consume fruit 1 or more times per day (0 = no, 1 = yes)                                                                 |
| Veggies               | Consume vegetables 1 or more times per day (0 = no, 1 = yes)                                                            |
| HvyAlcoholConsump     | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) (0 = no, 1 = yes) |
| AnyHealthcare         | Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. (0 = no, 1 = yes)   |
| NoDocbcCost           | Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? (0 = no, 1 = yes)|
| GenHlth               | Would you say that in general your health is (scale 1-5): 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor     |
| MentHlth              | Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? (scale 1-30 days) |
| PhysHlth              | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (scale 1-30 days) |
| DiffWalk              | Do you have serious difficulty walking or climbing stairs? (0 = no, 1 = yes)                                            |
| Sex                   | Sex (0 = female, 1 = male)                                                                                              |
| Age                   | 13-level age category: 1 = 18-24, 9 = 60-64, 13 = 80 or older                                                            |
| Education             | Education level (scale 1-6): 1 = Never attended school or only kindergarten, 2 = Grades 1 through 8 (Elementary), 3 = Grades 9 through 11 (Some high school), 4 = Grade 12 or GED (High school graduate), 5 = College 1 year to 3 years (Some college or technical school), 6 = College 4 years or more (College graduate) |
| Income                | Income scale (scale 1-8): 1 = less than $10,000, 5 = less than $35,000, 8 = $75,000 or more                             |
| ExtraMedTest          | The result of an extra medical test, range (-100, 100)                                                                  |
| ExtraAlcoholTest      | The result of an extra alcohol test, range (-100, 100)                                                                  |
