## Income Prediction for Azubian

#### Business Understanding

##### `Project Overview`
Income inequality - when income is distributed in an uneven manner among a population - is a growing problem in developing nations across the world. With the rapid rise of AI and worker automation, this problem could continue to grow if steps are not taken to address the issue.

##### `Project Goal`
The objective of this challenge is to create a machine learning model to predict whether an individual earns above or below a certain amount.

##### `Business Objectives`
This solution can potentially reduce the cost and improve the accuracy of monitoring key population indicators such as income level in between census years. This information will help policymakers to better manage and avoid income inequality globally.


##### `Source of Data`
The dataset provided for this project is a modified version of a publicly available data source from Johns Hopkins University from Kaggle. It includes various patient attributes and their corresponding sepsis status. The dataset is subject to strict usage restrictions and can only be used for the purpose of this assignment.


##### `Key Stakeholders`
- Government Policy Makers: Involes the people who are responsible for managing the affairs of a nation and setting base
- Data Scientists and Developers: Team members involved in the development, training, and deployment of the machine learning model and API.


##### `Evaluation Metrics`
The error metric for this competition is the F1 score, which ranges from 0 (total failure) to 1 (perfect score). Hence, the closer your score is to 1, the better your model.

F1 Score: A performance score that combines both precision and recall. It is a harmonic mean of these two variables. Formula is given as: 2*Precision*Recall/(Precision + Recall)

Precision: This is an indicator of the number of items correctly identified as positive out of total items identified as positive. Formula is given as: TP/(TP+FP)

Recall / Sensitivity / True Positive Rate (TPR): This is an indicator of the number of items correctly identified as positive out of total actual positives. Formula is given as: TP/(TP+FN)

Where:

TP=True Positive
FP=False Positive
TN=True Negative
FN=False Negative

##### `Success Criteria`
- Accuracy:The final model should have an average score of 0.80 or above.
- Precision and Recall:The final model should maintain both Precision and Recall scores of 0.80 or above.
- F1 Score: The final model should attain an F1 score of 0.75 to 0.85 or higher according to state-of-the-art SOTA models
- Area Under the Receiver Operating Characteristic Curve (AUC-ROC): According to the state-of-the-art SOTA models for sepsis prediction should achieve AUC-ROC scores in the range of 0.80 to 0.90 or higher.


##### `Data Dictionary`

| Column Name       | Attribute/Target | Data Type | Description                                                                                 |
|-------------------|------------------|------------|---------------------------------------------------------------------------------------------|
| **ID**            | N/A              | Integer    | Unique identifier for each patient.                                                         |
| **PRG**           | Attribute        | Float      | Plasma glucose: Measurement of plasma glucose levels.                                       |
| **PL**            | Attribute        | Float      | Blood Work Result-1: Blood work result in mu U/ml.                                          |
| **PR**            | Attribute        | Float      | Blood Pressure: Measurement of blood pressure in mm Hg.                                     |
| **SK**            | Attribute        | Float      | Blood Work Result-2: Blood work result in mm.                                               |
| **TS**            | Attribute        | Float      | Blood Work Result-3: Blood work result in mu U/ml.                                          |
| **M11**           | Attribute        | Float      | Body Mass Index: BMI calculated as weight in kg/(height in m)^2.                            |
| **BD2**           | Attribute        | Float      | Blood Work Result-4: Blood work result in mu U/ml.                                          |
| **Age**           | Attribute        | Integer    | Age: Age of the patient in years.                                                           |
| **Insurance**     | N/A              | Boolean    | Insurance: Indicates whether the patient holds a valid insurance card.                      |
| **Sepssis**        | Target           | Boolean    | Sepsis: Target variable indicating whether the patient will develop sepsis (Positive) or not (Negative). |


##### `Hypothesis Statement`
Sepsis, a life-threatening condition, is a leading cause of mortality in intensive care units. While lack of insurance and age differences has been associated with higher in-hospital mortality due to sepsis, the reasons behind this disparity remain unclear. Insurance can facilitate timely access to care, potentially impacting sepsis outcomes and age is a factor that is likely to determine sepsis-related hospitalization.

With this, I investigate this hypothesis

`Null Hypothesis (Ho)`: There is no correlation between age and a patient's likelihood of developing sepssis.

`Altenatenate Hypothesis (Ha)`: There is a statistically significant correlation between age and a patient's likelihood of developing sepssis.


##### `Analytical Questions`
1. Are elderly people at a higher risk of developing sepsis compared to younger individuals?
2. Are patients with a high BMI at a greater risk of developing sepsis?
3. Are patients with high blood pressure at a greater risk of developing sepsis?
4. Does higher plasma glucose increase the likelihood of developing sepsis?
5. Are patients with higher insurance coverage at a greater risk of developing sepsis?

 


#### Importations

#### Data Understanding