 <h1><center>Predicting Asthma Diagnosis As A Screening Tool</center></h1>
 

[<center>Link to GitHub Repository<center>](https://github.com/August-JF-Perez/AugustP_Springboard/tree/main/Projects/Capstone2) 

# Introduction

- This project aims to build a predictive model in order to screen whether a patient would be diagnosed with asthma.
- The real-world application would be for doctors to more easily determine which patients to focus on, for the more efficient allocation of resources in an already strained healthcare system.
- The final model was trained with 26 features/variables for each patient that encompass categories of demographic details, lifestyle factors, environmental and allergy factors, medical history, clinical measurements, symptoms, and including diagnosis indicator.

- The final goal is to build a classification model with a focus on maximizing the recall score (sensitivity) to decrease false negatives.


Asthma Information from The Mayo Clinic: https://www.mayoclinic.org/diseases-conditions/asthma/symptoms-causes/syc-20369653

# Data Source

The dataset used in the models explored was "Asthma Disease Dataset" from Kaggle

[Link to Kaggle Dataset](https://www.kaggle.com/datasets/rabieelkharoua/asthma-disease-dataset?resource=download)

Below is a snapshot of the raw data arranged in a dataframe after importing.

![original_data_df](images/original_data_dataframe.png)

# Data Cleaning & Preprocessing

The raw dataset has 2392 entries and 29 columns. 

To ensure clean data prior modeling, unnecessary features were removed (PatientID & DoctorInCharge) as they did not provide information useful for predicting diagnosis.

Cleaning:
- The data was checked for null values, outliers, value consistency within each feature, and detectable irregularities. No occurences requiring in-depth cleaning were found.

Preprocessing:
- Categorical features were confirmed to be or converted to indicators of 0 or 1 (dummy variable conversion)
- Addition of additional features that would combine features within the same groups to give the magnitude of each feature group was included in the dataset with the goal of indicating that with compunding factors, the chance of asthma diagnosis would increase.
- Standardize the magnitude of numeric features to have ranges from 0 to 1 (0 representing the minimum value of the feature, and 1 representing the maximum)
- Resampling was performed to address the class imbalance of Non-Diagnosed vs Diagnosed for asthma
    - Oversampling performed with replacement to achieve equal count of Diagnosis=0 & Diagnosis=1
    - This was not performed on test/validation data

Below is a snapshot of the cleaned & preprocessed dataset.

![clean_df](images/clean_dataframe.png)

# EDA: Hightlights

**General Observations**

- Continuous columns had relatively flat distributions for all data and when filtered for Diagnosis=0
- Continuous columns had multiple different distributions when filtered for Diagnosis=1
    - For each feature, there were peaks and valleys shown in the histogram but no peak was tall enough to make the value of the feature jump out as significant or shown to be deterministic for asthma
- Categorical columns had very similar distributions when comparing Diagnosis= 0 and 1
    - Suggesting that even if one class in the category was in majority, it did not serve as a good indicator by itself if the patient would get diagnosed with asthma

## Feature Distributions

### All Data (Unfiltered)

![dist_all](images/distributions_alldata.png)

- **Fairly flat distributions for numeric variables.**
    - bmi, lungfunctionfev1, lungfunctionfvc
    - Discrete
        - age, ethnicity, educationlevel, physicalactivity, dietquality, sleepquality, pollutionexposure, pollenexposure, dustexposure
- **The ratios within the categorical variables seem to reflect ratios expected in the real population.**
    - Binary
        - gender, smoking, petallergy, familyhistoryasthma, historyofallergies, eczema, hayfever, gastroesophagealreflux, wheezing, shortnessofbreath, chesttightness, coughing, nighttimesymptoms, exerciseinduced, diagnosis (the target variable)
    - Ordinal
        - age, educationlevel, physicalactivity, dietquality, sleepquality, pollutionexposure, pollenexposure, dustexposure
    - Nominal
        - ethnicity

### Diagnosis=0

![dist_all](images/distributions_diag_0.png)

- Near to the same distributions as the unfiltered data

### Diagnosis=1

![dist_all](images/distributions_diag_1.png)

- Numerical variables seemingly have more varied distributions.
    - Likely due to about 5% of the total data has Diagnosis = 1.
    - pollutionexposure almost a bimodal distribution but not a strong enough case to be confidently classified as such.
- Categorical variables have extremely similar ratios to the unfiltered and Diagnosis=0 distributions

## Interactions Between Features

Target feature of the dataset: Diagnosis

Correlation Heatmap:

![corr_heatmap](images/corr_heatmap.png)


- Correlation between two features
 - Highest
     - 0.064841 between BMI & DustExposure
 - Lowest
     - -0.059298 between Wheezing & Hayfever

- Correlation between the target (diagnosis) & a feature
    - Highest
        - 0.053956 for ExerciseInduced
    - Lowest
        - -0.039278 for Chesttightness

These correlation correficients indicate there is barely any relationship between features and a relationship between a single feature and Diagnosis. With a correlation of 1 or -1 being a perfect linear relationship.

<img src="Figures/count.png">

**Balanced vs Imbalanced Dataset**

- The dataset suffered a class imbalance between Diagnosis=0 (Non-Diagnosed for asthma) & Diagnosis=1 (Diagnosed for asthma).
- This implies that  two-class ML modeling may suffer from imbalance in the dataset.
- Observations where Diagnosis=1 account for 5.18% of all the data.

- This imbalance was addressed as noted in the Data Cleaning & Preprocessing section

# Model Results

**Methodology**
- Choose multiple classification models in the SciKit Learn library that could be reasonably applied to this two-class prediction problem
- Fit on data that has been balanced between classes and test/validate on data that did not recieve artifical class balancing
- Return scores and model metrics with a **focus on recall (for Diagnosis = 1)**
- Choose a best 2 models and perform hyperparameter tuning
- Choose best model

# Baseline: Random Guessing

- The baseline for which to compare classification models to was created using SciKit Learn's DummyClassifier() function to **simulate random guessing**.
    - Applying a Grid Search, the dummy classifier hyperparameter 'strategy' was set to the value 'uniform'
- Continuiing the focus on maximizing the recall score, **randomly guessing gave a recall of 37.9%**

# Preliminary Modeling

- 6 classification models were trained and scored based on their predictions
- Some values for model hyperparameters were changed based on applicability to the problem the model was applied to.
- The 2 best models then underwent hyperparameter tuning

## Models and their classification reports

In [1]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>
<!-- To left-align the classification report tables -->

### Model 1: Decision Tree

Classification Report:

|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.95 |       0.95 |     565   |
| 1            |        0.03 |     0.03 |       0.03 |      33   |
| accuracy     |        0.9  |     0.9  |       0.9  |       0.9 |
| macro avg    |        0.49 |     0.49 |       0.49 |     598   |
| weighted avg |        0.89 |     0.9  |       0.89 |     598   |


Decision Tree:

    - Accuracy: 91%
    - Recall: 3%
    - High scores for diagnosis=0, low scores for diagnosis=1
    - Worse recall than dummy classifier

### Model 2: Random Forest

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.95 |       0.95 |     565   |
| 1            |        0.03 |     0.03 |       0.03 |      33   |
| accuracy     |        0.9  |     0.9  |       0.9  |       0.9 |
| macro avg    |        0.49 |     0.49 |       0.49 |     598   |
| weighted avg |        0.89 |     0.9  |       0.89 |     598   |

Random Forest:

    - Accuracy: 94%
    - Recall: 0%
    - Worse recall than dummy classifier

Performed well for diagnosis=0, extremely porrly for predicting diagnosis=1

### Model 3: KNN

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.84 |       0.89 |     565   |
| 1            |        0.04 |     0.12 |       0.06 |      33   |
| accuracy     |        0.8  |     0.8  |       0.8  |       0.8 |
| macro avg    |        0.49 |     0.48 |       0.48 |     598   |
| weighted avg |        0.89 |     0.8  |       0.85 |     598   |

KNN:

    - Accuracy: 80%
    - Recall (diag=1): 12%
    - Worse recall than dummy classifier

KNN did well predicting diagnosis=0, slightly better than previous models but still poorly in predicting diagnosis=1

### Model 4: Logistic Regression

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.95 |     0.59 |       0.73 |    565    |
| 1            |        0.06 |     0.45 |       0.11 |     33    |
| accuracy     |        0.58 |     0.58 |       0.58 |      0.58 |
| macro avg    |        0.5  |     0.52 |       0.42 |    598    |
| weighted avg |        0.9  |     0.58 |       0.69 |    598    |

Logistic Regression:

    - Accuracy: 58%
    - Recall: 45%
    - Improved recall over dummy classifier

Low accuracy but much higher recall vs other models.

I theorize this is because the grouping of the data points makes it difficult to make a decision plane such that the model will get about half predictions correct


### Model 5: Gradient Boosting
    Using decision trees

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.92 |       0.93 |    565    |
| 1            |        0.04 |     0.06 |       0.05 |     33    |
| accuracy     |        0.88 |     0.88 |       0.88 |      0.88 |
| macro avg    |        0.49 |     0.49 |       0.49 |    598    |
| weighted avg |        0.89 |     0.88 |       0.89 |    598    |

Gradient Boosting:

    - Accuracy: 88%
    - Recall: 6%
    - Worse recall than dummy classifier

Accuracy lower than all models except Logistic Regression (at 58%). Recall (sensitivity) is only better than the tree & forest models.


### Model 6: AdaBoost classifier
    Using Logistic Regression since that has given the best recall so far

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.95 |     0.59 |       0.73 |    565    |
| 1            |        0.06 |     0.45 |       0.11 |     33    |
| accuracy     |        0.58 |     0.58 |       0.58 |      0.58 |
| macro avg    |        0.5  |     0.52 |       0.42 |    598    |
| weighted avg |        0.9  |     0.58 |       0.69 |    598    |

AdaBoost Classifier:

    - Accuracy: 58%
    - Recall: 45%

Almost exact same results as Logistic Regression by itself. Changing weights does not seem to have an effect using default values

### Confusion Matrixes

Confusion Matrixes shown for models that were selected for hyperparameter tuning

![knn_confusion_matrix](images/knn_confusion_matrix.png)
![logreg_confusion_matrix](images/logreg_confusion_matrix.png)
![ada_confusion_matrix](images/ada_confusion_matrix.png)

### Top 3 Models
The 3 models with the highest recall for Diagnosis=1 were KNN, Logistic Regression, & AdaBoost(LogisticRegression)

- KNN recall: 12%
- Logistic Regression recall: 45%
- AdaBoost classifier recall: 45%

Note: KNN inclusion explained in "Hyperparameter Tuning" section of this report.

# Hyperparameter Tuning
    Of the models: KNN, Logistic Regression, & AdaBoost(LogisticRegression)

A hyperparamter optimization was performed by applying the GridSearch approach to find the best modeling parameters and further improve the prediction recall (for Diagnosis = 1) of the models. The results show marginally slight improvement in the performance for each classifier after hyperparamter tuning.


Logistic Regression & AdaBoost(LogisticRegression) had the highest recall for Diagnosis = 1 (about 45% for each) and were chosen for hyperparameter tuning.

KNN only had a recall of about 12%. This model underwent hyperparameter tuning as a means of becoming more familiar with the tools utilized in this project.
    
    The results of tuning KNN are not shown in this report as they do not improve upon the original model or result in metrics better than Logistic Regression or AdaBoost. This is noted within this report as an indicator for when reviewing the project Jupyter Notebook.

## Tuning Logistic Regression

Best parameters:
 - 'class_weight': 'balanced'
 - 'max_iter': 500
 - 'random_state': 9
 - 'solver': 'liblinear'

Best recall score: 43%

## Tuning AdaBoostClassifier

Best parameters:
- 'estimator': LogisticRegression(
    - class_weight='balanced'
    - max_iter=500
    - solver='liblinear')
- 'learning_rate': 0.1

Best recall score: 46%

## AdaBoost saw a 1% improvement in recall while the tuned Logistic Regression model saw a decrease in recall versus the default model.

AdaBoost Recall Scores:
- Original model: 45%
- Tuned model: 46%

Logistic Regression Recall Scores:
- Original model: 45%
- Tuned model: 43%

## Best & Final Model

AdaBoost Classifier
- 'estimator': LogisticRegression(
    - class_weight='balanced'
    - max_iter=500
    - solver='liblinear')
- 'learning_rate': 0.1



# Conclusion

- The goal of this project was to build a classification prediction model as a screening for whether or not a paitent would get diagnosed with asthma if tested using laboratory tests or professional medical examination.
    - The final use of the model would be for medical professionals (doctors, RN's, etc.) to more easily prioritize resources towards patients with higher chance of having asthma.
        - As opposed to attempting to keep track of every single patient or waiting until symptoms got worsse to fully investigate.
    - The prediction would be based on multiple categories of variables such as family history, environmental factors, and any current symptoms.
    - The model was focused on maximizing sensitivity as being more accurate in predicting true positives is more beneficial than reducing false positives.
        - A false positive for asthma in a screening test (which would result in follow-up examinations and tests) is much less harmful to the patient than a false positive with another disease such as cancer.
            - Especially the mental burden on the patient and the higher cost & time requirements of tests for most other diseases.
- The methodology applied to the prediction of asthma diagnosis and the maximizing of sensitivity was that of exploring the dataset with statistical and plotting methods, applying multiple binary classification models, tuning of hyperparameters of the best 2 models, then selecting the best of the tuned models.
- The overall recall (sensitivity) for correctly predicting a true diagnosis of asthma was fairly low (at 46% recall).
    - While diagnostic screening tests can have very low sensitivities (when favoring specificity), any test that aims to minimize false negatives requires a high sensitivity (usually > 80%)
- The best model found was the AdaBoost classifier using Logistic Regression as it's estimator.
    - The Logistic Regression model alone had similar performance but did not see improvement after hyperparameter tuning.
- For real world application, I would recommend against using the model in its current state as a screening test for asthma.

# Possible Improvements

**Possible Limiting factors to model efficacy were:**

- Low count of Diagnosis=1 samples (only about 5% of the original dataset)
- The variables chosen in the study could each have extremely weak correlation to asthma diagnosis
- Variables not collected in the study could have greater correlation to asthma diagnosis
- The sample recored in the dataset does not truly reflect the population (in such a way that bootstrapping does not fully address the issue)
- Faulty sample collection or incorrect data handling of the dataset prior to being made available
    - This is suspected as the ratios of classes for common symptoms of asthma do not appear to change significantly from Diagnosis=0 to Diagnosis=1
        - Common symptoms: Shortness of breath, Chest tightness or pain, Wheezing when exhaling, which is a common sign of asthma in children, Trouble sleeping caused by previously listed symptoms

Improvements to the project would include:

- Performing effective feature engineering by learning more about each variable collected in the dataset and gaining a deeper understanding of how they could relate to each other.
- Perform feature selection and apply it to the models.
    - Gaining a better understanding of how to perform effective feature selection is the hurdle for this improvement.
- Gain more data from other studies.
    - This would only make the data used for modeling more reflective of the population.
    - This might illucidate the potential innacuracies within the original dataset.