 <h1><center>Predicting Asthma Diagnosis As A Screening Tool</center></h1>
 

[<center>Link to GitHub Repository<center>](https://github.com/August-JF-Perez/AugustP_Springboard/tree/main/Projects/Capstone2) 

# Introduction

- This project aims to build a predictive model in order to screen whether a patient would be diagnosed with asthma.
- The real-world application would be for doctors to more easily determine which patients to focus on, for the more efficient allocation of resources in an already strained healthcare system.
- The final model was trained with 26 features/variables for each patient that encompass categories of demographic details, lifestyle factors, environmental and allergy factors, medical history, clinical measurements, symptoms, and including diagnosis indicator.

- The final goal is to build a classification model with a focus on maximizing the recall score (sensitivity) to decrease false negatives.

# Data Source

The dataset used in the models explored was "Asthma Disease Dataset" from Kaggle

[Link to Kaggle Dataset](https://www.kaggle.com/datasets/rabieelkharoua/asthma-disease-dataset?resource=download)

Below is a snapshot of the raw data arranged in a dataframe after importing.

![original_data_df](images/original_data_dataframe.png)

# Data Cleaning & Preprocessing

The raw dataset has 2392 entries and 29 columns. 

To ensure clean data prior modeling, unnecessary features were removed (PatientID & DoctorInCharge) as they did not provide information useful for predicting diagnosis.

Cleaning:
- The data was checked for null values, outliers, value consistency within each feature, and detectable irregularities. No occurences requiring in-depth cleaning were found.

Preprocessing:
- Categorical features were confirmed to be or converted to indicators of 0 or 1 (dummy variable conversion)
- Addition of additional features that would combine features within the same groups to give the magnitude of each feature group was included in the dataset with the goal of indicating that with compunding factors, the chance of asthma diagnosis would increase.
- Standardize the magnitude of numeric features to have ranges from 0 to 1 (0 representing the minimum value of the feature, and 1 representing the maximum)
- Resampling was performed to address the class imbalance of Non-Diagnosed vs Diagnosed for asthma
    - Oversampling performed with replacement to achieve equal count of Diagnosis=0 & Diagnosis=1
    - This was not performed on test/validation data

Below is a snapshot of the cleaned & preprocessed dataset.

![clean_df](images/clean_dataframe.png)

# EDA: Hightlights

**General Observations**

- Continuous columns had relatively flat distributions for all data and when filtered for Diagnosis=0
- Continuous columns had multiple different distributions when filtered for Diagnosis=1
    - For each feature, there were peaks and valleys shown in the histogram but no peak was tall enough to make the value of the feature jump out as significant or shown to be deterministic for asthma
- Categorical columns had very similar distributions when comparing Diagnosis= 0 and 1
    - Suggesting that even if one class in the category was in majority, it did not serve as a good indicator by itself if the patient would get diagnosed with asthma

**Interactions Between Features** 

Target feature of the dataset: Diagnosis

Correlation Heatmap:

![corr_heatmap](images/corr_heatmap.png)


- Correlation between two features
 - Highest
     - 0.064841 between BMI & DustExposure
 - Lowest
     - -0.059298 between Wheezing & Hayfever

- Correlation between the target (diagnosis) & a feature
    - Highest
        - 0.053956 for ExerciseInduced
    - Lowest
        - -0.039278 for Chesttightness

These correlation correficients indicate there is barely any relationship between features and a relationship between a single feature and Diagnosis. With a correlation of 1 or -1 being a perfect linear relationship.

<img src="Figures/count.png">

**Balanced vs Imbalanced Dataset**

- The dataset suffered a class imbalance between Diagnosis=0 (Non-Diagnosed for asthma) & Diagnosis=1 (Diagnosed for asthma).
- This implies that  two-class ML modeling may suffer from imbalance in the dataset.
- Observations where Diagnosis=1 account for 5.18% of all the data.

- This imbalance was addressed as noted in the Data Cleaning & Preprocessing section

# Model Results

**Methodology**
- Choose multiple classification models in the SciKit Learn library that could be reasonably applied to this two-class prediction problem
- Fit on data that has been balanced between classes and test/validate on data that did not recieve artifical class balancing
- Return scores and model metrics with a **focus on recall (for Diagnosis = 1)**
- Choose a best 2 models and perform hyperparameter tuning
- Choose best model

# Baseline: Random Guessing

- The baseline for which to compare classification models to was created using SciKit Learn's DummyClassifier() function to **simulate random guessing**.
    - Applying a Grid Search, the dummy classifier hyperparameter 'strategy' was set to the value 'uniform'
- Continuiing the focus on maximizing the recall score, **randomly guessing gave a recall of 37.9%**

# Preliminary Modeling

- 6 classification models were trained and scored based on their predictions
- Some values for model hyperparameters were changed based on applicability to the problem the model was applied to.
- The 2 best models then underwent hyperparameter tuning

## Models and their classification reports

In [8]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>
<!-- To left-align the classification report tables -->

### Model 1: Decision Tree

Classification Report:

|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.95 |       0.95 |     565   |
| 1            |        0.03 |     0.03 |       0.03 |      33   |
| accuracy     |        0.9  |     0.9  |       0.9  |       0.9 |
| macro avg    |        0.49 |     0.49 |       0.49 |     598   |
| weighted avg |        0.89 |     0.9  |       0.89 |     598   |


Decision Tree:

    - Accuracy: 91%
    - Recall: 3%
    - High scores for diagnosis=0, low scores for diagnosis=1
    - Worse recall than dummy classifier

### Model 2: Random Forest

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.95 |       0.95 |     565   |
| 1            |        0.03 |     0.03 |       0.03 |      33   |
| accuracy     |        0.9  |     0.9  |       0.9  |       0.9 |
| macro avg    |        0.49 |     0.49 |       0.49 |     598   |
| weighted avg |        0.89 |     0.9  |       0.89 |     598   |

Random Forest:

    - Accuracy: 94%
    - Recall: 0%
    - Worse recall than dummy classifier

Performed well for diagnosis=0, extremely porrly for predicting diagnosis=1

### Model 3: KNN

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.84 |       0.89 |     565   |
| 1            |        0.04 |     0.12 |       0.06 |      33   |
| accuracy     |        0.8  |     0.8  |       0.8  |       0.8 |
| macro avg    |        0.49 |     0.48 |       0.48 |     598   |
| weighted avg |        0.89 |     0.8  |       0.85 |     598   |

KNN:

    - Accuracy: 80%
    - Recall (diag=1): 12%
    - Worse recall than dummy classifier

KNN did well predicting diagnosis=0, slightly better than previous models but still poorly in predicting diagnosis=1

### Model 4: Logistic Regression

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.95 |     0.59 |       0.73 |    565    |
| 1            |        0.06 |     0.45 |       0.11 |     33    |
| accuracy     |        0.58 |     0.58 |       0.58 |      0.58 |
| macro avg    |        0.5  |     0.52 |       0.42 |    598    |
| weighted avg |        0.9  |     0.58 |       0.69 |    598    |

Logistic Regression:

    - Accuracy: 58%
    - Recall: 45%
    - Improved recall over dummy classifier

Low accuracy but much higher recall vs other models.

I theorize this is because the grouping of the data points makes it difficult to make a decision plane such that the model will get about half predictions correct


### Model 5: Gradient Boosting
    Using decision trees

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.94 |     0.92 |       0.93 |    565    |
| 1            |        0.04 |     0.06 |       0.05 |     33    |
| accuracy     |        0.88 |     0.88 |       0.88 |      0.88 |
| macro avg    |        0.49 |     0.49 |       0.49 |    598    |
| weighted avg |        0.89 |     0.88 |       0.89 |    598    |

Gradient Boosting:

    - Accuracy: 88%
    - Recall: 6%
    - Worse recall than dummy classifier

Accuracy lower than all models except Logistic Regression (at 58%). Recall (sensitivity) is only better than the tree & forest models.


### Model 6: AdaBoost classifier
    Using Logistic Regression since that has given the best recall so far

Classification Report:
|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |        0.95 |     0.59 |       0.73 |    565    |
| 1            |        0.06 |     0.45 |       0.11 |     33    |
| accuracy     |        0.58 |     0.58 |       0.58 |      0.58 |
| macro avg    |        0.5  |     0.52 |       0.42 |    598    |
| weighted avg |        0.9  |     0.58 |       0.69 |    598    |

AdaBoost Classifier:

    - Accuracy: 58%
    - Recall: 45%

Almost exact same results as Logistic Regression by itself. Changing weights does not seem to have an effect using default values

### Confusion Matrixes

![tree_confusion_matrix](images/tree_confusion_matrix.png)
![forest_confusion_matrix](images/forest_confusion_matrix.png)
![knn_confusion_matrix](images/knn_confusion_matrix.png)
![logreg_confusion_matrix](images/logreg_confusion_matrix.png)
![gboost_confusion_matrix](images/gboost_confusion_matrix.png)
![ada_confusion_matrix](images/ada_confusion_matrix.png)

Note that KNN, Logistic Regression, & AdaBoost(LogisticRegression) have the greatest amount of correctly predicted for Diagnosis=1

### The 3 models with the highest recall for Diagnosis=1 were KNN, Logistic Regression, & AdaBoost(LogisticRegression)

- KNN recall: 12%
- Logistic Regression recall: 45%
- AdaBoost classifier recall: 45%

# Hyperparameter Tuning
    Of the models: KNN, Logistic Regression, & AdaBoost(LogisticRegression)

A hyperparamter optimization was performed by applying the GridSearch approach to find the best modeling parameters and further improve the prediction recall (for Diagnosis = 1) of the models. The results show marginally slight improvement in the performance for each classifier after hyperparamter tuning.


Logistic Regression & AdaBoost(LogisticRegression) had the highest recall for Diagnosis = 1 (about 45% for each) and were chosen for hyperparameter tuning.

KNN only had a recall of about 12%. This model underwent hyperparameter tuning as a means of becoming more familiar with the tools utilized in this project.
    
    The results of tuning KNN are not shown in this report as they do not improve upon the original model or result in metrics better than Logistic Regression or AdaBoost. This is noted within this report as an indicator for when reviewing the project Jupyter Notebook.

## Tuning Logistic Regression

Best parameters:
 - 'class_weight': 'balanced'
 - 'max_iter': 500
 - 'random_state': 9
 - 'solver': 'liblinear'

Best recall score: 43%

## Tuning AdaBoostClassifier

Best parameters:
- 'estimator': LogisticRegression(
    - class_weight='balanced'
    - max_iter=500
    - solver='liblinear')
- 'learning_rate': 0.1

Best recall score: 46%

## AdaBoost saw a 1% improvement in recall while the tuned Logistic Regression model saw a decrease in recall versus the default model.

AdaBoost Recall Scores:
- Original model: 45%
- Tuned model: 46%

Logistic Regression Recall Scores:
- Original model: 45%
- Tuned model: 43%

# Further Improvement: Text Classification With Keras Deep Learning Model

The methodology followed for applying deep learning model is similar to the steps taken in this [google workshop] for Keras BoW (https://github.com/tensorflow/workshops/blob/master/extras/keras-bag-of-words/keras-bow-model.ipynb). The tf-idf data into was split in to training and test sets. Next, a tokenizer method from keras library was applied to count the unique words in the vocabulary for this dataset and assign each of those words to indices. A fit_on_texts() function was called to create a word index lookup of the vocabulary in the dataset. The vocabulary was limited to the top words by passing a num_words param to the tokenizer. Then a  texts_to_matrix method was used to process the training data and test data in a format  that can be passed to the keras deep learning model.  Then a deep learning model was built by specifyin Keras the shape of the input data, output data, and the type of each layer. Keras can then fit the model to the input data and evaluates prediction accuracy using the test data.


- Comparison of performance between the deep learning model and Naive Bayes model on original data shows that deep learning model performs significantly better than the Naive Bayes Classifier when the problem is extended to classes >6. Both model shows similar decreasing trend in the accuracy, but the overall roc_auc score is better for the deep learning model. 

<img src="Figures/NB_DL.png">

- The results stated above highlight that the deep learning model is less affected than the linear classifiers by the class imbalance in the given data (since roc_auc is decent for number of classes >6). It is possible that the threshold number of observations that the DL model requires to adequately train for each class is much less than whats required by a linear classifier. This was further verified by studying the effect of  increasing number of classes on the performance of the DL classifier using oversampled data. The results show that the roc_auc or accuracy are similar for both original and resampled dataset when a DL model is applied. As a result, it can be concluded that within the window of this investigation (i.e. number of classes between 2 to 15), unlike linear classifiers, deep neural network modeling is not impacted by imbalance in the data.

<img src="Figures/DL_DLOS.png">

# Conclusion

- The overall accuracy at classifying the charcaters using different ML models was fairly low. 

- TF-IDF was found to be the best word embedding method to be used with Linear Classifiers. All the linear classifiers (SVM, Naive Bayes, Logisitic Regression) exhibited similar performance with Naive Bayes topping the list chart in terms of accuracy.

- Investigation of the classifiers performance for increasing number of classes in the mutli classification task shows that roc_auc is a more reliable choice compared to accuracy to better understand different classifier's performance.

- While the overall ROC_AUC scores obtained for different approaches are not great for both the linear classifiers and deep learning model, obtaining overall roc_auc scores >0.5 under different circumstances suggest that machine learning can capture unique traits about movie or TV show characters from reading past transcripts. In terms of roc_auc score, both kinds of classifiers are robust to changes in the number of classes (varied from 2 classes upto 15 classes) in the multiclass classification task.   

- While modeling with linear classifiers for this dataset, the performance of a classifier can be impacted by imbalance in the dataset, and can be overcome by applying an oversampling technique. 

- Investigation of roc_auc score of specific classes show that the classifier can distinguish lines from the leading actors fairly well. The roc_auc score is the highest for identifying RACHEL. All the supporting actors considered has slightly low roc_auc score compared to the leading actors resembling the imbalance in the distribution of the data.

- The rank of different classes in terms of roc_auc score does not follow their rank in terms of number of lines in the dataset. (e.g. Phoebe has the least number of lines compared to other characaters, but the roc_auc score is better than Ross or Chandler when modeled using a one vs rest linear classifier on imbalanced dataset). This suggests that ML models can capture the inherent features perticular to these characters and model's ability  do not entirely depend on the number of observations of the character.

- Comparison of performance between the deep learning model and Naive Bayes model on original data shows that deep learning model performs significantly better than the Naive Bayes Classifier when the problem is extended to classes >6.

- Deep learning model performs better than the linear classifiers. 

- Applying deep learning model to both original and oversampled datasets shows that the roc_auc scores or accuracies are similar for both original and resampled dataset. This implies that within the window of this investigation (i.e. number of classes between 2 to 15), unlike linear classifiers, deep neural network modeling is not impacted by imbalance in the data.



# Future Work

- One possible explanation for word2vec to not work well is that the corpus size from this dataset is small and does not provide enough information to capture word relationship in the embedding space. This can be explored by training word2vec on a very large corpus (e.g. pretrained models from 100 billion words in google news)  to get a very good word vector before embedding the dataset and will be considered in the future extension of this project. 

- Comparison of class specific ROC_AUC scores before an after oversampling suggests that the random oversampling algorithm adjusts the imbalance in the majority classes, but not in the minority classes. Applying deep learning model to both original and oversampled datasets shows that the roc_auc scores or accuracies are similar for both original and resampled dataset. The similarity suggests that  , unlike linear classifiers, deep neural network modeling is more robust  to imbalance in the data. Therefore , more advanced deep larning models can be attempted for better prediction with the current dataset,