# Asthma Prediction Model
August J.F. Perez

Capstone 2 Project

## Purpose, motivation, and description

### Purpose
What problem is being solved

Build a pre-screening tool to identify patients who would have a high probability for an asthma diagnosis

Focus on maximizing sensitivity (recall)

### Motivation
Why is the problem important

- The healthcare system is barely able to keep up & often lags behind patient needs.
- Large practices or hospitals can't prioritize screening as much as it would be helpful
- A screening test for asthma would assist physicians in allocating their time and efforts more effectively. In addition, unnecessary use of resources could be prevented or reduced
- By reducing the time and effort needed in identifying patients who might benefit from an asthma diagnosis, or waiting for patients to approach their doctor themselves, more time can be spent focusing on other tasks that require a physician's attentions

____
### Target Feature

- Asthma Diagnosis
    - Binary

____
### Model

AdaBoost using Logistic Regression

    'learning_rate': 0.1
    
    LogisticRegression(class_weight='balanced', max_iter=500, solver='liblinear')

Recall: 46%

## Data Acquisition

Source: Kaggle 🌬️ Asthma Disease Dataset 🌬️

https://www.kaggle.com/datasets/rabieelkharoua/asthma-disease-dataset?resource=download

Downloaded as a CSV and read into a pandas dataframe

## Data Management

- Format: CSV
- Amount of data (rows): 2392
- Processing: pandas pd.reqad_csv()

## Cleaning

- Initially clean dataset
    - I performed checks and found no problems
- each observation is a row, each variable is a col
- No nulls
- Feature labels make sense
- Check dtypes
- No Outliers
- No skewed distributions
    - Except asthma diagnosis target feature
        - SMOTE used in training split
        - Test data not artifically balanced
- Checked expected count of unique values for each feature
- Value ranges make sense for non-categorical features


- Dropped 'patient id' & 'doctor in charge' features

____
### Engineered Features
& their input features

- lifequality
    - (Method: Mean)
    - physical activity
    - diet quality
    - sleep quality
- exposure_count
    - (Method: Mean)
    - pollution exposure
    - pollen exposure
    - dust exposure
    - smoking
- lungfunction
    - (Method: Division)
    - lung function fev1
    - lung function fvc
- allergy_count
    - (Method: Sum)
    - pet allergy
    - history of allergies
    - eczema
    - hay fever
- symptom_count
    - (Method: Sum)
    - gastroesophageal reflux
    - wheezing
    - shortness of breath
    - chest tightness
    - coughing
    - nighttime symptoms
    - exercise induced

## EDA

Began looking at all features to investigate whether any popped out at interesting to investigate further

    No single feature stood out from the rest

numerical features: (count: 10)

'age', 'bmi', 'physicalactivity', 'dietquality', 'sleepquality', 'pollutionexposure', 'pollenexposure', 'dustexposure', 'lungfunctionfev1', 'lungfunctionfvc'

Categorical features (count: 17) (includes target features 'diagnosis')
- All binary 0 or 1

'gender', 'ethnicity', 'educationlevel', 'smoking', 'petallergy', 'familyhistoryasthma', 'historyofallergies', 'eczema', 'hayfever', 'gastroesophagealreflux', 'wheezing', 'shortnessofbreath', 'chesttightness', 'coughing', 'nighttimesymptoms', 'exerciseinduced', 'diagnosis'

### Distribution analysis

Inspected histograms of
- each feature
- each feature split by target feature (0 or 1 for diagnosis)

**Categorical col Comparitive distribution takeaways:**

- Many fewer samples for diagnosis=1 than diagnosis=0
- Presence of symptoms has potential for use in asthma prediction, diagnosis=1 mostly shows symptoms being present more often than not
    - Kind of expected since diagnosis is usually based on symptoms
- Other cols (non-symptoms) are not showing me a clear difference between diagnosis 0 or 1

**Numerical cols Distributions**

Overall flat, even when split by diagnosis

### Inter-feature relationships

heatmap & pairplot used

- heatmap findings:
    - No inter-feature correlation above 0.1 (0.065 was highest value discovered)
- pairplot findings
    - no inter-feature correlations/relationships shown
        - even to target feature
    - Color coded by diagnosis

#### Extra Info:
**Relationships between each feature & target**

- Continuous cols:
    - No feature shows a true relationship between a lower, middle, or upper range and a diagnosis=1
    -  lungfunctionfev1 & lungfunctionfvc both show a potential that a higher value may correlate with a diagnosis=1, this appears to be a week relationship at current investigative stages
    -  dustexposure seems to correlate to diagnosis=1 with lower values
        - I would have originally expected higher values to correlate
- Categorical cols:
    - All cat cols except those listed below have distributions that are a very close match between diagnosis=0 & =1
    - Cat cols with non matching distributions between diagnosis= 0 or 1 & notes
        - shortnessofbreath: fewer shortnessofbreath=1 than =0 where diag=1 (where diag=0, ratio between shortnessofbreath=0 or 1 was just about equal)
        -  chesttightness: fewer chesttightness=1 than =0 where diag=1 (where diag=0, ratio between chesttightness=0 or 1 was just about equal)
        -  coughing: fewer coughing=1 than =0 where diag=1 (where diag=0, ratio between coughing=0 or 1 was just about equal)

### Feature Selection

All features where included in the model as there were none that stood out at better or worse candidates

## Modeling

**Models Used:**
- DummyClassifier (for comparison to other models) (38% recall)
- Decision Tree (3% recall)
- Random Forest (0% recall)
- KNN (12% recall)
- Logistic Regression (45% recall)
- Gradient Boosting (Decision Tree) (6% recall)
- AdaBoost classifier (Logistic Regression) (45% recall before optimizing) (46% recall after optimizing)

### Evaluation

**Primary target score: Recall**

Other scores noted but weighed little in final evaluations

    accuracy, precision, f1-score

____
Training and test splits were performed

The training set was balanced using SMOTE for diagnosis=1
- Imbalance was 1703 records for diagnosis=0 & 91 records for diagnosis=1
- Balanced to equal number of records each
____