### Further features and hyperparameters tuning
- After completing the initial pipeline, the evaluated performance is poor, indicating the need for further tuning
- The initial pipeline and its performance will be described, followed by noting its measured performance
- Every following tuning of the pipeline will be mentioned, evaluated and labeled as meaningful or not based on the improvement it adds
#### The initial pipeline
- Features:
    - Patients-related: age, gender, time in the hospital before admission in ICU, ethnicity
    - Events: time-stamped values of items are aggregated into: average, standard deviation, trend, range (max-min), count
        - Data not present replaced with 0
    - Target (ICU stay): log transformed
- Feature selection:
    - first items with most appearances: 32
        - 168 features
    - filter by variance: <1%
        - 153 features left
    - fileter by correlation: >0.9
        - 79 features left
    - filter by contribution to RMSE (permutation imporance): <1%
        - 9 features left
- Cosen model: XGBoost
- Hyperparameters:
    - Grid search:
        - max depth: 3, 5, 7, 10
        - learning rate: .01, .1
        - nr of estimators: 100, 200
        - reg_alpha: .1
- Evaluation on test data:
    - MAE:  9.69  days
    - RMSE: 13.47 days
    - R^2:  -0.29
#### Pipeline modification
- Removing cross-validation and and grid search, default XGBoost parameters
    - Evaluation:
        - MAE:  10.08 days
        - RMSE: 13.73 days
        - R^2:  -0.34
    - Verdict: Cross-validation and grid search are helpful
- Removing only grid search
    - Evaluation:
        - MAE:  10.63 days
        - RMSE: 13.09 days
        - R^2:  -0.5
    - Verdict: Grid search is helpful
- Removing filtering by contribution to RMSE (permutation importance)
    - Evaluation:
        - MAE:  9.45
        - RMSE: 12.79
        - R^2:  -0.16
    - Verdict: Tuning of the treshold for this filter may prove useful
- Eliminate filtering by correlation and permutation importance
    - Evaluation:
        - MAE:  8.6
        - RMSE: 11.74
        - R^2:  0.02
    - Verdict: Correlation proves to be a problem, needs tuning or removal
#### Pipeline modification after dropping pattients with LOS > 30 days
- Given that 75% of patients stay in the ICU for less than 22 days, removing patients with LOS greater than 30 days seems like a good trade-off 
- New results after limiting entries and discarding all feature filters
    - Evaluation:
        - MAE:  5.94
        - RMSE: 6.9
        - R^2:  -0.17
    - Verdict: Based on the R^2, promissing gains can be added with filters
- Adding only filtering by variance
    - Evaluation:
        - MAE:  5.90
        - RMSE: 7.04
        - R^2:  -0.19
    - Verdict: Same results
- Adding filtering by correlation
    - Evaluation:
        - MAE:  6.28
        - RMSE: 7.21
        - R^2:  -0.24
    - Verdict: Filtering by correlation gives worse results
- Adding all filters
    - Remaining with only 7 relevant features
    - Evaluation:
        - MAE:  5.93
        - RMSE: 6.95
        - R^2:  -0.16
    - Verdict: **Best version so far**
- Leaving only variance and permutation importance feature filters
    - Remaining with 14 relevant features
    - Evaluation:
        - MAE:  6.05
        - RMSE: 7.07
        - R^2:  -0.19
    - Verdict: Filtering by correlation helps XGBoost in permutation importance filtering
- Adding all filters and *imputing missing values with 0*
    - Remaining with only 7 relevant features
    - Evaluation:
        - MAE:  6.20
        - RMSE: 6.90
        - R^2:  -0.14
    - Verdict: Worse results, XGBoost works well with missing values


## **1. Data preprocessing**

### Data transformation
- scaling, encoding
- The age of the patients is computed from the ICU admission time minus their birth date
    - Patients with ages greater than 89 years old are entered in the database to 300 in order to protect confidentiality
    - All the patients in this case will have their age moved to 91.4, the average of their group
- Categorical features of patients are encoded
    - Gender is label encoded, while ethnicity is encoded using one hot encoding for better model interpretation
- The distribution of the target variable, LOS, is right-skewed
    - In order to have that data better interpretable by a wider range of models, a *log transformation* will be applied on it
### Data cleaning
- Since the time window of 24 hours was chosen for prediction, all chart events past that time will be discarded
- There are 4 very far outliers with LOS above 60 days and they will be discarded since they are not enough to prove helpful for the model training
- Based on the fact that 75% of the admissions have a LOS under 21 days and 87.2% have a LOS under 30 days, entries with more than that will also be discarded
- Not all ICU stays have all the item values present
    - The two possible approaces of dealing with that missing data are imputing 0 in their place or leaving them as null
    - Since XGBoost, the model planned to be trained, works well with missing values, these will be left as null
    - In order to mark their missing as a relevant feature, a count of the number of their appearances will be added for each item, 0 meaning a missing value
### Feature engineering
- The time-stamps of each chart event will be normalized as starting from the addmission time
- Given the time-stamped nature of the item entries, the item values for each ICU stay will be aggregated into the following 5 features per item:
    - average
    - standard deviation
    - trend
    - range
    - count
- Since not all ICU stays have the same items in their respective events, only the items with more than 78% appearance will be selected as features
    - This criteria means selecting the first 32 items, multiplied by the number of 5 metrics for each, giving 160 features
    - This sorting is performed by an SQL query, that saves the results into `items_appearance_pneumonia.csv`
#### Feature selection
- 160 features, plus the encoded categorical ones, for each entry being too many for such a small number of entries (about 140), careful feature selection is necessary
- The filtering of the features will be performed in three steps (proven useful by iterative experiments):
    1. Filtering out the features with a *variance lower* than 1%
    2. Building a *correlation matrix* on the remaining features and determine the pairs with correlation higher than 0.9
        - Out of these pairs, only one of them is worth keeping, so the other one is dropped
    3. Training a dummy XGBoost model in order to extract the permutation importance of the features
        - Drop the features with a contribution to the RMS less than 1%
### Data splitting
- train, validation, test
- The test holdout will be 20% out of the total number of entries
    - Although this leaves the model with less training data, a larger test dataset is essential for comprehensive model evaluation
- The model will be trained using 5-fold cross-validation

## **2. Model selection**
### Choosing the appropriate algorithm
- XGBoost (Extreme Gradinent Boosting) is a great fit for this pipeline, having the following advantages:
    - Good handling of tabular data
    - Good performance on moderate-sized data
    - Robustness for missing values and mixed data types (encoded categorical and numerical values)
    - Explainable feature importance for interpretability
    - Very good handling of non-linear relationships
    - Includes regulation for preventing overfitting
- Alternatives considered:
    - Random forests
        - Are simpler but offer less accurate results
        - Filtering of feature importance is already done during the preprocessing part of the pipeline
    - Neural Networks
        - Need much larger ammounts of instances and data, would need to consider a different diagnostic
### Defining the evaluation strategy
- The key metrics that will be tracked to evaluate the performance of the model are:
    - RMSE (Root Mean Squared Error)
        - Penalizes large errors, especially useful in such critical medical cases
        - RMSE chosen instead of MSE for better interpretability by converting the result back to the original units (days)
    - MAE (Mean Absolute Error)
        - Easily interpretable as the average days mispredicted
    - R^2
        - Explains the variance captured by the model 
        - Results from 0 increasing to one indicate increasing performance, while results lower than 0 indincate performance worse than predicting the average
- These metrics can be extracted from the five-fold cross-validation training, but more reliably on the test data holdout

## **3. Model training**
### Training the model on the training set
- As previously stated, the model will be trained using 5-fold cross-validation
    - This solves the problem of biased results of a single train-test split
    - The cross-validation approach aims to reduce overfitting and variance in the chosen perfomance metrics
    - The data utilization is maximized, every ICU stay being used four times for training and one time for testing
- Grouping ICU stays by patient ids is essential, since one patient can have multiple ICU stays
    - This grouping can prevent data leakage, not allowing ICU stays of patients to be split across training, validation and test data
    - This approach better represents real-world performance
### Hyperparameter tuning
- Grid search is chosen for automating the hyperparameter tuning of the model
    - It works by systematically testing the combinations of predefined hyperparameters in order to find the best performing model
    - Seamlessly integrates the five-fold cross-validation
    - Automatically balances the bias-variance tradeoff
    - Provides explainable results, making the best combination of hyperparameters available for inspection

## **4. Model evaluation and interpretation**
### Evaluating using metrics on test data prediction
- MAE, RMSE, R^2
### Perform error analysis
- ??
### Explaining model behaviour
- SHAP, LIME
- feature importance
- SHAP (Shape additive explanations) is a very useful tool for explaining how machine learning models make predictions
    - It assigns a feature importance value for a specific predicion, showing how much a feature contributed to the model's output
    - A baseline (expected value) is computed to represent the average prediction of the model over the dataset, SHAP using it to explain how features push the prediction above or below this baseline 
    - Positive values indicate a fature increasing the LOS predicion, while negative values indicate decreasing it
    - SHAP can also be used for global interpretability, generatin a bar plot for feature values across all patients
- LIME (Local Interpretable Model-agnostic Explanations) is used to explain only individual predictions 
    - approximating the model behaviour with another simpler, interpretabe model (like linear regression or decision rules) 
    - It explains individual predictions by creating small variations of the input, querying the model, and fitting the simple interpretable model to approximate the model's behavior locally.
    - It highlights key features for a specific case using the surrogate model’s coefficients.
