<a href="https://www.kaggle.com/code/aisuko/time-series-missing-data-imputation?scriptVersionId=198610826" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Based on https://github.com/amazon-science/chronos-forecasting/issues/60, Chronos-forecasting currently doesn't support to do misisng valua imputation directly. So, how we do that?

# GluonTS-Probabilistic Time Series Modeling in Python

The common time-series imputation methods such as:
* Forward fill
* Mean imputation
* Interpolation

https://ts.gluon.ai/stable/tutorials/data_manipulation/pandasdataframes.html#Use-case-2---Loading-data-with-missing-values

---

# Which imputation technique is good for us?

For handling missing data in medical time-series datasets, especially ICU patient data, the imputation method should be robust to ensure that important medical trends and patterns are preserved. Below are various imputation techniques:

## 1. Forward Fill or Last Observation Carried Forward (LOCF)

This method replaces missing values by carrying forward the last observed value.

**Pros**:
- Simple to implement.
- Preserves continuity for vital signs that tend to change gradually.
- Often works well for physiological data, where the last observed value can be indicative of the near future.

**Cons**:
- Not suitable if the missing gap is large.
- Can introduce bias if there is a sudden, significant change in vital signs after the missing data.

**Recommended if**: Missing intervals are small, and vital signs are relatively stable over short periods.

---

## 2. Linear Interpolation

This method fills missing values by interpolating between known values using a straight line.

**Pros**:
- Simple and commonly used for continuous data.
- Suitable for short gaps in vital signs.

**Cons**:
- Not suitable for large gaps or cases where vital signs change rapidly.
- Linear assumptions may not capture non-linear trends.

**Recommended if**: You have small gaps, and the trend of vital signs is generally smooth.

---

## 3. Spline Interpolation

Spline interpolation is a more flexible interpolation method that fits a smooth curve through the data points.

**Pros**:
- Captures non-linear trends better than linear interpolation.
- Useful for handling data with fluctuating trends, such as heart rate or blood pressure.

**Cons**:
- More complex and can overshoot, leading to unrealistic values if the data are noisy.

**Recommended if**: The missing intervals are moderate, and your data shows non-linear trends.

---

## 4. K-Nearest Neighbors (KNN) Imputation

This method uses similar patients' data (based on other features) to impute missing values.

**Pros**:
- Can capture correlations between different features (e.g., heart rate and blood pressure).
- Suitable for multivariate time-series data like yours with multiple vital signs.

**Cons**:
- Computationally intensive, especially for large datasets.
- Requires careful tuning to choose the number of neighbors and avoid overfitting.

**Recommended if**: You want to exploit relationships between multiple features and handle non-linear patterns in the data.

---

## 5. Probabilistic Imputation (e.g., Gaussian Processes, Kalman Filters)

This method uses probabilistic models to estimate missing values, considering both temporal dependencies and feature correlations.

**Pros**:
- Handles uncertainty in missing values and produces a probabilistic estimate.
- Captures temporal dependencies and can model trends in time-series data.

**Cons**:
- More complex to implement and computationally expensive.
- Requires a good understanding of the underlying model assumptions.

**Recommended if**: Your dataset has significant missingness, and you want a more statistically rigorous approach to imputation.

---

## 6. Deep Learning-based Imputation (e.g., Autoencoders, RNNs, GluonTS)

Deep learning models such as Recurrent Neural Networks (RNNs) or models within GluonTS can impute missing values by learning the temporal dependencies and correlations between features. **Here are some papers mentioned they used LSTM(RNN) for time-series data prediction not missing data imputation.**

**Pros**:
- Can handle complex, non-linear relationships between features and time steps.
- Well-suited for large, high-dimensional datasets with many missing values.

**Cons**:
- Requires more computational resources and time to train.
- Needs a large amount of data to perform well.

**Recommended if**: You have a large dataset, and you're looking for a powerful method that can capture complex temporal patterns.

---

## Recommended Approach

Given that mimic_iii dataset contains ICU patients' vital signs, which are critical and can change rapidly, I suggest starting with a combination of:
1. **Forward Fill** (for shorter gaps where continuity is important).
2. **Spline Interpolation** (for capturing non-linear trends).
3. **KNN Imputation** or **Deep Learning-based Imputation (GluonTS)** (for more complex cases where the correlation between vital signs might be important).

This hybrid approach allows you to handle different missing data scenarios flexibly, depending on the nature of the missingness (e.g., short gaps vs. long gaps).


# Use Transformer based architecture model to do this

1. Use Pre-trained language model
2. Tokenization of Time-Series data
3. Fine-Tuning on Time-Series data
4. Predict Missing Values
5. Post-Processing

# Reference

* https://github.com/kearnz/autoimpute