<a href="https://www.kaggle.com/code/aisuko/time-series-forecasting?scriptVersionId=199546905" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Time series forecasting, the process of using historical data to predict future outcomes, is key to creating accurate environmental models.


# Common challenges in env time series forecasting

Time series forecasting for environmental monitoring is riddled with complexities that don't exist in more controlled fields.

## Irregular and missing data

Environmental data is often collected at irregular intervals, and missing data points are common issue due to technical malfunctions or extreme weather events. This introduces biased that traditional time series models struggle to handle.

## Nonlinear trends and seasonality

nvironmental data usually exhibits strong seasonal patterns(e.g, temperature, rainfall) and long-term trends(e.g. global warming). Capturing both short-term fluctuations and long-term patterns in a single model is diffcult.

## High variability and noise

Natural phenomena are subject to high variability, with many factors(e.g., temperature, air pollution, humidity) contributing to the overall state of the environment. This noise makes it difficult for forecasting models to detect true patterns.

## External influenece

Environmental systems are affected by external, often uppredicatable, fores. For instance, human activities such as deforestation or urbanization can drastically alter predictions that reply solelu on historial data.

## Multivatiate dependencies

Environmental systems often have multiple vatiables influencing on another. For example, in air quality forecasting, temperature, humidity, and wind speed all play interconnected roles. Capturing these multivariate dependencies requires more advanced modeling techniques.


In [None]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Create the dataset
num_records = 100
patient_ids = np.arange(1, num_records + 1)
ph_values = np.random.uniform(7.35, 7.45, num_records)  # Random PH values between typical human range
hr_values = np.random.randint(60, 100, num_records)      # Random Heart Rate values between 60 and 100 bpm
rr_values = np.random.randint(12, 20, num_records)       # Random Respiratory Rate values between 12 and 20 breaths per minute
sbp_values = np.random.randint(90, 140, num_records)     # Random Systolic Blood Pressure values between 90 and 140 mmHg

# Create DataFrame
data = {
    'Patient_ID': patient_ids,
    'PH': ph_values,
    'HR': hr_values,
    'RR': rr_values,
    'SBP': sbp_values
}

df = pd.DataFrame(data)

date_range=pd.date_range(start="2024-01-01", periods=num_records, freq='h')
df['Observation_Time'] = date_range

# Display the DataFrame (you can save or process it further as needed)
df.head()

## Randomly setting the data to NaN

In [None]:
fraction_missing=0.3

# Copy the DataFrame to avoid modifying the original one
df_with_missing = df.copy()

# Select random elements to be set as NaN in specific columns
for column in ['PH', 'HR', 'RR', 'SBP','Observation_Time']:
    df_with_missing.loc[
        np.random.choice(df_with_missing.index, size=int(fraction_missing * len(df_with_missing)), replace=False), column
    ] = np.nan

df_with_missing.head(10)


# 1.Handling missing and irregular data

One of the most significatn challenges in environmental time series forecasting is dealing with missing or irregular data points.

## Interpolation

Filling in missing data points using terpolation methods, like:
* linear
* spline
* nearest-neighbot interpolation

These methods can help smooth out irregular time series.

In [None]:
df_with_missing['PH']=df_with_missing['PH'].interpolate(method='linear')
df_with_missing.head(10)

## Data imputation using machine learning

More advanced techniques such as k-nearest neighbors(KNN) or even neural networks can be used to impute missing data by identifying from other variables in the dataset.

In [None]:
fraction_missing=0.3

# Copy the DataFrame to avoid modifying the original one
df_with_missing_knn = df.copy()

# Select random elements to be set as NaN in specific columns
for column in ['PH', 'HR', 'RR', 'SBP','Observation_Time']:
    df_with_missing_knn.loc[
        np.random.choice(df_with_missing_knn.index, size=int(fraction_missing * len(df_with_missing_knn)), replace=False), column
    ] = np.nan

df_with_missing_knn.head(10)

In [None]:
df_with_missing_knn['Observation_Time']=pd.to_datetime(df_with_missing_knn['Observation_Time'])
df_with_missing_knn.head(5)

In [None]:
from sklearn.impute import KNNImputer

# example of using KNN to impute missing values
imputer=KNNImputer(n_neighbors=5)

missing_columns=df_with_missing_knn[['PH','HR','RR','SBP']]

df_imputed=imputer.fit_transform(missing_columns)

In [None]:
df_imputed_pd=pd.DataFrame(df_imputed, columns=missing_columns.columns)

In [None]:
df_imputed_pd['Patient_ID']=df_with_missing_knn['Patient_ID']
df_imputed_pd['Observation_Time']=df_with_missing_knn['Observation_Time']

In [None]:
df_imputed_pd.head(10)

## Impute time using linear interpolation for missing times

We should convert datetime column into numerical form(such as UNIX timestamps), performa the interpolation, and then convert it back into datetime.

In [None]:
df_imputed_pd['Observation_Timestamp'] = df_imputed_pd['Observation_Time'].apply(lambda x: x.timestamp() if pd.notnull(x) else None)

df_imputed_pd['Observation_Timestamp'] = df_imputed_pd['Observation_Timestamp'].interpolate(method='linear')

In [None]:
# Convert timestamps back to datetime
df_imputed_pd['Observation_Time'] = pd.to_datetime(df_imputed_pd['Observation_Timestamp'], unit='s')

In [None]:
df_imputed_pd.drop(columns=['Observation_Timestamp'], inplace=True)

In [None]:
df_imputed_pd

# 2.Nonlinear models for seaconality and trends

Environmental data often displat complex seasonality and long-term trends that cannot be captured by lienar models. We can use nonlinear models like Prophet, a forecasting tool developed by Facebook, which is designed to catpture these kinds of patterns.

In [None]:
!pip install -U -q prophet==1.1.6

In [None]:
from prophet import Prophet

df_for_prophet = df_imputed_pd[['Observation_Time', 'PH']].rename(columns={'Observation_Time': 'ds', 'PH': 'y'})

In [None]:
# fit Prophet model
model=Prophet(yearly_seasonality=True)
model.fit(df_for_prophet)

In [None]:
import matplotlib.pyplot as plt

# Forecast future values
future=model.make_future_dataframe(periods=365)
forecast=model.predict(future)

# Plot the forecast
model.plot(forecast)
plt.show()

# 3.Using machine learning for noisy and complex data

Machine learning models like `random forests`, `XGBoost` and `Neural Networks` can be more effective at capturing patterns in noisy and highly variable environmental data. XGBoost works well with noisy datasets because it can capture complex, nonlinear relationships in the data, while also providing flexibility to model multivatiate time series.

In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# prepare data
X=df_for_prophet[['','','']]
y=df_for_prophet['']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an XGBoost model
model = xgb.XGBRegressor(objective='reg:squarederror')
model.fit(X_train, y_train)

# Make predictions and evaluate
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')

# 4.Incorporating external factors

# 5.Multivariate time seris modeling

Modeling multiple variables simultaneuously can provide more accurate forecasts. Techniques like **Vector Autoregression(VAR)** or **multivariate LSTMs**.

# Acknowledgement

* https://blog.gopenai.com/is-time-series-forecasting-in-environmental-monitoring-full-of-unsolved-problems-2fbb3f04cd4f

# Reference

* https://www.linkedin.com/posts/bowen-li-10101197_timeseries-llms-timeseries-activity-7248124243433926657-30Qm?utm_source=share&utm_medium=member_desktop
* https://www.linkedin.com/feed/update/urn:li:activity:7216033310441947136?lipi=urn%3Ali%3Apage%3Ad_flagship3_detail_base%3B0w0RFwZJRFi2cQTih%2Bfl1g%3D%3D
* https://arxiv.org/pdf/2402.10198
