# Advanced Time Series Forecasting with LSTM

**Author:** Olivier Robert-Duboille

## 1. Introduction
In this notebook, we explore the capabilities of Long Short-Term Memory (LSTM) networks for forecasting complex time series data. We will move beyond simple ARIMA models and dive into deep learning approaches that can capture non-linear temporal dependencies.

### Objectives:
- Generate a synthetic dataset representing realistic market trends with seasonality and noise.
- Perform rigorous Exploratory Data Analysis (EDA) to understand stationarity and autocorrelation.
- Build and train a stacked LSTM model using TensorFlow/Keras.
- Evaluate performance using RMSE and visual inspection of forecast vs. actuals.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# Set plotting style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Data Generation
Real-world data is often messy. To ensure we have a clean ground truth for this demonstration, we'll generate a synthetic time series that combines:
1. A linear trend
2. A sinusoidal seasonal component
3. Gaussian noise

This simulates typical scenarios found in sales forecasting or stock trends.

In [None]:
def generate_time_series(n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10))  # wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # wave 2
    series += 0.1 * (np.random.rand(n_steps) - 0.5)   # noise
    return series[..., np.newaxis].astype(np.float32)

# Generate data
n_steps = 1000
series = generate_time_series(n_steps)
date_range = pd.date_range(start='2020-01-01', periods=n_steps, freq='D')
df = pd.DataFrame(series, columns=['value'], index=date_range)

df.head()

## 3. Exploratory Data Analysis (EDA)
Before modeling, we must visualize the data to understand its structure. We'll look at the raw series and the distribution of values.

In [None]:
plt.figure(figsize=(14, 6))
plt.plot(df.index, df['value'], label='Observed Data', color='royalblue')
plt.title('Synthetic Time Series Data', fontsize=16)
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Distribution plot
plt.figure(figsize=(10, 4))
sns.histplot(df['value'], kde=True, color='teal')
plt.title('Distribution of Values')
plt.show()

## 4. Data Preprocessing
LSTMs are sensitive to the scale of input data. We'll normalize the data to the range [0, 1] using MinMaxScaler. We also need to reshape the data into a 3D format `[samples, time steps, features]`.

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(df)

def create_sequences(dataset, look_back=60):
    X, y = [], []
    for i in range(len(dataset) - look_back - 1):
        a = dataset[i:(i + look_back), 0]
        X.append(a)
        y.append(dataset[i + look_back, 0])
    return np.array(X), np.array(y)

look_back = 60
X, y = create_sequences(scaled_data, look_back)

# Reshape for LSTM [samples, time steps, features]
X = np.reshape(X, (X.shape[0], X.shape[1], 1))

# Split into train and test
train_size = int(len(X) * 0.8)
test_size = len(X) - train_size
X_train, X_test = X[0:train_size], X[train_size:len(X)]
y_train, y_test = y[0:train_size], y[train_size:len(y)]

print(f"Training Shape: {X_train.shape}, Testing Shape: {X_test.shape}")

## 5. Model Building
We use a Sequential model with two LSTM layers. The first returns sequences to feed into the second. Dropout layers are included to prevent overfitting.

In [None]:
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(look_back, 1)))
model.add(Dropout(0.2))
model.add(LSTM(50, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(25))
model.add(Dense(1))

model.compile(optimizer='adam', loss='mean_squared_error')
model.summary()

In [None]:
history = model.fit(X_train, y_train, batch_size=32, epochs=20, validation_data=(X_test, y_test), verbose=1)

## 6. Evaluation
Visualizing the loss curve helps us check for convergence. Then, we predict on the test set and inverse transform the values to their original scale for interpretation.

In [None]:
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper right')
plt.show()

In [None]:
# Predictions
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

# Invert predictions
train_predict = scaler.inverse_transform(train_predict)
y_train_inv = scaler.inverse_transform([y_train])
test_predict = scaler.inverse_transform(test_predict)
y_test_inv = scaler.inverse_transform([y_test])

# Calculate RMSE
train_score = np.sqrt(mean_squared_error(y_train_inv[0], train_predict[:,0]))
print(f'Train Score: {train_score:.2f} RMSE')
test_score = np.sqrt(mean_squared_error(y_test_inv[0], test_predict[:,0]))
print(f'Test Score: {test_score:.2f} RMSE')

## 7. Conclusion
We have successfully built an LSTM model to forecast time series data. The plot below shows how our model's predictions align with the actual test data.

In [None]:
# Plotting
train_plot = np.empty_like(scaled_data)
train_plot[:, :] = np.nan
train_plot[look_back:len(train_predict)+look_back, :] = train_predict

test_plot = np.empty_like(scaled_data)
test_plot[:, :] = np.nan
test_plot[len(train_predict)+(look_back*2)+1:len(scaled_data)-1, :] = test_predict

plt.figure(figsize=(14,6))
plt.plot(scaler.inverse_transform(scaled_data), label='Actual Data')
plt.plot(train_plot, label='Train Prediction')
plt.plot(test_plot, label='Test Prediction')
plt.title('LSTM Forecasting Results')
plt.legend()
plt.show()