<div style="font-size:2.5em; font-weight:bold; text-align:center; margin-top:20px;">LSTM Financial Time Series Data Exploration</div>


This notebook explores the financial time series datasets used for LSTM forecasting. We'll analyze stock prices, forex data, economic indicators, and their relationships to understand patterns and prepare the data for LSTM modeling.

# 0. Liberaries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import os

# Set plotting style
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
plt.rcParams['figure.figsize'] = [12, 6]
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

# 1. Data Loading and Initial Inspection

Let's start by loading the sample datasets:
- Stock price data (daily)
- Forex data (hourly)
- Economic indicators (monthly)
- Processed features

In [None]:
# Load the datasets
stock_data = pd.read_csv('data/stock_price_sample.csv', parse_dates=['Date'], index_col='Date')
forex_data = pd.read_csv('data/forex_sample.csv', parse_dates=['Timestamp'], index_col='Timestamp')
econ_data = pd.read_csv('data/economic_indicators_sample.csv', parse_dates=['Date'], index_col='Date')
processed_data = pd.read_csv('data/processed_features_sample.csv', parse_dates=['Date'], index_col='Date')

# Display basic information for each dataset
print("="*50)
print("Stock Price Data")
print("="*50)
print(f"Shape: {stock_data.shape}")
print(f"Time range: {stock_data.index.min()} to {stock_data.index.max()}")
print(f"Columns: {stock_data.columns.tolist()}")
print(stock_data.head())
print("\nData types:")
print(stock_data.dtypes)
print("\nSummary statistics:")
print(stock_data.describe())

print("\n"+"="*50)
print("Forex Data")
print("="*50)
print(f"Shape: {forex_data.shape}")
print(f"Time range: {forex_data.index.min()} to {forex_data.index.max()}")
print(f"Columns: {forex_data.columns.tolist()}")
print(forex_data.head())
print("\nData types:")
print(forex_data.dtypes)
print("\nSummary statistics:")
print(forex_data.describe())

print("\n"+"="*50)
print("Economic Indicators Data")
print("="*50)
print(f"Shape: {econ_data.shape}")
print(f"Time range: {econ_data.index.min()} to {econ_data.index.max()}")
print(f"Columns: {econ_data.columns.tolist()}")
print(econ_data.head())
print("\nData types:")
print(econ_data.dtypes)
print("\nSummary statistics:")
print(econ_data.describe())

print("\n"+"="*50)
print("Processed Features Data")
print("="*50)
print(f"Shape: {processed_data.shape}")
print(f"Time range: {processed_data.index.min()} to {processed_data.index.max()}")
print(f"Columns: {processed_data.columns.tolist()}")
print(processed_data.head())
print("\nData types:")
print(processed_data.dtypes)

# 2. Missing Values and Data Quality

Let's check for missing values in our datasets.

In [None]:
# Check for missing values in each dataset
print("Missing values in Stock Price Data:")
print(stock_data.isnull().sum())

print("\nMissing values in Forex Data:")
print(forex_data.isnull().sum())

print("\nMissing values in Economic Indicators Data:")
print(econ_data.isnull().sum())

print("\nMissing values in Processed Features Data:")
print(processed_data.isnull().sum())

# Check for duplicates
print("\nDuplicate rows in Stock Price Data:", stock_data.index.duplicated().sum())
print("Duplicate rows in Forex Data:", forex_data.index.duplicated().sum())
print("Duplicate rows in Economic Indicators Data:", econ_data.index.duplicated().sum())
print("Duplicate rows in Processed Features Data:", processed_data.index.duplicated().sum())

# 3. Time Series Visualization

Let's visualize the time series data to identify patterns, trends, and seasonality.

In [None]:
# Create a function to plot time series data
def plot_time_series(data, title, columns=None):
    plt.figure(figsize=(15, 8))
    
    if columns:
        data = data[columns]
    
    data.plot()
    plt.title(title, fontsize=15)
    plt.grid(True, alpha=0.3)
    plt.legend(loc='best')
    plt.tight_layout()
    plt.show()

# Plot stock price data
plot_time_series(stock_data, 'Stock Price Data (January 2023)', ['Close'])

# Plot High, Low, Open, Close in one chart
plot_time_series(stock_data, 'OHLC Stock Price Data (January 2023)', ['Open', 'High', 'Low', 'Close'])

# Plot trading volume
plt.figure(figsize=(15, 5))
stock_data['Volume'].plot(kind='bar', color='skyblue')
plt.title('Trading Volume (January 2023)', fontsize=15)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot forex data
plot_time_series(forex_data, 'Forex Data (January 3, 2023)', ['Close'])

# Plot economic indicators
plot_time_series(econ_data, 'Economic Indicators (2022-2023)')

# Plot selected technical indicators from processed data
plot_time_series(processed_data, 'Technical Indicators (January 2023)', 
                ['SMA_20', 'EMA_12', 'RSI_14'])

# Plot MACD indicators
plot_time_series(processed_data, 'MACD Indicators (January 2023)', 
                ['MACD', 'MACD_Signal', 'MACD_Hist'])

# Plot Bollinger Bands with Close price
plt.figure(figsize=(15, 8))
processed_data[['Close', 'BB_Upper', 'BB_Lower']].plot()
plt.title('Bollinger Bands (January 2023)', fontsize=15)
plt.grid(True, alpha=0.3)
plt.legend(loc='best')
plt.tight_layout()
plt.show()

# 4. Statistical Analysis

Let's perform statistical analysis on the time series data to understand stationarity, autocorrelation, and other properties.

In [None]:
def check_stationarity(time_series, title):
    """Check stationarity of a time series using ADF and KPSS tests"""
    # Augmented Dickey-Fuller test
    result_adf = adfuller(time_series.dropna())
    print(f'ADF Statistic for {title}: {result_adf[0]:.4f}')
    print(f'p-value: {result_adf[1]:.4f}')
    print(f'Critical Values:')
    for key, value in result_adf[4].items():
        print(f'\t{key}: {value:.4f}')
    
    # KPSS test
    result_kpss = kpss(time_series.dropna())
    print(f'\nKPSS Statistic for {title}: {result_kpss[0]:.4f}')
    print(f'p-value: {result_kpss[1]:.4f}')
    print(f'Critical Values:')
    for key, value in result_kpss[3].items():
        print(f'\t{key}: {value:.4f}')
    
    # Plot ACF and PACF
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
    plot_acf(time_series.dropna(), ax=ax1, title=f'Autocorrelation Function for {title}')
    plot_pacf(time_series.dropna(), ax=ax2, title=f'Partial Autocorrelation Function for {title}')
    plt.tight_layout()
    plt.show()

# Check stationarity of closing prices
check_stationarity(stock_data['Close'], 'Stock Closing Prices')

# Check stationarity of returns
if 'Returns' in processed_data.columns:
    check_stationarity(processed_data['Returns'], 'Stock Returns')

# Check stationarity of forex closing prices
check_stationarity(forex_data['Close'], 'Forex Closing Prices')

# 5. Correlation Analysis

Let's examine the correlations between different features.

In [None]:
# Correlation analysis for stock data
plt.figure(figsize=(12, 10))
sns.heatmap(stock_data.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix - Stock Price Data', fontsize=15)
plt.tight_layout()
plt.show()

# Correlation analysis for processed features
# Select a subset of columns to keep the plot readable
selected_features = ['Open', 'Close', 'Volume', 'SMA_20', 'EMA_12', 'RSI_14', 
                     'MACD', 'MACD_Signal', 'ATR_14', 'Returns', 'Volatility_20']

plt.figure(figsize=(14, 12))
sns.heatmap(processed_data[selected_features].corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix - Selected Technical Indicators', fontsize=15)
plt.tight_layout()
plt.show()

# Correlation between economic indicators
plt.figure(figsize=(10, 8))
sns.heatmap(econ_data.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix - Economic Indicators', fontsize=15)
plt.tight_layout()
plt.show()

# 6. Distribution Analysis

Let's analyze the distributions of key variables.

In [None]:
# Function to plot distributions
def plot_distributions(data, columns, title):
    n_cols = len(columns)
    fig, axes = plt.subplots(1, n_cols, figsize=(15, 5))
    
    for i, col in enumerate(columns):
        sns.histplot(data[col], kde=True, ax=axes[i] if n_cols > 1 else axes)
        if n_cols > 1:
            axes[i].set_title(col)
        else:
            axes.set_title(col)
    
    plt.suptitle(title, fontsize=15)
    plt.tight_layout()
    plt.show()

# Distribution of stock prices
plot_distributions(stock_data, ['Close'], 'Distribution of Stock Closing Prices')

# Distribution of trading volume
plot_distributions(stock_data, ['Volume'], 'Distribution of Trading Volume')

# Distribution of returns and volatility
if 'Returns' in processed_data.columns and 'Volatility_20' in processed_data.columns:
    plot_distributions(processed_data, ['Returns', 'Volatility_20'], 
                      'Distribution of Returns and Volatility')

# Distribution of technical indicators
plot_distributions(processed_data, ['RSI_14', 'ATR_14'], 
                  'Distribution of Technical Indicators')

# QQ plots for returns (if available)
if 'Log_Returns' in processed_data.columns:
    plt.figure(figsize=(10, 6))
    from scipy import stats
    stats.probplot(processed_data['Log_Returns'].dropna(), dist="norm", plot=plt)
    plt.title('Q-Q Plot of Log Returns', fontsize=15)
    plt.tight_layout()
    plt.show()

# 7. Feature Engineering Analysis

Let's analyze the relationship between raw prices and engineered features.

In [None]:
# Plot relationship between closing price and moving averages
plt.figure(figsize=(15, 8))
processed_data[['Close', 'SMA_20', 'EMA_12']].plot()
plt.title('Closing Price vs Moving Averages', fontsize=15)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

# Plot RSI and closing price
fig, ax1 = plt.subplots(figsize=(15, 8))

color = 'tab:blue'
ax1.set_xlabel('Date')
ax1.set_ylabel('Closing Price', color=color)
ax1.plot(processed_data.index, processed_data['Close'], color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()
color = 'tab:red'
ax2.set_ylabel('RSI', color=color)
ax2.plot(processed_data.index, processed_data['RSI_14'], color=color)
ax2.tick_params(axis='y', labelcolor=color)
ax2.axhline(y=70, color='gray', linestyle='--')
ax2.axhline(y=30, color='gray', linestyle='--')

plt.title('Closing Price vs RSI', fontsize=15)
fig.tight_layout()
plt.show()

# Plot returns vs volatility
if 'Returns' in processed_data.columns and 'Volatility_20' in processed_data.columns:
    plt.figure(figsize=(10, 6))
    plt.scatter(processed_data['Returns'], processed_data['Volatility_20'])
    plt.xlabel('Returns')
    plt.ylabel('Volatility (20-day)')
    plt.title('Returns vs Volatility', fontsize=15)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# 8. Time Series Decomposition

Let's decompose the time series to identify trend, seasonality, and residual components.

In [None]:
# Import required library
from statsmodels.tsa.seasonal import seasonal_decompose

# Time series decomposition of stock closing prices
try:
    # Note: For a proper decomposition, we'd need more data points
    # This is for illustration purposes with the sample data
    decomposition = seasonal_decompose(stock_data['Close'], model='additive', period=5)
    
    plt.figure(figsize=(12, 10))
    plt.subplot(411)
    plt.plot(decomposition.observed)
    plt.title('Observed', fontsize=12)
    plt.subplot(412)
    plt.plot(decomposition.trend)
    plt.title('Trend', fontsize=12)
    plt.subplot(413)
    plt.plot(decomposition.seasonal)
    plt.title('Seasonality', fontsize=12)
    plt.subplot(414)
    plt.plot(decomposition.resid)
    plt.title('Residuals', fontsize=12)
    plt.tight_layout()
    plt.show()
except:
    print("Not enough data points for seasonal decomposition with the sample data.")
    print("This would work with a larger dataset spanning several periods.")

# 9. Volatility Clustering Analysis

Let's analyze volatility clustering in returns, which is important for financial time series.

In [None]:
# Volatility clustering in returns
if 'Returns' in processed_data.columns:
    # Plot returns
    plt.figure(figsize=(15, 6))
    plt.subplot(211)
    plt.plot(processed_data['Returns'])
    plt.title('Stock Returns', fontsize=15)
    plt.grid(True, alpha=0.3)
    
    # Plot squared returns (proxy for volatility)
    plt.subplot(212)
    plt.plot(processed_data['Returns']**2)
    plt.title('Squared Returns (Volatility Proxy)', fontsize=15)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Autocorrelation of squared returns
    plt.figure(figsize=(12, 6))
    plot_acf(processed_data['Returns'].dropna()**2, lags=20)
    plt.title('Autocorrelation of Squared Returns', fontsize=15)
    plt.tight_layout()
    plt.show()

# 10. Data Preparation for LSTM

Finally, let's prepare the data for LSTM modeling.

In [None]:
# Import necessary utilities
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

def create_sequences(data, seq_length):
    """Create sequences for LSTM input"""
    xs, ys = [], []
    for i in range(len(data) - seq_length):
        x = data[i:(i + seq_length)]
        y = data[i + seq_length]
        xs.append(x)
        ys.append(y)
    return np.array(xs), np.array(ys)

# Example: Prepare closing price data for LSTM
# 1. Normalize the data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(stock_data[['Close']])

# 2. Create sequences
seq_length = 5  # Use 5 days of data to predict the next day
X, y = create_sequences(scaled_data, seq_length)

# 3. Split into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, shuffle=False)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, shuffle=False)

# Print shapes
print("Training data shape:")
print(f"X_train: {X_train.shape}")
print(f"y_train: {y_train.shape}")
print("\nValidation data shape:")
print(f"X_val: {X_val.shape}")
print(f"y_val: {y_val.shape}")
print("\nTest data shape:")
print(f"X_test: {X_test.shape}")
print(f"y_test: {y_test.shape}")

# Visualize a sample sequence
plt.figure(figsize=(10, 6))
plt.plot(scaler.inverse_transform(X_train[0]), label='Input Sequence')
plt.scatter(seq_length, scaler.inverse_transform(y_train[0]), color='r', label='Target')
plt.title(f'Sample LSTM Input Sequence and Target (Sequence Length: {seq_length})', fontsize=15)
plt.xlabel('Time Steps')
plt.ylabel('Normalized Closing Price')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# 11. Conclusion

In this exploratory data analysis, we examined several financial time series datasets:
1. **Stock Price Data**: Daily OHLCV data showing typical stock market patterns
2. **Forex Data**: Hourly exchange rate data showing smaller price movements
3. **Economic Indicators**: Monthly macroeconomic indicators with strong correlations
4. **Processed Features**: Technical indicators derived from price data

Key findings:
- The stock price data shows volatility and trends typical of financial markets
- Technical indicators like RSI, MACD, and Bollinger Bands provide useful signals
- The returns data shows volatility clustering, a common feature in financial time series
- The economic indicators have interesting correlations with each other

Next steps for LSTM modeling:
1. Extend the data preparation to include multivariate features
2. Implement the LSTM models with different architectures
3. Evaluate model performance with appropriate metrics
4. Compare LSTM predictions with traditional forecasting methods