# Stock Data Preparation for LSTM Models

## Objective
This notebook prepares data from cleaned stock CSVs specifically for training LSTM models. For each stock, we prepare 3 separate datasets for predicting:
- Next day price direction (up/down)
- Next week price direction (up/down)
- Next month price direction (up/down)

## Input
- Cleaned stock CSVs from `../data/cleaned/{stock}.csv`

## Output
- Processed datasets ready for LSTM at `../data/lstm/{period}/{stock}_lstm_{period}.csv`
  - Where period is one of: 'day', 'week', 'month'

## Steps
1. Import libraries and define functions
2. Select and engineer features suitable for LSTM
3. Create sequences of data for time series input
4. Scale features appropriately
5. Split data into train/validation
6. Save prepared datasets
7. Perform quality checks

In [26]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import seaborn as sns
from datetime import datetime

# Set display options
pd.set_option('display.max_columns', None)
np.set_printoptions(precision=3, suppress=True)

## 1. Define Important Functions

In [None]:
def create_sequences(data, seq_length):
    """
    Create sequences of data with a specific length for LSTM input.
    
    Parameters:
    - data: DataFrame with features
    - seq_length: Number of time steps in each sequence
    
    Returns:
    - X: numpy array of sequences (samples, seq_length, features)
    - y: numpy array of target values
    """
    X = []
    y = []
    
    # Extract features and target
    features = data.drop('target', axis=1).values
    targets = data['target'].values
    
    # Create sequences
    for i in range(len(features) - seq_length):
        X.append(features[i:i+seq_length])
        y.append(targets[i+seq_length])
        
    return np.array(X), np.array(y)

In [None]:
def select_features_by_target_correlation(df, corr_threshold=0.8):
    """
    Select features based on correlation with target while removing multicollinearity.
    
    For each group of highly correlated features, keeps the one with the
    highest correlation with the target.
    
    Parameters:
    - df: DataFrame with features and 'target' column
    - corr_threshold: Correlation threshold for considering features as highly correlated
    
    Returns:
    - List of features to drop
    """
    # Calculate correlation matrix
    corr_matrix = df.corr().abs()
    
    # Get correlation with target
    target_corr = corr_matrix['target'].drop('target')
    
    # Get upper triangle of correlations
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    # Find feature groups with high correlation
    to_drop = set()
    
    # For each column
    for c in upper.columns:
        if c == 'target':
            continue
            
        # Find highly correlated features
        correlated_features = [
            r for r in upper.index 
            if upper.loc[r, c] > corr_threshold and r != 'target' and c != 'target'
        ]
        
        # If we found any
        if correlated_features:
            # Add column c and all its correlated features to a group
            corr_group = [c] + correlated_features
            
            # Find correlations with target for this group
            target_corrs = {feat: target_corr[feat] for feat in corr_group}
            
            # Sort by correlation with target (descending)
            sorted_group = sorted(target_corrs.items(), key=lambda x: x[1], reverse=True)
            
            # Keep the feature with highest correlation with target
            to_keep = sorted_group[0][0]
            
            # Mark the rest for removal
            for feat in corr_group:
                if feat != to_keep:
                    to_drop.add(feat)
    
    # Return list of features to drop
    return list(to_drop)

In [None]:
def prepare_stock_data(input_file, output_dir, period, seq_length=10):
    """
    Prepare stock data for LSTM model.
    
    Parameters:
    - input_file: Path to input CSV file
    - output_dir: Directory to save prepared data
    - period: Prediction period ('day', 'week', 'month')
    - seq_length: Length of sequences for LSTM (default: 10)
    """
    print(f"Processing {os.path.basename(input_file)} for {period} prediction...")
    
    try:
        # Read data and ensure chronological order
        df = pd.read_csv(input_file)
        
        # Handle date column properly - fix timezone issues
        try:
            # Try parsing with UTC=True to avoid timezone warnings
            df['date'] = pd.to_datetime(df['date'], utc=True)
        except:
            try:
                # If that fails, try converting the string format directly
                df['date'] = pd.to_datetime(df['date'].str.split('-04:00').str[0], errors='coerce')
            except:
                # Last resort - just use the date part and ignore time
                print("Warning: Date parsing issue. Converting to datetime without timezone info.")
                df['date'] = pd.to_datetime(df['date'], errors='coerce')
        
        # Make sure date is properly converted
        if df['date'].dtype != 'datetime64[ns]' and df['date'].dtype != 'datetime64[ns, UTC]':
            raise ValueError("Failed to convert date column to datetime")
            
        # Sort by date
        df = df.sort_values('date')
        
        # Define target column based on period
        if period == 'day':
            target_col = 'next_day_direction'
        elif period == 'week':
            target_col = 'next_week_direction'
        elif period == 'month':
            target_col = 'next_month_direction'
        else:
            raise ValueError(f"Invalid period: {period}. Use 'day', 'week', or 'month'.")
        
        # Calculate all features on the original time series BEFORE selecting subset
        
        # 1. Daily change features
        df['day_change'] = df['close'] - df['open']
        df['day_change_pct'] = (df['close'] - df['open']) / df['open']
        
        # 2. Calculate short-term volatility (to capture recent changes better)
        df['volatility_5'] = df['return'].rolling(window=5).std()
        
        # 3. Rate of change over different periods (original series)
        df['roc_3'] = df['close'].pct_change(3)
        df['roc_5'] = df['close'].pct_change(5)
        df['roc_10'] = df['close'].pct_change(10)
        
        # 4. Momentum indicators (original series)
        df['momentum_3'] = df['close'] - df['close'].shift(3)
        df['momentum_5'] = df['close'] - df['close'].shift(5)
        df['momentum_10'] = df['close'] - df['close'].shift(10)
        
        # 5. RSI changes (original series)
        df['rsi_diff'] = df['rsi'] - df['rsi'].shift(1)
        df['rsi_diff_3'] = df['rsi'] - df['rsi'].shift(3)
        df['rsi_ma_diff'] = df['rsi'] - df['rsi'].rolling(window=5).mean()
        
        # 6. Moving average differences - capturing recent trends
        df['ma5_diff'] = df['ma5'] - df['ma5'].shift(1)
        df['ma20_diff'] = df['ma20'] - df['ma20'].shift(1)
        
        # 7. Price-to-MA differences - rate of change in trend strength
        df['price_to_ma5_diff'] = df['price_to_ma5'] - df['price_to_ma5'].shift(1)
        df['price_to_ma20_diff'] = df['price_to_ma20'] - df['price_to_ma20'].shift(1)
        
        # 8. Volatility ratios - recalculated with better window handling
        df['volatility_ratio'] = df['volatility'] / df['volatility'].rolling(window=10).mean()
        df['volatility_change'] = df['volatility'] - df['volatility'].shift(1)
        
        # 9. Volume changes
        df['volume_diff'] = df['volume'] - df['volume'].shift(1)
        df['rel_volume_diff'] = df['rel_volume'] - df['rel_volume'].shift(1)
        
        # 10. Add temporal features - SAFER VERSION
        # Extract date components directly from date column to avoid .dt accessor issues
        try:
            # Only add these if date parsing worked correctly
            if df['date'].dtype == 'datetime64[ns]' or df['date'].dtype == 'datetime64[ns, UTC]':
                df['day_of_week'] = df['date'].dt.dayofweek
                df['day_of_week_sin'] = np.sin(2 * np.pi * df['date'].dt.dayofweek / 7)
                df['day_of_week_cos'] = np.cos(2 * np.pi * df['date'].dt.dayofweek / 7)
                df['month'] = df['date'].dt.month
                df['month_sin'] = np.sin(2 * np.pi * df['date'].dt.month / 12)
                df['month_cos'] = np.cos(2 * np.pi * df['date'].dt.month / 12)
        except Exception as e:
            print(f"Warning: Could not create temporal features: {str(e)}")
            print("Continuing without temporal features.")
        
        # Select relevant base features plus the new engineered features
        selected_features = [
            # Basic price metrics
            'close', 'return', 
            
            # Moving averages
            'ma5', 'ma20', 'ma50',
            
            # Price to MA relationships
            'price_to_ma5', 'price_to_ma20', 'price_to_ma50',
            
            # Crossover signals
            'ma5_cross_ma20', 'ma20_cross_ma50',
            
            # Volatility measures
            'volatility', 'volatility_5',
            
            # Volume indicators
            'rel_volume',
            
            # RSI indicators
            'rsi', 'rsi_diff',
            
            # Dynamic changes in indicators
            'ma5_diff', 'ma20_diff',
            'price_to_ma5_diff', 'price_to_ma20_diff',
            
            # Rate of change indicators
            'roc_3', 'roc_5', 'roc_10',
            
            # Momentum indicators
            'momentum_3', 'momentum_5', 
            
            # Daily changes
            'day_change_pct',
            
            # Volatility dynamics
            'volatility_change',
            
            # Volume dynamics
            'rel_volume_diff',
            
            # Temporal features (if available)
            'day_of_week_sin', 'day_of_week_cos',
            'month_sin', 'month_cos'
        ]
        
        # Handle missing values with FORWARD fill only (no backfill to avoid leakage)
        df = df.fillna(method='ffill')
        
        # Create a new dataframe with only selected features and target
        valid_features = [f for f in selected_features if f in df.columns]
        lstm_df = df[valid_features + [target_col]].copy()
        lstm_df.rename(columns={target_col: 'target'}, inplace=True)
        
        # Remove any remaining NaNs after feature engineering
        lstm_df = lstm_df.dropna()
        
        # IMPROVED FEATURE SELECTION: Use the new function to remove highly correlated features
        # while keeping those most correlated with the target
        to_drop = select_features_by_target_correlation(lstm_df, corr_threshold=0.8)
        
        if to_drop:
            print(f"Removing highly correlated features: {to_drop}")
            print("(Keeping the feature in each correlated group that has highest correlation with target)")
            lstm_df = lstm_df.drop(to_drop, axis=1)
        
        # Split features and target
        features = lstm_df.drop('target', axis=1)
        target = lstm_df['target']
        
        # Scale features using MinMaxScaler
        scaler = MinMaxScaler(feature_range=(0, 1))
        scaled_features = scaler.fit_transform(features)
        
        # Create DataFrame with scaled features
        scaled_df = pd.DataFrame(scaled_features, columns=features.columns)
        scaled_df['target'] = target.values
        
        # Create directory if it doesn't exist
        os.makedirs(os.path.join(output_dir, period), exist_ok=True)
        
        # Create output filename
        stock_name = os.path.basename(input_file).split('.')[0]
        output_file = os.path.join(output_dir, period, f"{stock_name}_lstm_{period}.csv")
        
        # Save prepared data
        scaled_df.to_csv(output_file, index=False)
        
        print(f"Saved prepared data to {output_file}")
        print(f"Data shape: {scaled_df.shape}")
        print(f"Features: {', '.join(scaled_df.columns[:-1])}")
        print(f"Target balance: {scaled_df['target'].value_counts(normalize=True).to_dict()}")
        
        # Perform a quick check for static values
        print("\nChecking for static features...")
        static_check = scaled_df.iloc[:5].copy()
        static_features = []
        
        for col in static_check.columns:
            if col != 'target' and static_check[col].nunique() == 1:
                static_features.append(col)
        
        if static_features:
            print(f"Warning: The following features appear static in the first 5 rows: {static_features}")
        else:
            print("No static features detected in sample rows. Data looks good!")
        
        return scaled_df, output_file
        
    except Exception as e:
        print(f"Error processing {os.path.basename(input_file)}: {str(e)}")
        print("Traceback:")
        import traceback
        traceback.print_exc()
        raise

## 2. Define Paths and Process All Stocks

In [31]:
# Define input and output directories
input_dir = '../data/cleaned/'
output_dir = '../data/lstm/'

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Get list of cleaned stock files
stock_files = [f for f in os.listdir(input_dir) if f.endswith('.csv')]

print(f"Found {len(stock_files)} stock files")

Found 20 stock files


In [None]:
# Process each stock for all periods
periods = ['day', 'week', 'month']
results = {}

for stock_file in stock_files:
    stock_name = stock_file.split('.')[0]
    input_file = os.path.join(input_dir, stock_file)
    
    results[stock_name] = {}
    
    for period in periods:
        scaled_df, output_file = prepare_stock_data(input_file, output_dir, period)
        results[stock_name][period] = {
            'df': scaled_df,
            'file': output_file
        }
        print("---")

## 3. Quality Checks on Prepared Data

In [33]:
def perform_quality_checks(results):
    """
    Perform quality checks on prepared data.
    
    Parameters:
    - results: Dictionary of prepared data results
    """
    print("\n=== Data Quality Checks ===")
    
    # 1. Check for class imbalance
    class_balance = {period: {} for period in periods}
    
    for stock_name, stock_data in results.items():
        for period, period_data in stock_data.items():
            df = period_data['df']
            up_pct = df['target'].mean() * 100
            class_balance[period][stock_name] = up_pct
    
    # Display class balance summary
    for period, stocks in class_balance.items():
        avg_balance = sum(stocks.values()) / len(stocks)
        min_balance = min(stocks.values())
        max_balance = max(stocks.values())
        
        print(f"\n{period.capitalize()} prediction class balance:")
        print(f"  Average % Up: {avg_balance:.2f}%")
        print(f"  Range: {min_balance:.2f}% - {max_balance:.2f}%")
        print(f"  Stocks with extreme imbalance (<30% or >70%): ", end="")
        
        extreme = [s for s, v in stocks.items() if v < 30 or v > 70]
        if extreme:
            print(", ".join(extreme))
        else:
            print("None")
    
    # 2. Check a sample dataset to verify scaling is correct
    sample_stock = list(results.keys())[0]
    sample_period = 'day'
    sample_df = results[sample_stock][sample_period]['df']
    
    print("\nVerifying scaling for sample dataset:")
    print(f"  Stock: {sample_stock}, Period: {sample_period}")
    
    # Check min and max values for each feature
    feature_ranges = {}
    for feature in sample_df.columns:
        if feature != 'target':
            min_val = sample_df[feature].min()
            max_val = sample_df[feature].max()
            feature_ranges[feature] = (min_val, max_val)
    
    # Print ranges for first 5 features
    for feature, (min_val, max_val) in list(feature_ranges.items())[:5]:
        print(f"  {feature}: Min={min_val:.4f}, Max={max_val:.4f}")
    
    print(f"  All features in [0,1] range: {all(0 <= min_val <= max_val <= 1 for min_val, max_val in feature_ranges.values())}")
    
    # 3. Check for correlation among features in the sample dataset
    corr_matrix = sample_df.drop('target', axis=1).corr()
    
    # Count highly correlated feature pairs
    high_corr_count = 0
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > 0.8:
                high_corr_count += 1
    
    print(f"\nCorrelation check for sample dataset:")
    print(f"  Number of highly correlated feature pairs (>0.8): {high_corr_count}")
    if high_corr_count > 0:
        print("  Some features are still highly correlated, consider further feature selection")
    else:
        print("  No high correlation issues detected")

In [13]:
# Perform quality checks
perform_quality_checks(results)


=== Data Quality Checks ===

Day prediction class balance:
  Average % Up: 52.23%
  Range: 49.34% - 55.08%
  Stocks with extreme imbalance (<30% or >70%): None

Week prediction class balance:
  Average % Up: 55.10%
  Range: 51.02% - 59.94%
  Stocks with extreme imbalance (<30% or >70%): None

Month prediction class balance:
  Average % Up: 58.29%
  Range: 49.46% - 66.04%
  Stocks with extreme imbalance (<30% or >70%): None

Verifying scaling for sample dataset:
  Stock: PYPL, Period: day
  close: Min=0.0000, Max=1.0000
  return: Min=0.0000, Max=1.0000
  price_to_ma5: Min=0.0000, Max=1.0000
  price_to_ma20: Min=0.0000, Max=1.0000
  ma5_cross_ma20: Min=0.0000, Max=1.0000
  All features in [0,1] range: False

Correlation check for sample dataset:
  Number of highly correlated feature pairs (>0.8): 0
  No high correlation issues detected


## 4. Sample Sequence Creation for LSTM

Now let's demonstrate how to create sequences for LSTM input using the prepared data.

In [34]:
def demonstrate_sequence_creation(results):
    """
    Demonstrate sequence creation for LSTM models.
    
    Parameters:
    - results: Dictionary of prepared data results
    """
    # Select a sample stock and period
    sample_stock = list(results.keys())[0]
    sample_period = 'day'
    sample_df = results[sample_stock][sample_period]['df']
    
    print(f"\n=== Sequence Creation Demonstration for {sample_stock}, {sample_period} ===\n")
    
    # Create sequences with length 10
    seq_length = 10
    X, y = create_sequences(sample_df, seq_length)
    
    # Print information about the sequences
    print(f"Input shape: {X.shape}")
    print(f"Target shape: {y.shape}")
    print(f"Number of sequences: {len(X)}")
    print(f"Sequence length: {X.shape[1]}")
    print(f"Number of features: {X.shape[2]}")
    
    # Calculate class distribution in targets
    unique, counts = np.unique(y, return_counts=True)
    class_dist = dict(zip(unique, counts))
    total = sum(counts)
    
    print("\nTarget class distribution:")
    for cls, count in class_dist.items():
        print(f"  Class {int(cls)}: {count} samples ({count/total*100:.2f}%)")
    
    # Show an example of how these sequences will be used for LSTM training
    print("\nExample sequence (first 3 time steps, first 5 features):")
    example_seq = X[0, :3, :5]  # First sequence, first 3 time steps, first 5 features
    print(example_seq)
    print(f"Target for this sequence: {int(y[0])}")
    
    print("\nThis demonstrates how sequences are prepared for LSTM input.")
    print("Each sequence consists of 10 consecutive days of data, with the target being the direction prediction")
    print("for the day immediately following the sequence.")

In [35]:
# Demonstrate sequence creation
demonstrate_sequence_creation(results)


=== Sequence Creation Demonstration for PYPL, day ===

Input shape: (2438, 10, 19)
Target shape: (2438,)
Number of sequences: 2438
Sequence length: 10
Number of features: 19

Target class distribution:
  Class 0: 1164 samples (47.74%)
  Class 1: 1274 samples (52.26%)

Example sequence (first 3 time steps, first 5 features):
[[0.011 0.752 0.    0.    0.313]
 [0.011 0.623 0.    0.    0.313]
 [0.011 0.534 0.    0.    0.313]]
Target for this sequence: 0

This demonstrates how sequences are prepared for LSTM input.
Each sequence consists of 10 consecutive days of data, with the target being the direction prediction
for the day immediately following the sequence.


In [25]:
sample_day=pd.read_csv('../data/lstm/day/AAPL_lstm_day.csv')
sample_week=pd.read_csv('../data/lstm/week/AAPL_lstm_week.csv')
sample_month=pd.read_csv('../data/lstm/month/AAPL_lstm_month.csv')
sample_day.head(5)


Unnamed: 0,close,return,price_to_ma5,price_to_ma20,ma5_cross_ma20,ma20_cross_ma50,volatility,rel_volume,roc_5,volatility_ratio,rsi_diff,rsi_ma_diff,target
0,0.035797,0.453029,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0
1,0.035684,0.453029,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0
2,0.035675,0.456027,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,1
3,0.036928,0.492571,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,1
4,0.038011,0.487344,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0


In [23]:
sample_week.head(5)

Unnamed: 0,close,return,price_to_ma5,price_to_ma20,ma5_cross_ma20,ma20_cross_ma50,volatility,rel_volume,roc_5,volatility_ratio,rsi_diff,rsi_ma_diff,target
0,0.035797,0.453029,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,1
1,0.035684,0.453029,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,1
2,0.035675,0.456027,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,1
3,0.036928,0.492571,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0
4,0.038011,0.487344,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0


In [22]:
sample_month.head(5)

Unnamed: 0,close,return,price_to_ma5,price_to_ma20,ma5_cross_ma20,ma20_cross_ma50,volatility,rel_volume,roc_5,volatility_ratio,rsi_diff,rsi_ma_diff,target
0,0.035797,0.453029,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0
1,0.035684,0.453029,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0
2,0.035675,0.456027,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0
3,0.036928,0.492571,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0
4,0.038011,0.487344,0.605877,0.49431,0.0,0.0,0.067359,0.145222,0.542001,0.290694,0.511208,0.536806,0


## 5. Conclusion

We have successfully prepared and saved the data for LSTM models:

1. Selected appropriate features for time series prediction
2. Added engineered features relevant for stock price direction prediction
3. Removed highly correlated features to reduce dimensionality
4. Scaled all features to [0,1] range for LSTM
5. Created separate datasets for day, week, and month prediction horizons
6. Demonstrated how to create sequences for LSTM input

The prepared data is saved in the following structure:
- `../data/lstm/day/{stock}_lstm_day.csv`
- `../data/lstm/week/{stock}_lstm_week.csv`
- `../data/lstm/month/{stock}_lstm_month.csv`

These datasets are now ready to be used for training LSTM models for stock direction prediction.