# Stock Price Prediction Data Preparation for LSTM

This notebook prepares stock data for prediction using LSTM models for three different time periods:
1. Next day close price
2. Next week average close price
3. Next month average close price

For each stock, we'll create three separate datasets specifically formatted for each prediction period.

In [1]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

In [2]:
# List of stock symbols
stocks = ["AAPL", "MSFT", "GOOG", "AMZN", "TSLA", 
          "META", "NVDA", "SPY", "V", "DIS",
          "NFLX", "PYPL", "BABA", "IBM", "AMD",
          "BA", "INTC", "T", "GS", "NKE"]

# Path to the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))

# Paths for input and output data
input_folder = os.path.join(project_root, "data", "lstm","initial")
output_base_folder = os.path.join(project_root, "data", "lstm")

# Create output folders for each prediction period
output_folders = {
    'day': os.path.join(output_base_folder, "day"),
    'week': os.path.join(output_base_folder, "week"),
    'month': os.path.join(output_base_folder, "month")
}

# Ensure output directories exist
for folder in output_folders.values():
    os.makedirs(folder, exist_ok=True)

## Helper Functions for Data Preparation

In [3]:
def prepare_next_day_data_lstm(df):
    """
    Prepare data for next day prediction using LSTM.
    For LSTM, we keep all features but update the target values.
    """
    # Make a copy to avoid modifying the original dataframe
    lstm_df = df.copy()
    
    # The target for next day prediction is already in the dataset as 'next_day_close_scaled'
    # No additional processing is needed
    
    return lstm_df

def prepare_next_week_data_lstm(df):
    """
    Prepare data for next week average prediction using LSTM.
    We'll create a new target column that represents the average close price for the next 5 trading days.
    """
    # Make a copy to avoid modifying the original dataframe
    lstm_df = df.copy()
    
    # Ensure date column is datetime
    lstm_df['date'] = pd.to_datetime(lstm_df['date'])
    
    # Sort by date
    lstm_df = lstm_df.sort_values('date')
    
    # Calculate the average close price for the next 5 trading days (1 week)
    # First, get the raw close prices for the calculation
    close_values = lstm_df['close_scaled'].values
    next_week_avg = []
    
    for i in range(len(close_values)):
        if i + 5 < len(close_values):
            # Calculate average of next 5 days
            avg = np.mean(close_values[i+1:i+6])
            next_week_avg.append(avg)
        else:
            # For the last 5 days, we can't calculate the average, so use NaN
            next_week_avg.append(np.nan)
    
    # Add the calculated average as a new target column
    lstm_df['next_week_close_scaled'] = next_week_avg
    
    # Calculate raw next week close price (not scaled) for reference
    # Since the raw close is already in the dataframe, we can use the same approach
    close_original_values = lstm_df['close_original'].values
    next_week_avg_original = []
    
    for i in range(len(close_original_values)):
        if i + 5 < len(close_original_values):
            avg = np.mean(close_original_values[i+1:i+6])
            next_week_avg_original.append(avg)
        else:
            next_week_avg_original.append(np.nan)
    
    lstm_df['next_week_close_original'] = next_week_avg_original
    
    # Add a column indicating whether the next week's average is higher than the current close
    lstm_df['price_up_week'] = (lstm_df['next_week_close_scaled'] > lstm_df['close_scaled']).astype(int)
    
    # Drop rows with NaN values for the target
    lstm_df = lstm_df.dropna(subset=['next_week_close_scaled'])
    
    return lstm_df

def prepare_next_month_data_lstm(df):
    """
    Prepare data for next month average prediction using LSTM.
    We'll create a new target column that represents the average close price for the next 21 trading days.
    """
    # Make a copy to avoid modifying the original dataframe
    lstm_df = df.copy()
    
    # Ensure date column is datetime
    lstm_df['date'] = pd.to_datetime(lstm_df['date'])
    
    # Sort by date
    lstm_df = lstm_df.sort_values('date')
    
    # Calculate the average close price for the next 21 trading days (1 month)
    close_values = lstm_df['close_scaled'].values
    next_month_avg = []
    
    for i in range(len(close_values)):
        if i + 21 < len(close_values):
            # Calculate average of next 21 days
            avg = np.mean(close_values[i+1:i+22])
            next_month_avg.append(avg)
        else:
            # For the last 21 days, we can't calculate the average, so use NaN
            next_month_avg.append(np.nan)
    
    # Add the calculated average as a new target column
    lstm_df['next_month_close_scaled'] = next_month_avg
    
    # Calculate raw next month close price (not scaled) for reference
    close_original_values = lstm_df['close_original'].values
    next_month_avg_original = []
    
    for i in range(len(close_original_values)):
        if i + 21 < len(close_original_values):
            avg = np.mean(close_original_values[i+1:i+22])
            next_month_avg_original.append(avg)
        else:
            next_month_avg_original.append(np.nan)
    
    lstm_df['next_month_close_original'] = next_month_avg_original
    
    # Add a column indicating whether the next month's average is higher than the current close
    lstm_df['price_up_month'] = (lstm_df['next_month_close_scaled'] > lstm_df['close_scaled']).astype(int)
    
    # Drop rows with NaN values for the target
    lstm_df = lstm_df.dropna(subset=['next_month_close_scaled'])
    
    return lstm_df

## Process Each Stock

In [4]:
def process_stocks_lstm():
    """Process all stocks and create datasets for different prediction periods for LSTM model"""
    results = []
    
    for stock in stocks:
        try:
            # Construct input file path
            input_file = os.path.join(input_folder, f"{stock}_lstm.csv")
            
            # Read the stock data
            df = pd.read_csv(input_file)
            
            # Check if 'close_original' column exists, if not, add it
            if 'close_original' not in df.columns and 'close_scaled' in df.columns and 'close_min' in df.columns and 'close_max' in df.columns:
                # Unscale the close price using min-max scaling formula: original = min + scaled * (max - min)
                df['close_original'] = df['close_min'] + df['close_scaled'] * (df['close_max'] - df['close_min'])
            
            # Prepare data for different prediction periods
            day_data = prepare_next_day_data_lstm(df)
            week_data = prepare_next_week_data_lstm(df)
            month_data = prepare_next_month_data_lstm(df)
            
            # Save the prepared data
            day_data.to_csv(os.path.join(output_folders['day'], f"{stock}_lstm_day.csv"), index=False)
            week_data.to_csv(os.path.join(output_folders['week'], f"{stock}_lstm_week.csv"), index=False)
            month_data.to_csv(os.path.join(output_folders['month'], f"{stock}_lstm_month.csv"), index=False)
            
            results.append({
                'stock': stock,
                'status': 'Success',
                'day_rows': len(day_data),
                'week_rows': len(week_data),
                'month_rows': len(month_data)
            })
            
        except Exception as e:
            results.append({
                'stock': stock,
                'status': f'Error: {str(e)}',
                'day_rows': 0,
                'week_rows': 0,
                'month_rows': 0
            })
    
    return pd.DataFrame(results)

In [5]:
# Process all stocks and display results
results_df = process_stocks_lstm()
results_df

Unnamed: 0,stock,status,day_rows,week_rows,month_rows
0,AAPL,Success,2513,2508,2492
1,MSFT,Success,2513,2508,2492
2,GOOG,Success,2513,2508,2492
3,AMZN,Success,2513,2508,2492
4,TSLA,Success,2513,2508,2492
5,META,Success,2513,2508,2492
6,NVDA,Success,2513,2508,2492
7,SPY,Success,2513,2508,2492
8,V,Success,2513,2508,2492
9,DIS,Success,2513,2508,2492


## Data Verification

Let's verify the structure of one example from each prediction period to ensure the data is correctly formatted.

In [6]:
# Sample stock for verification
sample_stock = "AAPL"

try:
    # Load samples from each prediction period
    day_sample = pd.read_csv(os.path.join(output_folders['day'], f"{sample_stock}_lstm_day.csv"))
    week_sample = pd.read_csv(os.path.join(output_folders['week'], f"{sample_stock}_lstm_week.csv"))
    month_sample = pd.read_csv(os.path.join(output_folders['month'], f"{sample_stock}_lstm_month.csv"))
    
    # Display first few rows of each sample
    print(f"Day prediction data sample for {sample_stock}:")
    display(day_sample.head()[['date', 'close_scaled', 'next_day_close_scaled', 'price_up']])
    
    print(f"\nWeek prediction data sample for {sample_stock}:")
    display(week_sample.head()[['date', 'close_scaled', 'next_week_close_scaled', 'price_up_week']])
    
    print(f"\nMonth prediction data sample for {sample_stock}:")
    display(month_sample.head()[['date', 'close_scaled', 'next_month_close_scaled', 'price_up_month']])
    
except Exception as e:
    print(f"Error during verification: {str(e)}")

Day prediction data sample for AAPL:


Unnamed: 0,date,close_scaled,next_day_close_scaled,price_up
0,2015-05-13 00:00:00-04:00,0.03186,0.034629,1
1,2015-05-14 00:00:00-04:00,0.034629,0.03446,0
2,2015-05-15 00:00:00-04:00,0.03446,0.035797,1
3,2015-05-18 00:00:00-04:00,0.035797,0.035684,0
4,2015-05-19 00:00:00-04:00,0.035684,0.035675,0



Week prediction data sample for AAPL:


Unnamed: 0,date,close_scaled,next_week_close_scaled,price_up_week
0,2015-05-13 00:00:00-04:00,0.03186,0.035249,1
1,2015-05-14 00:00:00-04:00,0.034629,0.035709,1
2,2015-05-15 00:00:00-04:00,0.03446,0.036419,1
3,2015-05-18 00:00:00-04:00,0.035797,0.036312,1
4,2015-05-19 00:00:00-04:00,0.035684,0.036683,1



Month prediction data sample for AAPL:


Unnamed: 0,date,close_scaled,next_month_close_scaled,price_up_month
0,2015-05-13 00:00:00-04:00,0.03186,0.035357,1
1,2015-05-14 00:00:00-04:00,0.034629,0.035266,1
2,2015-05-15 00:00:00-04:00,0.03446,0.035214,1
3,2015-05-18 00:00:00-04:00,0.035797,0.035084,0
4,2015-05-19 00:00:00-04:00,0.035684,0.034986,0


## Data Analysis

Let's analyze the data to understand the difference between the different prediction targets.

In [None]:
def analyze_lstm_prediction_targets(stock):
    """Analyze and compare the different prediction targets for a given stock"""
    try:
        # Load data
        day_data = pd.read_csv(os.path.join(output_folders['day'], f"{stock}_lstm_day.csv"))
        week_data = pd.read_csv(os.path.join(output_folders['week'], f"{stock}_lstm_week.csv"))
        month_data = pd.read_csv(os.path.join(output_folders['month'], f"{stock}_lstm_month.csv"))
        
        # Convert dates to datetime for consistent comparison
        day_data['date'] = pd.to_datetime(day_data['date'])
        week_data['date'] = pd.to_datetime(week_data['date'])
        month_data['date'] = pd.to_datetime(month_data['date'])
        
        # Find common date range
        start_date = max(day_data['date'].min(), week_data['date'].min(), month_data['date'].min())
        end_date = min(day_data['date'].max(), week_data['date'].max(), month_data['date'].max())
        
        # Filter data to common date range
        day_filtered = day_data[(day_data['date'] >= start_date) & (day_data['date'] <= end_date)]
        week_filtered = week_data[(week_data['date'] >= start_date) & (week_data['date'] <= end_date)]
        month_filtered = month_data[(month_data['date'] >= start_date) & (month_data['date'] <= end_date)]
        
        # Merge datasets on date to analyze relationships between targets
        # First, select only necessary columns
        day_slim = day_filtered[['date', 'close_scaled', 'next_day_close_scaled', 'price_up']]
        week_slim = week_filtered[['date', 'next_week_close_scaled', 'price_up_week']]
        month_slim = month_filtered[['date', 'next_month_close_scaled', 'price_up_month']]
        
        # Merge
        merged = pd.merge(day_slim, week_slim, on='date')
        merged = pd.merge(merged, month_slim, on='date')
        
        # Calculate statistics about the targets
        stats = {
            'day_mean': merged['next_day_close_scaled'].mean(),
            'week_mean': merged['next_week_close_scaled'].mean(),
            'month_mean': merged['next_month_close_scaled'].mean(),
            'day_std': merged['next_day_close_scaled'].std(),
            'week_std': merged['next_week_close_scaled'].std(),
            'month_std': merged['next_month_close_scaled'].std(),
            'day_week_corr': merged['next_day_close_scaled'].corr(merged['next_week_close_scaled']),
            'day_month_corr': merged['next_day_close_scaled'].corr(merged['next_month_close_scaled']),
            'week_month_corr': merged['next_week_close_scaled'].corr(merged['next_month_close_scaled']),
            'price_up_day_pct': merged['price_up'].mean() * 100,
            'price_up_week_pct': merged['price_up_week'].mean() * 100,
            'price_up_month_pct': merged['price_up_month'].mean() * 100
        }
        
        return stats
    
    except Exception as e:
        print(f"Error analyzing {stock}: {str(e)}")
        return None

In [None]:
# Analyze a sample stock
try:
    sample_analysis = analyze_lstm_prediction_targets("AAPL")
    if sample_analysis:
        pd.DataFrame([sample_analysis]).T.rename(columns={0: 'Value'})
    else:
        print("Analysis failed or no data available.")
except Exception as e:
    print(f"Error: {str(e)}")

## Feature Analysis

Let's examine the differences in features between the three datasets.

In [None]:
def compare_feature_distributions(stock):
    """Compare the distribution of features across the three prediction datasets"""
    try:
        # Load data
        day_data = pd.read_csv(os.path.join(output_folders['day'], f"{stock}_lstm_day.csv"))
        week_data = pd.read_csv(os.path.join(output_folders['week'], f"{stock}_lstm_week.csv"))
        month_data = pd.read_csv(os.path.join(output_folders['month'], f"{stock}_lstm_month.csv"))
        
        # Get common numeric features to compare (excluding target variables and non-numeric columns)
        numeric_cols = [col for col in day_data.columns if 
                        col in week_data.columns and 
                        col in month_data.columns and
                        pd.api.types.is_numeric_dtype(day_data[col]) and
                        'next_' not in col and
                        'price_up' not in col and
                        col not in ['date']]
        
        # Create a summary of distribution differences
        summary = {}
        for col in numeric_cols[:10]:  # Limit to first 10 features to avoid information overload
            summary[col] = {
                'day_mean': day_data[col].mean(),
                'week_mean': week_data[col].mean(),
                'month_mean': month_data[col].mean(),
                'day_std': day_data[col].std(),
                'week_std': week_data[col].std(),
                'month_std': month_data[col].std()
            }
        
        return pd.DataFrame(summary).T
    
    except Exception as e:
        print(f"Error comparing features for {stock}: {str(e)}")
        return None

In [None]:
# Compare feature distributions for a sample stock
try:
    feature_comparison = compare_feature_distributions("AAPL")
    if feature_comparison is not None:
        display(feature_comparison)
    else:
        print("Feature comparison failed or no data available.")
except Exception as e:
    print(f"Error: {str(e)}")

## Summary

This notebook has successfully prepared stock data for prediction using LSTM models for three different time periods:

1. Next day close price (daily prediction)
2. Next week average close price (weekly prediction)
3. Next month average close price (monthly prediction)

For each stock, we've created three separate datasets specifically formatted for each prediction period. The data has been saved in the following folders:

- `data/lstm/day/` - Contains data for next-day predictions
- `data/lstm/week/` - Contains data for next-week predictions
- `data/lstm/month/` - Contains data for next-month predictions

Each file follows the naming convention: `stockname_lstm_period.csv` where period is day, week, or month.

The key differences between the datasets are:

1. Day prediction: Uses `next_day_close_scaled` as the target
2. Week prediction: Creates and uses `next_week_close_scaled` (average of next 5 days) as the target
3. Month prediction: Creates and uses `next_month_close_scaled` (average of next 21 days) as the target

Additionally, each dataset contains a binary indicator for whether the price goes up in the respective prediction period.