# Stock Price Prediction Data Preparation for Prophet

This notebook prepares stock data for prediction using Prophet for three different time periods:
1. Next day close price
2. Next week average close price
3. Next month average close price

For each stock, we'll create three separate datasets specifically formatted for each prediction period.

In [1]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

In [2]:
# List of stock symbols
stocks = ["AAPL", "MSFT", "GOOG", "AMZN", "TSLA", 
          "META", "NVDA", "SPY", "V", "DIS",
          "NFLX", "PYPL", "BABA", "IBM", "AMD",
          "BA", "INTC", "T", "GS", "NKE"]

# Path to the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))

# Paths for input and output data
input_folder = os.path.join(project_root, "data", "prophet","initial")
output_base_folder = os.path.join(project_root, "data", "prophet")

# Create output folders for each prediction period
output_folders = {
    'day': os.path.join(output_base_folder, "day"),
    'week': os.path.join(output_base_folder, "week"),
    'month': os.path.join(output_base_folder, "month")
}

# Ensure output directories exist
for folder in output_folders.values():
    os.makedirs(folder, exist_ok=True)

## Helper Functions for Data Preparation

In [3]:
def prepare_next_day_data(df):
    """
    Prepare data for next day prediction using Prophet.
    Prophet requires 'ds' and 'y' columns for date and target variable.
    
    The format is already correct, but we'll ensure only necessary columns are kept
    and data is properly sorted.
    """
    # Make a copy to avoid modifying the original dataframe
    prophet_df = df.copy()
    
    # Ensure datetime format for ds column
    prophet_df['ds'] = pd.to_datetime(prophet_df['ds'])
    
    # Sort by date
    prophet_df = prophet_df.sort_values('ds')
    
    # Keep only required columns for Prophet: 'ds' and 'y'
    prophet_df = prophet_df[['ds', 'y']]
    
    return prophet_df

def prepare_next_week_data(df):
    """
    Prepare data for next week average prediction.
    We'll create a new column 'y' that represents the average close price for the next 5 trading days.
    """
    # Make a copy to avoid modifying the original dataframe
    df_copy = df.copy()
    
    # Ensure datetime format for ds column
    df_copy['ds'] = pd.to_datetime(df_copy['ds'])
    
    # Sort by date
    df_copy = df_copy.sort_values('ds')
    
    # Calculate the 5-day forward rolling average of close prices (representing a trading week)
    df_copy['next_week_avg'] = df_copy['y'].shift(-5).rolling(window=5).mean()
    
    # Drop rows with NaN values in the next_week_avg column (last 9 rows)
    df_copy = df_copy.dropna(subset=['next_week_avg'])
    
    # Create the Prophet format dataframe
    prophet_df = pd.DataFrame({
        'ds': df_copy['ds'],
        'y': df_copy['next_week_avg']
    })
    
    return prophet_df

def prepare_next_month_data(df):
    """
    Prepare data for next month average prediction.
    We'll create a new column 'y' that represents the average close price for the next 21 trading days.
    """
    # Make a copy to avoid modifying the original dataframe
    df_copy = df.copy()
    
    # Ensure datetime format for ds column
    df_copy['ds'] = pd.to_datetime(df_copy['ds'])
    
    # Sort by date
    df_copy = df_copy.sort_values('ds')
    
    # Calculate the 21-day forward rolling average of close prices (representing a trading month)
    df_copy['next_month_avg'] = df_copy['y'].shift(-21).rolling(window=21).mean()
    
    # Drop rows with NaN values in the next_month_avg column
    df_copy = df_copy.dropna(subset=['next_month_avg'])
    
    # Create the Prophet format dataframe
    prophet_df = pd.DataFrame({
        'ds': df_copy['ds'],
        'y': df_copy['next_month_avg']
    })
    
    return prophet_df

## Process Each Stock

In [4]:
def process_stocks():
    """Process all stocks and create datasets for different prediction periods"""
    results = []
    
    for stock in stocks:
        try:
            # Construct input file path
            input_file = os.path.join(input_folder, f"{stock}_prophet.csv")
            
            # Read the stock data
            df = pd.read_csv(input_file)
            
            # Prepare data for different prediction periods
            day_data = prepare_next_day_data(df)
            week_data = prepare_next_week_data(df)
            month_data = prepare_next_month_data(df)
            
            # Save the prepared data
            day_data.to_csv(os.path.join(output_folders['day'], f"{stock}_prophet_day.csv"), index=False)
            week_data.to_csv(os.path.join(output_folders['week'], f"{stock}_prophet_week.csv"), index=False)
            month_data.to_csv(os.path.join(output_folders['month'], f"{stock}_prophet_month.csv"), index=False)
            
            results.append({
                'stock': stock,
                'status': 'Success',
                'day_rows': len(day_data),
                'week_rows': len(week_data),
                'month_rows': len(month_data)
            })
            
        except Exception as e:
            results.append({
                'stock': stock,
                'status': f'Error: {str(e)}',
                'day_rows': 0,
                'week_rows': 0,
                'month_rows': 0
            })
    
    return pd.DataFrame(results)

In [5]:
# Process all stocks and display results
results_df = process_stocks()
results_df

Unnamed: 0,stock,status,day_rows,week_rows,month_rows
0,AAPL,Success,2513,2504,2472
1,MSFT,Success,2513,2504,2472
2,GOOG,Success,2513,2504,2472
3,AMZN,Success,2513,2504,2472
4,TSLA,Success,2513,2504,2472
5,META,Success,2513,2504,2472
6,NVDA,Success,2513,2504,2472
7,SPY,Success,2513,2504,2472
8,V,Success,2513,2504,2472
9,DIS,Success,2513,2504,2472


## Data Verification

Let's verify the structure of one example from each prediction period to ensure the data is correctly formatted.

In [6]:
# Sample stock for verification
sample_stock = "AAPL"

# Load samples from each prediction period
day_sample = pd.read_csv(os.path.join(output_folders['day'], f"{sample_stock}_prophet_day.csv"))
week_sample = pd.read_csv(os.path.join(output_folders['week'], f"{sample_stock}_prophet_week.csv"))
month_sample = pd.read_csv(os.path.join(output_folders['month'], f"{sample_stock}_prophet_month.csv"))

# Display first few rows of each sample
print(f"Day prediction data sample for {sample_stock}:")
display(day_sample.head())

print(f"\nWeek prediction data sample for {sample_stock}:")
display(week_sample.head())

print(f"\nMonth prediction data sample for {sample_stock}:")
display(month_sample.head())

Day prediction data sample for AAPL:


Unnamed: 0,ds,y
0,2015-05-13 00:00:00-04:00,28.259068
1,2015-05-14 00:00:00-04:00,28.918398
2,2015-05-15 00:00:00-04:00,28.878025
3,2015-05-18 00:00:00-04:00,29.19648
4,2015-05-19 00:00:00-04:00,29.169573



Week prediction data sample for AAPL:


Unnamed: 0,ds,y
0,2015-05-19 00:00:00-04:00,29.407284
1,2015-05-20 00:00:00-04:00,29.484429
2,2015-05-21 00:00:00-04:00,29.434644
3,2015-05-22 00:00:00-04:00,29.34494
4,2015-05-26 00:00:00-04:00,29.36019



Month prediction data sample for AAPL:


Unnamed: 0,ds,y
0,2015-06-11 00:00:00-04:00,28.264728
1,2015-06-12 00:00:00-04:00,28.248069
2,2015-06-15 00:00:00-04:00,28.247001
3,2015-06-16 00:00:00-04:00,28.256719
4,2015-06-17 00:00:00-04:00,28.281495


## Data Analysis

Let's analyze the data to understand the difference between the different prediction targets.

In [7]:
def analyze_prediction_targets(stock):
    """Analyze and compare the different prediction targets for a given stock"""
    # Load data
    day_data = pd.read_csv(os.path.join(output_folders['day'], f"{stock}_prophet_day.csv"))
    week_data = pd.read_csv(os.path.join(output_folders['week'], f"{stock}_prophet_week.csv"))
    month_data = pd.read_csv(os.path.join(output_folders['month'], f"{stock}_prophet_month.csv"))
    
    # Convert dates to datetime for consistent comparison
    day_data['ds'] = pd.to_datetime(day_data['ds'])
    week_data['ds'] = pd.to_datetime(week_data['ds'])
    month_data['ds'] = pd.to_datetime(month_data['ds'])
    
    # Find common date range
    start_date = max(day_data['ds'].min(), week_data['ds'].min(), month_data['ds'].min())
    end_date = min(day_data['ds'].max(), week_data['ds'].max(), month_data['ds'].max())
    
    # Filter data to common date range
    day_filtered = day_data[(day_data['ds'] >= start_date) & (day_data['ds'] <= end_date)]
    week_filtered = week_data[(week_data['ds'] >= start_date) & (week_data['ds'] <= end_date)]
    month_filtered = month_data[(month_data['ds'] >= start_date) & (month_data['ds'] <= end_date)]
    
    # Merge datasets on date
    merged = pd.merge(day_filtered, week_filtered, on='ds', suffixes=('_day', '_week'))
    merged = pd.merge(merged, month_filtered, on='ds')
    merged = merged.rename(columns={'y': 'y_month'})
    
    # Calculate basic statistics
    stats = {
        'day_mean': merged['y_day'].mean(),
        'week_mean': merged['y_week'].mean(),
        'month_mean': merged['y_month'].mean(),
        'day_std': merged['y_day'].std(),
        'week_std': merged['y_week'].std(),
        'month_std': merged['y_month'].std(),
        'day_week_corr': merged['y_day'].corr(merged['y_week']),
        'day_month_corr': merged['y_day'].corr(merged['y_month']),
        'week_month_corr': merged['y_week'].corr(merged['y_month'])
    }
    
    return stats

In [8]:
# Analyze a sample stock
sample_analysis = analyze_prediction_targets("AAPL")
pd.DataFrame([sample_analysis]).T.rename(columns={0: 'Value'})

Unnamed: 0,Value
day_mean,100.725828
week_mean,100.930369
month_mean,101.492727
day_std,68.098532
week_std,68.106535
month_std,68.135278
day_week_corr,0.998876
day_month_corr,0.996527
week_month_corr,0.998107


## Summary

This notebook has successfully prepared stock data for prediction using Prophet for three different time periods:

1. Next day close price (daily prediction)
2. Next week average close price (weekly prediction)
3. Next month average close price (monthly prediction)

For each stock, we've created three separate datasets specifically formatted for each prediction period. The data has been saved in the following folders:

- `data/prophet/day/` - Contains data for next-day predictions
- `data/prophet/week/` - Contains data for next-week predictions
- `data/prophet/month/` - Contains data for next-month predictions

Each file follows the naming convention: `stockname_prophet_period.csv` where period is day, week, or month.

The Prophet model requires data with 'ds' (datetime) and 'y' (target value) columns, which our prepared datasets follow.